<a href="https://colab.research.google.com/github/MK316/workshop22/blob/main/cmu_dict.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CMU dictionary in Python

"CMUdict is a versioned python wrapper package for The CMU Pronouncing Dictionary data files. The main purpose is to expose the data with little or no assumption on how it is to be used." [source link](https://pypi.org/project/cmudict/)

**CMUdict (the Carnegie Mellon Pronouncing Dictionary)** is a free
pronouncing dictionary of English, suitable for uses in speech
technology and is maintained by the Speech Group in the School of
Computer Science at Carnegie Mellon University.

In [18]:
# Copyright

cmudict.license_string() # Returns the cmudict license as a string

b"Copyright (C) 1993-2015 Carnegie Mellon University. All rights reserved.\n\nRedistribution and use in source and binary forms, with or without\nmodification, are permitted provided that the following conditions\nare met:\n\n1. Redistributions of source code must retain the above copyright\n   notice, this list of conditions and the following disclaimer.\n   The contents of this file are deemed to be source code.\n\n2. Redistributions in binary form must reproduce the above copyright\n   notice, this list of conditions and the following disclaimer in\n   the documentation and/or other materials provided with the\n   distribution.\n\nThis work was supported in part by funding from the Defense Advanced\nResearch Projects Agency, the Office of Naval Research and the National\nScience Foundation of the United States of America, and by member\ncompanies of the Carnegie Mellon Sphinx Speech Consortium. We acknowledge\nthe contributions of many volunteers to the expansion and improvement of\

# CMU pronunciation dictionary: symbol list

## CMU Pronunciation Dicationary
        
        Phoneme Example Translation
        ------- ------- -----------
        AA	odd     AA D
        AE	at	AE T
        AH	hut	HH AH T
        AO	ought	AO T
        AW	cow	K AW
        AY	hide	HH AY D
        B 	be	B IY
        CH	cheese	CH IY Z
        D 	dee	D IY
        DH	thee	DH IY
        EH	Ed	EH D
        ER	hurt	HH ER T
        EY	ate	EY T
        F 	fee	F IY
        G 	green	G R IY N
        HH	he	HH IY
        IH	it	IH T
        IY	eat	IY T
        JH	gee	JH IY
        K 	key	K IY
        L 	lee	L IY
        M 	me	M IY
        N 	knee	N IY
        NG	ping	P IH NG
        OW	oat	OW T
        OY	toy	T OY
        P 	pee	P IY
        R 	read	R IY D
        S 	sea	S IY
        SH	she	SH IY
        T 	tea	T IY
        TH	theta	TH EY T AH
        UH	hood	HH UH D
        UW	two	T UW
        V 	vee	V IY
        W 	we	W IY
        Y 	yield	Y IY L D
        Z 	zee	Z IY
        ZH	seizure	S IY ZH ER

# **{cmudict}**

[📍 webpage: {cmudict}](https://pypi.org/project/cmudict/)

In [1]:
!pip install cmudict

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cmudict
  Downloading cmudict-1.0.2-py2.py3-none-any.whl (939 kB)
[K     |████████████████████████████████| 939 kB 4.2 MB/s 
[?25hInstalling collected packages: cmudict
Successfully installed cmudict-1.0.2


**Usage** [Description from https://pypi.org/project/cmudict/]

The cmudict data set includes 4 data files: cmudict.dict, cmudict.phones, cmudict.symbols, and cmudict.vp. See The CMU Pronouncing Dictionary for details on the data. Chances are, if you're here, you already know what's in the files.

Each file can be accessed through three functions, one which returns the raw (string) contents, one which returns a binary stream of the file, and one which does minimal processing of the file into an appropriate structure:

In [10]:
import cmudict

In [None]:
# Entire dictionary content: e.g., 'absentees': [['AE2', 'B', 'S', 'AH0', 'N', 'T', 'IY1', 'Z']],
# cmudict.dict()

**cmudict.phones()**  Tuple list
> ('AA', ['vowel'])  
> ('JH', ['affricate'])

In [None]:
cmudict.phones()

**cmudict.symbols()**   

string list: ['AA', 'AA0', 'AA1', ....]

In [None]:
cmudict.symbols()

In [None]:
# cmudict.vp() # I don't know what this is for.

> * cmudict.entries() # Compatible with NLTK  
> * cmudict.raw() # Compatible with NLTK  
> * cmudict.words() # Compatible with NTLK  

In [28]:
#  ('acuity', ['AH0', 'K', 'Y', 'UW1', 'AH0', 'T', 'IY0']),
cmudict.entries()[:10]

[("'bout", ['B', 'AW1', 'T']),
 ("'cause", ['K', 'AH0', 'Z']),
 ("'course", ['K', 'AO1', 'R', 'S']),
 ("'cuse", ['K', 'Y', 'UW1', 'Z']),
 ("'em", ['AH0', 'M']),
 ("'frisco", ['F', 'R', 'IH1', 'S', 'K', 'OW0']),
 ("'gain", ['G', 'EH1', 'N']),
 ("'kay", ['K', 'EY1']),
 ("'m", ['AH0', 'M']),
 ("'n", ['AH0', 'N'])]

In [None]:
# cmudict.raw()

In [17]:
len(cmudict.words())

135155

# Applicable codings

In [21]:
%%capture
!pip install pronouncing
import pronouncing

In [None]:
pronouncing.rhymes('word')