Markdown cells are for personal notes & thought process. This notebook in general is for a more interactive experience as I work through setting up the database for this helper tool.

In [1]:
import pandas as pd
import numpy as np
import os, re

In [2]:
os.listdir()

['wordnet20-from-prolog-all-3.sql',
 'cmudict-0.7b.phones',
 'cmudict-0.7b',
 'newdic.txt',
 '.ipynb_checkpoints',
 'cmudict-pared.txt',
 'cmudict-0.7b.symbols',
 'cmu-dict-test.txt',
 'Database Prep.ipynb']

## Initial data import
file - 'cmu-dict-test.txt', temporary project file with a much smaller overall list of words

issues: 
-parse the two columns to a dataframe, pronunciations are separated by white space, each word is separated by a new line.
-supply column names

word -> first white space -> turn pronunciations in to list
    -i.e. HH A0 R D L YY1 ['HH', 'A0', 'R', 'D', 'L', 'YY1']
    -then look for \n for next entry in dataframe
    -to do this, we can make an initial list by removing all whitespace, index 0 is the word. the rest will be the phonemes.
    
potential steps:
- make a list of each word, then each pronunciation
- after this, I have my basis for the initial dataframe
- then I can write functions to parse each pronunciation for syllable count & stressed, and create a list for each of those

In [3]:
dictionary = open('cmudict-pared.txt', 'r', encoding="ISO-8859-1")
# utf-8 was not the default encoding for this file. 
# something to keep note of in case this affects searching results
# later in the project

In [4]:
# dictionary.readline()
# imported correctly, now to make initial list, per line in the file

In [5]:
# each word in the CMUdict is separated by a new line.
# want to split and get rid of \n character

entries = [word.rstrip('\n') for word in dictionary]

In [6]:
# checking random entries

entries[101]

'ABBEVILLE  AE1 B V IH0 L'

### splitting each entry to a word & phonics

- must first split each by white space and then pull index 0 from each of those splits.

In [7]:
sample_word = entries[0].split(' ')
sample_word

["'ALLO", '', 'AA2', 'L', 'OW1']

In [101]:
# set up where they'll go, and we can probably separate the word
# & pronunciation lists in one pass while using the split method.
words = []
pronunciations = []

for item in entries:
    item = item.split()
    words.append(item[0])
    pronunciations.append(tuple(item[1:]))

In [102]:
len(words)

134316

In [103]:
len(pronunciations)

134316

In [104]:
pronunciations[313]

('AH0', 'B', 'AH1', 'V', 'Z')

In [105]:
len(pronunciations)

134316

In [106]:
words[13]

"'TIL"

In [107]:
pronunciations[45]

('EH1', 'R', 'AH0', 'N', 'Z')

* something to keep track of is the pronunciation in list format. When searching for rhymes, would it be more efficient to have this in list form or string form? It would be a matter of comparing substrings and delimited white space vs number of indexes in common at the end of each list.

## getting syllable counts

In [108]:
count = 0
numbers = r'\d'

# count = sum(numbers in item for item in pronunciations[13])
# count

# this is a potential setup for what I need to do for syllable counts
# a list comprehension with something like the below could get me syllable
# counts for each word.

for item in pronunciations[45]:
    if bool(re.search(numbers, item)):
        count += 1
count

2

In [109]:
def count_syllables(list):
    count = 0
    for item in list:
        if bool(re.search(numbers, item)):
            count += 1
    return count

In [110]:
syllables = [count_syllables(x) for x in pronunciations]

In [111]:
len(syllables)

134316

^ looks like we got the whole list

\\/ some checks for various syllable counts

In [112]:
syllables[45]

2

In [113]:
syllables[197]

2

In [114]:
words.index('ABERRATIONAL')

195

Looking good, now to parse meter from each of these entries.

## Syllable stresses

The numbers attached to vowel sounds indicate stresses;
- 0: no stress
- 1: primary stress
- 2: secondary stress

### How to handle stresses?
- secondary stresses have a weird place in this project. 
- I could decide to count them as stresses always, but that doesn't tell the whole story of secondary stresses.
- There are cases where secondary stresses get promoted to a primary stress depending on surrounding syllables and/or meter.
- For the set-up of my database, more research on how to handle these will need to be.
- In the end it will modify how some functions are written and certainly how lines are parsed for meter when that feature is added.

Will need to look at the necessity of separating primary & secondary stress vs combining them... '\`_

... So why make all these lists and not just run the functions each time a word is searched?
    - It'll be better to search the database pre-filled later than run functions over the entire dictionary later.
    - The database may need further fields, or bins of some kind for feet/meter & rhymes. Those will require searching the entire dicitonary.
    - English has 44 phonemes. These pronunciations can be grouped by phonemes. ??? searching end-rhymes vs full/partial word-rhymes???
    
### Stresses
- primary & secondary stresses will be retained
- symbols:
    - unstressed: _
    - primary:    '
    - secondary:  \`

In [115]:
def parse_scansion(list):
    scansion = ''
    for item in list:
        if '0' in item:
            scansion += '_'
        if '1' in item:
            scansion += "'"
        if '2' in item:
            scansion += '`'
    return scansion        

In [116]:
scansion = [parse_scansion(x) for x in pronunciations]
print(len(scansion))

134316


In [117]:
scansion[195]

"`_'__"

In [118]:
# storing the pronunciations as a list makes it harder to put in to an SQL database. For now 
# they can be stored as strings. Finding matching rhymes down the road will only slightly change.

pronunciations = [' '.join(item) for item in pronunciations]
pronunciations[313]

'AH0 B AH1 V Z'

## time to set up the DataFrame

this will eventually be converted to an SQL/postgres database for use in the web app.

In [119]:
columns = ["WORD", "PRONUNCIATION", "SYLLABLES", "SCANSION"]
l = [list(word) for word in zip(words, pronunciations, syllables, scansion)]
df = pd.DataFrame(l, columns= columns)

In [120]:
df.head()
# df.shape

Unnamed: 0,WORD,PRONUNCIATION,SYLLABLES,SCANSION
0,'ALLO,AA2 L OW1,2,`'
1,'BOUT,B AW1 T,1,'
2,'CAUSE,K AH0 Z,1,_
3,'COURSE,K AO1 R S,1,'
4,'CUSE,K Y UW1 Z,1,'


necessary columns in place. Time to ponder what else could be useful for the remainder of the project.

??? would it be better to list the types of feet each word can fulfill in another column, or calculate it during search? ???



In [121]:
df.isna().sum()

WORD             0
PRONUNCIATION    0
SYLLABLES        0
SCANSION         0
dtype: int64

In [122]:
df.dtypes

WORD             object
PRONUNCIATION    object
SYLLABLES         int64
SCANSION         object
dtype: object

In [80]:
import sys
!{sys.executable} -m pip install pymysql



In [129]:
# df.to_json('words_db')
from sqlalchemy import create_engine
hostname='ZipCoders-MacBook-Pro.local'
dbname='project_of_passion'
uname='nick'
pwd='nick123'

engine = create_engine("mysql+pymysql://{user}:{pw}@{host}/{db}".format(host=hostname, db=dbname, user=uname,pw=pwd))

Time to research how to store the database. SQL/JSON/CSV. The file is rather small.

In [131]:
df.to_sql('words', engine, if_exists='replace')

134316