Update May 6th, 2024
The stress pattern was based on The CMU Pronouncing Dictionary.
The cmudict module from the NLTK library was used to extract the stress pattern from the dataset.
The English words dataset was based on the SubtlexUS dataset.
According to what is mentioned on the CMU Pronouncing Dictionary website, "Stress is difficult to get right and people disagree about it."
main.py
- Execute the script
eng_stress_pattern_finder.py
- Find the stress pattern of the English words with the given dataset
extract_and_transform_syllable_data.py
- Extract data from SubtlexUS dataset
- Transform data to find a syllable count and stress pattern of each English word
- Words that aren't in the dictionary will be filtered out
load_to_sqlite.py
- Load data to SQLite database tables
The CMU Pronouncing Dictionary: http://www.speech.cs.cmu.edu/cgi-bin/cmudict
“SubtlexUS” dataset: http://www.lexique.org/?page_id=241