-
Notifications
You must be signed in to change notification settings - Fork 0
Add count_syllables to ch08 #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Yeah, no surprise there. Looks like I might have to split functional and unit tests. Functional tests will be tough since I'd have to somehow install the CMUdict corpus. Likely by downloading the Functional tests on this one could take a while to work out. |
Well, at least I didn't have to go with a manual install using a bash script. |
As they say, "more than one way to skin a cat." Now, let's add a directory check since CMUdict installs to the user's home directory without administrator privileges. |
src/ch08/p1_count_syllables.py
Outdated
nltk.download('cmudict') | ||
if not os.path.exists(os.path.expanduser('~/nltk_data/corpora/cmudict/cmudict')): | ||
# FIXME: This is nearly impossible to test. | ||
# Patching os affects every use of os in the module. | ||
nltk.download('cmudict') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, but is difficult to test. If there is a better way, please open a PR.
|
||
if not os.path.exists(os.path.expanduser('~/nltk_data/corpora/cmudict/cmudict')): | ||
if not os.path.exists( | ||
os.path.expanduser('~/nltk_data/corpora/cmudict/cmudict')): | ||
# pylint: disable=fixme | ||
# FIXME: This is nearly impossible to test. | ||
# Patching os affects every use of os in the module. | ||
nltk.download('cmudict') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funnily enough, because the test environment doesn't have CMUdict installed, installing it counts as a test of this portion.
I'm leaving the FIXME
because it's more of an unintended benefit rather than an actual test.
cache: pip # Don't delete pip install | ||
cache: | ||
pip: true # Don't delete pip install | ||
directories: | ||
- $HOME/nltk_data/corpora/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caching the nltk
corpora should relieve load on the nltk servers.
syllables += MISSING_WORDS[word] | ||
else: | ||
for phonemes in CMUDICT[word][0]: | ||
for phoneme in phonemes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can be implemented as MISSING_WORDS.get(word)
and CMUDICT.get(word)[0]
; however, the get()
method doesn't raise a KeyError
if the key isn't present. Instead, it returns a value of None
by default, but this behavior can be changed.
Here's the thing: I'd be willing to refactor the tests to account for this change in behavior if not for one small detail. Setting values would still be done by doing MISSING_WORDS[word] == value
. There isn't a set()
dictionary method. Probably for good reason, but as a result, I don't want to mix syntax.
word_list = cleanup_dict(DICTIONARY_FILE_PATH) | ||
sample_list = sample(word_list, 15) | ||
for word in sample_list: | ||
try: | ||
syllables = count_syllables(format_words(word)) | ||
except KeyError: | ||
# Skip words in neither dictionary. | ||
print(f'Not found: {word}') | ||
continue | ||
print(f'{word} {syllables}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reviewing the book's solution, I agree that using sample()
is a better way to randomly select words to use without duplicates and makes for better looping. I also agree that displaying missing words helps add them to either dictionary.
@@ -0,0 +1,15 @@ | |||
Not found: yggdrasil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to see a phoneme list for yggdrasil
. In the meantime, I'd now be able to add it to missing_words.json
as "yggdrasil": 3
, which is handy.
Summary
Start and finish chapter 8 with a syllable counter.
Description
Introduces the
nltk
package to count the number of syllables in a word or phrase. Thenltk
package contains the CMUdict corpus which lists words with their phonemes. Right? Me neither.Since the CMUdict corpus doesn't contain direct syllable counts, we have to check the parts of the word's phoneme that have vowels and count them.
If the word is not in
nltk
's CMUdict corpus, it checks a localjson
file for the syllable count. This file gets populated manually.The
main
function tests how wellcount_syllables
works with Ubuntu's systemamerican-english
word dictionary.This is a new package, so I'm betting dollars to donuts that tests will fail because the CMUdict corpus isn't installed on the test machine and I'll end up having to mock the CMUdict corpus as either a
json
file or a dictionary.But, hey, that's what PRs are for.
Team Notifications
Me, myself, and I.