Skip to content

Conversation

JoseALermaIII
Copy link
Owner

Summary

Start and finish chapter 8 with a syllable counter.

Description

Introduces the nltk package to count the number of syllables in a word or phrase. The nltk package contains the CMUdict corpus which lists words with their phonemes. Right? Me neither.

Since the CMUdict corpus doesn't contain direct syllable counts, we have to check the parts of the word's phoneme that have vowels and count them.

If the word is not in nltk's CMUdict corpus, it checks a local json file for the syllable count. This file gets populated manually.

The main function tests how well count_syllables works with Ubuntu's system american-english word dictionary.

This is a new package, so I'm betting dollars to donuts that tests will fail because the CMUdict corpus isn't installed on the test machine and I'll end up having to mock the CMUdict corpus as either a json file or a dictionary.

But, hey, that's what PRs are for.

Team Notifications

Me, myself, and I.

@JoseALermaIII JoseALermaIII added the enhancement New feature or request label Oct 19, 2019
@JoseALermaIII JoseALermaIII self-assigned this Oct 19, 2019
@JoseALermaIII
Copy link
Owner Author

Yeah, no surprise there. Looks like I might have to split functional and unit tests.

Functional tests will be tough since I'd have to somehow install the CMUdict corpus. Likely by downloading the zip file and extracting it to the proper location via a script.

Functional tests on this one could take a while to work out.

@JoseALermaIII
Copy link
Owner Author

Well, at least I didn't have to go with a manual install using a bash script.

@JoseALermaIII JoseALermaIII marked this pull request as ready for review October 19, 2019 19:38
@JoseALermaIII
Copy link
Owner Author

As they say, "more than one way to skin a cat." Now, let's add a directory check since CMUdict installs to the user's home directory without administrator privileges.

Comment on lines 19 to 22
nltk.download('cmudict')
if not os.path.exists(os.path.expanduser('~/nltk_data/corpora/cmudict/cmudict')):
# FIXME: This is nearly impossible to test.
# Patching os affects every use of os in the module.
nltk.download('cmudict')
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but is difficult to test. If there is a better way, please open a PR.

Comment on lines 19 to 24

if not os.path.exists(os.path.expanduser('~/nltk_data/corpora/cmudict/cmudict')):
if not os.path.exists(
os.path.expanduser('~/nltk_data/corpora/cmudict/cmudict')):
# pylint: disable=fixme
# FIXME: This is nearly impossible to test.
# Patching os affects every use of os in the module.
nltk.download('cmudict')
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funnily enough, because the test environment doesn't have CMUdict installed, installing it counts as a test of this portion.

I'm leaving the FIXME because it's more of an unintended benefit rather than an actual test.

Comment on lines -9 to +12
cache: pip # Don't delete pip install
cache:
pip: true # Don't delete pip install
directories:
- $HOME/nltk_data/corpora/
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caching the nltk corpora should relieve load on the nltk servers.

Comment on lines +89 to +92
syllables += MISSING_WORDS[word]
else:
for phonemes in CMUDICT[word][0]:
for phoneme in phonemes:
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be implemented as MISSING_WORDS.get(word) and CMUDICT.get(word)[0]; however, the get() method doesn't raise a KeyError if the key isn't present. Instead, it returns a value of None by default, but this behavior can be changed.

Here's the thing: I'd be willing to refactor the tests to account for this change in behavior if not for one small detail. Setting values would still be done by doing MISSING_WORDS[word] == value. There isn't a set() dictionary method. Probably for good reason, but as a result, I don't want to mix syntax.

Comment on lines +100 to +109
word_list = cleanup_dict(DICTIONARY_FILE_PATH)
sample_list = sample(word_list, 15)
for word in sample_list:
try:
syllables = count_syllables(format_words(word))
except KeyError:
# Skip words in neither dictionary.
print(f'Not found: {word}')
continue
print(f'{word} {syllables}')
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing the book's solution, I agree that using sample() is a better way to randomly select words to use without duplicates and makes for better looping. I also agree that displaying missing words helps add them to either dictionary.

@@ -0,0 +1,15 @@
Not found: yggdrasil
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to see a phoneme list for yggdrasil. In the meantime, I'd now be able to add it to missing_words.json as "yggdrasil": 3, which is handy.

@JoseALermaIII JoseALermaIII merged commit e38e7b8 into master Oct 25, 2019
@JoseALermaIII JoseALermaIII deleted the count-syllables branch October 25, 2019 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant