Add count_syllables to ch08 #14

JoseALermaIII · 2019-10-19T01:33:30Z

Summary

Start and finish chapter 8 with a syllable counter.

Description

Introduces the nltk package to count the number of syllables in a word or phrase. The nltk package contains the CMUdict corpus which lists words with their phonemes. Right? Me neither.

Since the CMUdict corpus doesn't contain direct syllable counts, we have to check the parts of the word's phoneme that have vowels and count them.

If the word is not in nltk's CMUdict corpus, it checks a local json file for the syllable count. This file gets populated manually.

The main function tests how well count_syllables works with Ubuntu's system american-english word dictionary.

This is a new package, so I'm betting dollars to donuts that tests will fail because the CMUdict corpus isn't installed on the test machine and I'll end up having to mock the CMUdict corpus as either a json file or a dictionary.

But, hey, that's what PRs are for.

Team Notifications

Me, myself, and I.

JoseALermaIII · 2019-10-19T01:40:44Z

Yeah, no surprise there. Looks like I might have to split functional and unit tests.

Functional tests will be tough since I'd have to somehow install the CMUdict corpus. Likely by downloading the zip file and extracting it to the proper location via a script.

Functional tests on this one could take a while to work out.

JoseALermaIII · 2019-10-19T19:37:58Z

Well, at least I didn't have to go with a manual install using a bash script.

This reverts commit 321281e

This reverts commit e0a9d50

JoseALermaIII · 2019-10-19T20:37:00Z

As they say, "more than one way to skin a cat." Now, let's add a directory check since CMUdict installs to the user's home directory without administrator privileges.

JoseALermaIII · 2019-10-19T21:40:14Z

src/ch08/p1_count_syllables.py

-nltk.download('cmudict')
+if not os.path.exists(os.path.expanduser('~/nltk_data/corpora/cmudict/cmudict')):
+    # FIXME: This is nearly impossible to test.
+    #  Patching os affects every use of os in the module.
+    nltk.download('cmudict')


This works, but is difficult to test. If there is a better way, please open a PR.

JoseALermaIII · 2019-10-19T21:52:18Z

src/ch08/p1_count_syllables.py


-if not os.path.exists(os.path.expanduser('~/nltk_data/corpora/cmudict/cmudict')):
+if not os.path.exists(
+        os.path.expanduser('~/nltk_data/corpora/cmudict/cmudict')):
+    # pylint: disable=fixme
    # FIXME: This is nearly impossible to test.
    #  Patching os affects every use of os in the module.
    nltk.download('cmudict')


Funnily enough, because the test environment doesn't have CMUdict installed, installing it counts as a test of this portion.

I'm leaving the FIXME because it's more of an unintended benefit rather than an actual test.

JoseALermaIII · 2019-10-19T21:59:23Z

.travis.yml

-cache: pip  # Don't delete pip install
+cache:
+  pip: true  # Don't delete pip install
+  directories:
+    - $HOME/nltk_data/corpora/


Caching the nltk corpora should relieve load on the nltk servers.

JoseALermaIII · 2019-10-25T21:05:34Z

src/ch08/p1_count_syllables.py

+            syllables += MISSING_WORDS[word]
+        else:
+            for phonemes in CMUDICT[word][0]:
+                for phoneme in phonemes:


These can be implemented as MISSING_WORDS.get(word) and CMUDICT.get(word)[0]; however, the get() method doesn't raise a KeyError if the key isn't present. Instead, it returns a value of None by default, but this behavior can be changed.

Here's the thing: I'd be willing to refactor the tests to account for this change in behavior if not for one small detail. Setting values would still be done by doing MISSING_WORDS[word] == value. There isn't a set() dictionary method. Probably for good reason, but as a result, I don't want to mix syntax.

JoseALermaIII · 2019-10-25T21:41:58Z

src/ch08/p1_count_syllables.py

+    word_list = cleanup_dict(DICTIONARY_FILE_PATH)
+    sample_list = sample(word_list, 15)
+    for word in sample_list:
+        try:
+            syllables = count_syllables(format_words(word))
+        except KeyError:
+            # Skip words in neither dictionary.
+            print(f'Not found: {word}')
+            continue
+        print(f'{word} {syllables}')


After reviewing the book's solution, I agree that using sample() is a better way to randomly select words to use without duplicates and makes for better looping. I also agree that displaying missing words helps add them to either dictionary.

JoseALermaIII · 2019-10-25T21:44:11Z

tests/data/ch08/main/count_syllables.txt

@@ -0,0 +1,15 @@
+Not found: yggdrasil


I'd love to see a phoneme list for yggdrasil. In the meantime, I'd now be able to add it to missing_words.json as "yggdrasil": 3, which is handy.

JoseALermaIII added 10 commits October 18, 2019 17:59

Add nltk 3.4.5

1fd6af7

Initial commit

87fb3b1

Add test_count_syllables to TestCountSyllables

7486627

Initial commit

3876c4f

Add test_main and setUpClass to TestCountSyllables

941fe04

Initial commit

4cc5d4e

Add src.ch08

a1edcc5

Fix spacing for skipsdist

3dc409a

Add nltk to intersphinx_mapping

56576bb

Add sphinx reference to nltk in count_syllables docstring

2a3ac2a

JoseALermaIII added the enhancement New feature or request label Oct 19, 2019

JoseALermaIII self-assigned this Oct 19, 2019

JoseALermaIII added 3 commits October 19, 2019 13:02

Add another test to test_format_words

e65fac8

Add nltk.corpus.cmudict to install section

e0a9d50

Add nltk to pip install

321281e

JoseALermaIII marked this pull request as ready for review October 19, 2019 19:38

JoseALermaIII added 3 commits October 19, 2019 15:26

Revert "Add nltk to pip install"

8519175

This reverts commit 321281e

Revert "Add nltk.corpus.cmudict to install section"

ae8fde9

This reverts commit e0a9d50

Add cmudict download to module

34b6b61

Add path check to cmudict download

6f37d7c

JoseALermaIII commented Oct 19, 2019

View reviewed changes

Fix pylint line-too-long and locally disable fixme

bd92303

JoseALermaIII commented Oct 19, 2019

View reviewed changes

Add nltk corpora directory to cache

52c96a0

JoseALermaIII commented Oct 19, 2019

View reviewed changes

Add attributes section to module docstring

fc3663a

JoseALermaIII commented Oct 25, 2019

View reviewed changes

JoseALermaIII added 3 commits October 25, 2019 16:32

Refactor main to use random sample

d2c3f56

Refactor main to display words in neither dictionary

b99cc36

Refactor tests to reflect changes

4a73bd4

JoseALermaIII commented Oct 25, 2019

View reviewed changes

JoseALermaIII merged commit e38e7b8 into master Oct 25, 2019

JoseALermaIII deleted the count-syllables branch October 25, 2019 21:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add count_syllables to ch08 #14

Add count_syllables to ch08 #14

Uh oh!

JoseALermaIII commented Oct 19, 2019

Uh oh!

JoseALermaIII commented Oct 19, 2019

Uh oh!

JoseALermaIII commented Oct 19, 2019

Uh oh!

JoseALermaIII commented Oct 19, 2019

Uh oh!

JoseALermaIII Oct 19, 2019

Uh oh!

JoseALermaIII Oct 19, 2019

Uh oh!

JoseALermaIII Oct 19, 2019

Uh oh!

JoseALermaIII Oct 25, 2019

Uh oh!

JoseALermaIII Oct 25, 2019

Uh oh!

JoseALermaIII Oct 25, 2019

Uh oh!

Uh oh!

Add count_syllables to ch08 #14

Add count_syllables to ch08 #14

Uh oh!

Conversation

JoseALermaIII commented Oct 19, 2019

Summary

Description

Team Notifications

Uh oh!

JoseALermaIII commented Oct 19, 2019

Uh oh!

JoseALermaIII commented Oct 19, 2019

Uh oh!

JoseALermaIII commented Oct 19, 2019

Uh oh!

JoseALermaIII Oct 19, 2019

Choose a reason for hiding this comment

Uh oh!

JoseALermaIII Oct 19, 2019

Choose a reason for hiding this comment

Uh oh!

JoseALermaIII Oct 19, 2019

Choose a reason for hiding this comment

Uh oh!

JoseALermaIII Oct 25, 2019

Choose a reason for hiding this comment

Uh oh!

JoseALermaIII Oct 25, 2019

Choose a reason for hiding this comment

Uh oh!

JoseALermaIII Oct 25, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!