# NLTK Chapter 4

## Writing Structured Programs

*The html version of this chapter in the NLTK book is available [here](https://www.nltk.org/book/ch04.html#exercises "Ch04 Exercises").*

### 4.11   Exercises

In [100]:
import nltk, re, pprint



###### 1. 

☼ Find out more about sequence objects using Python's help facility. In the interpreter, type `help(str)`, `help(list)`, and `help(tuple)`. This will give you a full list of the functions supported by each type. Some functions have special names flanked with underscore; as the help documentation shows, each such function corresponds to something more familiar. For example `x.__getitem__(y)` is just a long-winded way of saying `x[y]`.

In [1]:
help(str)

Help on class str in module builtins:

class str(object)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |  
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to sys.getdefaultencoding().
 |  errors defaults to 'strict'.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __format__(self, format_spec, /)
 |      Return a formatted version of the string as described by format_spec.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  

In [2]:
help(list)

Help on class list in module builtins:

class list(object)
 |  list(iterable=(), /)
 |  
 |  Built-in mutable sequence.
 |  
 |  If no argument is given, the constructor creates a new empty list.
 |  The argument must be an iterable if specified.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iadd__(self, value, /)
 |      Implement self+=value.
 |  
 |  __imul__(self, value, /)
 |      Implement self*=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self))

In [3]:
help(tuple)

Help on class tuple in module builtins:

class tuple(object)
 |  tuple(iterable=(), /)
 |  
 |  Built-in immutable sequence.
 |  
 |  If no argument is given, the constructor returns an empty tuple.
 |  If iterable is specified the tuple is initialized from iterable's items.
 |  
 |  If the argument is a tuple, the return value is the same object.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(self, key, /)
 |      Return self[key].
 |  
 |  __getnewargs__(self, /)
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __hash__(self, /)
 |      Return hash(self).
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __le__(self, value, /)
 |

##### 2. 

☼ Identify three operations that can be performed on both tuples and lists. Identify three list operations that cannot be performed on tuples. Name a context where using a list instead of a tuple generates a Python error.

*Operations that can be performed on both tuples and lists:*

In [7]:
print([x for x in dir(list) if x in dir(tuple)], end = '')

['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'count', 'index']

*Operations that can be performed on lists, but not tuples:*

In [13]:
print([x for x in dir(list) if x not in dir(tuple)], end = '')

['__delitem__', '__iadd__', '__imul__', '__reversed__', '__setitem__', 'append', 'clear', 'copy', 'extend', 'insert', 'pop', 'remove', 'reverse', 'sort']

*Trying to use a list as a key in a dictionary will not work, but it's possible with a tuple:*

In [16]:
a = ("Snugglebunnies")
b = ["Basselopes"]

c = {a: "N"}

*Saved as markdown because having a cell that throws an error will cause my notebook to go higgledy piddledy:*

```
c = {b: "N"}

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-17-ee5632d64820> in <module>
----> 1 c = {b: "N"}

TypeError: unhashable type: 'list'
```

*This is because tuples are hashable, but lists are not:*

In [18]:
hash(a)

6857707573350195552


```
hash(b)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-ad85d8b55702> in <module>
----> 1 hash(b)

TypeError: unhashable type: 'list'
        
```

##### 3. 

☼ Find out how to create a tuple consisting of a single item. There are at least two ways to do this.

*Add a common after a single value, or create a list with a single value, and use `tuple` to convert this.*

In [20]:
x = 1, 

type(x)

tuple

In [26]:
x = tuple([1])
type(x)

tuple

##### 4. 

☼ Create a list `words = ['is', 'NLP', 'fun', '?']`. Use a series of assignment statements (e.g. `words[1] = words[2]`) and a temporary variable `tmp` to transform this list into the list `['NLP', 'is', 'fun', '!']`. Now do the same transformation using tuple assignment.

In [27]:
words = ['is', 'NLP', 'fun', '?']
tmp = words[0]
words[0] = words[1]
words[1] = tmp
words[3] = '!'
words

['NLP', 'is', 'fun', '!']

In [28]:
words = ['is', 'NLP', 'fun', '?']
words[1], words[0], words[3] = words[0], words[1], "!"
words

['NLP', 'is', 'fun', '!']

##### 5.

☼ Read about the built-in comparison function `cmp`, by typing `help(cmp)`. How does it differ in behavior from the comparison operators?

*`cmp` has been deprecated.  Get with the times, authors!*

```
help(cmp)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-29-6cc5f65683db> in <module>
----> 1 help(cmp)

NameError: name 'cmp' is not defined
```

##### 6. 

☼ Does the method for creating a sliding window of n-grams behave correctly for the two limiting cases: $n$ = 1, and $n$ = `len(sent)`?

*No.  $n$ = 1 will just return __unigrams__, i.e., the individual words which comprised the sentence.  $n$ = `len(sent)` will just return the entire list:*

In [30]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
n = 1
[sent[i:i+n] for i in range(len(sent)-n+1)]

[['The'], ['dog'], ['gave'], ['John'], ['the'], ['newspaper']]

In [31]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
>>> n = len(sent)
>>> [sent[i:i+n] for i in range(len(sent)-n+1)]

[['The', 'dog', 'gave', 'John', 'the', 'newspaper']]

##### 7.

☼ We pointed out that when empty strings and empty lists occur in the condition part of an `if` clause, they evaluate to `False`. In this case, they are said to be occurring in a Boolean context. Experiment with different kind of non-Boolean expressions in Boolean contexts, and see whether they evaluate as `True` or `False`.

*With the exception of 0, all numbers evaluate as True.  Even `float('Inf')`, `-float('Inf')`, and `float('NaN')`. All strings evaluate as True, except for the empty string. Empty lists and tuples will evaluate as False.  `None` evaluates as false, but `None` in a list or a tuple evaluates as True.*

In [44]:
cands = [0, 1, -1, float('Inf'), -float('Inf'), float('NaN'), "0", "1", "", [], 
         "Fahrvergnügen", 3.1415, None, [None], tuple([]), tuple([None])]

for c in cands:
    if c:
        print("{} evaluates as True".format(c))
    else:
        print("{} evaluates as False".format(c))

0 evaluates as False
1 evaluates as True
-1 evaluates as True
inf evaluates as True
-inf evaluates as True
nan evaluates as True
0 evaluates as True
1 evaluates as True
 evaluates as False
[] evaluates as False
Fahrvergnügen evaluates as True
3.1415 evaluates as True
None evaluates as False
[None] evaluates as True
() evaluates as False
(None,) evaluates as True


##### 8. 

☼ Use the inequality operators to compare strings, e.g. `'Monty' < 'Python'`. What happens when you do `'Z' < 'a'`? Try pairs of strings which have a common prefix, e.g. `'Monty' < 'Montague'`. Read up on "lexicographical sort" in order to understand what is going on here. Try comparing structured objects, e.g. `('Monty', 1) < ('Monty', 2)`. Does this behave as expected?

*In lexicographical sort, only the first index in both items is compared.  Since 'M' comes before 'P', the rest of the string is ignored.*

In [45]:
'Monty' < 'Python'

True

In [53]:
'M' < 'Python'

True

In [54]:
'Monty' < 'P'

True

*Uppercase letters are considered as coming 'before' lowercase ones.*

In [46]:
'Z' < 'a'

True

*'Monty' and 'Montague' have the same first four elements.  Since 'y' comes after 'a', the comparison evaluates as `False`.*

In [47]:
'Monty' < 'Montague'

False

*As the first elements in both tuples are identical, the second element is compared, and one is indeed less than 2.*

In [51]:
('Monty', 1) < ('Monty', 2)

True

*Carrying on that logic a bit further:*

In [52]:
('Monty', 1, 1, 1, 5) < ('Monty', 1, 1, 1, 4)

False

##### 9.

☼ Write code that removes whitespace at the beginning and end of a string, and normalizes whitespace between words to be a single space character.

* 1. do this task using `split()` and `join()`
* 2. do this task using regular expression substitutions

In [58]:
sent = "    this    is   a really inconsistent          use  of     whitespace.      "
sent = sent.split()
' '.join(sent)

'this is a really inconsistent use of whitespace.'

In [68]:
sent = "    this    is   a really inconsistent          use  of     whitespace.      "
sent = re.sub(r'^\s*|\s*$', '', sent)
sent = re.sub(r'\s+', ' ', sent)
sent

'this is a really inconsistent use of whitespace.'

##### 10.

☼ Write a program to sort words by length. Define a helper function `cmp_len` which uses the `cmp` comparison function on word lengths.

*`cmp` is still deprecated, so I'll only do the first part of this exercise.*

In [77]:
def sort_words_by_length(text):
    """
    Returns a list, sorted from shortest to longest,
    of words in text.
    """
    
    return [w for _, w in sorted([(len(w), w) for w in text.split()])]

In [79]:
sent = 'the words in this sentence are mostly of unique character lengths'

print(sort_words_by_length(sent), end = '')

['in', 'of', 'are', 'the', 'this', 'words', 'mostly', 'unique', 'lengths', 'sentence', 'character']

##### 11.

◑ Create a list of words and store it in a variable `sent1`. Now assign `sent2 = sent1`. Modify one of the items in `sent1` and verify that `sent2` has changed.

* a. Now try the same exercise but instead assign `sent2 = sent1[:]`. Modify sent1 again and see what happens to `sent2`. Explain.
* b. Now define `text1` to be a list of lists of strings (e.g. to represent a text consisting of multiple sentences. Now assign `text2 = text1[:]`, assign a new value to one of the words, e.g. `text1[1][1] = 'Monty'`. Check what this did to `text2`. Explain.
* c. Load Python's `deepcopy()` function (i.e. `from copy import deepcopy`), consult its documentation, and test that it makes a fresh copy of any object.

In [94]:
sent = "Mairzy doats and dozy doats and liddle lamzy divey " \
        "A kiddley divey too, wouldn't you?"
sent1 = sent.split()
sent2 = sent1
sent1[0] = "Mares"
print(sent2, end = '')

['Mares', 'doats', 'and', 'dozy', 'doats', 'and', 'liddle', 'lamzy', 'divey', 'A', 'kiddley', 'divey', 'too,', "wouldn't", 'you?']

*__a.__ Using `[:]` causes `sent1` to be shallow copied, so changes aren't replicated in `sent2`.  I.e., it's a reference to a copy of the list, and not the original list.*

In [95]:
sent = "Mairzy doats and dozy doats and liddle lamzy divey " \
        "A kiddley divey too, wouldn't you?"
sent1 = sent.split()
sent2 = sent1[:]
sent1[0] = "Mares"
print(sent2, end = '')

['Mairzy', 'doats', 'and', 'dozy', 'doats', 'and', 'liddle', 'lamzy', 'divey', 'A', 'kiddley', 'divey', 'too,', "wouldn't", 'you?']

*__b.__ Now the change is permeated.  It appears a shallow copy of a list of lists is just a copy of the references of the lists.  I.e., if one of the original lists is changed, the changed is reflected.*

In [88]:
sentA = "Mairzy doats and dozy doats and liddle lamzy divey"
sentB = "A kiddley divey too, wouldn't you?"

text1 = [sentA.split(), sentB.split()]
text2 = text1[:]

text1[1][1] = 'Monty'

print(text2, end = '')

[['Mairzy', 'doats', 'and', 'dozy', 'doats', 'and', 'liddle', 'lamzy', 'divey'], ['A', 'Monty', 'divey', 'too,', "wouldn't", 'you?']]

*If we explicitly make shallow copies of the lists in the list of lists, then the changes won't be replicated:*

In [91]:
sentA = "Mairzy doats and dozy doats and liddle lamzy divey"
sentB = "A kiddley divey too, wouldn't you?"

text1 = [sentA.split(), sentB.split()]
text2 = [text1[0][:], text1[1][:]]

text1[1][1] = 'Monty'

print(text2, end = '')

[['Mairzy', 'doats', 'and', 'dozy', 'doats', 'and', 'liddle', 'lamzy', 'divey'], ['A', 'kiddley', 'divey', 'too,', "wouldn't", 'you?']]

*__c.__*

In [96]:
from copy import deepcopy

sent = "Mairzy doats and dozy doats and liddle lamzy divey " \
        "A kiddley divey too, wouldn't you?"
sent1 = sent.split()
sent2 = deepcopy(sent1)
sent1[0] = "Mares"
print(sent2, end = '')

['Mairzy', 'doats', 'and', 'dozy', 'doats', 'and', 'liddle', 'lamzy', 'divey', 'A', 'kiddley', 'divey', 'too,', "wouldn't", 'you?']

##### 12. 

◑ Initialize an $n$-by-$m$ list of lists of empty strings using list multiplication, e.g. `word_table = [[''] * n] * m`. What happens when you set one of its values, e.g. `word_table[1][2] = "hello"`? Explain why this happens. Now write an expression using `range()` to construct a list of lists, and show that it does not have this problem.

In [99]:
m, n = 6, 7

word_table = [[''] * n] * m
pprint.pprint(word_table)

[['', '', '', '', '', '', ''],
 ['', '', '', '', '', '', ''],
 ['', '', '', '', '', '', ''],
 ['', '', '', '', '', '', ''],
 ['', '', '', '', '', '', ''],
 ['', '', '', '', '', '', '']]


In [106]:
word_table[1][2] = ("hello")
pprint.pprint(word_table)

[['', '', 'hello', '', '', '', ''],
 ['', '', 'hello', '', '', '', ''],
 ['', '', 'hello', '', '', '', ''],
 ['', '', 'hello', '', '', '', ''],
 ['', '', 'hello', '', '', '', ''],
 ['', '', 'hello', '', '', '', '']]


*Above we have a copy of a list containing sublists. When we make a change to one of the sublists, the copy is replicated in all of the lists which have been created by copying it.*

In [109]:
word_table = [['' for i in range(n)] for j in range(m)]
pprint.pprint(word_table)

[['', '', '', '', '', '', ''],
 ['', '', '', '', '', '', ''],
 ['', '', '', '', '', '', ''],
 ['', '', '', '', '', '', ''],
 ['', '', '', '', '', '', ''],
 ['', '', '', '', '', '', '']]


In [110]:
word_table[1][2] = ("hello")
pprint.pprint(word_table)

[['', '', '', '', '', '', ''],
 ['', '', 'hello', '', '', '', ''],
 ['', '', '', '', '', '', ''],
 ['', '', '', '', '', '', ''],
 ['', '', '', '', '', '', ''],
 ['', '', '', '', '', '', '']]


*In this case, we're creating a brand new empty string for each of the iterations through $n$, and likewise through $m$.  As the strings are not copies of each other, changes in one are not reflected in the others.*

##### 13. 

◑ Write code to initialize a two-dimensional array of sets called `word_vowels` and process a list of words, adding each word to `word_vowels[l][v]` where `l` is the length of the word and `v` is the number of vowels it contains.

*If we follow the instructions literally and create an array of sets, both sets will come back sorted, and it is very likely that the $n$th item in the first set will not be the same as the $n$th item in the second.  I.e., a long word without many vowels would be represented towards the end of the first set, but towards the beginning of the second (and vice versa):*

In [126]:
def find_length_word_number_vowels(text):
    """
    Returns an array of two sets.  The first set comprises
    the lengths of the words.  The second the number of vowels.
    For the sake of simplicity, 'y' is not counted as a vowel.
    Results are ordered from smallest to greatest.
    """
    
    word_vowels = [set(), set()]

    for t in text:
        word_vowels[0].add(len(t))
        word_vowels[1].add(sum([1 for i in t if i.lower() in 'aeiou']))

    return word_vowels

In [136]:
test = ["supercalifragilisticexpialidocious", "eye", "Hawai'i", "draft"]
find_length_word_number_vowels(test)

[{3, 5, 7, 34}, {1, 2, 4, 16}]

*Using lists instead of sets will obviate this problem:*

In [331]:
def find_length_word_number_vowels_by_order_of_appearance(text):
    """
    Returns an array of two lists.  The first set comprises
    the lengths of the words.  The second the number of vowels.
    For the sake of simplicity, 'y' is not counted as a vowel.
    Results are ordered by order of appearance of the original
    word.
    """
    
    word_vowels = [[], []]

    for t in text:
        word_vowels[0].append(len(t))
        
        # taking advantage of the fact that `True` evaluates to 1
        word_vowels[1].append(sum([1 for i in t if i.lower() in 'aeiou']))

    return word_vowels

In [137]:
find_length_word_number_vowels_by_order_of_appearance(test)

[[34, 3, 7, 5], [16, 2, 4, 1]]

##### 14. 

◑ Write a function `novel10(text)` that prints any word that appeared in the last 10% of a text that had not been encountered earlier.

In [216]:
def novel10(text):
    """
    Returns a set of words that appear for the first time
    in the last 10% of a text.
    """

    split = int(len(text) * .9)
    first90, last10 = text[:split], text[split:]
    novel90 = set(first90)

    return set([i for i in last10 if i not in novel90])

In [212]:
from urllib import request
from nltk import word_tokenize

# Mary Shelley's Frankenstein
url = 'http://www.gutenberg.org/cache/epub/42324/pg42324.txt'

frank = request.urlopen(url).read().decode('utf8')


In [213]:
frank = word_tokenize(frank)

# used trial and error (and `index`) to find and remove
# header and footer
frank = frank[115:90732]

*Converted results to a list so that I could use slicing to get the first 20 items:*

In [217]:
print(list(novel10(frank))[:20], end = '')

['torch', 'later', 'wasting', '6375', 'lessening', 'Black', '2d', 'piercing', 'renounce', 'mole-hills', 'indecision', 'Cold', 'asseverations', 'irremediable', 'mutilated', 'repast', 'abortion', 'unquenched', 'rustling', 'boils']

##### 15.

◑ Write a program that takes a sentence expressed as a single string, splits it and counts up the words. Get it to print out each word and the word's frequency, one per line, in alphabetical order.

*Although it's not best practice stylistically, the instructions say to print everything, so I'm going to do that instead of returning a value:*

In [275]:
import nltk
from nltk import word_tokenize

def print_words_and_frequency(text):
    """
    Counts the words in a text and prints out the
    resulting table in alphabetical order.
    """

    # tokenizer separates words from punctuation
    tokens = word_tokenize(text)
    
    # remove punctuation
    words = [t.lower() for t in tokens if t.isalpha()]
    
    # get word counts from FreqDist
    ordered = sorted(set([(w, v) for w, v in nltk.FreqDist(words).items()]))
    
    # get widths for pretty printing
    # width of longest word
    width = max([len(w) for w, _ in ordered]) + 2
    # width of longest number
    width_counts = max([len(str(v)) for _, v in ordered])
    
    # print everything
    for w, v in ordered:
        print("{}:{}{:>{}}".format(w, ' ' * (width - len(w)), 
                                     v, width_counts))

In [272]:
test = "If police police police police, who polices the police police? " \
       "Police police police police police police."

print_words_and_frequency(test)

if:        1
police:   12
polices:   1
the:       1
who:       1


In [273]:
test = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
print_words_and_frequency(test)

a:          2
chuck:      2
could:      1
how:        1
if:         1
much:       1
wood:       2
woodchuck:  2
would:      1


##### 16.

◑ Read up on Gematria, a method for assigning numbers to words, and for mapping between words having the same number to discover the hidden meaning of texts (cf. [here](http://en.wikipedia.org/wiki/Gematria "Wikipedia entry"), or [here](http://essenes.net/gemcal.htm "Gematria")).

* a. Write a function `gematria()` that sums the numerical values of the letters of a word, according to the letter values in `letter_vals`:

``` 	
>>> letter_vals = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':80, 'g':3, 'h':8,
... 'i':10, 'j':10, 'k':20, 'l':30, 'm':40, 'n':50, 'o':70, 'p':80, 'q':100,
... 'r':200, 's':300, 't':400, 'u':6, 'v':6, 'w':800, 'x':60, 'y':10, 'z':7}
```

* b. Process a corpus (e.g. `nltk.corpus.state_union`) and for each document, count how many of its words have the number 666.

* c. Write a function `decode()` to process a text, randomly replacing words with their Gematria equivalents, in order to discover the "hidden meaning" of the text.

In [305]:
letter_vals = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':80, 'g':3, 'h':8,
       'i':10, 'j':10, 'k':20, 'l':30, 'm':40, 'n':50, 'o':70, 'p':80, 'q':100,
       'r':200, 's':300, 't':400, 'u':6, 'v':6, 'w':800, 'x':60, 'y':10, 'z':7}

def gematria(word, vals = letter_vals):
    """
    Returns Gematria value of a word.
    
    Arguments:
    
    word: Word to be converted.
    vals: Dictionary of values for each letter.
          Default is `letter_vals`.
    """
    
    return sum(vals[w.lower()] for w in word if w in vals)

*__b.__*

In [329]:
su = nltk.corpus.state_union

# for pretty printing later
width = max([len(u[5:-4]) for u in su.fileids()])

devil_words = [sum([1 for w in su.words(u) if gematria(w) == 666]) for u in su.fileids()]

In [330]:
for u, w in zip(su.fileids(), devil_words):
    print(u[:4], u[5:-4] + ":" +
          " " * (width - len(u[5:-4])),  "{:>2}".format(w))

1945 Truman:      2
1946 Truman:     20
1947 Truman:      1
1948 Truman:      3
1949 Truman:      2
1950 Truman:      1
1951 Truman:      0
1953 Eisenhower:  1
1954 Eisenhower:  6
1955 Eisenhower:  5
1956 Eisenhower:  1
1957 Eisenhower:  2
1958 Eisenhower:  5
1959 Eisenhower:  1
1960 Eisenhower:  5
1961 Kennedy:     0
1962 Kennedy:     4
1963 Johnson:     0
1963 Kennedy:     3
1964 Johnson:     1
1965 Johnson-1:   0
1965 Johnson-2:   0
1966 Johnson:     0
1967 Johnson:     2
1968 Johnson:     3
1969 Johnson:     0
1970 Nixon:       0
1971 Nixon:       1
1972 Nixon:       0
1973 Nixon:       1
1974 Nixon:       0
1975 Ford:        0
1976 Ford:        3
1977 Ford:        0
1978 Carter:      1
1979 Carter:      2
1980 Carter:      0
1981 Reagan:      4
1982 Reagan:      0
1983 Reagan:      2
1984 Reagan:      1
1985 Reagan:      1
1986 Reagan:      1
1987 Reagan:      1
1988 Reagan:      2
1989 Bush:        2
1990 Bush:        2
1991 Bush-1:      0
1991 Bush-2:      0
1992 Bush:        3


*__c.__ Since strings are immutable, we need to convert whatever format the `text` is in to a list. From there, we have several ways to choose words at random.  I originally chose $n$% of the index values at random and replaced those words.  But this was very slow with large texts, as the interpreter would have to traverse the entire list of words to find the word to be replaced.  I settled on the method below, which is much quicker: The interpreter only goes through the list of words once, and returns a random number between 0.0 and 1.0 for each word.  If the random number is below a certain threshold, the word is replaced.  With this method, there is a decent chance that the percentage of words replaced will differ from the percentage we specify, since the chances of one word being replaced are independent of all other words (i.e., as long as the threshold is not 0% or 100%, there's a chance that none of the words will be replaced, and also a chance that all of the words will be replaced).  Also, punctuation will never be replaced, since these marks don't have values in the dictionary `letter_vals`, which we use in the `gematria` function.*

*Also using a function I used in [Chapter 2](https://github.com/Sturzgefahr/Natural-Language-Processing-with-Python-Analyzing-Text-with-the-Natural-Language-Toolkit/tree/master/Chapter%2002 "Chapter 2 Exercises") - `join_punctuation` - to reattach punctuation that was separated during tokenizing*

In [484]:
import random, nltk
from nltk import word_tokenize

def join_punctuation(text, characters = ["'", '’', ')', ',', '.', ':', ';', '?', '!', ']', "''"]): 
    """
    Takes a list of strings and attaches punctuation to
    the preceding string in the list.
    """
    
    text = iter(text)
    current = next(text)

    for nxt in text:
        if nxt in characters:
            current += nxt
        else:
            yield current
            current = nxt
            

    yield current

def decode(text, n = 10):
    """
    Returns a copy of the original text with n percent
    of the words converted into its gematria form.
    
    Arguments:
    
    text: Text can be any form. Will be converted into a list.
    n:    Percentage of the words to be converted.  Should be
          an integer between 0 and 100.   
    """
    
    
    # convert a cp of text into a list
    if type(text) == str:
        # use tokenize to separate punctuation from words
        cp = word_tokenize(text)    
    elif type(text) == list:
        cp = text[:]    
    else:
        cp = list(text[:])
        

    # go through the words in the list,
    # and return a random number;
    # if the random number is less than the percentage
    # replace the word with its gematria value
    for i in range(len(cp)):
        if random.random() <= n/100: 
            cp[i] = str(gematria(cp[i]))
        
    
    # using join punctuation to rejoin punctuation separated during
    # tokenizing
    return ' '.join(join_punctuation(cp))

In [487]:
decode("This is just a test to see if my code will work", 25)

'318 is just a 1105 to 310 90 my code 870 work'

In [488]:
decode(su.words('2006-GWBush.txt'), 33)[:1000]

"0 GEORGE W 0 0' S 0 BEFORE A JOINT SESSION OF THE CONGRESS 0 0 STATE OF 0 UNION January 0 0 0 THE PRESIDENT: Thank you all. Mr. 311, Vice President Cheney, members 150 Congress, members 150 the Supreme 676 55 diplomatic corps, distinguished guests 0 55 1015 785 0 Today our nation lost a beloved, graceful, courageous woman 878 73 America to 710 founding ideals and 423 on a 157 250 0 Tonight we are comforted 12 the 163 of a 38 391 1218 the husband who 1101 476 so long ago 0 and we are 725 for the 147 life 150 Coretta 873 63. ( 502 .) President George W 0 Bush reacts to applause during his State of 413 Union Address at the 591, Tuesday, Jan. 0, 0. White 381 photo by 213 707 time I' 40 485 to this 1216, I' 40 humbled 12 413 privilege 0 55 mindful of the history we 0 11 360 1091. 5 have 626 under 718 591 dome in moments of national mourning and national 533. We 20 520 America through one of the 810 consequential periods of our history -- and it 309 been 50 honor 470 serve 1218 86. In a sys

In [489]:
decode(nltk.corpus.gutenberg.words('austen-emma.txt'), 30)[:1000]

"0 81 by Jane Austen 0 0 VOLUME I CHAPTER I Emma 533, handsome, clever 0 55 rich, with 1 comfortable home and happy disposition 0 seemed 470 471 some of the 707 blessings of existence; and 13 lived nearly twenty - one 516 in the world 1218 very little to 1519 270 71 her 0 She 1101 the youngest of the two daughters of a 810 affectionate, 558 father; and 13, in consequence 150 her sister' s marriage, 62 mistress of his 389 from 1 221 early period. Her 723 had 23 540 153 74 for 213 470 have more 459 an indistinct remembrance of her caresses; and 213 place had been supplied 12 an excellent woman as 939 0 878 had 196 little 978 of a 723 in affection 0 530 years had Miss Taylor been 60 Mr. Woodhouse' s 171, less as 1 939 459 a friend, very 204 150 480 daughters, but particularly of Emma. Between _them_ 410 was 315 the 524 of sisters 0 Even 362 Miss Taylor 13 ceased to hold the nominal office of 939, the mildness of 213 temper 13 hardly allowed her to impose 61 restraint 0 and 413 1183 150 11

##### 17. 

◑ Write a function `shorten(text, n)` to process a text, omitting the $n$ most frequently occurring words of the text. How readable is it?

*The texts are usually quite readable if we delete the most common words, as most of these common words will be stop words.  However, in news articles and novels some of the most common words will be the names of the principals, and without those it's impossible to know who's doing what.*

In [479]:
import nltk, re

def join_punctuation(text, characters = ["'", '’', ')', ',', '.', ':', ';', '?', '!', ']', "''"]): 
    """
    Takes a list of strings and attaches punctuation to
    the preceding string in the list.
    """
    
    text = iter(text)
    current = next(text)

    for nxt in text:
        if nxt in characters:
            current += nxt
        else:
            yield current
            current = nxt
            

    yield current


def shorten(text, n):
       
    # convert a cp of text into a list
    if type(text) == str:
        # use tokenize to separate punctuation from words
        cp = word_tokenize(text)    
    elif type(text) == list:
        cp = text[:]    
    else:
        cp = list(text[:])
        
    # get a list of most common words, and strip away the counts
    most_common = [w for w, _ in nltk.FreqDist(w for w in cp if w.isalpha()).most_common(n)]
    
    # replace most common words
    for i in range(len(cp)):
        if cp[i] in most_common:
            cp[i] = ''
    
    # join list and normalize whitespace - 
    # otherwise there'll be gaps for the missing words
    # also use join_punctuation to reattach punctuation 
    # separate during tokenizing
    return re.sub(r'\s+', ' ', ' '.join(join_punctuation(cp)))
        
        

In [480]:
shorten("This is a test which is a rather short one", 1)

'This a test which a rather short one'

In [491]:
shorten(su.words('2006-GWBush.txt'), 50)[:1000]

"PRESIDENT GEORGE W. BUSH' S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION January 31, 2006 THE PRESIDENT: Thank . Mr. Speaker, Vice President Cheney, members Congress, members Supreme Court diplomatic corps, distinguished guests, fellow citizens: Today lost beloved, graceful, courageous woman called its founding ideals carried noble dream. Tonight comforted hope glad reunion husband was taken so long ago, grateful good life Coretta Scott King. ( .) President George W. Bush reacts applause during his State Union Address Capitol, Tuesday, Jan. 31, 2006. White House photo Eric DraperEvery time ' m invited rostrum, ' m humbled privilege, mindful history ' ve seen together. gathered under Capitol dome moments national mourning national achievement. served through one most consequential periods history -- been my honor serve . system two parties, two chambers, two elected branches, there always differences debate. But even tough debates conducted civil tone, diffe

In [492]:
shorten(nltk.corpus.gutenberg.words('austen-emma.txt'), 50)[:1000]

"[ Jane Austen 1816] VOLUME CHAPTER Woodhouse, handsome, clever, rich, comfortable home happy disposition, seemed unite some best blessings existence; lived nearly twenty - one years world little distress or vex . youngest two daughters most affectionate, indulgent father; , consequence sister' marriage, mistress house early period. Her mother died too long ago more than an indistinct remembrance caresses; place supplied an excellent woman governess, who fallen little short mother affection. Sixteen years Taylor . Woodhouse' family, less governess than friend, fond both daughters, particularly . Between _them_ more intimacy sisters. Even before Taylor ceased hold nominal office governess, mildness temper hardly allowed impose restraint; shadow authority being now long passed away, they living together friend friend mutually attached, doing just what liked; highly esteeming Taylor' judgment, directed chiefly own. The real evils, indeed, ' situation power having rather too much own way, 

*If we delete an additional 50 common words, we'll lose mentions of "Woodhouse", one of the principal characters.*

In [493]:
shorten(nltk.corpus.gutenberg.words('austen-emma.txt'), 100)[:1000]

"[ Austen 1816] VOLUME CHAPTER , handsome, clever, rich, comfortable home happy disposition, seemed unite some best blessings existence; lived nearly twenty - years world distress vex . youngest two daughters most affectionate, indulgent father; , consequence sister' marriage, mistress house early period. Her mother died too long ago indistinct remembrance caresses; place supplied excellent woman governess, fallen short mother affection. Sixteen years Taylor . ' family, less governess friend, fond both daughters, particularly . Between _them_ intimacy sisters. Even before Taylor ceased hold nominal office governess, mildness temper hardly allowed impose restraint; shadow authority long passed away, living together friend friend mutually attached, doing just liked; highly esteeming Taylor' judgment, directed chiefly . real evils, indeed, ' situation power having rather too way, disposition too ; these disadvantages threatened alloy many enjoyments. danger, however, present unperceived, 

##### 18.

◑ Write code to print out an index for a lexicon, allowing someone to look up words according to their meanings (or pronunciations; whatever properties are contained in lexical entries).

*Don't really understand what I'm supposed to do.*

*Maybe I need to sort the dictionary by keys?*



In [515]:
entries = nltk.corpus.cmudict.entries()

d = {}

for w, v in entries:
    key = ''.join(v)
    d.setdefault(key, []).append(w)
    
indexed = [[i + 1, (k, v)] for i, (k, v) in enumerate(d.items())]

In [506]:
d = {}

for w, v in entries:
    key = ''.join(v)
    d.setdefault(key, []).append(w)

In [513]:
indexed = [[i + 1, (k, v)] for i, (k, v) in enumerate(d.items())]

In [519]:
sorted(d.iteritems())

AttributeError: 'dict' object has no attribute 'iteritems'