# Think Python

## Chapter 13 - Case study: data structure selection

### 13.1 Word frequency analysis

*HTML of this chapter in "Think Python 2e" can be found [here](http://greenteapress.com/thinkpython2/html/thinkpython2014.html "Chapter 13").*

#### Exercise 1  

*Write a program that reads a file, breaks each line into words, strips whitespace and punctuation from the words, and converts them to lowercase.*

*Hint: The `string` module provides a string named whitespace, which contains space, tab, newline, etc., and `punctuation` which contains the punctuation characters. Let’s see if we can make Python swear:*

```
>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
```

*Also, you might consider using the string methods `strip`, `replace` and `translate`.*


In [98]:
import string

punct = string.punctuation

# Some PG texts use non-ASCII characters, so we'll add these manually.
punct += '‘’“”—'

# Some PG texts also have line numbers, so let's remove all numbers:
punct += string.digits

out = " " * len(punct)

def get_words_from_file(text, encode = "utf8"):
    """
    Returns a list of words from a file.  Words
    are stripped of punctuation and converted to lowercase.
    Since punctuation is removed, contractions are returned
    without apostrophes (e.g., `can't` -> `cant`). Numbers
    are also removed from the returned list
    
    Arguments:
    
    text: name of file
    encode: text encoding used in file.  Default is UTF-8
    """
    opened_text = open(text, 'r', encoding = encode)
    t = []
    for line in opened_text:
        translation = line.maketrans(punct, out)
        for word in line.translate(translation).split():
            t.append(word.strip().lower())
            
    return t

In [39]:
import os

path = "C:\\Users\\mjcor\\Desktop\\ProgrammingStuff\\ThinkPython"
os.chdir(path)

In [88]:
ma = get_words_from_file('alice.txt')
ma[1:25]

['through',
 'the',
 'looking',
 'glass',
 'by',
 'lewis',
 'carroll',
 'chapter',
 'i',
 'looking',
 'glass',
 'house',
 'one',
 'thing',
 'was',
 'certain',
 'that',
 'the',
 'white',
 'kitten',
 'had',
 'had',
 'nothing',
 'to']

#### Exercise 2  

*Go to Project Gutenberg (http://gutenberg.org) and download your favorite out-of-copyright book in plain text format.*

*Modify your program from the previous exercise to read the book you downloaded, skip over the header information at the beginning of the file, and process the rest of the words as before.*

*Then modify the program to count the total number of words in the book, and the number of times each word is used.*

*Print the number of different words used in the book. Compare different books by different authors, written in different eras. Which author uses the most extensive vocabulary?*

*__I had difficulty removing the boilerplate at the beginning of Project Gutenberg texts, as I felt that the book didn't do an adequate job in showing us how to remove headers and footers.  So I went online to see how other people dealt with this problem.  At [this blog](https://epequeno.wordpress.com/2012/05/06/exercise-13-2/ "Removing headers from PG texts") I found some interesting code for removing the header of PG texts; but the creator of the code neglected to deal with the footer, so I made a few changes to deal with that:__*

In [81]:
import string

punct = string.punctuation

# Some PG texts use non-ASCII quotes, so we'll add these manually.
punct += '‘’“”'

# Some PG texts also have line numbers, so let's remove all numbers:
punct += string.digits

out = " " * len(punct)

def clean_pg_text(text, encode = "utf8"):
    """
    Returns a list of words from a Project Gutenberg text.  
    Headers and footers are removed from texts. Words
    are stripped of punctuation and converted to lowercase.
    Since punctuation is removed, contractions are returned
    without apostrophes (e.g., `can't` -> `cant`). Numbers
    are also removed from the returned list
    
    Arguments:
    
    text: name of file
    encode: text encoding used in file.  Default is UTF-8
    """
    """
    Removes headers and footers from Project Gutenberg texts.
    """
    opened_text = open(text, 'r', encoding = encode)
    cleaned_text = []
    flag = False
    start = "*** START OF"
    end = "*** END OF"

    # some PG texts don't use spaces to designate start/end of text
    alt_start = "***START OF"
    alt_end = "***END OF"
    
    for line in opened_text:
        if (start in line) or (alt_start in line) and (flag == False):
            flag = True
        elif (end in line) or (alt_end in line) and (flag == True):
            flag = False
        elif flag == True:
            for line in opened_text:
                translation = line.maketrans(punct, out)
                for word in line.translate(translation).split():
                    cleaned_text.append(word.strip().lower())
        else:
            pass
    return cleaned_text

In [68]:
def tally_words(text_list):
    """
    Returns a tally of the words in text_list.
    """
    total_words = {}

    for word in text_list:
        total_words[word] = 1 + total_words.get(word, 0)

    return total_words

In [82]:
texts = ['austen.txt', 'beowulf.txt', 'canterbury.txt', 
         'hamlet.txt', 'iliad.txt', 'sherlock.txt', 'ulysses.txt']

titles = ["'Pride and Prejudice' by Jane Austen",
          "'Beowulf' translated by Lesslie Hall",
          "'The Canterbury Tales' by Geoffrey Chaucer",
          "'Hamlet' by William Shakespeare",
          "'The Iliad of Homer', translated by Alexander Pope",
          "'The Adventures of Sherlock Holmes' by Arthur Conan Doyle",
          "'Ulysses' by James Joyce"]

In [83]:
for text, title in zip(texts, titles):
    clean = clean_pg_text(text)
    tally = tally_words(clean)
    print("{} uses {:,} words".format(title, len(tally)))

'Pride and Prejudice' by Jane Austen uses 6,525 words
'Beowulf' translated by Lesslie Hall uses 5,935 words
'The Canterbury Tales' by Geoffrey Chaucer uses 15,494 words
'Hamlet' by William Shakespeare uses 5,021 words
'The Iliad of Homer', translated by Alexander Pope uses 12,639 words
'The Adventures of Sherlock Holmes' by Arthur Conan Doyle uses 8,238 words
'Ulysses' by James Joyce uses 29,720 words


#### Exercise 3  

*Modify the program from the previous exercise to print the 20 most frequently used words in the book.*

In [84]:
def n_most_common_words(text, n):
    """
    Returns a list of the n most common words in a 
    Project Gutenberg text.
    """
    
    clean = clean_pg_text(text)
    tally = tally_words(clean)
    
    sorted_words = []
    
    for (y, z) in reversed(sorted(zip(tally.values(), tally.keys()))):
        sorted_words.append([z, y])
        
    return sorted_words[:n]

In [86]:
n_most_common_words('ulysses.txt', 20)

[['the', 15040],
 ['of', 8253],
 ['and', 7217],
 ['a', 6536],
 ['to', 5032],
 ['in', 4995],
 ['he', 4174],
 ['his', 3326],
 ['s', 2837],
 ['i', 2828],
 ['that', 2730],
 ['with', 2556],
 ['it', 2491],
 ['was', 2127],
 ['on', 2125],
 ['you', 2029],
 ['for', 1952],
 ['her', 1784],
 ['him', 1522],
 ['is', 1432]]

#### Exercise 4  

*Modify the previous program to read a word list (see Section 9.1) and then print all the words in the book that are not in the word list. How many of them are typos? How many of them are common words that should be in the word list, and how many of them are really obscure?*

In [100]:
# modified from code in ex. 12.4

def make_word_dict(text):
    """
    Reads lines from text and 
    returns a dictionary.
    """
    d = {}
    for line in text:   
        d[line.strip().lower()] = None
    to_add = ["i", "a"]
    for ta in to_add:
        d[ta] = None
    return d

def find_words_not_in_dict(text, t):
    """
    Takes a Project Gutenberg text and a
    wordlist and returns a list of words
    in the PG text that cannot be found
    in the list.
    
    Arguments:
    text: a raw Project Gutenberg text file
    t: a raw text word list with one word per line
    """
    check_dict = make_word_dict(t)
    clean = clean_pg_text(text)
    tally = tally_words(clean)
    not_in_dict = []
    for word in tally.keys():
        if word not in check_dict:
            not_in_dict.append(word)

    return not_in_dict

In [104]:
fin = open('words.txt')
austen_obscura = find_words_not_in_dict('austen.txt', fin)
austen_obscura[:35]

['austen',
 'neighbourhood',
 'mr',
 'netherfield',
 'mrs',
 'england',
 'monday',
 'michaelmas',
 'bingley',
 'william',
 'lucas',
 'lizzy',
 'lydia',
 'elizabeth',
 't',
 's',
 'neices',
 'mary',
 'm',
 'neighbour',
 'favourable',
 'delightful',
 'etc',
 'hertfordshire',
 'london',
 'gentlemanlike',
 'hurst',
 'darcy',
 'derbyshire',
 'unreserved',
 'behaviour',
 'fastidious',
 'catherine',
 'longbourn',
 'inhabitants']

*__Most of the words not in the word list are proper names (including single letters that were most likely initials in the original texts) and obscure (or anachronistic) words.  However, in some texts the results were words with punctuation that had not been cleaned by the function `clean_pg_text`.  I therefore went back to the function and added those characters to the list of ones that should be replaced.__*