This notebook holds the starting point for your Pattern Assignment. I'll put the function stub at the top, but I have some potentially helpful examples down below. 

In [12]:
import nltk
import numpy as np

from sklearn import preprocessing
from nltk.book import *
from nltk.tokenize import word_tokenize, sent_tokenize
from string import punctuation
from collections import Counter

from pprint import pprint # get some prettier printing of objects

from nltk.corpus import stopwords

sw = stopwords.words('english')

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [13]:
def get_patterns(text)  :
    """
        This function takes text as an input and returns a dictionary of statistics,
        after cleaning the text. 
    
    """
    
    # We'll make things a big clearer by initializing the 
    # statistics here. These are placeholder values.
    total_tokens = 1
    unique_tokens = 0
    avg_token_len = 0.0
    lex_diversity = 0.0
    top_10 = Counter()
    
    # Do your tokenization and normalization here
    
    beo_clean = [w for w in beowulf.split()]
    beo_clean = [w.lower() for w in beo_clean if w.isalpha() and w not in sw]
    #word_tokenize(beowulf)
    #sent_tokenize(beowulf)
    
    # Calculate your statistics here
    
    print(f"Beowulf is {len(beo_clean)} tokens long.")
    print(f"Beowulf has {len(set(beo_clean))} unique tokens.")

    print(f"Beowulf's lexical diversity is {len(set(beo_clean))/len(beo_clean):.3f}.")

    beo_token_len = [len(w) for w in beo_clean]

    print(f"Beowulf's average token length is {np.mean(beo_token_len):.2f}.")

    pprint(sorted(Counter(beo_token_len).items()))

    print("")
    print("All statistics are calculated after normalization and tokenization.")
    
    # Now we'll fill out the dictionary. 
    results = {'tokens':total_tokens,
               'unique_tokens':unique_tokens,
               'avg_token_length':avg_token_len,
               'lexical_diversity':lex_diversity,
               'top_10':top_10}

    return(results)

In [14]:
beowulf = open("beowulf.txt",'r').read()
big_word_file = open("big.txt",'r').read()

In [15]:
beo_results = get_patterns(beowulf)
big_results = get_patterns(big_word_file)

Beowulf is 10146 tokens long.
Beowulf has 3033 unique tokens.
Beowulf's lexical diversity is 0.299.
Beowulf's average token length is 5.14.
[(1, 205),
 (2, 251),
 (3, 1135),
 (4, 2719),
 (5, 2053),
 (6, 1540),
 (7, 1060),
 (8, 725),
 (9, 285),
 (10, 100),
 (11, 51),
 (12, 16),
 (13, 3),
 (14, 1),
 (15, 2)]

All statistics are calculated after normalization and tokenization.
Beowulf is 10146 tokens long.
Beowulf has 3033 unique tokens.
Beowulf's lexical diversity is 0.299.
Beowulf's average token length is 5.14.
[(1, 205),
 (2, 251),
 (3, 1135),
 (4, 2719),
 (5, 2053),
 (6, 1540),
 (7, 1060),
 (8, 725),
 (9, 285),
 (10, 100),
 (11, 51),
 (12, 16),
 (13, 3),
 (14, 1),
 (15, 2)]

All statistics are calculated after normalization and tokenization.


When you've finished the assignment, these next cells should run without raising any assertion errors. It's always possible that _I_ have a mistake in my version, so if you think you've done everything correctly and you're still getting errors, let me know. 

In [16]:
assert(beo_results['tokens']==9288)
assert(beo_results['unique_tokens']==2970)
assert(0.31 < beo_results['lexical_diversity'] < 0.32)
assert(5.3 < beo_results['avg_token_length'] < 5.4)
assert(len(beo_results['top_10'])==10)
assert(beo_results['top_10'][4]==('thou',47))
print("Passed all assertion tests.")

AssertionError: 

In [17]:
assert(big_results['tokens']==410950)
assert(big_results['unique_tokens']==25173)
assert(0.06 < big_results['lexical_diversity'] < 0.065)
assert(6.3 < big_results['avg_token_length'] < 6.4)
assert(len(big_results['top_10_tokens'])==10)
assert(big_results['top_10'][0]==('said',2946))
assert(big_results['top_10'][8]==('upon',1088))
print("Passed all assertion tests.")

AssertionError: 

---

All of the important bits of the assignment are above here. Everything below is designed to help you wrap your mind around what I'm asking for. 


### Dictionaries

We'll talk about dictionary objects in class a bit, but think of them as a big bag of key-value pairs. The keys, which have to be unique, are like an address to the data that's being stored in the "value", which can be any kind of object. They're incredibly versatile and useful objects to learn about. Here's a [video](https://www.youtube.com/watch?v=daefaLgNkw0) that will help orient you to them. 

In the next couple of cells, I'll make a couple of dictionaries and show you some basic functionality with them.


In [18]:
animal_weights = {"mouse":0.03,"dog":20,"horse":700,"elephant":4100,"blue whale":100000}

animal_weights['dog']

20

In [19]:
for key in animal_weights :
    print(key)
    print(animal_weights[key])

mouse
0.03
dog
20
horse
700
elephant
4100
blue whale
100000


In [20]:
# This will throw an error
animal_weights['cat']

KeyError: 'cat'

In [21]:
animal_weights['cat'] = 3

animal_weights

{'mouse': 0.03,
 'dog': 20,
 'horse': 700,
 'elephant': 4100,
 'blue whale': 100000,
 'cat': 3}

In [22]:
for animal, wt in animal_weights.items() :
    print(f"An average {animal} weighs {wt} kilograms ({wt/2.2:0.2f} lbs).")

An average mouse weighs 0.03 kilograms (0.01 lbs).
An average dog weighs 20 kilograms (9.09 lbs).
An average horse weighs 700 kilograms (318.18 lbs).
An average elephant weighs 4100 kilograms (1863.64 lbs).
An average blue whale weighs 100000 kilograms (45454.55 lbs).
An average cat weighs 3 kilograms (1.36 lbs).


### Functions

Functions are a useful way to organize code and prevent yourself from making mistakes. There is a lot to learn about functions, but we'll just handle the basics here. Functions are declared with the `def` statement (for "define", I assume). The parentheses allow you to define the _parameters_ that are passed into the function. Here's a simple example that squares a number. 

In [23]:
def simple_square(x) :
    return(x*x)

In [24]:
simple_square(4)

16

In [25]:
simple_square(102)

10404

In [26]:
simple_square("cat")

TypeError: can't multiply sequence by non-int of type 'str'

We can add some error handling to make sure we only try to square a number.

In [27]:
def square(x) :
    if not str(x).isnumeric() :
        raise ValueError(f"Hey, {x} isn't a number!")
    else :
        return(x*x)

In [28]:
square(2)

4

In [29]:
square("cat")

ValueError: Hey, cat isn't a number!

Here's an example of a function that returns a dictionary based on an NLTK text being sent in. (Note that those texts are in lists.)

In [30]:
def process_text(text) :
    # Take in a text, return the first token, the 100th token, and the last token.
    
    first_token = text[0]
    hundredth_token = text[99]
    last_token = text[-1]
    
    ret_dict = {'first':first_token,
                '100th':hundredth_token,
                'last':last_token}
    return(ret_dict)
    

In [31]:
print(process_text(text1))
print(process_text(text5))
print(process_text(text8))

{'first': '[', '100th': ',', 'last': '.'}
{'first': 'now', '100th': 'JOIN', 'last': '.'}
{'first': '25', '100th': 'groomed', 'last': '!'}
