This notebook holds the starting point for your Pattern Assignment. I'll put the function stub at the top, but I have some potentially helpful examples down below. 

In [1]:
import nltk
import numpy as np

from nltk.book import *
from nltk import FreqDist
from string import punctuation
from collections import Counter

from pprint import pprint # get some prettier printing of objects

from nltk.corpus import stopwords

sw = stopwords.words('english')

In [2]:
def get_patterns(text)  :
    """
        This function takes text as an input and returns a dictionary of statistics,
        after cleaning the text. 
    
    """
    
    # We'll make things a big clearer by initializing the 
    # statistics here. These are placeholder values.
    
    total_tokens = 1
    unique_tokens = 0
    avg_token_len = 0.0
    lex_diversity = 0.0
    top_10 = Counter()
    
    # Do your tokenization and normalization here
    
    text = text.split()
    
    words = []
    
    for word in text:
        if word.isalpha():
            word = word.lower()
            if word in sw:
                continue
            else:
                words.append(word)
        else:
            continue
    
    # Calculate your statistics here
    
    total_tokens = len(words)
    Freq_tokens = FreqDist(words)
    unique_tokens = len(list(Freq_tokens.keys()))
    all_word_length_list = []
    for words in list(words):
        all_word_length = len(words)
        all_word_length_list.append(all_word_length)
        avg_token_len = sum(all_word_length_list)/len(all_word_length_list)
    lex_diversity = unique_tokens/total_tokens
    top_10 = Freq_tokens.most_common(10)
    
    # Now we'll fill out the dictionary. 
    results = {'tokens':total_tokens,
               'unique_tokens':unique_tokens,
               'avg_token_length':avg_token_len,
               'lexical_diversity':lex_diversity,
               'top_10_tokens':top_10}

    return(results)

In [3]:
beowulf = open("beowulf.txt",'r').read()
big_word_file = open("big.txt",'r').read()

In [5]:
beo_results = get_patterns(beowulf)
big_results = get_patterns(big_word_file)

When you've finished the assignment, these next cells should run without raising any assertion errors. It's always possible that _I_ have a mistake in my version, so if you think you've done everything correctly and you're still getting errors, let me know. 

In [8]:
assert(beo_results['tokens']==9288)
assert(beo_results['unique_tokens']==2970)
assert(0.31 < beo_results['lexical_diversity'] < 0.32)
assert(5.3 < beo_results['avg_token_length'] < 5.4)
assert(len(beo_results['top_10_tokens'])==10)
assert(beo_results['top_10_tokens'][4]==('thou',47))
print("Passed all assertion tests.")

Passed all assertion tests.


In [None]:
assert(big_results['tokens']==410950)
assert(big_results['unique_tokens']==25173)
assert(0.06 < big_results['lexical_diversity'] < 0.065)
assert(6.3 < big_results['avg_token_length'] < 6.4)
assert(len(big_results['top_10_tokens'])==10)
assert(big_results['top_10'][0]==('said',2946))
assert(big_results['top_10'][8]==('upon',1088))
print("Passed all assertion tests.")

---

All of the important bits of the assignment are above here. Everything below is designed to help you wrap your mind around what I'm asking for. 


### Dictionaries

We'll talk about dictionary objects in class a bit, but think of them as a big bag of key-value pairs. The keys, which have to be unique, are like an address to the data that's being stored in the "value", which can be any kind of object. They're incredibly versatile and useful objects to learn about. Here's a [video](https://www.youtube.com/watch?v=daefaLgNkw0) that will help orient you to them. 

In the next couple of cells, I'll make a couple of dictionaries and show you some basic functionality with them.


In [None]:
animal_weights = {"mouse":0.03,"dog":20,"horse":700,"elephant":4100,"blue whale":100000}

animal_weights['dog']

In [None]:
for key in animal_weights :
    print(key)
    print(animal_weights[key])

In [None]:
# This will throw an error
animal_weights['cat']

In [None]:
animal_weights['cat'] = 3

animal_weights

In [None]:
for animal, wt in animal_weights.items() :
    print(f"An average {animal} weighs {wt} kilograms ({wt/2.2:0.2f} lbs).")

### Functions

Functions are a useful way to organize code and prevent yourself from making mistakes. There is a lot to learn about functions, but we'll just handle the basics here. Functions are declared with the `def` statement (for "define", I assume). The parentheses allow you to define the _parameters_ that are passed into the function. Here's a simple example that squares a number. 

In [None]:
def simple_square(x) :
    return(x*x)

In [None]:
simple_square(4)

In [None]:
simple_square(102)

In [None]:
simple_square("cat")

We can add some error handling to make sure we only try to square a number.

In [None]:
def square(x) :
    if not str(x).isnumeric() :
        raise ValueError(f"Hey, {x} isn't a number!")
    else :
        return(x*x)

In [None]:
square(2)

In [None]:
square("cat")

Here's an example of a function that returns a dictionary based on an NLTK text being sent in. (Note that those texts are in lists.)

In [None]:
def process_text(text) :
    # Take in a text, return the first token, the 100th token, and the last token.
    
    first_token = text[0]
    hundredth_token = text[99]
    last_token = text[-1]
    
    ret_dict = {'first':first_token,
                '100th':hundredth_token,
                'last':last_token}
    return(ret_dict)
    

In [None]:
print(process_text(text1))
print(process_text(text5))
print(process_text(text8))