Let's calculate some basic statistics regarding a body of text. We'll start with reading Beowulf from a text file, since reading in data is a common use case when you're working with text data. 

In [None]:
import nltk
import numpy as np

from nltk.book import *
from string import punctuation
from collections import Counter

from pprint import pprint # get some prettier printing of objects

from nltk.corpus import stopwords

sw = stopwords.words('english')

In [None]:
beowulf = open("beowulf.txt",'r').read()

As always, we start with tokenization and normalization. Create a new variable, `beo_clean` that splits on whitespace, casts to lowercase, removes any tokens where `isalpha` is false, and removes stopwords.

In [None]:
# your code here

Let's rip through the basic statistics: 

1. Overall text length (number of tokens)
1. Number of unique tokens
1. Lexical diversity 
1. Average token length
1. Distribution of token lengths

In [None]:
# your code here

### Frequency Distributions

Frequency distributions are *incredibly* useful early in a text analysis. Let's build one for `beo_clean` and look at the 30 most common tokens

In [None]:
beo_fd = FreqDist(beo_clean)

beo_fd.most_common(30)

How many tokens have a count of 1 and what fraction do they represent? These are also called "hapaxes".

In [None]:
# your code here

Interesting, about half the words only appear once. 

Frequency distribution objects come along with a plot method that can be quite useful. Plot the first 25 tokens (example is at the end section 3.1 of [chapter 1 of the NLTK book](https://www.nltk.org/book/ch01.html). 

In [None]:
beo_fd.plot(25, cumulative=True)

There are a number of useful methods available for the `FreqDist` object. They're summarized in [this table](https://www.nltk.org/book/ch01.html#tab-freqdist) in NLTK Chapter 1.


Not many words are longer than 12 characters. Print them all to the screen.

In [None]:
# your code here

### NLTK Functions

The NLKT library comes with a number of helpful functions. In order to take advantage of those, you'll need to run your list of tokens through `nltk.Text`. 

In [None]:
beo_nltk = nltk.Text(beo_clean)

### Concordance

Concordances give you words in context. Let's look at usages of "hrothgar", the [Danish king](https://en.wikipedia.org/wiki/Hrothgar) featured in the book.  

In [None]:
beo_nltk.concordance("hrothgar")

You'd need to know more Beowulf than I do to appreciate these words. Let's look at the example from the NLTK book, comparing the usage of "monstrous" between _Moby Dick_ and _Sense and Sensibility_. 

In [None]:
text1.concordance("monstrous")

In [None]:
text2.concordance("monstrous")

As you can see, Austen uses the word in _Sense and Sensibility_ in a different way from our modern usage. 

Related to concordances is the concept of similarity. 

In [None]:
beo_nltk.similar("son")

In [None]:
text1.similar("whale")

In [None]:
text2.similar("love")

### Collocations

Collocations are words that tend to appear together in a text. There are a number of different methods of deciding which collocations are "most interesting". One option is simple frequency: the collocations that appear a lot are the most important. Another option is to look at how often the words occur together versus how often they appear alone, using a calculation called [Pointwise Mutual Information](https://en.wikipedia.org/wiki/Pointwise_mutual_information) (PMI). Either way, we'll need to bring in some technology from the NLTK library to make these work.

It can be quite useful to limit to some sort of minimal frequency if you're using the PMI measure. 

In [None]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

In [None]:
beo_coll = BigramCollocationFinder.from_words(beo_clean)

In [None]:
beo_coll.nbest(bigram_measures.raw_freq,10)

In [None]:
beo_coll.apply_freq_filter(3)
beo_coll.nbest(bigram_measures.pmi,10)

In [None]:
# You could also do the frequency collocation with bigrams and FreqDist
beo_bigram_fd = FreqDist(nltk.bigrams(beo_clean))
beo_bigram_fd.most_common(10)

### Words Corpus

NLTK includes a corpus of words. It will occasionally be useful for us.

In [None]:
nltk_words = nltk.corpus.words.words()
len(nltk_words)

In [None]:
"happy" in nltk_words

In [None]:
"hppy" in nltk_words