We need to install the `nltk` book corpus. If you haven't done it before, here are the steps:

1. Open a console or command window.
1. Type `python` to start using python. 
1. Type `import nltk` and hit enter.
1. Type `nltk.download()` and hit enter.
1. This will open a little window. 
1. Click "All Packages" at the top of the list. 
1. Click "Download"

In [None]:
import nltk
from nltk.book import *
from collections import Counter

## Normalization

Questions:

1. Find emojis in the chat corpus.

1. Determine a normalization scheme. (What needs to be normalized, how would you do it?)

1. Count the happy vs sad emojis.

In [None]:
chat = text5 # give it a nice name. 

# Let's find emojis in chat. 
potential_emojis = {w for w in chat if ":" in w or ";" in w or "=" in w}

In [None]:
potential_emojis

Clearly we're catching some non-emojis, but let's assume we're getting most of the list. 

In [None]:
# Count happy vs sad
happy = [w for w in chat if w in {":-)",":)",":D",";-)","=)"}]
sad = [w for w in chat if w in {":-(",":(",";-(","=("}]

print(len(happy))
print(len(sad))

Now let's do some normalization of a plain text corpus. The repo includes a file "beowulf.txt". Read this in, split it into words, and let's count the number of each word.

In [None]:
beowulf = [] # a list per line is one option

with open("beowulf.txt",'r') as infile :
    for line in infile :
        beowulf.append(line.strip())
        
# Lots of blanks we don't need. 
beowulf = [line for line in beowulf if line]

In [None]:
beo_words = []
for line in beowulf :
    beo_words.extend(line.split())


In [None]:
beo_counter = Counter(beo_words)

In [None]:
beo_counter.most_common(20)

---

## Stemming

Let's go through some stemming examples from the NLTK. First, let's continue to practice exploring words. How many words in our Beowulf text end in "ing"? 

In [None]:
beo_ing = []
for word, count in beo_counter.items() :
    if word[-3:] == "ing" :
        beo_ing.append(word)

In [None]:
len(beo_ing)

What about in the NLTK words corpus?

In [None]:
# Here's something to get you started:
for idx, word in enumerate(nltk.corpus.words.words()) :
    print(word)
    if idx > 10 :
        break

In [None]:
all_ing = []

for word in nltk.corpus.words.words() :
    if word[-3:] == "ing" :
        all_ing.append(word)

In [None]:
len(all_ing)

What are the Beowulf words that aren't in the words corpus? Are there some in there that you think should be? Could you modify the Beowulf processing to have a higher hit rate? 

In [None]:
[w for w in beo_ing if w not in all_ing]

In [None]:
# This finds more:
[w for w in beo_ing if w.lower() not in all_ing]

Now to stemming. Let's use the Porter Stemmer to look at some inaugural speech texts.

In [None]:
porter = nltk.PorterStemmer() # give it a short name.
start = 30000
distance = 100

print(" ".join(text4[start:(start + distance)]))
print("\n\n")
print(" ".join([porter.stem(w) for w in text4[start:(start + distance)]]))



How many words are in the inaugural addresses? How many stems are in them? 

In [None]:
# words in inaugural addresses
print(len(set(text4)))

In [None]:
inaug_stemmed = {porter.stem(w.lower()) for w in text4}

print(len(inaug_stemmed))

print(len(set(text4))/len(inaug_stemmed))

---

## Language Models
Let's find some common n-grams in _Sense and Sensibility_.

In [None]:
fd = FreqDist(text2)

In [None]:
fd.freq('a')

In [None]:
nltk.corpus.stopwords.words("english")

In [None]:
# the `isalpha` function helps us identify strings of ASCII letters 
print("abc".isalpha())
print("abc123".isalpha())
print("_".isalpha())
print("Hi!".isalpha())


Now go through the book (text2) and build a new frequency distribution. Build one with all of the following attributes:

1. Lowercase words
1. Words that _aren't_ in the `stopword` list
1. Words that pass the `isalpha` test. 

What's the count of "the" in both frequency distributions? How have the most common words changed? (Use the `most_common` method on the frequency distribution.)

In [None]:
fd2 = FreqDist([w.lower() for w in text2 
                if w.lower() not in 
                nltk.corpus.stopwords.words("english") 
                and w.isalpha()])

In [None]:
print(fd["the"])
print(fd2["the"])

In [None]:
fd.most_common(10)

In [None]:
fd2.most_common(10)

Sum up the total words in this second frequency distribution. Display the 20 most common words, their count, and their overall fraction of the books words. 

In [None]:
total_words = sum([count for word, count in fd2.items()])

for pairs in fd2.most_common(20) :
    print(" : ".join([pairs[0],str(pairs[1]),str(pairs[1]/total_words)]))
    

We can use this `FreqDist` function to look at common co-occurences of bigrams. NLTK provides a useful function, `ngrams`, that gives us the N-grams.

In [None]:
ss_bigrams = nltk.ngrams(text2,2)

In [None]:
for idx, pair in enumerate(ss_bigrams) :
    print(pair)
    if idx > 20 :
        break

In [None]:
# Note, `ngrams` returns an iterator, so we have to re-initialize it to use it.
ss_bigrams = nltk.ngrams(text2,2)
# Ask me about this if it doesn't make sense. And sensibility. 

Build a frequency distribution of the bigrams in S&S and look at the most common ones.  

In [None]:
ss_bi_fd = FreqDist(nltk.ngrams(text2,2))
ss_bi_fd.most_common(20)

Now, let's build a new frequency distribution of bigrams where the following hold true:
1. All words to lowercase
1. No bigrams where *both* words are in the `stopword` list
1. Words that pass the `isalpha` test. 

Build this and display the 20 most common words. 

In [None]:
sw = nltk.corpus.stopwords.words("english")

In [None]:
clean_ss_bigrams = []

for pair in nltk.ngrams(text2,2) :
    first, second = pair
    first = first.lower()
    second = second.lower()
    
    if first not in sw or second not in sw :
        if first.isalpha() and second.isalpha() :
            clean_ss_bigrams.append((first,second))

In [None]:
clean_bi_fd = FreqDist(clean_ss_bigrams)

In [None]:
clean_bi_fd.most_common(20)

Concordance is a cool way to look at a word in context. Explore some of your more common words in bigrams. 

In [None]:
text2.concordance("sister")

---

## N-gram models

Let's make a function that takes in text, builds a freq dist and generates text with various n-grams. To do this, we'll need a function that gives us words from a frequency distribution probabilistically. 

In [None]:
import random

def weighted_choice(freq_dist):
    weight_total = sum([count for token,count in freq_dist.items()])
    n = random.uniform(0, weight_total)
    for token, count in freq_dist.items() :
        if n < count:
            return(token)
        n = n - count
    return(token)

Kind of complicated, but it does what we expect. Play around with the following cell to see words from various texts.

In [None]:
weighted_choice(FreqDist(text4))

Now, write a function that generates text of a given length, using the probabilistic approach to glue one word to another. Have it start with a text and the desired length of the output.

In [None]:
def generate_unigram(text,length=10) :
    fd = FreqDist(text)
    
    results = []
    for i in range(length) :
        results.append(weighted_choice(fd))
        
    return(" ".join(results))


Now play around with the various texts, generating nonsense sentences from them. 

In [None]:
generate_unigram(text1)

In [None]:
generate_unigram(text2)

In [None]:
generate_unigram(text5)

Challenge exercise: Do the same thing, but have it work with bigrams. This is harder, since you have a "current word" you want to glue text onto. 

In [None]:
def weighted_choice_ngram(cur_word,freq_dist) :
    ''' Starts with a current word and randomly chooses 
        a following word based on the bigrams. '''
    
    # First, build list of tuples of the form
    # ('a_word',count)
    # where our freq_dist has an entry like 
    # ('cur_word','a_word',count)
    sub_dist = {}
    
    for bigram, count in freq_dist.items() :
        if bigram[0] == cur_word :
            sub_dist[bigram[1]] = count
    
    return(weighted_choice(sub_dist))

def generate_bigram(text,length=10,start=None) :
    
    if not start :
        uni_fd = FreqDist(text)
        start = weighted_choice(uni_fd)
        
    fd = FreqDist(nltk.bigrams(text))
    
    results = []
    this_word = start
    for i in range(length) :
        this_word = weighted_choice_ngram(this_word,fd)
        results.append(this_word)
        
    return(" ".join(results))


In [None]:
generate_bigram(text1)

In [None]:
generate_bigram(text2)

In [None]:
generate_bigram(text5)