# Spring 2018, Week 1
To use this notebook, you'll need NLTK's data packages installed. To install these packages, run the cell below. <b>Run the cell below only if you have not downloaded the NLTK data!</b> Running the cell will cause a dialog box to open. Select "all" and press "Download." You can close the window once the download is complete.

For today's workgroup, we'll talk about finding repeated phrases (or ngrams) in texts. We'll focus on a single text, the novel <i>Emma</i>, by Jane Austen. As we walk through the cells below, you can make notes for yourself, either by creating a new Markdown Cell (click Insert above, then use the dialog box to change the cell to a markdown cell) or by inserting comments in the code using the # symbol.

In [1]:
# Run this cell only if you don't have the NLTK data!!

import nltk

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [None]:
import nltk
emmaTokens = list(nltk.corpus.gutenberg.words("austen-emma.txt")) #read in the tokens file from NLTK library
emmaWords = [word for word in emmaTokens if word[0].isalpha()] #create a list of words from the tokens file
print(emmaWords[:25])

In [None]:
emmaText = nltk.Text(emmaWords) 
emmaText.collocations(20) #show the most frequent bigrams, excluding all stopwords

In [None]:
emmaFirstSixWords = " ".join(emmaWords[:6]) #join the first six words of Emma with spaces
print("first six words of Emma: ", emmaFirstSixWords)
emmaBigrams = list(nltk.ngrams(emmaWords, 2)) #create bigrams from emmaWords
emmaBigrams[:5] 

In [None]:
print(len(emmaWords))
print(len(emmaBigrams))

In [None]:
emma4grams = list(nltk.ngrams(emmaWords, 4)) #create 4-grams of Emma
emma4gramsFreqs = nltk.FreqDist(emma4grams) #determine frequency of 4-grams
for words, count in emma4gramsFreqs.most_common(15):
    print(count, " ".join(list(words))) #show the count and then create a string from the 4-gram

In [None]:
%matplotlib inline
emmaText.dispersion_plot(["I do not know"]) #Won't work because phrases are not in the emmaText tokens

In [None]:
emma4gramsText = nltk.Text(emma4grams)
emma4gramsText.dispersion_plot([("I","do","not","know")]) #convert the 4grams into a text and then plot this tuple

In [None]:
emma4gramsTokens = [" ".join(gram) for gram in emma4grams] #joins the 4grams into a list of tuples
nltk.Text(emma4gramsTokens).dispersion_plot([("I do not know")]) #converts list of tuples to a string and plots

In [None]:
ngramsFreqs = [] #keep track of the last set of repeating phrases
for length in range(2, len(emmaWords)):  #create a range from 2 to the length of the whole text
    ngrams = list(nltk.ngrams(emmaWords, length))  #get ngrams for a specified length
    freqs = nltk.FreqDist(ngrams) #get the frequencies for those ngrams
    freqs = [(ngram, count) for ngram, count in freqs.items() if count > 1] #filter out frequencies that don't repeat
    if len(freqs) > 0:  #if we have at least one repeating phrase
        ngramsFreqs = freqs  #new set of frequenices
    else:
        break  #if we've filtered out all the frequencies, then break out of the loop

for ngram, count in ngramsFreqs:
    print("ngram of", len(ngram), "words occurring", count, "times:", " ".join(list(ngram)))

In [None]:
numbers = list(range(0,10))
numbers

In [None]:
import numpy as np
numberBins = np.array_split(numbers, 5) #divide our numbers into five equal bins
print("number of bins:", len(numberBins))

In [None]:
numberBins

In [None]:
emma4gramsSegments = np.array_split(emma4gramsTokens, 10)
[len(segment) for segment in emma4gramsSegments]

In [None]:
idkCounts = [list(segment).count(("I do not know")) for segment in emma4gramsSegments]
idkCounts

In [None]:
import matplotlib.pyplot as plt
line = plt.plot(idkCounts, label="I do not know")
plt.ylim(0) #y axis at zero
plt.legend(handles=line) #add legend
plt.show() #flush out the line chart
emma4gramsText.dispersion_plot([("I","do","not","know")])

In [None]:
xaxis = range(0, len(idkCounts))
bar = plt.bar(xaxis, idkCounts, label="I do not know")
plt.legend(handles=[bar])
plt.show()

In [None]:
searches = ["I do not know", "I have no doubt"] #line plot comparison of "I do not know" and "I have no doubt"
lines = []
for search in searches:
    line, = plt.plot([list(segment).count(search) for segment in emma4gramsSegments], label=search)
    lines.append(line)
plt.legend(handles=lines)
plt.show()

In [None]:
list1 = [1,2,3]
list2 = [1,5,7]
plt.plot(range(3), list1, range(3), list2)
plt.show()
np.corrcoef(list1, list2)[0,1] #returns a matrix of values, but we just want the top right value

In [None]:
list2.reverse()
plt.plot(range(3), list1, range(3), list2)
plt.show()
np.corrcoef(list1, list2)[0,1]

In [None]:
idkCounts = [list(segment).count("I do not know") for segment in emma4gramsSegments]
iHaveNoDoubtCounts = [list(segment).count("I have no doubt") for segment in emma4gramsSegments]
print(idkCounts)
print(iHaveNoDoubtCounts)

In [None]:
np.corrcoef(idkCounts, iHaveNoDoubtCounts)[0,1]

In [None]:
emma4gramsMostFrequent = [" ".join(words) for words, count in emma4gramsFreqs.most_common(15)] 
        # create a list of the top 15 most frequent 4grams
print(emma4gramsMostFrequent)

In the code below, we first go through the most frequent 4grams in Emma, and for every segment in the Emma 4grams segments, we create a list of the counts of the 4gram and add that count to the segments counts dictionary.

Then, we create another dictionary item of correlations. We go through each of the ngram and count tuples in the emma segments couunts, and for every one of those counts, we do a numpy correlation coefficient against "I do not know".

Then, we create a FreqDist of the correlations dictionary so the items are ordered by frequency, and we plot that on a line graph.

In [None]:
emma4gramsSegmentsCounts = {}  #build a dictionary of counts for each search item
for search in emma4gramsMostFrequent:
    emma4gramsSegmentsCounts[search] = [list(segment).count(search) for segment in emma4gramsSegments]
#print(emma4gramsSegmentsCounts)

iDoNotKnowCorrelations = {}  #build a dictionary of correlation values for "I do not know"
for ngram, counts in emma4gramsSegmentsCounts.items():
    iDoNotKnowCorrelations[ngram] = np.corrcoef(emma4gramsSegmentsCounts["I do not know"], counts)[0,1]
#print(iDoNotKnowCorrelations)

iDoNotKnowCorrelationsFreqs = nltk.FreqDist(iDoNotKnowCorrelations)
print(iDoNotKnowCorrelationsFreqs.most_common())
iDoNotKnowCorrelationsFreqs.plot()