# Word Frequencies, Distinctive Words

And an introduction to Pandas. 

It's always a good idea to put all your import statements at the beginning of your notebook. 

In [59]:
# The Natural Language Processing Toolkit
import nltk

# We can rename long function names like this,
# so that it's easier to type.
from nltk import word_tokenize as tokenize
from nltk import sent_tokenize as sentTokenize

# This one is for counting things. 
from collections import Counter

# And here is the amazing data science library, Pandas.
import pandas as pd

# This magic command tells Jupyter to display plots (graphs) 
# here in this notebook, and not elsewhere. 
%matplotlib inline

Before, we tokenized a string with `nltk.word_tokenize()`. But now, we renamed this function to `tokenize()` in the `from ... import ... as` statement above, so now we can just run this: 

In [185]:
tokenize("The quick brown fox")

['The', 'quick', 'brown', 'fox']

As usual, make sure we're in the right directory (i.e., the one with `moonstone.md` in it). 

In [3]:
%cd ..

/home/jon/Code/course-computational-literary-analysis


In [4]:
%ls

[0m[01;34mHomework[0m/  LICENSE       ngram-pos-experiments.ipynb  README.md
[01;34mHW1[0m/       moonstone.md  [01;34mNotes[0m/


In [55]:
moonstone = open('/home/jon/Code/course-computational-literary-analysis/moonstone.md').read()

The NLTK tokenizer doesn't seem to be able to understand hyphenated words, so we can replace all hyphens with hyphens surrounded with spaces. This can allow the NLTK tokenizer to recognize words better if there is punctuation around it. 

In [192]:
moonstone = moonstone.replace('-', ' - ')

As before, we've edited the moonstone markdown file, adding these `%%%%%` markers to mark where narratives begin and end. This can allow us to extract certain narratives with these commands: 

In [56]:
moonstoneParts = moonstone.split('%%%%%')

In [57]:
bet = moonstoneParts[1]

In [60]:
betSentences = sentTokenize(bet)

In [63]:
bet10sents = betSentences[:10]

In [194]:
bet = moonstoneParts[1]

In [195]:
clack = moonstoneParts[3]

In [196]:
bruff = moonstoneParts[4]

In [14]:
emsent = []

In [15]:
type(emsent)

list

In [22]:
emsent = []

In [23]:
emsent.append('item')

In [24]:
emsent

['item']

In [31]:
tokenize("this is a test!")

['this', 'is', 'a', 'test', '!']

In [32]:
"this is a test!".split()

['this', 'is', 'a', 'test!']

In [41]:
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [42]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [40]:
for word in ["inform", "information", "informative"]: 
    print(stemmer.stem(word))

inform
inform
inform


In [43]:
for word in ["jump", "jumps", "jumping"]:
    print(lemmatizer.lemmatize(word))

jump
jump
jumping


Now we can tokenize them all: 

In [197]:
betTokens = tokenize(bet)
clackTokens = tokenize(clack)
bruffTokens = tokenize(bruff)
moonstoneTokens = tokenize(moonstone)

## Functions

Functions, like `for` loops, are good ways to do things repeatedly. We can run some task over a series of objects this way. Here's a simple function that multiplies a number by two: 

In [198]:
def timesTwo(inputNumber):
    numberTimesTwo = inputNumber * 2
    return numberTimesTwo

If we want, we can make this a little fancier, by checking to make sure that we get an integer passed to us first. 

In [199]:
def timesTwo(inputNumber):
    if type(inputNumber) is not int: 
        return "I don't want to multiply this by two, because it's not an integer," +\
               "and I'm afraid of what might happen!!!!!!!!!!"
    numberTimesTwo = inputNumber * 2
    return numberTimesTwo

Now we can call the function we just made: 

In [204]:
timesTwo(5)

10

In [205]:
timesTwo("Hello!")

"I don't want to multiply this by two, because it's not an integer,and I'm afraid of what might happen!!!!!!!!!!"

In [44]:
"the the the the the the the that's all folks".count("the")

7

In [45]:
porky = tokenize("the the the the the the the that's all folks")

In [47]:
len(porky)

11

In [48]:
7/11

0.6363636363636364

In [49]:
name = "Rachel, Franklin, Godfrey".split(', ')

In [51]:
len(name)

3

Functions are useful for running some series of tasks repeatedly on something. Let's say I have a list of numbers, and I was to multiply each by two: 

In [206]:
listOfNumbers = [3, 6, 9, 11, 2, 0]

In [208]:
for number in listOfNumbers: 
    print(timesTwo(number))

6
12
18
22
4
0


I can also write a function that returns `True` or `False`, which will then speak directly to an `if` statement later: 

In [209]:
def isDelicious(fruit): 
    if fruit == "apple": 
        return True
    else: 
        return False

In [210]:
if isDelicious("kiwi"): 
    print("Yay! My kiwi is delicious!")
else: 
    print("My kiwi is not delicious!!!!! Oh noes!!!!!")

My kiwi is not delicious!!!!! Oh noes!!!!!


Here's an example of a function that takes two inputs: 

In [211]:
def makeLovers(loverA, loverB): 
    return loverA + " and " + loverB + ", sitting in a tree, K-I-S-S-I-N-G"

In [212]:
makeLovers('Rachel', 'Franklin')

'Rachel and Franklin, sitting in a tree, K-I-S-S-I-N-G'

## Word Frequencies

Let's analyze the frequencies of the words in each narrative we've read so far. Now that we've tokenized each, we can lowercase each token, so that we're not paying attention to whether a word starts a sentence or not. 

In [213]:
clackTokensLower = []
for token in clackTokens: 
    clackTokensLower.append(token.lower())

Or you can use a more advanced pattern, called a "list comprehension." The clackTokensLower line below is equivalent to the one in the cell above. It's just a shorter and nicer way of writing that. 

In [214]:
# Using list comprehensions
clackTokensLower = [token.lower() for token in clackTokens]
betTokensLower = [token.lower() for token in betTokens]
bruffTokensLower = [token.lower() for token in bruffTokens]
moonstoneTokensLower = [token.lower() for token in moonstoneTokens]

In [152]:
clackTokensLower[:10]

['#', '#', '#', 'chapter', 'i', 'i', 'am', 'indebted', 'to', 'my']

We can use the `Counter()` object we imported above from the `collections` module to count anything in a list, like our list of lowercased tokens. 

In [215]:
clackCounts = Counter(clackTokensLower)
betCounts = Counter(betTokensLower)
bruffCounts = Counter(bruffTokensLower)
moonstoneCounts = Counter(moonstoneTokensLower)

Let's try it out. How many times does Miss Clack use exclamation points? 

In [216]:
clackCounts['!']

248

Betteredge? Bruff?

In [218]:
betCounts['!']

335

In [219]:
moonstoneCounts['!']

992

Now let's build up a dictionary where we compare the relative proportions of words in Clack's narrative and in Betteredge's narrative. 

In [158]:
clacknesses = {}
for word in clackCounts: 
    # How many times does Miss Clack use this word? 
    clackCount = clackCounts[word]
    
    # Adjust for the number of words in Miss Clack's narrative. 
    clackProportion = clackCount / len(clackTokensLower)
    
    # How many times does Betteredge use this word?

    # Instead of indexing the word directly, which 
    # would fail if the word isn't in our dictionary, 
    # we can use the dictionary `.get()` method, which allows
    # us to say what we want it to return if the word isn't in the 
    # dictionary (in this case, 0). 
    betCount = betCounts.get(word, 0)
    betProportion = betCount / len(betTokensLower)
    
    # Define "clackness" as the difference in proportions
    # between Clack's and Betteredge's narratives
    clackness = (clackProportion - betProportion)*100
    #print(word, clackness)
    clacknesses[word] = clackness

Now we can sort the dictionary, and print out the top 20 words with highest scores for "Clackness":

In [222]:
clacknessesSorted = sorted(clacknesses, key=clacknesses[1], reverse=True)
for word in clacknessesSorted[:20]: 
    print(word, clacknesses[word])

i 0.4429010077119939
! 0.3274994929197816
my 0.30563630809693193
godfrey 0.2840785409841493
aunt 0.2517972218755506
dear 0.22362791586109432
ablewhite 0.20260033138523487
which 0.19439841508607547
. 0.1793568699493528
me 0.16722308784032766
? 0.1664868114811185
bruff 0.14715259442612658
rachel 0.13792359136233168
be 0.1358510308172969
clack 0.133266731488864
to 0.13055750276120756
not 0.11583947816151018
by 0.11576953929629424
is 0.11332884057165499
of 0.10531846752790189


## Word Frequencies as a Pandas Data Frame

That was the long way of doing it. Now here is a slightly easier way, using Pandas. First, make a new Pandas DataFrame object, and give it a list of all of our counts. I'm also giving it some labels, so that the table is easier to read. 

In [160]:
frequencies = pd.DataFrame([clackCounts, betCounts, bruffCounts, moonstoneCounts], 
                          index = ['Clack', 'Betteredge', 'Bruff', 'All'])

Since Pandas doesn't know how to handle words that are not in a particular dictionary, it calls them "NaN" or "not a number." We know that if a word isn't in a dictionary, it doesn't appear in that character's narrative, so we can replace these with zero (and that will make our calculations easier below). 

In [223]:
frequencies = frequencies.fillna(0)

Transpose it! Just because columns are easier to work with than rows. 

In [225]:
frequencies = frequencies.T

We can divide by the total number of tokens in each speaker to transform the raw counts into proportions of counts. We can multiply this by 100 to make it a little easier to read (and to make it seem more like a percentage). 

In [163]:
frequencies['clackP'] = (frequencies['Clack'] / len(clackTokens)) * 100
frequencies['betP'] = (frequencies['Betteredge'] / len(betTokens)) * 100
frequencies['bruffP'] = (frequencies['Bruff'] / len(bruffTokens)) * 100
frequencies['allP'] = (frequencies['All'] / len(moonstoneTokens)) * 100

Now we can define the distinctiveness of certain words among characters by looking at how much more they say a certain word than the average for the text: 

In [226]:
frequencies['clackness'] = frequencies['clackP'] - frequencies['allP']
frequencies['bruffness'] = frequencies['bruffP'] - frequencies['allP']

Sorting for "bruffness," for instance, we see the words distinctive of bruff: 

In [227]:
frequencies.sort_values('bruffness', ascending=False)

Unnamed: 0,Clack,Betteredge,Bruff,All,clackP,betP,bruffP,allP,clackness,bruffness
of,854.0,2135.0,395.0,5603.0,2.329578,2.224260,3.246220,2.370296,-0.040718,0.875924
the,1592.0,4838.0,671.0,12166.0,4.342726,5.040266,5.514464,5.146710,-0.803984,0.367754
had,294.0,803.0,141.0,1959.0,0.801986,0.836572,1.158777,0.828736,-0.026750,0.330041
to,1084.0,2713.0,391.0,6955.0,2.956982,2.826424,3.213346,2.942247,0.014735,0.271100
was,350.0,958.0,149.0,2353.0,0.954745,0.998052,1.224523,0.995414,-0.040669,0.229109
that,377.0,1069.0,164.0,2647.0,1.028397,1.113692,1.347798,1.119788,-0.091391,0.228009
i,943.0,2044.0,326.0,5818.0,2.572356,2.129455,2.679158,2.461249,0.111107,0.217909
indians,6.0,52.0,32.0,120.0,0.016367,0.054174,0.262985,0.050765,-0.034398,0.212220
verinder,64.0,73.0,40.0,291.0,0.174582,0.076052,0.328731,0.123105,0.051477,0.205626
would,67.0,119.0,44.0,391.0,0.182765,0.123975,0.361604,0.165409,0.017357,0.196195


Notice that `.` is token which is the least distinctive of Bruff. Is this because he has longer sentences? Let's test that theory. Here I'll create a list of sentence lengths.

In [169]:
bruffSentLens = [len(sent) for sent in nltk.sent_tokenize(bruff)]

...and then find the average of all of them. 

In [170]:
sum(bruffSentLens)/len(bruffSentLens)

124.6065934065934

Since this is something I'm going to want to do repeatedly, I can abstract this into a function:

In [228]:
def averageSentLen(text): 
    sentLengths = [len(sent) for sent in nltk.sent_tokenize(text)]
    return sum(sentLengths)/len(sentLengths)

In [172]:
averageSentLen(bruff)

124.6065934065934

In [174]:
averageSentLen(clack)

100.13305489260144

In [175]:
averageSentLen(bet)

112.3322818086225

Yep. Bruff has the longest sentences of anyone, so far. 