# Words, Tokens, Stems, Lemmas

## and the NLTK

The NLTK is the Python Natural Language (processing) ToolKit. To use it, we import the package like this: 

In [1]:
import nltk

In addition to importing the NLTK, we also need to make sure to download the language models for English. 

In [None]:
nltk.download('book')

Also, make sure that we're in the directory where `moonstone.md` exists, or else we won't be able to load the file. 

In [2]:
%cd ..

/home/jon/Code/course-computational-literary-analysis


This will likely be different on your system, so don't just run this command blindly!

In [3]:
%ls

[0m[01;34mHomework[0m/  [01;34mHW1[0m/  LICENSE  moonstone.md  [01;34mNotes[0m/  README.md


Note: as someone pointed out in the course chatroom, sometimes Windows doesn't load the file as unicode by default, so we have to tell it to do this explicitly: 

In [4]:
moonstoneRaw = open('moonstone.md', encoding="UTF-8").read()

Note that I've prepared my text ahead of time by marking the beginnings and ends of the Betteredge and Clack sections with `%%%%%`. (This is an arbitrary mark, and doesn't really mean anything.) This allows me to split the text like this: 

In [5]:
moonstoneParts = moonstoneRaw.split('%%%%%')

In [6]:
len(moonstoneParts)

5

What's the second part look like? 

In [7]:
print(moonstoneParts[1][:500])



### Chapter I

In the first part of ROBINSON CRUSOE, at page one hundred and
twenty-nine, you will find it thus written:

“Now I saw, though too late, the Folly of beginning a Work before we
count the Cost, and before we judge rightly of our own Strength to go
through with it.”

Only yesterday, I opened my ROBINSON CRUSOE at that place. Only this
morning (May twenty-first, Eighteen hundred and fifty), came my lady’s
nephew, Mr. Franklin Blake, and held a short conversation with me, as
follows:


Yep, that's Betteredge. Now how about the fourth part? 

In [8]:
print(moonstoneParts[3][:500])

 

### Chapter I

I am indebted to my dear parents (both now in heaven) for having had
habits of order and regularity instilled into me at a very early age.

In that happy bygone time, I was taught to keep my hair tidy at all
hours of the day and night, and to fold up every article of my clothing
carefully, in the same order, on the same chair, in the same place at
the foot of the bed, before retiring to rest. An entry of the day’s
events in my little diary invariably preceded the folding up. Th


And that's certainly Miss Clack. Let's assign these both to variables. 

In [9]:
betteredge = moonstoneParts[1]
clack = moonstoneParts[3]

## Tokens and Tokenizing

Tokens are word-like objects. Punctuation marks and parts of words, like "ca" and "n't" are also considered tokens by some tokenizers. Let's make a test sentence, and try to tokenize it. 

In [10]:
testSentence = """
I am indebted to my dear parents (both now in heaven) 
for having had habits of order and regularity 
instilled into me at a very early age."""

In [11]:
print(testSentence)


I am indebted to my dear parents (both now in heaven) 
for having had habits of order and regularity 
instilled into me at a very early age.


We'll use the nltk function `word_tokenize()`: 

In [12]:
nltk.word_tokenize(testSentence)

['I',
 'am',
 'indebted',
 'to',
 'my',
 'dear',
 'parents',
 '(',
 'both',
 'now',
 'in',
 'heaven',
 ')',
 'for',
 'having',
 'had',
 'habits',
 'of',
 'order',
 'and',
 'regularity',
 'instilled',
 'into',
 'me',
 'at',
 'a',
 'very',
 'early',
 'age',
 '.']

How many tokens did it find? 

In [13]:
len(nltk.word_tokenize(testSentence))

30

Let's try another sentence. 

In [14]:
nltk.word_tokenize("An entry of the day’s events in my little diary invariably preceded the folding up.")

['An',
 'entry',
 'of',
 'the',
 'day',
 '’',
 's',
 'events',
 'in',
 'my',
 'little',
 'diary',
 'invariably',
 'preceded',
 'the',
 'folding',
 'up',
 '.']

Notice what happens there with "day's"? What if our sentence contains a contraction? 

In [15]:
nltk.word_tokenize("I can't believe this!")

['I', 'ca', "n't", 'believe', 'this', '!']

### Stems and Stemming

To stem a word, we first have to instantiate, or make a fresh copy of, our semmer object: 

In [16]:
stemmer = nltk.stem.LancasterStemmer()

Now let's test it on three different forms of the same stem: 

In [17]:
for word in ["believe", "belief", "believing"]:
    print(stemmer.stem(word))

believ
believ
believ


In [32]:
stemmer.stem("believe")

'believ'

### Lemmas and Lemmatizers

A lemma is the "dictionary form" of a word, so the lemma for "jumps" is "jump." Lemmatizing often doesn't transform the text as much as stemming. First, instantiate the lemmatizer: 

In [18]:
lemmatizer = nltk.stem.WordNetLemmatizer()

In [35]:
lemmatizer.lemmatize("believe")

'believe'

In [36]:
for word in ["believe", "belief", "believing"]:
    print(lemmatizer.lemmatize(word))

believe
belief
believing


In [37]:
for word in ["happy", "happier", "happiest"]:
    print(lemmatizer.lemmatize(word))

happy
happier
happiest


In [39]:
for word in ["jumps", "jumping", "jump"]:
    print(lemmatizer.lemmatize(word))

jump
jumping
jump


In [41]:
testTokens = nltk.word_tokenize(testSentence)

### Application: Comparing Miss Clack with Betteredge 

First, tokenize each text: 

In [22]:
clackTokens = nltk.word_tokenize(clack)
betteredgeTokens = nltk.word_tokenize(betteredge)

Now compare the lengths of each: 

In [23]:
len(clackTokens), len(betteredgeTokens)

(36247, 94899)

Convert each token into its stem: 

In [24]:
clackStems = []
for word in clackTokens: 
    stem = stemmer.stem(word)
    clackStems.append(stem)

In [25]:
betteredgeStems = []
for word in betteredgeTokens: 
    stem = stemmer.stem(word)
    betteredgeStems.append(stem)

Now let's create a word frequency table for each of these collections of stems. 

In [26]:
clackStemsDict = {}
for stem in clackStems:
    # If our stem is not already in the dictionary, 
    # it has a frequency of one. 
    if stem not in clackStemsDict: 
        clackStemsDict[stem] = 1
    else: 
        # Otherwise, increase the count by one. 
        clackStemsDict[stem] = clackStemsDict[stem] + 1

In [27]:
betteredgeStemsDict = {}
for stem in betteredgeStems:
    # If our stem is not already in the dictionary, 
    # it has a frequency of one. 
    if stem not in betteredgeStemsDict: 
        betteredgeStemsDict[stem] = 1
    else: 
        betteredgeStemsDict[stem] = betteredgeStemsDict[stem] + 1

In [28]:
len(clackTokens), len(betteredgeTokens)

(36247, 94899)

Let's compare the proportions of exclamation marks used by each. We're dividing by the total number of tokens in each, so that we're dealing with proportions, rather than raw counts: 

In [30]:
print(clackStemsDict['!'] / len(clackTokens)) 
print(betteredgeStemsDict['!'] / len(betteredgeTokens))

0.0068419455403205785
0.003530068809997998


Looks like Miss Clack uses exclamation point (!) about twice as much as Betteredge!!!

## Sentence Tokenization

We can also tokenize by sentences instead of words

In [32]:
betteredgeSents = nltk.sent_tokenize(betteredge)
clackSents = nltk.sent_tokenize(clack)

In [33]:
clackSents[:5]

[' \n\n### Chapter I\n\nI am indebted to my dear parents (both now in heaven) for having had\nhabits of order and regularity instilled into me at a very early age.',
 'In that happy bygone time, I was taught to keep my hair tidy at all\nhours of the day and night, and to fold up every article of my clothing\ncarefully, in the same order, on the same chair, in the same place at\nthe foot of the bed, before retiring to rest.',
 'An entry of the day’s\nevents in my little diary invariably preceded the folding up.',
 'The\n“Evening Hymn” (repeated in bed) invariably followed the folding up.',
 'And the sweet sleep of childhood invariably followed the “Evening Hymn.”\n\nIn later life (alas!)']

Let's see what some lengths of some sentences are: 

In [34]:
for sent in betteredgeSents[100:105]: 
    print(len(sent))

123
100
279
80
76


...and we can build up lists of sentence lengths for each character: 

In [35]:
clackSentenceLengths = []
for sent in clackSents: 
    clackSentenceLengths.append(len(sent))

In [36]:
betteredgeSentenceLengths = []
for sent in betteredgeSents: 
    betteredgeSentenceLengths.append(len(sent))

Now we can find the average sentence length for each: 

In [37]:
sum(clackSentenceLengths)/len(clackSentenceLengths)

99.87649164677805

In [38]:
sum(betteredgeSentenceLengths)/len(betteredgeSentenceLengths)

112.04363827549948