# Intro to low level NLP - Tokenization, Stopwords, Frequencies, Bigrams

We will have to stop and install some stuff for NLTK to work properly, so let me know when you get an error using it.

In [19]:
import itertools
import nltk
import string
from nltk.corpus import stopwords

PUNCTUATION = string.punctuation

## Tokenization

Read in a file to use for practice.  The directory is one level above us now, in data/books.  You can add other files into the data directory if you want.

** This command will not work on Windows:**

In [2]:
!ls data/books/

[31mAusten_Emma.txt[m[m       [31mMelville_MobyDick.txt[m[m
[31mAusten_Pride.txt[m[m      [31mlovecraft.txt[m[m


In [7]:
with open("data/books/Austen_Emma.txt", errors="ignore") as handle:
    text = handle.read()

In [8]:
text[0:200]

'EMMA\nBY\nJANE AUSTEN\nVOLUME I\nCHAPTER I\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home and\nhappy disposition, seemed to unite some of the best blessings of\nexistence; and had lived'

In [11]:
## if you don't want the newlines in there - replace them all with a space!
text = text.replace('\n', ' ')

In [12]:
text[0:100]

'EMMA BY JANE AUSTEN VOLUME I CHAPTER I Emma Woodhouse, handsome, clever, and rich, with a comfortabl'

### In order to use some features of the NLTK module, you need to download and install some things from it.

Tip: Some people have this little UI freeze up on them if they try to scroll using their mousepad instead of the scrollbar.  If this happens, restart your notebook kernel (see the menu "kernel" on top).  You will have to run the cells above again.


In [13]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### Look for the "python window" that was opened on your machine.  

* In the Corpora tab you need to get "stopwords."
* In the Models tab you need "Punkt Tokenizer Models."
* In Models you also need "Averaged Perceptron Tagger."

<img src="assets/nltk_downloader.png">

### After you download these files, you must close the dialog using the X in the upper left corner, in order to get back to the notebook cells.

The process of breaking up a text into smaller pieces that you can count (or analyze) is called **"tokenizing."**  Here we break it up into sentences using a sentence tokenizer.

In [15]:
## Breaking it up by sentence! 
sentences = nltk.sent_tokenize(text)
sentences[20:30]

['She dearly loved her father, but he was no companion for her.',
 'He could not meet her in conversation, rational or playful.',
 'The evil of the actual disparity in their ages (and Mr. Woodhouse had not married early) was much increased by his constitution and habits; for having been a valetudinarian all his life, without activity of mind or body, he was a much older man in ways than in years; and though everywhere beloved for the friendliness of his heart and his amiable temper, his talents could not have recommended him at any time.',
 'Her sister, though comparatively but little removed by matrimony, being settled in London, only sixteen miles off, was much beyond her daily reach; and many a long October and November evening must be struggled through at Hartfield, before Christmas brought the next visit from Isabella and her husband, and their little children, to fill the house, and give her pleasant society again.',
 'Highbury, the large and populous village, almost amounting to

In [16]:
# How many sentences?
len(sentences)

7497

**"Word" tokenization** is much more common and useful.  But there are lots of types of word tokenization, depending on how you handle punctuation.

In [18]:
tokens = nltk.word_tokenize(text)
tokens[20:50]

['comfortable',
 'home',
 'and',
 'happy',
 'disposition',
 ',',
 'seemed',
 'to',
 'unite',
 'some',
 'of',
 'the',
 'best',
 'blessings',
 'of',
 'existence',
 ';',
 'and',
 'had',
 'lived',
 'nearly',
 'twenty-one',
 'years',
 'in',
 'the',
 'world',
 'with',
 'very',
 'little',
 'to']

In [19]:
# Notice the difference here:
nltk.wordpunct_tokenize(text)[23:50]

['happy',
 'disposition',
 ',',
 'seemed',
 'to',
 'unite',
 'some',
 'of',
 'the',
 'best',
 'blessings',
 'of',
 'existence',
 ';',
 'and',
 'had',
 'lived',
 'nearly',
 'twenty',
 '-',
 'one',
 'years',
 'in',
 'the',
 'world',
 'with',
 'very']

There are other options for tokenization in NLTK.  You can test some out here: http://text-processing.com/demo/tokenize/

## Counting Words (First Attempt)

Now let's break up a text and count some stuff.

In [20]:
tokens = nltk.word_tokenize(text)
tokens[0:15]

['EMMA',
 'BY',
 'JANE',
 'AUSTEN',
 'VOLUME',
 'I',
 'CHAPTER',
 'I',
 'Emma',
 'Woodhouse',
 ',',
 'handsome',
 ',',
 'clever',
 ',']

In [21]:
# this is how many tokens we have:
len(tokens)

191739

### What's one way we learned to count things?  Using the Counter from collections.

In [22]:
from collections import Counter

In [24]:
words = Counter(tokens)

In [26]:
words.most_common(20)

[(',', 12016),
 ('.', 6359),
 ('to', 5124),
 ('the', 4842),
 ('and', 4652),
 ('of', 4272),
 ('I', 3164),
 ('--', 3097),
 ('a', 3001),
 ('was', 2383),
 ('her', 2360),
 (';', 2353),
 ('not', 2242),
 ("''", 2189),
 ('in', 2103),
 ('it', 2101),
 ('``', 1998),
 ('be', 1965),
 ('she', 1774),
 ('that', 1728)]

Now let's do the same thing with another book.  Let's make a function to make this easier:

In [27]:
def get_most_common(filename):
    # Takes a path to a file, splits it up into tokens, and prints the "count" most common.
    from collections import Counter
    text = None
    mycounts = None
    
    with open(filename, errors="ignore") as handle:
        text = handle.read()
    if text:
        text = text.replace("\n", " ")
        words = nltk.word_tokenize(text)
        mycounts = Counter(words)
    return mycounts

### Call this with a path to your book file.

In [28]:
counts = get_most_common("data/books/Austen_Pride.txt")

In [None]:
counts

In [32]:
if counts:
    for word,count in counts.most_common(15):
        print("%s\t%s" % (word,count))
else:
    print("Something wrong with your path, no file read, so no word counts.")

,	9117
.	5034
to	4081
the	4047
of	3588
and	3377
her	2139
I	2049
a	1891
was	1841
in	1778
``	1770
''	1739
;	1538
that	1514


In [33]:
counts = get_most_common("data/books/lovecraft.txt")
for word,count in counts.most_common(10):
        print("%s\t%s" % (word,count))

the	734
,	650
of	440
and	439
.	329
a	248
to	218
in	215
I	176
was	155


In [34]:
counts = get_most_common("data/books/Melville_MobyDick.txt")
for word,count in counts.most_common(10):
        print("%s\t%s" % (word,count))

,	18923
the	13522
.	7064
of	6402
and	5931
a	4465
to	4444
;	4143
in	3830
that	2942


These aren't very interesting top words.  Very different books look quite similar when you look at the most common tokens. The problem is that these are common words and punctuation, and they appear in all books the most frequently. You will read about Zipf's law in the reading this week.

## Removing Punctuation

**There are 2 things we need to do to clean the word list.  One is remove punctuation, the other is remove stopwords.**

**Let's use a list comprehension to remove the punctuation first.  We can get a list of punctuation from the string library.**

In [35]:
tokens[4:16]

['VOLUME',
 'I',
 'CHAPTER',
 'I',
 'Emma',
 'Woodhouse',
 ',',
 'handsome',
 ',',
 'clever',
 ',',
 'and']

In [36]:
import string
punctuation = string.punctuation

In [33]:
# notice this is a string of characters - see the quotes around it?
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [34]:
len(tokens)

191739

In [35]:
# remember we can do this with strings to test if they are substrings:
'?' in punctuation

True

In [36]:
"a" in punctuation

False

In [37]:
# this says: make my new list out of the elements in tokens if they aren't in punctuation.
no_punct = [word for word in tokens if word not in punctuation]

In [38]:
no_punct[4:16]

['VOLUME',
 'I',
 'CHAPTER',
 'I',
 'Emma',
 'Woodhouse',
 'handsome',
 'clever',
 'and',
 'rich',
 'with',
 'a']

In [39]:
len(tokens) - len(no_punct)

22936

In [18]:
# let's make a function.  I am assuming punctuation is a global variable, defined outside.
# Frequently globals and constants are assigned at the top of your file with capital letters:

import string

PUNCTUATION = string.punctuation

def remove_punct(wordlist):
    return [word for word in wordlist if word not in PUNCTUATION]

Let's edit our function for the most common words and add this into it.

In [40]:
# Let's put printing inside the function right now:
def print_most_common(filename, count=10):
    # Takes a path to a file, splits it up into tokens, and prints the "count" most common.
    # This version removes punctuation.
    from collections import Counter
    text = None
    mycounts = None
    
    try:
        with open(filename, errors="ignore") as handle:
            text = handle.read()
            text = text.replace("\n", " ")
            words = nltk.word_tokenize(text)
            
            print("Total Length", len(words))
            words = remove_punct(words)  # new stuff!
            print("Total Length After Punct Removal", len(words))
            
            mycounts = Counter(words)
            print("Word\tCount")
            for word,count in mycounts.most_common(count):
                print("%s\t%s" % (word,count))
    except:
        print("Something is wrong with your file location.")

In [43]:
print_most_common("data/books/Austen_Emma.txt",15)

Total Length 191739
Total Length After Punct Removal 168803
Word	Count
to	5124
the	4842
and	4652
of	4272
I	3164
--	3097
a	3001
was	2383
her	2360
not	2242
''	2189
in	2103
it	2101
``	1998
be	1965


In [44]:
print_most_common("data/books/lovecraft.txt",15)

Total Length 12025
Total Length After Punct Removal 10912
Word	Count
the	734
of	440
and	439
a	248
to	218
in	215
I	176
was	155
had	131
that	128
which	95
my	93
with	93
it	90
--	85


So what happened there? Any theories?

### StopWords

"Stopwords" are words that are usually excluded because they are common connectors (or determiners, or short verbs) that are not considered to carry meaning. **BEWARE**: Always check stopword lists to see if you agree with their contents!

In [10]:
from nltk.corpus import stopwords
english_stops = stopwords.words('english')

Notice they are lowercase.  This means we need to be sure we lowercase our text if we want to match against them.

In [50]:
english_stops

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'her',
 'hers',
 'herself',
 'it',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 '

In [52]:
french_stops = stopwords.words('french')
french_stops

['au',
 'aux',
 'avec',
 'ce',
 'ces',
 'dans',
 'de',
 'des',
 'du',
 'elle',
 'en',
 'et',
 'eux',
 'il',
 'je',
 'la',
 'le',
 'leur',
 'lui',
 'ma',
 'mais',
 'me',
 'même',
 'mes',
 'moi',
 'mon',
 'ne',
 'nos',
 'notre',
 'nous',
 'on',
 'ou',
 'par',
 'pas',
 'pour',
 'qu',
 'que',
 'qui',
 'sa',
 'se',
 'ses',
 'son',
 'sur',
 'ta',
 'te',
 'tes',
 'toi',
 'ton',
 'tu',
 'un',
 'une',
 'vos',
 'votre',
 'vous',
 'c',
 'd',
 'j',
 'l',
 'à',
 'm',
 'n',
 's',
 't',
 'y',
 'été',
 'étée',
 'étées',
 'étés',
 'étant',
 'étante',
 'étants',
 'étantes',
 'suis',
 'es',
 'est',
 'sommes',
 'êtes',
 'sont',
 'serai',
 'seras',
 'sera',
 'serons',
 'serez',
 'seront',
 'serais',
 'serait',
 'serions',
 'seriez',
 'seraient',
 'étais',
 'était',
 'étions',
 'étiez',
 'étaient',
 'fus',
 'fut',
 'fûmes',
 'fûtes',
 'furent',
 'sois',
 'soit',
 'soyons',
 'soyez',
 'soient',
 'fusse',
 'fusses',
 'fût',
 'fussions',
 'fussiez',
 'fussent',
 'ayant',
 'ayante',
 'ayantes',
 'ayants',
 'eu'

** How would we get the french stopwords and look at them?**

To remove stopwords efficiently, we need to make sure everything is lower case. This will also mean our counts are better in general. Words at the beginning of a sentence are still the same as words in the middle, just capitalized.  If we want good counts, they must all look the same.

In [53]:
lowercase = [token.lower() for token in tokens]

In [56]:
tokens[3].lower()

'austen'

In [57]:
lowercase[0:10]

['emma',
 'by',
 'jane',
 'austen',
 'volume',
 'i',
 'chapter',
 'i',
 'emma',
 'woodhouse']

In [58]:
print(len(lowercase), len(tokens))

191739 191739


In [59]:
# try this without .lower() in the if-statement and check the size!
# We are using a python list comprehension to remove the tokens from Emma (after lowercasing them!) that are stopwords
nostops = [token for token in lowercase if token not in english_stops]
len(nostops)

104376

In [60]:
# now look at the first 15 words:
nostops[0:15]

['emma',
 'jane',
 'austen',
 'volume',
 'chapter',
 'emma',
 'woodhouse',
 ',',
 'handsome',
 ',',
 'clever',
 ',',
 'rich',
 ',',
 'comfortable']

In [17]:

def remove_stops(wordlist, stopwords):
    # takes a list of words and stopwords and filters out stopwords after lowercasing all.
    lowercase = [word.lower() for word in wordlist]
    return [word for word in lowercase if word not in stopwords]

In [None]:
remove_stops(tokens, english_stops)

Let's update our function again to remove stopwords along with making it all lowercase and removing punctuation.

In [7]:
# Let's rename it again
def print_most_common3(filename, stops, count=10):
    # Takes a path to a file, stopwords list, splits it up into tokens, and prints the 
    # "count" most common.  Count will default to 10 if not specified.
    # This version removes punctuation.
    from collections import Counter
    text = None
    mycounts = None
    try:
        with open(filename, errors="ignore") as handle:
            text = handle.read()
            text = text.replace("\n", " ")
            words = nltk.word_tokenize(text)
            
            print("Length Before Cleaning", len(words))
            words = remove_punct(words)
            words = remove_stops(words, stops)
            print("Length After Cleaning", len(words))
            
            mycounts = Counter(words)
            print("Word\tCount")
            for word,count in mycounts.most_common(count):
                print("%s\t%s" % (word,count))
    except:
        print("Something is wrong with your file location.")

In [11]:
print_most_common3("data/books/Austen_Emma.txt", english_stops)

Length Before Cleaning 191739
Something is wrong with your file location.


In [12]:
# in this call, we specify the number for count.
print_most_common3("data/books/Melville_MobyDick.txt", english_stops, 15)

Length Before Cleaning 250560
Something is wrong with your file location.


We can write a little function to allow us to remove any string we want, too.  This is basically a custom stopwords filter.

In [16]:
def remove_custom(wordlist, mylist):
    return [word for word in wordlist if word not in mylist]

In [75]:
# a list of words to also remove might start like this:
TO_REMOVE = ["--","``","'s", "''" ]

In [76]:
# test it out:
remove_custom(['hi', '--', 'fred'], TO_REMOVE)

['hi', 'fred']

### So how would we use this and the other removal and cleaning functions?

In our functions, we are just doing things to our words list, one after another.

In [78]:
mywords = ['Hi', 'there', ',', 'Sally', '--', 'this', 'is', 'the', 'FBI', '.']

In [79]:
words = remove_punct(mywords)
words

['Hi', 'there', 'Sally', '--', 'this', 'is', 'the', 'FBI']

In [80]:
words = remove_stops(words, STOPS)
words

['hi', 'sally', '--', 'fbi']

In [81]:
words = remove_custom(words, TO_REMOVE)
words

['hi', 'sally', 'fbi']

How would we use it in our existing function?  We might need to change the arguments we call this with, to pass in the list of custom stopwords.  Edit this in class to add remove_custom and handle the new list of custom stopwords:

In [82]:
# Let's rename it again
def print_most_common4(filename, stopwords, customstops=None, count=10):
    # Takes a path to a file, splits it up into tokens, and prints the "count" most common.
    # This version removes punctuation.
    from collections import Counter
    text = None
    mycounts = None
    if not customstops:
        customstops = []
        
    try:
        with open(filename, errors="ignore") as handle:
            text = handle.read()
            text = text.replace("\n", " ")
            words = nltk.word_tokenize(text)
            
            words = remove_punct(words)
            words = remove_stops(words, stopwords)
            words = remove_custom(words, customstops)
            
            mycounts = Counter(words)
            print("Word\tCount")
            for word,count in mycounts.most_common(count):
                print("%s\t%s" % (word,count))
    except:
        print("Something is wrong with your file location.")

So now when you call that, you should be able to see a list you can live with.

In [84]:
# here's how we might call this after editing it... (this is a hint)
print_most_common4("data/books/Austen_Emma.txt", english_stops, customstops=TO_REMOVE, count=15)

Word	Count
mr.	1089
emma	860
could	836
would	818
mrs.	668
miss	597
must	566
harriet	500
much	484
said	483
one	447
weston	437
every	434
thing	394
elton	384


In [85]:
# here's how we might call this after editing it... (this is a hint)
print_most_common4("data/books/lovecraft.txt", STOPS, customstops=TO_REMOVE, count=15)

Word	Count
house	59
street	41
uncle	39
one	35
harris	34
could	29
seemed	27
would	22
might	21
door	20
thing	18
shunned	18
place	17
cellar	17
time	16


In [86]:
# here's how we might call this after editing it... (this is a hint)
print_most_common4("data/books/Melville_MobyDick.txt", english_stops, customstops=TO_REMOVE, count=15)

Word	Count
whale	1027
one	899
like	572
upon	560
ahab	511
man	496
ship	464
old	439
ye	433
would	430
though	380
sea	367
yet	344
time	324
captain	323


You might want to read about Python functions, positional arguments, and keyword-named arguments here: http://sys-exit.blogspot.fr/2013/07/python-positional-arguments-and-keyword.html 

### Refactoring 

Let's add a bunch of separate little functions now to pull apart the cleaning from the printing.  We might want to use that list of words for more than just a print.

In [87]:
def tokenize_text(path):
    # This takes a file path and word_tokenizes it, returning a list of words.
    words = None
    try:
        with open(path, errors="ignore") as handle:
            text = handle.read()
            text = text.replace("\n", " ")
            words = nltk.word_tokenize(text)
            return words
    except:
        return None

In [89]:
def print_counts(tokens, count=10):
    # Takes a list of words, counts, prints top "count" words.
    from collections import Counter
    mycounts = Counter(tokens)
    print("Word\tCount")
    for word,count in mycounts.most_common(count):
        print("%s\t%s" % (word,count))

In [15]:
# Now let's define a small python function that's a pretty common one for text processing.

def clean_tokens(tokens, stops):
    """ Lowercases, takes out punct and stopwords and short strings. """
    words = remove_punct(tokens)
    words = remove_stops(words, stops)
    return words

In [91]:
emma = tokenize_text("data/books/Austen_Emma.txt")
emma[0:10]

['EMMA',
 'BY',
 'JANE',
 'AUSTEN',
 'VOLUME',
 'I',
 'CHAPTER',
 'I',
 'Emma',
 'Woodhouse']

In [92]:
emma_clean = clean_tokens(emma, english_stops)
emma_clean[0:10]

['emma',
 'jane',
 'austen',
 'volume',
 'chapter',
 'emma',
 'woodhouse',
 'handsome',
 'clever',
 'rich']

In [93]:
len(emma_clean)

81440

In [94]:
emma_cleaner = remove_custom(emma_clean, TO_REMOVE)

In [96]:
emma_cleaner[0:40]

['emma',
 'jane',
 'austen',
 'volume',
 'chapter',
 'emma',
 'woodhouse',
 'handsome',
 'clever',
 'rich',
 'comfortable',
 'home',
 'happy',
 'disposition',
 'seemed',
 'unite',
 'best',
 'blessings',
 'existence',
 'lived',
 'nearly',
 'twenty-one',
 'years',
 'world',
 'little',
 'distress',
 'vex',
 'youngest',
 'two',
 'daughters',
 'affectionate',
 'indulgent',
 'father',
 'consequence',
 'sister',
 'marriage',
 'mistress',
 'house',
 'early',
 'period']

In [97]:
len(emma_cleaner)

73230

In [98]:
print_counts(emma_cleaner)

Word	Count
mr.	1089
emma	860
could	836
would	818
mrs.	668
miss	597
must	566
harriet	500
much	484
said	483


What would you do now?

## Count Word Frequencies Using NLTK

In [99]:
# Another way to do this is with nltk.FreqDist, which creates an object with keys that are 
# the vocabulary, and values for the counts- just like a counter.

frequencies = nltk.FreqDist(emma_cleaner)

In [100]:
frequencies['mr.']

1089

In [101]:
frequencies.most_common(10)

[('mr.', 1089),
 ('emma', 860),
 ('could', 836),
 ('would', 818),
 ('mrs.', 668),
 ('miss', 597),
 ('must', 566),
 ('harriet', 500),
 ('much', 484),
 ('said', 483)]

In [None]:
# vocabulary size - unique tokens

In [None]:
frequencies.keys()

If you wanted to save the words and counts to a file to use, you can do it like this:

In [164]:
wordpairs = frequencies.most_common()
with open("emma_word_counts.csv", "w") as handle:
    for pair in wordpairs:
        handle.write(pair[0] + "," + str(pair[1]) + "\n")

In [165]:
# this won't work on Windows - it's a unix command to look at the top of the file.
!head -n5 emma_word_counts.csv

mr.,1089
emma,860
could,836
would,818
mrs.,668


In [None]:
# How would we read that into pandas and make a graph of the top 10?

## Finding Most Common Pairs of Words ("Bigrams")

Words occur in common sequences, sometimes.  We call word pairs "bigrams" (and word triples "trigrams").  We refer to N-grams when we mean "sequences of some length."  These functions work on the tokenized text -- the list of tokens.  If we clean it first (punctuation and stopword removal), then the bi- and tri-grams may not look like good grammar.  But it may still be useful information.

In [102]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

word_fd = nltk.FreqDist(emma_cleaner) # all the words
bigram_fd = nltk.FreqDist(nltk.bigrams(emma_cleaner))
finder = BigramCollocationFinder(word_fd, bigram_fd)
scored = finder.score_ngrams(bigram_measures.likelihood_ratio) # a good option here, there are others:
scored[0:50]

[(('mr.', 'knightley'), 1946.6592454307174),
 (('mrs.', 'weston'), 1850.336542407057),
 (('frank', 'churchill'), 1606.6422087377366),
 (('mr.', 'elton'), 1322.480745223186),
 (('miss', 'woodhouse'), 1252.7696197262974),
 (('miss', 'bates'), 938.7056563680686),
 (('miss', 'fairfax'), 889.3547077244223),
 (('every', 'body'), 885.9210086539515),
 (('jane', 'fairfax'), 865.7102826733177),
 (('mrs.', 'elton'), 863.7876470227845),
 (('every', 'thing'), 822.5115947376371),
 (('mr.', 'weston'), 819.316541417629),
 (('young', 'man'), 738.0043606265148),
 (('mr.', 'woodhouse'), 709.2509559335367),
 (('great', 'deal'), 624.5844244998653),
 (('maple', 'grove'), 543.5640164601741),
 (('mrs.', 'goddard'), 539.8815046356124),
 (('dare', 'say'), 519.4899438008467),
 (('john', 'knightley'), 481.1441515680198),
 (('miss', 'taylor'), 455.8587834632044),
 (('miss', 'smith'), 449.07602962315485),
 (('robert', 'martin'), 423.1323714473208),
 (('colonel', 'campbell'), 382.65688903898683),
 (('box', 'hill'), 

In [103]:
# Trigrams - using raw counts is much faster than using a statistical measure of likelihood

finder = nltk.collocations.TrigramCollocationFinder.from_words(emma_cleaner,
    window_size = 15)
# there must be at least 2 for them to be reported:
finder.apply_freq_filter(2)
# if you want to remove extra words, like character names, you can create the ignored_words list too:
#finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)

In [104]:
trigram_measures = nltk.collocations.TrigramAssocMeasures()
# now we use the raw counts here and you'll see lots of garbage unless you did further cleaning
finder.nbest(trigram_measures.raw_freq, 20)

[('mr.', 'knightley', 'mr.'),
 ('mr.', 'elton', 'mr.'),
 ('mr.', 'mr.', 'knightley'),
 ('mr.', 'mr.', 'elton'),
 ('mr.', 'frank', 'churchill'),
 ('mr.', 'knightley', 'emma'),
 ('mrs.', 'weston', 'emma'),
 ('mr.', 'knightley', 'would'),
 ('emma', 'mr.', 'knightley'),
 ('mr.', 'weston', 'mr.'),
 ('mrs.', 'weston', 'would'),
 ('mr.', 'mrs.', 'weston'),
 ('could', 'mr.', 'knightley'),
 ('emma', 'mrs.', 'weston'),
 ('mr.', 'knightley', 'could'),
 ('mrs.', 'weston', 'mr.'),
 ('would', 'mr.', 'knightley'),
 ('elton', 'mr.', 'elton'),
 ('mr.', 'john', 'knightley'),
 ('mr.', 'mr.', 'weston')]

In [105]:
finder.score_ngrams(trigram_measures.raw_freq)[0:20]

[(('mr.', 'knightley', 'mr.'), 1.5006160028691778e-05),
 (('mr.', 'elton', 'mr.'), 1.3355482425535682e-05),
 (('mr.', 'mr.', 'knightley'), 1.2605174424101094e-05),
 (('mr.', 'mr.', 'elton'), 1.1104558421231916e-05),
 (('mr.', 'frank', 'churchill'), 1.0954496820944997e-05),
 (('mr.', 'knightley', 'emma'), 1.0054127219223492e-05),
 (('mrs.', 'weston', 'emma'), 9.904065618936573e-06),
 (('mr.', 'knightley', 'would'), 9.45388081807582e-06),
 (('emma', 'mr.', 'knightley'), 9.303819217788902e-06),
 (('mr.', 'weston', 'mr.'), 9.003696017215066e-06),
 (('mrs.', 'weston', 'would'), 8.853634416928149e-06),
 (('mr.', 'mrs.', 'weston'), 8.10332641549356e-06),
 (('could', 'mr.', 'knightley'), 7.953264815206642e-06),
 (('emma', 'mrs.', 'weston'), 7.953264815206642e-06),
 (('mr.', 'knightley', 'could'), 7.953264815206642e-06),
 (('mrs.', 'weston', 'mr.'), 7.653141614632807e-06),
 (('would', 'mr.', 'knightley'), 7.653141614632807e-06),
 (('elton', 'mr.', 'elton'), 7.503080014345889e-06),
 (('mr.', 'jo

In [None]:
## This is very slow!  Don't run unless you're serious :)

finder = TrigramCollocationFinder.from_words(emma_cleaner,
    window_size = 10)
finder.apply_freq_filter(2)
# if you want to remove extra words, like character names, you can create the ignored_words list too:
#finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
#finder.apply_word_filter(lambda w: len(w) < 3)  # remove short words
# maximum likelihood ratio is a statistical measure different from just raw counts used above
finder.nbest(trigram_measures.likelihood_ratio, 15)

Some more help is here: http://www.nltk.org/howto/collocations.html

### What if we wanted to try non-fiction, to see if there are more interesting results?

We need to read and clean the text for another file.  Let's try positive movie reviews, located in data/movie_reviews/all_pos.txt.

In [13]:

with open("data/movie_reviews/all_pos.txt", errors="ignore") as handle:
    text = handle.read()

In [21]:
tokens = nltk.word_tokenize(text)  # tokenize them - split into words and punct
clean_posrevs = clean_tokens(tokens, english_stops)  # clean up stopwords and punct

In [22]:
clean_posrevs[0:20]

['rated',
 '4-star',
 'scale',
 'screening',
 'venue',
 'odoen',
 'liverpool',
 'city',
 'centre',
 'released',
 'uk',
 'uip',
 'april',
 '7',
 '2000',
 'certificate',
 '15',
 '126',
 'minutes',
 'country']

In [173]:
word_fd = nltk.FreqDist(clean_posrevs)
bigram_fd = nltk.FreqDist(nltk.bigrams(clean_posrevs))
finder = BigramCollocationFinder(word_fd, bigram_fd)
scored = finder.score_ngrams(bigram_measures.likelihood_ratio) # other options are 
scored[0:50]

[(('--', '--'), 276.37856128911676),
 (('martial', 'arts'), 172.9566013400519),
 (('green', 'mile'), 124.60518604561268),
 (('jackie', 'brown'), 114.33464649956649),
 (('high', 'fidelity'), 101.97315011972981),
 (('wonder', 'boys'), 91.96089134910119),
 (('mr', 'tarantino'), 87.87868213572972),
 (('http', '//www'), 78.12829220990892),
 (('pulp', 'fiction'), 78.12829220990892),
 (('mira', 'sorvino'), 75.91905463957946),
 (('beavis', 'butthead'), 70.98878226918048),
 (('hong', 'kong'), 70.98878226918048),
 (('irwin', 'winkler'), 70.98878226918048),
 (('mrs', 'pascal'), 70.98878226918048),
 (('ca', "n't"), 70.26383877215791),
 (('first', 'sight'), 69.93537990510472),
 (('good', 'hunting'), 69.39432459762817),
 (('replacement', 'killers'), 65.9847580337986),
 (('robin', 'williams'), 65.9402713911571),
 (('running', 'time'), 64.15271723747468),
 (('matt', 'damon'), 63.775616177967265),
 (('pam', 'grier'), 63.35061224964273),
 (('serial', 'killer'), 63.35061224964273),
 (('ben', 'affleck'), 

Apparently we need to remove custom stopwords from these too!