# Class 9.2: n-grams and collocations


Here's some code to get us started imports, creating a decent stop list, and then a function for reading in and tokenizing. I've pre-processed the Gutenberg books to remove all the intro and legal language at the end.

In [1]:
# import statements
import re
import nltk
from nltk.corpus import stopwords

# improving stop word list
# get stop word list
stoplist = stopwords.words('english')
stoplist.extend([".", ",", "?", "could", "would", "“", "”", "’", ";", "!","much", "like", "one", "many", "though", "without", "upon"])


# function for reading in, tokenizing
def readdata(filename):
    # open the file, read it in, replace newlines with space
    f = open(filename, encoding="utf=8")
    fulltext = f.read()
    alltext = re.sub("\n", " ", fulltext)

    # call nltk.word_tokenize
    alltokens = nltk.word_tokenize(alltext)    
    return alltokens



### Okay, let's do great expectations

In [2]:
greattokens = readdata("greatexpectations.txt")

# get bigrams|
bigrams = nltk.ngrams(greattokens, 2)
bigramlist = list(bigrams)

# print out most frequent bigrams
bigramfreq = nltk.FreqDist(bigramlist)
print(bigramfreq.most_common(10))


[((',', 'and'), 3715), (('.', '“'), 1362), (('”', '“'), 1249), (('’', 's'), 1151), ((',', '”'), 1118), ((',', 'I'), 1025), (('”', 'said'), 933), ((',', '“'), 895), (('?', '”'), 892), (('of', 'the'), 831)]


### Unsurprisingly, the most frequent bigrams all involve punctuation and stop words! Let's try to filter those out.

In [3]:
# print out most frequent bigrams where neither word is a stop word
for b in bigramfreq.most_common(1000):
    if b[0][0].lower() not in stoplist and b[0][1].lower() not in stoplist:
        print(b)

(('Miss', 'Havisham'), 299)
(('Mr.', 'Jaggers'), 213)
(('said', 'Joe'), 121)
(('said', 'Mr.'), 117)
(('Mr.', 'Wopsle'), 110)
(('Mr.', 'Pumblechook'), 89)
(('Mr.', 'Pip'), 66)
(('said', 'Wemmick'), 63)
(('Mrs.', 'Joe'), 60)
(('Mrs.', 'Pocket'), 53)
(('dear', 'boy'), 52)
(('Mr.', 'Pocket'), 51)
(('said', 'Herbert'), 51)
(('old', 'chap'), 37)
(('said', 'Estella'), 37)
(('Mr.', 'Wemmick'), 34)
(('Miss', 'Skiffins'), 31)
(('young', 'man'), 26)
(('young', 'gentleman'), 25)
(('said', 'Biddy'), 25)


### That tells us a lot more about Great Expectations! Let's put all that in a function so we can call it with other books.

In [4]:
# let's put that bigram frequency code in a function

def get_bigrams(tokenlist):
    bigrams = nltk.ngrams(tokenlist, 2)
    bigramlist = list(bigrams)

    # print out most frequent bigrams
    bigramfreq = nltk.FreqDist(bigramlist)
    print(bigramfreq.most_common(10))

    # print out most frequent bigrams where neither word is a stop word
    for b in bigramfreq.most_common(1000):
        if b[0][0].lower() not in stoplist and b[0][1].lower() not in stoplist:
            print(b)


### Now on to Walden

In [5]:
# Walden
# You can change the argument to most_common in the code above to see more bigrams from Walden.

waldentokens = readdata("walden.txt")
get_bigrams(waldentokens)


[((',', 'and'), 1850), (('of', 'the'), 850), (('in', 'the'), 641), (('.', 'I'), 431), (('.', 'The'), 360), (('to', 'the'), 343), ((',', 'or'), 331), (('and', 'the'), 313), ((',', 'as'), 307), (('’', 's'), 302)]
(('New', 'England'), 19)
(('Walden', 'Pond'), 14)
(('*', '*'), 13)
(('every', 'day'), 12)


### And here's a really different one: Romeo and Juliet

In [6]:
# Romeo and Juliet
romeotokens = readdata("romeo.txt")
get_bigrams(romeotokens)

[((',', 'and'), 181), (('Rom', '.'), 163), (('.', 'I'), 159), ((',', 'I'), 123), (('Jul', '.'), 117), (('.', 'Rom'), 114), (('Nurse', '.'), 102), ((',', 'And'), 97), (('.', 'O'), 85), (('.', 'Jul'), 73)]
(('thou', 'art'), 24)
(('art', 'thou'), 19)
(('Romeo', "'s"), 19)
(('thou', 'wilt'), 17)
(('thou', 'hast'), 16)
(('[', 'Exit'), 14)
(("'s", 'dead'), 13)
(('Enter', 'Romeo'), 12)
(('Capulet', "'s"), 12)
(('Friar', 'Laurence'), 11)
(('Enter', 'Juliet'), 10)
(('good', 'night'), 10)
(('love', "'s"), 9)
(('County', 'Paris'), 9)
(('lady', "'s"), 9)
(('Enter', 'Friar'), 9)
(('Tybalt', "'s"), 9)
(("'s", 'death'), 9)
(('[', 'aside'), 8)
(('thy', 'love'), 8)
((']', 'Rom'), 8)
(('Juliet', "'s"), 7)
(('pray', 'thee'), 7)
(('[', 'Exeunt'), 7)
(('men', "'s"), 7)
(('[', 'Laurence'), 7)
(('Laurence', ']'), 7)
(('Enter', 'Nurse'), 7)
(('kill', "'d"), 7)
(('tell', 'thee'), 6)
(("'s", 'house'), 6)
(("'s", 'cell'), 6)
(('Chief', 'Watch'), 6)
(('Friar', 'John'), 5)
(('Enter', 'Benvolio'), 5)
(('Exeunt', '[

## 2. Collocations

There are ways of finding characteristic two- (or three-) word sequences that don't involve just filtering out stop words. Typically, we use statistical metrics to identify bigrams where one or both of the words is not particularly frequent by itself but when they do occur, they tend to occur together. These are called **collocations**, and nltk has a nice module for finding them.

Below, we use the statistical measure called pointwise mutual information (PMI). There are other options in `bigram_measures` such as chi square and likelihood ratios that you can experiment with.

In [7]:
from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()

print("Great Expectations")
finder = BigramCollocationFinder.from_words(greattokens)
print(finder.nbest(bigram_measures.pmi, 10))

print("Walden")
finder = BigramCollocationFinder.from_words(waldentokens)
print(finder.nbest(bigram_measures.pmi, 10))

print("Juliet")
finder = BigramCollocationFinder.from_words(romeotokens)
print(finder.nbest(bigram_measures.pmi, 10))



Great Expectations
[('Accoucheur', 'Policeman'), ('Arabian', 'Nights'), ('Assurance', 'shares'), ('Bear—bear', 'witness.'), ('Been', 'bolting'), ('Bosworth', 'Field'), ('Botany', 'Bay'), ('Chelsea', 'Reach'), ('Cock', 'Robin'), ('Come—to', 'play.')]
Walden
[('.........', '7.50'), ('.............................', '2.25'), ('..............................', '3.12½'), ('...............................', '0.40'), ('0.01', 'Transportation'), ('0.14', 'Latch'), ('0.15', 'Nails'), ('0.25', 'Dried'), ('0.40', 'Turnip'), ('0.54', 'Ploughing')]
Juliet
[("'It", 'lightens'), ("'Signior", 'Martino'), ("'When", 'griping'), ('-the', 'Clown'), ('AND', 'JULIET'), ('Alike', 'bewitched'), ('Alla', 'stoccata'), ('Ancient', 'damnation'), ('Brief', 'sounds'), ('Dramatis', 'Personae')]


### Improving collocations
Those collocations are weird and unhelpful for the most part. Because of the way PMI works, the collocations with the highest scores are often themselves very infrequent. We can filter out the ones that occur only once or twice and find other interesting collocations.

In [8]:
print("Great Expectations")
finder = BigramCollocationFinder.from_words(greattokens)
finder.apply_freq_filter(3)
for c in finder.nbest(bigram_measures.pmi, 10):
    print(" ".join(c))


print("\nWalden")
finder = BigramCollocationFinder.from_words(waldentokens)
finder.apply_freq_filter(3)
for c in finder.nbest(bigram_measures.pmi, 10):
    print(" ".join(c))

print("\nJuliet")
finder = BigramCollocationFinder.from_words(romeotokens)
finder.apply_freq_filter(3)
for c in finder.nbest(bigram_measures.pmi, 10):
    print(" ".join(c))



Great Expectations
Habraham Latharuth
T GO
filed asunder
Copper Rope-walk
Bartholomew Close
Cousin Raymond
scented soap
Covent Garden
Green Copper
aged parent

Walden
CIVIL DISOBEDIENCE
DUTY OF
Loch Fyne
OF CIVIL
_terra firma_
ON THE
THE DUTY
nineteenth century
Green Mountains
Fair Haven

Juliet
plague o
thousand times
Chief Watch
Make haste
They fight
honest gentleman
Thursday next
said 'Ay
silver sound
Scene II


### Collocations: nearby but not necessarily adjacent words

You can also think of collocations as two words that occur frequently in the same general area but not always exactly next to each other.  `BigramCollocationFinder` allows you to look within a "window" to find words that occur near each other frequently.

In [9]:
finder = BigramCollocationFinder.from_words(romeotokens, window_size=5)
finder.apply_freq_filter(3)
print(finder.nbest(bigram_measures.pmi, 10))

finder = BigramCollocationFinder.from_words(greattokens, window_size=5)
finder.apply_freq_filter(3)
print(finder.nbest(bigram_measures.pmi, 10))

finder = BigramCollocationFinder.from_words(waldentokens, window_size=5)
finder.apply_freq_filter(3)
print(finder.nbest(bigram_measures.pmi, 10))

[('fee', 'simple'), ("'Heart", 'ease'), ('lick', 'fingers'), ('ACT', 'I.'), ('Saint', 'Church'), ('hare', 'hoar'), ('plague', 'o'), ('bite', 'thumb'), ('through', 'veins'), ('Alack', 'alack')]
[('DON', 'GO'), ('Habraham', 'Latharuth'), ('twig', 'blade'), ('beats', 'cringes'), ('flint', 'steel'), ('backwards', 'forwards'), ('DON', 'T'), ('T', 'GO'), ('filed', 'asunder'), ('Copper', 'Rope-walk')]
[('hoo', 'hoorer'), ('CIVIL', 'DISOBEDIENCE'), ('DUTY', 'CIVIL'), ('DUTY', 'DISOBEDIENCE'), ('DUTY', 'OF'), ('Loch', 'Fyne'), ('OF', 'CIVIL'), ('OF', 'DISOBEDIENCE'), ('ON', 'CIVIL'), ('ON', 'DUTY')]


Now we can see words that co-occur not necessarily next to each other: 

* Saint Church, e.g., "the church of Saint So-and-so"
* backwards forwards, e.g., "backwards and forwards"
* through veins, e.g., "through my veins"
  

### Using other association metrics

In [10]:
finder = BigramCollocationFinder.from_words(waldentokens)
finder.apply_freq_filter(3)
print(finder.nbest(bigram_measures.pmi, 10))
print(finder.nbest(bigram_measures.likelihood_ratio, 10))
print(finder.nbest(bigram_measures.chi_sq, 10))

[('CIVIL', 'DISOBEDIENCE'), ('DUTY', 'OF'), ('Loch', 'Fyne'), ('OF', 'CIVIL'), ('_terra', 'firma_'), ('ON', 'THE'), ('THE', 'DUTY'), ('nineteenth', 'century'), ('Green', 'Mountains'), ('Fair', 'Haven')]
[(',', 'and'), ('’', 's'), ('.', 'The'), ('.', 'It'), ('of', 'the'), ('in', 'the'), ('.', 'I'), (';', 'but'), ('to', 'be'), ('It', 'is')]
[('CIVIL', 'DISOBEDIENCE'), ('DUTY', 'OF'), ('Fair', 'Haven'), ('Loch', 'Fyne'), ('OF', 'CIVIL'), ('_terra', 'firma_'), ('’', 's'), ('ON', 'THE'), ('THE', 'DUTY'), ('&', 'c.')]


### Note: You can do all this for trigrams, too!