# N-grams

N-grams are overlapping sequences of $n$ words, i.e., sequences of two words at a time (bigrams), three words at a time (trigrams), or some other number, like 4-grams or 5-grams. 

## Make ngrams manually

We can make a list of lists that is a bigram representation of a test sentence, like this: 

In [39]:
import nltk
import collections

In [6]:
testText = "The quick brown fox jumped over the lazy dogs!" 

In [7]:
testTokens = nltk.word_tokenize(testText)

In [8]:
testTokens

['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dogs', '!']

In [11]:
bigrams = []
for i, item in enumerate(testTokens): 
    if i<len(testTokens)-1:
        nextOne = testTokens[i+1]
        bigrams.append([item, nextOne])

In [12]:
bigrams

[['The', 'quick'],
 ['quick', 'brown'],
 ['brown', 'fox'],
 ['fox', 'jumped'],
 ['jumped', 'over'],
 ['over', 'the'],
 ['the', 'lazy'],
 ['lazy', 'dogs'],
 ['dogs', '!']]

## Using the `zip()` function

The built-in `zip()` function makes this a little easier: 

In [2]:
for item in zip([1, 2, 3], ['a', 'b', 'c']):
    print(item)

(1, 'a')
(2, 'b')
(3, 'c')


In [10]:
for item in zip(testTokens, testTokens[1:]):
    print(item)

('The', 'quick')
('quick', 'brown')
('brown', 'fox')
('fox', 'jumped')
('jumped', 'over')
('over', 'the')
('the', 'lazy')
('lazy', 'dogs')
('dogs', '!')


## On Tuples

These are like lists, but with `()` instead of `[]`, and they're immutable. 

In [11]:
(2, 3).append(5) # This should produce an error

AttributeError: 'tuple' object has no attribute 'append'

In [75]:
(2,3)[1] # Indexing a tuple

3

In [76]:
('apples', 'bananas', 'oranges')[1]

'bananas'

By default, `zip()` produces a generator object, which can be cast as a list like so: 

In [14]:
list(zip(testTokens, testTokens[1:], testTokens[2:]))

[('The', 'quick', 'brown'),
 ('quick', 'brown', 'fox'),
 ('brown', 'fox', 'jumped'),
 ('fox', 'jumped', 'over'),
 ('jumped', 'over', 'the'),
 ('over', 'the', 'lazy'),
 ('the', 'lazy', 'dogs'),
 ('lazy', 'dogs', '!')]

In [15]:
testTokens

['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dogs', '!']

In [16]:
testTokens[1:]

['quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dogs', '!']

In [17]:
def trigrams(tokens): 
    return list(zip(tokens, tokens[1:], tokens[2:]))

In [18]:
import nltk

In [23]:
list(nltk.ngrams(testTokens, 4))

[('The', 'quick', 'brown', 'fox'),
 ('quick', 'brown', 'fox', 'jumped'),
 ('brown', 'fox', 'jumped', 'over'),
 ('fox', 'jumped', 'over', 'the'),
 ('jumped', 'over', 'the', 'lazy'),
 ('over', 'the', 'lazy', 'dogs'),
 ('the', 'lazy', 'dogs', '!')]

In [27]:
moonstone = open('../Texts/moonstone.md').read()

In [28]:
moonstoneParts = moonstone.split('\n## ')

In [29]:
moonstoneParts[2][:200]

'First Period\n\nTHE LOSS OF THE DIAMOND (1848)\n\nThe events related by GABRIEL BETTEREDGE, house-steward in the service\nof JULIA, LADY VERINDER.\n\n### Chapter I\n\nIn the first part of ROBINSON CRUSOE, at p'

In [30]:
betteredge = moonstoneParts[2]

In [37]:
betTokens = nltk.word_tokenize(betteredge.lower())
betTokens = [word for word in betTokens if word.isalpha()]
betTrigrams = list(nltk.ngrams(betTokens, 3))

In [38]:
betTrigrams[:10]

[('first', 'period', 'the'),
 ('period', 'the', 'loss'),
 ('the', 'loss', 'of'),
 ('loss', 'of', 'the'),
 ('of', 'the', 'diamond'),
 ('the', 'diamond', 'the'),
 ('diamond', 'the', 'events'),
 ('the', 'events', 'related'),
 ('events', 'related', 'by'),
 ('related', 'by', 'gabriel')]

In [40]:
collections.Counter(betTrigrams).most_common(20)

[(('in', 'the', 'house'), 46),
 (('my', 'lady', 's'), 43),
 (('i', 'don', 't'), 39),
 (('miss', 'rachel', 's'), 35),
 (('one', 'of', 'the'), 34),
 (('i', 'can', 't'), 33),
 (('the', 'colonel', 's'), 33),
 (('of', 'the', 'diamond'), 32),
 (('said', 'the', 'sergeant'), 29),
 (('says', 'the', 'sergeant'), 28),
 (('of', 'the', 'moonstone'), 26),
 (('out', 'of', 'the'), 25),
 (('the', 'rest', 'of'), 23),
 (('the', 'shivering', 'sand'), 23),
 (('that', 'he', 'had'), 22),
 (('as', 'well', 'as'), 19),
 (('at', 'the', 'bottom'), 19),
 (('the', 'loss', 'of'), 18),
 (('to', 'my', 'lady'), 18),
 (('there', 'was', 'a'), 18)]

In [77]:
def narrator(narr): 
    """ Just a convenience function for getting a narrator from the text. """ 
    narrators = {"Betteredge": 2, "Clack": 4, "Bruff": 5, "Blake": 6, "Jennings": 7}
    moonstoneParts = moonstone.split('\n## ') 
    narrText = moonstoneParts[narrators[narr]]
    # Just print a few characters so that we can verify it. 
    print(narrText[:60].replace('\n', ' ')) 
    return narrText

In [46]:
bet = narrator("Betteredge")

First Period  THE LOSS OF THE DIAMOND (1848)  The events rel


In [66]:
def removeQuoted(tokens): 
    """ A function to remove tokens that happen between quotation marks
    (i.e. to remove dialogue from a text). """ 
    outsideQuotes = []
    insideQuotes = []
    isInside = False
    for token in tokens: 
        if token == '“':
            isInside = True
            continue
        if token == '”': 
            isInside = False
            continue
        if isInside: 
            insideQuotes.append(token)
        else:
            outsideQuotes.append(token)
    return outsideQuotes

In [67]:
def commonNgrams(text, n): 
    tokens = nltk.word_tokenize(text)
    tokens = removeQuoted(tokens)
    tokensClean = [token for token in tokens if token.isalpha()]
    ngrams = nltk.ngrams(tokensClean, n)
    return collections.Counter(ngrams).most_common(10)

In [74]:
commonNgrams(narrator("Betteredge"), 7)

First Period  THE LOSS OF THE DIAMOND (1848)  The events rel


[(('like', 'a', 'woman', 'in', 'a', 'dream', 'I'), 3),
 (('I', 'went', 'into', 'the', 'service', 'of', 'the'), 2),
 (('went', 'into', 'the', 'service', 'of', 'the', 'old'), 2),
 (('into', 'the', 'service', 'of', 'the', 'old', 'lord'), 2),
 (('my', 'lady', 'took', 'an', 'interest', 'in', 'the'), 2),
 (('pipe', 'and', 'took', 'a', 'turn', 'at', 'ROBINSON'), 2),
 (('and', 'took', 'a', 'turn', 'at', 'ROBINSON', 'CRUSOE'), 2),
 (('took', 'a', 'turn', 'at', 'ROBINSON', 'CRUSOE', 'Before'), 2),
 (('a', 'turn', 'at', 'ROBINSON', 'CRUSOE', 'Before', 'I'), 2),
 (('turn', 'at', 'ROBINSON', 'CRUSOE', 'Before', 'I', 'had'), 2)]

In [78]:
commonNgrams(narrator("Clack"), 7)

First Narrative  Contributed by MISS CLACK; niece of the lat


[(('may', 'be', 'the', 'consequence', 'of', 'a', 'mission'), 3),
 (('the', 'fallen', 'nature', 'which', 'we', 'all', 'inherit'), 2),
 (('fallen', 'nature', 'which', 'we', 'all', 'inherit', 'from'), 2),
 (('once', 'embarked', 'on', 'a', 'career', 'of', 'manifest'), 2),
 (('embarked', 'on', 'a', 'career', 'of', 'manifest', 'usefulness'), 2),
 (('Miss', 'Jane', 'Ann', 'Stamper', 'on', 'my', 'lap'), 2),
 (('I', 'was', 'left', 'alone', 'in', 'the', 'room'), 2),
 (('First', 'Narrative', 'Contributed', 'by', 'MISS', 'CLACK', 'niece'), 1),
 (('Narrative', 'Contributed', 'by', 'MISS', 'CLACK', 'niece', 'of'), 1),
 (('Contributed', 'by', 'MISS', 'CLACK', 'niece', 'of', 'the'), 1)]

In [79]:
commonNgrams(narrator("Clack"), 3)

First Narrative  Contributed by MISS CLACK; niece of the lat


[(('my', 'aunt', 's'), 14),
 (('Lady', 'Verinder', 's'), 14),
 (('out', 'of', 'the'), 13),
 (('which', 'I', 'had'), 12),
 (('in', 'Montagu', 'Square'), 10),
 (('in', 'the', 'room'), 9),
 (('on', 'the', 'subject'), 8),
 (('the', 'subject', 'of'), 8),
 (('of', 'the', 'house'), 8),
 (('one', 'of', 'the'), 8)]

In [80]:
commonNgrams(narrator("Blake"), 3)

Third Narrative  Contributed by FRANKLIN BLAKE  ### Chapter 


[(('which', 'I', 'had'), 20),
 (('looked', 'at', 'me'), 16),
 (('that', 'I', 'had'), 16),
 (('he', 'said', 'I'), 14),
 (('one', 'of', 'the'), 12),
 (('to', 'me', 'I'), 12),
 (('that', 'he', 'was'), 11),
 (('that', 'I', 'was'), 10),
 (('I', 'might', 'have'), 10),
 (('said', 'Ezra', 'Jennings'), 10)]