# N-gram language modeling

In this notebook we will build an n-gram language model, inspect the conditional probability distributions and try to generate some text.

In [5]:
from nltk import word_tokenize, ngrams # function for generating n-grams

In [6]:
tokens = word_tokenize("this is an example sentence that we will create some n-grams of")

for n in range(2, 5):
    print(f'{n}-GRAMS')
    print(list(ngrams(tokens, n)))
    print()

2-GRAMS
[('this', 'is'), ('is', 'an'), ('an', 'example'), ('example', 'sentence'), ('sentence', 'that'), ('that', 'we'), ('we', 'will'), ('will', 'create'), ('create', 'some'), ('some', 'n-grams'), ('n-grams', 'of')]

3-GRAMS
[('this', 'is', 'an'), ('is', 'an', 'example'), ('an', 'example', 'sentence'), ('example', 'sentence', 'that'), ('sentence', 'that', 'we'), ('that', 'we', 'will'), ('we', 'will', 'create'), ('will', 'create', 'some'), ('create', 'some', 'n-grams'), ('some', 'n-grams', 'of')]

4-GRAMS
[('this', 'is', 'an', 'example'), ('is', 'an', 'example', 'sentence'), ('an', 'example', 'sentence', 'that'), ('example', 'sentence', 'that', 'we'), ('sentence', 'that', 'we', 'will'), ('that', 'we', 'will', 'create'), ('we', 'will', 'create', 'some'), ('will', 'create', 'some', 'n-grams'), ('create', 'some', 'n-grams', 'of')]



Let us try to collect n-grams from our Gutenberg books. In the cell below, we will set `n` which will be used through the script. Change the value to see how the model works for different `n`s.

In [7]:
n = 3

In [8]:
from glob import glob
from collections import Counter

# iterate over the books, read the text and concatenate it as one text
text = ''
for filename in glob('../data/gutenberg/*.txt'):
    with open(filename) as f:
        text += f.read() + '\n'

# create tokens and n-grams and count them
tokens = word_tokenize(text)
print('Number of tokens:', len(tokens))
book_ngrams = ngrams(tokens, n)
counter = Counter(book_ngrams)
counter.most_common(50)

Number of tokens: 912242


[((',', '”', 'said'), 1783),
 (('?', '”', '“'), 792),
 (('.', '“', 'I'), 712),
 (('”', '“', 'I'), 606),
 ((',', 'and', 'the'), 517),
 (('don', '’', 't'), 493),
 (('.', 'It', 'was'), 375),
 (('”', 'said', 'Mr.'), 297),
 (('“', 'Oh', ','), 280),
 (('?', '”', 'said'), 278),
 ((',', 'and', 'I'), 271),
 (('I', 'don', '’'), 257),
 (('.', 'It', 'is'), 251),
 ((',', 'and', 'that'), 244),
 ((',', 'and', 'he'), 239),
 (('”', 'said', 'Dorothea'), 224),
 ((',', '”', 'he'), 215),
 ((',', 'and', 'she'), 214),
 ((',', 'with', 'a'), 213),
 (('“', 'I', 'am'), 208),
 (('”', 'said', 'Mrs.'), 199),
 ((',', 'my', 'dear'), 196),
 ((',', 'and', 'then'), 196),
 (('.', 'He', 'was'), 193),
 (('I', '’', 'll'), 193),
 (('“', 'Yes', ','), 192),
 (('”', '“', 'Oh'), 190),
 (('!', '”', '“'), 186),
 (('that', 'he', 'had'), 180),
 ((',', 'in', 'the'), 177),
 (('.', 'I', 'am'), 177),
 ((',', 'as', 'if'), 174),
 (('said', 'Dorothea', ','), 174),
 (('“', 'No', ','), 171),
 ((',', '“', 'I'), 170),
 (('.', '“', 'It'), 170),

To create a model which incorporates the idea of _history_ and a continuation, we will create n-grams of size `n`. We then consider the first `n - 1` tokens as history and the n'th word as a continuation that we want to count in order to make a distribution. We first make a model of raw counts and then normalize the values such that they form a probability distribution where the probability values sum to 1.

In [9]:
from collections import defaultdict

ngram_model_raw = defaultdict(Counter)

for ngram in ngrams(tokens, n):
    history, continuation = ngram[:-1], ngram[-1]
    ngram_model_raw[history][continuation] += 1

In [10]:
ngram_model_raw[('in', 'the')].most_common(10)

[('world', 103),
 ('same', 52),
 ('United', 51),
 ('morning', 37),
 ('sea', 30),
 ('air', 30),
 ('room', 30),
 ('evening', 26),
 ('house', 26),
 ('town', 26)]

In [11]:
ngram_model = dict()
for ngram, counter in ngram_model_raw.items():
    # normalize counts by dividing it with the total counts for that "history" n-gram
    summed_counts = sum(counter.values())
    ngram_model[ngram] = {word: count / summed_counts for word, count in counter.items()}

In [12]:
# checking that they sum to 1 - expect rounding errors
sum(ngram_model[('I', 'can')].values())

0.9999999999999963

In [13]:
import random
seed = random.choice(list(ngram_model.keys()))
# seed = ('I', 'can')
seed

('Five', 'pounds.')

In [14]:
for word in seed:
    print(word, end=' ')

previous = seed
for i in range(100):
    next_word_dist = ngram_model[previous]
    words = []
    weights = []
    for word, count in next_word_dist.items():
        words.append(word)
        weights.append(count)
    next_word = random.choices(words, weights=weights)[0]
    print(next_word, end=' ')
    previous = previous[1:] + (next_word,)
    

Five pounds. ” He meant to make a General of him before he is most vividly hit by the view . Excepting the sublime life of man . I can not imagine what the consequences which he opened his lips , and win celebrity , who was merely going to be useful for him in my genealogy . But an Italian lady , fie , you will be A giant in might , perhaps , ” said Dorothea , decidedly . “ In the beginning again . He would take it , and she was only in shades and shadows , drowned 

In [15]:
ngram_model

{('\ufeffThe', 'Project'): {'Gutenberg': 1.0},
 ('Project', 'Gutenberg'): {'eBook': 0.035211267605633804,
  'License': 0.07042253521126761,
  '’': 0.007042253521126761,
  'is': 0.035211267605633804,
  'trademark': 0.035211267605633804,
  'eBooks': 0.035211267605633804,
  '”': 0.176056338028169,
  'Literary': 0.45774647887323944,
  'are': 0.035211267605633804,
  ':': 0.035211267605633804,
  'volunteers': 0.04225352112676056,
  'web': 0.035211267605633804},
 ('Gutenberg', 'eBook'): {'of': 1.0},
 ('eBook', 'of'): {'Moby': 0.2,
  'Pride': 0.2,
  'Romeo': 0.2,
  'Middlemarch': 0.2,
  'A': 0.2},
 ('of', 'Moby'): {'Dick': 0.9, 'Dick—but': 0.1},
 ('Moby', 'Dick'): {';': 0.11842105263157894,
  '.': 0.15789473684210525,
  '?': 0.07894736842105263,
  'ye': 0.013157894736842105,
  'that': 0.039473684210526314,
  '!': 0.07894736842105263,
  'to': 0.013157894736842105,
  'with': 0.013157894736842105,
  ',': 0.13157894736842105,
  'was': 0.06578947368421052,
  'not': 0.013157894736842105,
  'had': 0.