## Toy example

Let's look at the miniature corpus from the class:

    I am Sam. 
    Sam I am. 
    I do not like green eggs and ham.
    
### Pre-processing

#### Tokenize the text into sentences

Split the text into a list of sentences.

In [377]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/christine/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [378]:
from nltk.tokenize import sent_tokenize

text = "I am Sam. Sam I am. I do not like green eggs and ham."

sentences = sent_tokenize(text)

#### Tokenize the sentences into lists of words

The text will become a list of lists.

In [379]:
from nltk import word_tokenize 
tokenized_text = []

# replace('.', '') removes the period.
for sentence in sentences:
    tokenized_text.append(word_tokenize(sentence.replace('.', '')))

In [380]:
tokenized_text

[['I', 'am', 'Sam'],
 ['Sam', 'I', 'am'],
 ['I', 'do', 'not', 'like', 'green', 'eggs', 'and', 'ham']]

**Tutorial question 1**: How many lists are in `tokenized_text`?

In [381]:
len(tokenized_text)

#trick question - include the outer nested list, so it's 3 + 1

3

### Get the N-grams

We'll get unigrams, bigrams and trigrams.

Note that this is not the N-gram **language model**, only the two and three word sequences.

Let's start with **bigrams**:

In [382]:
from nltk import bigrams

bigram_list = []

for sentence in tokenized_text:
    bigram_list.extend(list(bigrams(sentence, pad_right=True, pad_left=True, right_pad_symbol='</s>', left_pad_symbol='<s>')))

**Tutorial question 2:** What is it the last bigram in `bigram_list`?

In [383]:
total_length = len(bigram_list)

bigram_list[-1]

('ham', '</s>')

In [384]:
bigram_list[2]

('am', 'Sam')

**Trigrams**:

In [385]:
from nltk import trigrams
trigram_list = []

Populate `trigram_list` with trigrams using `trigrams` from NLTK.

In [386]:
for sentence in tokenized_text:
    trigram_list.extend(list(trigrams(sentence, pad_right=True, pad_left=True, right_pad_symbol='</s>', left_pad_symbol='<s>')))

In [387]:
trigram_list

[('<s>', '<s>', 'I'),
 ('<s>', 'I', 'am'),
 ('I', 'am', 'Sam'),
 ('am', 'Sam', '</s>'),
 ('Sam', '</s>', '</s>'),
 ('<s>', '<s>', 'Sam'),
 ('<s>', 'Sam', 'I'),
 ('Sam', 'I', 'am'),
 ('I', 'am', '</s>'),
 ('am', '</s>', '</s>'),
 ('<s>', '<s>', 'I'),
 ('<s>', 'I', 'do'),
 ('I', 'do', 'not'),
 ('do', 'not', 'like'),
 ('not', 'like', 'green'),
 ('like', 'green', 'eggs'),
 ('green', 'eggs', 'and'),
 ('eggs', 'and', 'ham'),
 ('and', 'ham', '</s>'),
 ('ham', '</s>', '</s>')]

**Tutorial question 3**: What is the thrid trigram in `trigram_list`?

In [388]:
trigram_list[2]

('I', 'am', 'Sam')

### N-gram language model

For the bigrams we need first to get the counts of individual tokens.

In [389]:
unigram_list = []

for sentence in tokenized_text:
    unigram_list.extend(sentence)
    unigram_list.append('<s>')
    unigram_list.append('</s>')

print(unigram_list)

['I', 'am', 'Sam', '<s>', '</s>', 'Sam', 'I', 'am', '<s>', '</s>', 'I', 'do', 'not', 'like', 'green', 'eggs', 'and', 'ham', '<s>', '</s>']


#### Maximum Likelyhood Estimate (MLE) bigram model

Populate the dictionary `bigram_lm` with the bigrams as keys and their corresponding probablities as values.
So, for each bigram compute it's probability and store as a value in `bigram_lm` under the key given by the bigram.

In [390]:
bigram_lm = {}

In [391]:
for bigram in bigram_list:
    
    bigram_lm[bigram] = bigram_list.count(bigram) / unigram_list.count(bigram[0])        

In [392]:
bigram_lm

{('<s>', 'I'): 0.6666666666666666,
 ('I', 'am'): 0.6666666666666666,
 ('am', 'Sam'): 0.5,
 ('Sam', '</s>'): 0.5,
 ('<s>', 'Sam'): 0.3333333333333333,
 ('Sam', 'I'): 0.5,
 ('am', '</s>'): 0.5,
 ('I', 'do'): 0.3333333333333333,
 ('do', 'not'): 1.0,
 ('not', 'like'): 1.0,
 ('like', 'green'): 1.0,
 ('green', 'eggs'): 1.0,
 ('eggs', 'and'): 1.0,
 ('and', 'ham'): 1.0,
 ('ham', '</s>'): 1.0}

**Tutorial question 4**: What is the probability of the bigram `('I', 'do')`?

In [393]:
bigram_lm[("I", "do")]

print(f"Probability is {bigram_lm[('I', 'do')]:.2f}")

Probability is 0.33


#### MLE trigram model

First we need to add ```<s><s>``` and ```</s></s>``` to the bigram_list.

In [394]:
left_pads = []
right_pads = []

print('Before:\n\t', bigram_list)

for bigram in bigram_list:
        if '<s>' in bigram:
            left_pads.append(('<s>', '<s>'))
        if '</s>' in bigram:
            right_pads.append(('</s>', '</s>'))
            
bigram_list.extend(left_pads)
bigram_list.extend(right_pads)

print('After:\n\t', bigram_list)

# ASK HJALMAR - WHY ARE WE ADDING THE DOUBLE STARTER AND ENDING TOKENS TO THE BIGRAM LIST?

Before:
	 [('<s>', 'I'), ('I', 'am'), ('am', 'Sam'), ('Sam', '</s>'), ('<s>', 'Sam'), ('Sam', 'I'), ('I', 'am'), ('am', '</s>'), ('<s>', 'I'), ('I', 'do'), ('do', 'not'), ('not', 'like'), ('like', 'green'), ('green', 'eggs'), ('eggs', 'and'), ('and', 'ham'), ('ham', '</s>')]
After:
	 [('<s>', 'I'), ('I', 'am'), ('am', 'Sam'), ('Sam', '</s>'), ('<s>', 'Sam'), ('Sam', 'I'), ('I', 'am'), ('am', '</s>'), ('<s>', 'I'), ('I', 'do'), ('do', 'not'), ('not', 'like'), ('like', 'green'), ('green', 'eggs'), ('eggs', 'and'), ('and', 'ham'), ('ham', '</s>'), ('<s>', '<s>'), ('<s>', '<s>'), ('<s>', '<s>'), ('</s>', '</s>'), ('</s>', '</s>'), ('</s>', '</s>')]


Populate the dictionary `trigram_lm` with the trigrams as keys and their corresponding probablities as values. So, for each trigram compute it's probability and store as a value in `trigram_lm` under the key given by the trigram.

In [395]:
trigram_lm = {}
for trigram in trigram_list:
    trigram_lm[trigram] = trigram_list.count(trigram) / bigram_list.count(trigram[0:2])   

In [396]:
trigram_lm

{('<s>', '<s>', 'I'): 0.6666666666666666,
 ('<s>', 'I', 'am'): 0.5,
 ('I', 'am', 'Sam'): 0.5,
 ('am', 'Sam', '</s>'): 1.0,
 ('Sam', '</s>', '</s>'): 1.0,
 ('<s>', '<s>', 'Sam'): 0.3333333333333333,
 ('<s>', 'Sam', 'I'): 1.0,
 ('Sam', 'I', 'am'): 1.0,
 ('I', 'am', '</s>'): 0.5,
 ('am', '</s>', '</s>'): 1.0,
 ('<s>', 'I', 'do'): 0.5,
 ('I', 'do', 'not'): 1.0,
 ('do', 'not', 'like'): 1.0,
 ('not', 'like', 'green'): 1.0,
 ('like', 'green', 'eggs'): 1.0,
 ('green', 'eggs', 'and'): 1.0,
 ('eggs', 'and', 'ham'): 1.0,
 ('and', 'ham', '</s>'): 1.0,
 ('ham', '</s>', '</s>'): 1.0}

In [397]:
# for trigram in trigram_list:
#     print("---")
#     print(trigram)
#     print(trigram[0:2])
#     print("---")

In [398]:
# left_pads = []
# right_pads = []

# print('Before:\n\t', bigram_list)

# for bigram in bigram_list:
#         if '<s>' in bigram:
#             left_pads.append(('<s>', '<s>'))
#         if '</s>' in bigram:
#             right_pads.append(('</s>', '</s>'))
            
# bigram_list.extend(left_pads)
# bigram_list.extend(right_pads)

# print('After:\n\t', bigram_list)

**Tutorial question 5**: What is the probability of the bigram `('I', 'am', '</s>')`?

In [399]:
trigram_lm[('I', 'am', '</s>')]

0.5

## Real N-gram example

### Data

We'll use the Reuters corpus. Reuters is a news agency, and the consists of 1.3M words in 10k news documents.

#### Download with NLTK

In [400]:
nltk.download('reuters')

[nltk_data] Downloading package reuters to
[nltk_data]     /Users/christine/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


True

#### Load and look

In [401]:
from nltk.corpus import reuters

tokenized_text = reuters.sents()

for i, sentence in enumerate(tokenized_text):
    if (i % 100) == 0:
        print(sentence)
    if i > 1000:
        break

['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']
['Now', 'it', "'", 's', 'largely', 'out', 'of', 'their', 'hands', ',"', 'said', 'Kleinwort', 'Benson', 'Ltd', 'financial', 'analyst', 'Simon', 'Smithson', '.']
['"', 'The', 'government', ',', 'however', ',', 'does', 'not', 'want', 'to', 'accelerate', 'reducing', 'the', 'debt', 'by', 'making', 'an', 'excessive', 'trade', 'surplus', ',"', 'he', 'said', '.']
['This', 'was', 'after', 'taking', 'into', 'account', 'inflation', 'in', 'consumer', 'prices', 'of', '1', '.', '5', 'pct', 'in', '1985', ',', 'slowing', 'to', '1', '.', '0', 'pct', 'in', '1986', '.']
['De', 'los', 'Angeles', "'", 'earli

### N-grams

To pad and get all N-grams up to N, we can use a preprocessing pipeline from NLTK.
Here we will get tri-grams so we set ```n = 3```

In [402]:
from nltk.lm.preprocessing import padded_everygram_pipeline

n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

print("train_data:", train_data)
print("padded_sents:", padded_sents)

train_data: <generator object padded_everygram_pipeline.<locals>.<genexpr> at 0x17cb8ece0>
padded_sents: <itertools.chain object at 0x17e4825c0>


**Tutorial question 6**: What is a generator object in Python?

A generator is a function which returns an object on which you can call next, such that for every call it returns some value, until it raises a StopIteration exception, signalling that all values have been generated.

### Train language model with MLE

#### Initialize

In [403]:
from nltk.lm import MLE
model = MLE(n) # n = 3 (above)

print(model)

<nltk.lm.models.MLE object at 0x30c0adb90>


#### Check
The model should be empty.

In [404]:
len(model.vocab)

0

#### Fit and check again

**Tutorial question 7**: What does fit do, and how does it do it?

In [405]:
model.fit(train_data, padded_sents)

In [406]:
# How does it fit?
# Counts number of occurences in the N-grams, i.e. trigrams

We got a vocabulary with plenty of words.

**Tutorial question 8**: How many words are there in the vocabulary?

In [407]:
len(model.vocab)

41602

#### Out-of-vocabulary words

What will the model do with unknown words?

Let's try a sentence from class.

Try the sentence:

    The suicidal Norway lemming is small, angry & adorable.

You can test whether words are present in the vocabulary, and what happens if they are not present, with ```model.vocab.lookup()```.

To tokenize the you can simply use ```"The suicidal Norway lemming is small, angry & adorable.".split()```

In [408]:
# tokens = "The suicidal Norway lemming is small, angry & adorable.".strip(".").split()

In [409]:
tokens = "The suicidal Norway lemming is small, angry & adorable.".strip(".").replace(",", "").split()

In [410]:
for token in tokens:
    print(token)

The
suicidal
Norway
lemming
is
small
angry
&
adorable


In [421]:
tokens

['The',
 'suicidal',
 'Norway',
 'lemming',
 'is',
 'small',
 'angry',
 '&',
 'adorable']

**Tutorial question 9**: What words were missing from the vocabulary, and happened with those words?

In [424]:
# Look Up
model.vocab.lookup(tokens)
# lemming and adorable are missing from the vocabulary

# for token in tokens:
#     print(token)


# What happened here??

('The', 'suicidal', 'Norway', '<UNK>', 'is', 'small', 'angry', '&', '<UNK>')

### Use the N-gram LM

#### Check the number of N-grams.

In [413]:
print(model.counts)

<NgramCounter with 3 ngram orders and 5655195 ngrams>


Count the number of individual N-grams

Unigram

In [414]:
print(model.counts['angry'])

7


Bigram

In [415]:
# C(into | taking)
print(model.counts[['taking']]['into'])

11


Trigram

In [416]:
# C(was | the figure)
print(model.counts[['the', 'figure']]['was'])

5


#### Check the N-gram probabilities

In [417]:
# P(angry)
print('P(angry):', model.score('angry'))

# P(into | taking)
print('P(into | taking):', model.score('into', ['taking']))

# P(was | the figure)
print('P(was | the figure):', model.score('was', ['the', 'figure']))

P(angry): 3.6086547914429515e-06
P(into | taking): 0.05851063829787234
P(was | the figure): 0.17857142857142858


How are unknown words dealt with?

In [418]:
model.score("<UNK>") == model.score("lemming")

True

#### Generate text

In [419]:
print(model.generate(20, random_seed=12))

# Sentence does not look

['a', 'relatively', 'new', 'problem', 'banks', 'in', 'Georgia', 'will', 'be', 'offered', 'intervention', 'grain', 'is', 'not', 'attached', 'to', 'a', 'select', 'committee', 'that']


**Tutorial question 10**: Does the generated sequence look natural?

**Tutorial question 11**: If not, suggest two improvements.

In [420]:
# How to Improve:
# Use Trigrams
# 