## Toy example

Let's look at the miniature corpus from the class:

    I am Sam. 
    Sam I am. 
    I do not like green eggs and ham.
    
### Pre-processing

#### Tokenize the text into sentences

Split the text into a list of sentences.

In [None]:
import nltk
nltk.download('punkt')

In [None]:
from nltk.tokenize import sent_tokenize

text = "I am Sam. Sam I am. I do not like green eggs and ham."

sentences = sent_tokenize(text)

#### Tokenize the sentences into lists of words

The text will become a list of lists.

In [None]:
from nltk import word_tokenize 
tokenized_text = []

# replace('.', '') removes the period.
for sentence in sentences:
    tokenized_text.append(word_tokenize(sentence.replace('.', '')))

**Tutorial question 1**: How many lists are in `tokenized_text`?

### Get the N-grams

We'll get unigrams, bigrams and trigrams.

Note that this is not the N-gram **language model**, only the two and three word sequences.

Let's start with **bigrams**:

In [None]:
from nltk import bigrams

bigram_list = []

for sentence in tokenized_text:
    bigram_list.extend(list(bigrams(sentence, pad_right=True, pad_left=True, right_pad_symbol='</s>', left_pad_symbol='<s>')))

**Tutorial question 2:** What is it the last bigram in `bigram_list`?

**Trigrams**:

In [None]:
from nltk import trigrams

trigram_list = []

Populate `trigram_list` with trigrams using `trigrams` from NLTK.

**Tutorial question 3**: What is the thrid trigram in `trigram_list`?

### N-gram language model

For the bigrams we need first to get the counts of individual tokens.

In [None]:
unigram_list = []

for sentence in tokenized_text:
    unigram_list.extend(sentence)
    unigram_list.append('<s>')
    unigram_list.append('</s>')

print(unigram_list)

#### Maximum Likelyhood Estimate (MLE) bigram model

Populate the dictionary `bigram_lm` with the bigrams as keys and their corresponding probablities as values.
So, for each bigram compute it's probability and store as a value in `bigram_lm` under the key given by the bigram.

In [None]:
bigram_lm = {}

In [None]:
for bigram in bigram_list:
    
    bigram_lm[bigram] = bigram_list.count(bigram) / unigram_list.count(bigram[0])        

**Tutorial question 4**: What is the probability of the bigram `('I', 'do')`?

#### MLE trigram model

First we need to add ```<s><s>``` and ```</s></s>``` to the bigram_list.

In [None]:
left_pads = []
right_pads = []

print('Before:\n\t', bigram_list)

for bigram in bigram_list:
        if '<s>' in bigram:
            left_pads.append(('<s>', '<s>'))
        if '</s>' in bigram:
            right_pads.append(('</s>', '</s>'))
            
bigram_list.extend(left_pads)
bigram_list.extend(right_pads)

print('After:\n\t', bigram_list)

Populate the dictionary `trigram_lm` with the trigrams as keys and their corresponding probablities as values. So, for each trigram compute it's probability and store as a value in `trigram_lm` under the key given by the trigram.

In [None]:
trigram_lm = {}

**Tutorial question 5**: What is the probability of the bigram `('I', 'am', '</s>')`?

## Real N-gram example

### Data

We'll use the Reuters corpus. Reuters is a news agency, and the consists of 1.3M words in 10k news documents.

#### Download with NLTK

In [None]:
nltk.download('reuters')

#### Load and look

In [None]:
from nltk.corpus import reuters

tokenized_text = reuters.sents()

for i, sentence in enumerate(tokenized_text):
    if (i % 100) == 0:
        print(sentence)
    if i > 1000:
        break

### N-grams

To pad and get all N-grams up to N, we can use a preprocessing pipeline from NLTK.
Here we will get tri-grams so we set ```n = 3```

In [None]:
from nltk.lm.preprocessing import padded_everygram_pipeline

n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

print("train_data:", train_data)
print("padded_sents:", padded_sents)

**Tutorial question 6**: What is a generator object in Python?

### Train language model with MLE

#### Initialize

In [None]:
from nltk.lm import MLE
model = MLE(n) # n = 3 (above)

print(model)

#### Check
The model should be empty.

In [None]:
len(model.vocab)

#### Fit and check again

**Tutorial question 7**: What does fit do, and how does it do it?

In [None]:
model.fit(train_data, padded_sents)

We got a vocabulary with plenty of words.

**Tutorial question 8**: How many words are there in the vocabulary?

#### Out-of-vocabulary words

What will the model do with unknown words?

Let's try a sentence from class.

Try the sentence:

    The suicidal Norway lemming is small, angry & adorable.

You can test whether words are present in the vocabulary, and what happens if they are not present, with ```model.vocab.lookup()```.

To tokenize the you can simply use ```"The suicidal Norway lemming is small, angry & adorable.".split()```

In [None]:
tokens = "The suicidal Norway lemming is small, angry & adorable.".strip('.').split()

**Tutorial question 9**: What words were missing from the vocabulary, and happened with those words?

### Use the N-gram LM

#### Check the number of N-grams.

In [None]:
print(model.counts)

Count the number of individual N-grams

Unigram

In [None]:
print(model.counts['angry'])

Bigram

In [None]:
# C(into | taking)
print(model.counts[['taking']]['into'])

Trigram

In [None]:
# C(was | the figure)
print(model.counts[['the', 'figure']]['was'])

#### Check the N-gram probabilities

In [None]:
# P(angry)
print('P(angry):', model.score('angry'))

# P(into | taking)
print('P(into | taking):', model.score('into', ['taking']))

# P(was | the figure)
print('P(was | the figure):', model.score('was', ['the', 'figure']))

How are unknown words dealt with?

In [None]:
model.score("<UNK>") == model.score("lemming")

#### Generate text

In [None]:
print(model.generate(20, random_seed=12))

**Tutorial question 10**: Does the generated sequence look natural?

**Tutorial question 11**: If not, suggest two improvements.