# 3: N-gram language model

In this notebook we'll build a bigram language model on Penn Treebank data.

We start by importing the Penn treebank from nltk and split it into three lists: `train`, `dev` and `test`. 

We'll assign the first 80% of sentences to `train`, the next 10% into `dev` and the remaining 10% into `test`.

In [1]:
from nltk.corpus import treebank

train = treebank.sents()[:int(0.8*len(treebank.sents()))]
dev = treebank.sents()[int(0.8*len(treebank.sents())):int(0.9*len(treebank.sents()))]
test = treebank.sents()[int(0.9*len(treebank.sents())):]

print(len(train))
print(len(dev))
print(len(test))

3131
391
392


Then, write a function `preprocess` which removes all trace tokens. These are tokens which contain the `"*"` character. Apply `preprocess` to `train`, `dev` and `test`.

In [2]:
def preprocess(sent):
    return [tok for tok in sent if not "*" in tok]

train = [preprocess(s) for s in train]
dev = [preprocess(s) for s in dev]
test = [preprocess(s) for s in test]

We now initialize an encoder which can encode word tokens into numbers. We'll use `sklearn.preprocessing.LabelEncoder` for this. 

You should generate a set `train_types` which contains every word type occurring in the training data. You should add three special symbols into `train_types`:

* `<UNK>` the unknown symbol,
* `<S>` the start of sentence symbol, and
* `</S>` the end of sentence symbol

Then initialize a `LabelEncoder` `word_enc` and fit it on `train_types`

In [3]:
from sklearn.preprocessing import LabelEncoder

UNK="<UNK>"
START="<S>"
STOP="</S>"

train_types = set([tok for sent in train for tok in sent])
train_types.add(UNK)
train_types.add(START)
train_types.add(STOP)

word_enc = LabelEncoder()
word_enc.fit(list(train_types))

LabelEncoder()

Before we can start training our language model, we'll need a function `encode_sent` which encodes a sentence into an array of feature numbers. 

`encode_sent` takes a list of tokens `s` as input, e.g. `["She", "works", "at", "the", "company"]`.

We replace every token which is not found in `train_types` with `UNK`. We also
append `START` and `STOP` at the beginning and at end of the sentence respectively, before calling `word_encoder.transform` on the sentence.

E.g.:

```
encode_sent("She works at the company".split())
array([  769,  3085, 10543,  4036,  9896,  4785,   768])
```
(note that the specific number values may differ)

In [4]:
def encode_sent(s):
    s = [START] + [UNK if not tok in train_types else tok for tok in s] + [STOP]
    return word_enc.transform(s)

Now we initialize a numpy array `count_w1_w2` of size `len(train_type) x len(train_type)`. Initialize every element of the array to `1` (add one smoothing). 

We'll use this array to store counts for word pairs in the training set.

In [7]:
import numpy as np

count_w1_w2 = np.ones((len(train_types), len(train_types)))

Now implement a function `train_lm`. It takes two arguments:

* `data` a dataset, and
* `count_w1_w2` an array to store word bigram counts

Use should use the nltk function `bigrams` to iterate through the bigrams of every sentence in `data`. First pass the sentence through `encode_sent` and then call `bigrams` on the output.

After populating `count_w1_w2` with bigram counts, we then generate a second `len(train_types) x 1` array `count_w1` which contains the row sums in `count_w1_w2`.

Return `count_w1_w2` and `count_w1`.

**Hint:** You can use `np.array.sum` to sum over the rows of `count_w1_w2`.

In [5]:
from nltk import bigrams

def train_lm(data, count_w1_w2):
    # store all the bigrams
    for s in train:
        for w1, w2 in bigrams(encode_sent(s)):
            count_w1_w2[w1,w2] += 1
            
    return count_w1_w2, count_w1_w2.sum(axis=1)

> Note that here the `count_w1_w2.sum(axis=1)` is the sum of how many times a word appears.

Use `train_lm` to train `count_w1_w2` and `count_w1`.

In [8]:
count_w1_w2, count_w1 = train_lm(train,count_w1_w2)

Now, write a function `prob` which takes three arguments:

* `sent` a sentence like `["I", "am", "Sam"]` containing $n$ tokens,
* `count_w1_w2` a table of word pair counts
* `count_w1` a table of word counts

The function should return the log probability of the sentence:

$$\log p(s) = \log p(w_1|{\rm START}) + \Big( \sum_{i=1?}^{n-1} \log p(w_{i+1}|w_i) \Big) + \log p({\rm STOP}|w_n)$$

In [9]:
def prob(sent, count_w1_w2, count_w1):
    """returns the log probability of the sentence"""
    log_prob = 0
    for w1, w2 in bigrams(encode_sent(sent)):
        log_prob += np.log(count_w1_w2[w1, w2]) - np.log(count_w1[w1])
        # P(w2|w1) = count(w1, w2) / count(w1)
    return log_prob

Compare the probabilities of the sentences "I like New York." and ". New I York like". 

The language model actually scores the correct order higher than the incorrect one (remember that log probabilities are negative and closer to 0 means that the sentence is more likely).

In [10]:
print(prob("I like New York .".split(), count_w1_w2, count_w1))
print(prob(". New I York like".split(), count_w1_w2, count_w1))

-37.632996193501334
-56.16365143231954


Now write a function `log_pp` which takes as arguments:
    
* `data` a dataset (dev or test) of $k$ setences $S_i$
* `count_w1_w2` a table of word pair counts
* `count_w1` a table of word counts

The funciton should return the log perplexity of the language model on the data:

$$\frac{-1}{|S_1| + ... + |S_k|} \cdot \sum_{i = 1}^{k} \log p(S_k)$$

Run `log_pp` on the development data. You should get around `8` as result.

In [11]:
def log_pp(data, count_w1_w2, count_w1):
    """returns the log perplexity of the language model"""
    data_prob = 0
    data_len = 0
    for s in data:
        data_prob += prob(s, count_w1_w2, count_w1)
        data_len += len(s)
    return -(1/data_len) * data_prob

log_pp(dev, count_w1_w2, count_w1)

8.260413159592071

Run `log_pp` on the test set. The result should be around `8` again.

In [12]:
print(log_pp(test, count_w1_w2, count_w1))

8.33955878691596


In this final partk, we will generate sentences from the language model.

We'll start with a seed sentence `s = [START]` and add tokens to this sentence iteratively.

At each step, we sample the next token from the conditional distribution:

$$P(tok|w_n)$$

where $w_n$ is the current final token in `s`.

Start by encoding $s_n$ into an index number `w_ind` using `word_end`. Then form the conditional token distribution for the next token by normalizing the `w_ind`th row in `counts_w1_w2` with `counts_w1[w_ind]`.

You should now initialize a parameter `accum = 0`. We'll use this parameter to keep track of the cumulative probability. You should also sample a random number `r` using `numpy.random.random()`.

Then iterate over the indices `i` in `distr` updating the cumulative probability with `distr[i]`. When the cumulative probability grows over `r`, we've found the index of the next token `i`.

Use `word_enc.inverse_transform` to transform `i` into a word token and append it to `s`.

Continue until you sample the end of sequence token `STOP` or your sentence length is 20.

Print a couple of sentences. The results will not be very impressive because the bigram language model is quite a weak language model. A trigram model would generate better outputs.

In [16]:
count_w1_w2

array([[1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       ...,
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.]])

In [24]:
s = [START]

while s[-1] != STOP and len(s) < 21:
    w_ind = word_enc.transform([s[-1]])[0]          # word_enc.transform return a list with one value
    distr = count_w1_w2[w_ind] / count_w1[w_ind]     # calculate the probability
    accum = 0
    r = np.random.random()
    for i, p in enumerate(distr):
        accum += p
        if r < accum:
            tok = word_enc.inverse_transform([i])[0]
            s.append(tok)
            break
print(s)

['<S>', 'pick', 'mainly', 'Mercantile', '1,880', 'passenger', 'tire', 'unjust', 'Deere', 'mature', 'Gray', '3\\/4', '2005', 'chips', 'transaction', 'LANDOR', 'banks', 'speeches', 'HHS', 'Miguel', 'warranties']
