<a href="https://colab.research.google.com/github/kyunghyuncho/ammi-2019-nlp/blob/master/01-day-LM/ngram_lm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Modeling

### Goal: compute a probabilty distribution over all possible sentences:


### $$p(W) = p(w_1, w_2, ..., w_T)$$

### This unsupervised learning problem can be framed as a sequence of supervised learning problems:

### $$p(W) = p(w_1) * p(w_2|w_1) * ... * p(w_T|w_1, ..., w_{T-1})$$

### If we have K sentences, where the j-th sentence has T_j words for all j frmo 1 to K, then we want to max:

### $$log p(W) = \sum_{j = 1}^K \sum_{i=1}^{T_j} log p(w_i | w_{<i})$$




# N-gram language model

### Goal: estimate the n-gram probabilities using counts of sequences of n consecutive words

### Given a sequence of words $w$, we want to compute

###  $$P(w_i|w_{i−1}, w_{i−2}, …, w_{i−n+1})$$

### Where $w_i$ is the i-th word of the sequence.

### $$P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) = \frac{p(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w_i)}{\sum_{w \in V} p(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w)}$$

### Key Idea: We can estimate the probabilities using counts of n-grams in our dataset 


## N-gram Probabilities

## $$P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) \approx \frac{c(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w_i)}{\sum_{w \in V} c(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w)}$$


## Bigram Probabilities

## $$p(w_i | w_{i-1}) = \frac{c(w_{i-1}, w_i)}{\sum_{w_i} c(w_{i-1}, w_i)} $$


In [1]:
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0MB)
[K     |████████████████████████████████| 12.0MB 197kB/s eta 0:00:01
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25ldone
[?25h  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.2.5-cp37-none-any.whl size=12011740 sha256=8cb8034e9729b493e53d366e76d3804d7b91bd9697e4ac9ba9865185a0124c57
  Stored in directory: /tmp/pip-ephem-wheel-cache-3pk37hi8/wheels/6a/47/fb/6b5a0b8906d8e8779246c67d4658fd8a544d4a03a75520197a
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [20]:
# !pip install altair
# !pip install pygtrie

Collecting pygtrie
  Downloading https://files.pythonhosted.org/packages/18/41/2e5cefc895a32d9ca0f3574bd0df09e53a697023579a93582bedc4eeac4d/pygtrie-2.3.2.tar.gz
Building wheels for collected packages: pygtrie
  Building wheel for pygtrie (setup.py) ... [?25ldone
[?25h  Created wheel for pygtrie: filename=pygtrie-2.3.2-cp37-none-any.whl size=18868 sha256=14d2a7030f0dcddf56575070975728572e911341e23e229e84647d192d38fe43
  Stored in directory: /home/aims/.cache/pip/wheels/1c/10/3c/2d28c8ac56cda265d0c16ca129f50e5c3526f49a7fbe224cd9
Successfully built pygtrie
Installing collected packages: pygtrie
Successfully installed pygtrie-2.3.2


In [28]:
# !ls

amazon_dataset.ipynb  neural_lm.ipynb	     ngram_lm.ipynb	   utils
ken_lm.ipynb	      neural_lm_small.ipynb  ngram_lm_small.ipynb


In [33]:
import os
import sys
# sys.path.append('utils/')
from c_utils import ngram_utils as ngram_utils
import c_utils.global_variables as gl
import torch
import random
from c_utils.ngram_utils import NgramLM

In [32]:
torch.manual_seed(1)

<torch._C.Generator at 0x7ff677864f50>

### Load Data from .txt Files

In [36]:
# Read data from .txt files and create lists of reviews

train_data = []
# create a list of all the reviews 
with open('../../data/amazon_train.txt', 'r') as f: # need to edit this depending of your folder structure 
    train_data = [review for review in f.read().split('\n') if review]
    
valid_data = []
# create a list of all the reviews 
with open('../../data/amazon_valid.txt', 'r') as f:
    valid_data = [review for review in f.read().split('\n') if review]
    

In [37]:
# type(train_data), len(train_data), \
# type(train_data[0]), len(train_data[0]), \
# type(train_data[0][0]), len(train_data[0][0])

In [38]:
train_data[0], train_data[0][0], len(train_data)


("this is a great tutu and at a really great price . it doesn ' t look cheap at all . i ' m so glad i looked on amazon and found such an affordable tutu that isn ' t made poorly . a + + ",
 't',
 22288)

### Process the Data

In [39]:
# Tokenize the Datasets
# TODO: this takes a really long time !! why?
train_data_tokenized, all_tokens_train = ngram_utils.tokenize_dataset(train_data)
valid_data_tokenized, all_tokens_valid = ngram_utils.tokenize_dataset(valid_data)


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




Let's look at the tokenized data!

In [40]:
# # Number of All Tokens
# len(all_tokens_train), all_tokens_train[0], \
len(train_data_tokenized), train_data_tokenized[0]

(107790,
 ['this',
  'is',
  'a',
  'great',
  'tutu',
  'and',
  'at',
  'a',
  'really',
  'great',
  'price',
  '.'])

In [41]:
train_ngram_lm = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing=None)
valid_ngram_lm = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing=None)

In [42]:
train_ngram_lm.trie_ngram['./<eos>/<eos>']

96175

In [44]:
train_ngram_lm.n, train_ngram_lm.frac_vocab

(3, 0.9)

In [45]:
valid_ngram_lm.id2token[0:10]

['<unk>', '<sos>', '<eos>', '.', 'the', 'i', ',', 'and', 'a', 'it']

In [46]:
valid_ngram_lm.token2id['<unk>'], valid_ngram_lm.token2id['<sos>'], valid_ngram_lm.token2id['the']

(0, 1, 4)

In [47]:
valid_ngram_lm.vocab_ngram[:10], valid_ngram_lm.count_ngram[:10]

((('.', '<eos>', '<eos>'),
  ('<sos>', '<sos>', 'i'),
  ('<sos>', '<sos>', 'the'),
  ('<sos>', '<sos>', 'it'),
  ('!', '<eos>', '<eos>'),
  ('<sos>', '<sos>', 'this'),
  ('it', "'", 's'),
  ('.', '.', '.'),
  ('.', '.', '<eos>'),
  ('<sos>', '<sos>', 'they')),
 (13625, 3635, 1425, 1100, 1049, 762, 687, 655, 580, 569))

In [48]:
valid_ngram_lm.vocab_bigram[:10], valid_ngram_lm.count_bigram[:10]

((('.', '<eos>'),
  ('<sos>', 'i'),
  ('<sos>', 'the'),
  ("'", 't'),
  ("'", 's'),
  ('.', '.'),
  ('<sos>', 'it'),
  ('!', '<eos>'),
  (',', 'and'),
  (',', 'but')),
 (13625, 3635, 1425, 1261, 1249, 1238, 1100, 1049, 900, 838))

In [49]:
valid_ngram_lm.vocab_unigram[:10], valid_ngram_lm.count_unigram[:10]

((('.',),
  ('the',),
  ('i',),
  (',',),
  ('and',),
  ('a',),
  ('it',),
  ('to',),
  ("'",),
  ('is',)),
 (14883, 9408, 8000, 7525, 6226, 5774, 5085, 4550, 3816, 3695))

In [50]:
valid_ngram_lm.vocab_prev_ngram[:10], valid_ngram_lm.count_prev_ngram[:10]

((('.', '<eos>'),
  ('<sos>', 'i'),
  ('<sos>', 'the'),
  ("'", 't'),
  ("'", 's'),
  ('.', '.'),
  ('<sos>', 'it'),
  ('!', '<eos>'),
  (',', 'and'),
  (',', 'but')),
 (13625, 3635, 1425, 1261, 1249, 1238, 1100, 1049, 900, 838))

In [51]:
valid_ngram_lm.id2token_ngram[:10]

[('.', '<eos>', '<eos>'),
 ('<sos>', '<sos>', 'i'),
 ('<sos>', '<sos>', 'the'),
 ('<sos>', '<sos>', 'it'),
 ('!', '<eos>', '<eos>'),
 ('<sos>', '<sos>', 'this'),
 ('it', "'", 's'),
 ('.', '.', '.'),
 ('.', '.', '<eos>'),
 ('<sos>', '<sos>', 'they')]

In [52]:
valid_ngram_lm.token2id_ngram[('.', '<eos>', '<eos>')], valid_ngram_lm.token2id_ngram[('.', '.', '<eos>')]

(0, 8)

#### Build the Vocabulary 


In [53]:
# Build a vocabulary using all the tokens found in train data (90% of most common ones)
print('Word vocabulary size: {} words'.format(len(train_ngram_lm.token2id)))        

Word vocabulary size: 20806 words


### CORPUS ANALYSIS (Train + Valid Data)

#### Number of Tokens in the Corpus Data


In [54]:
print("Number of All Tokens ", len(all_tokens_train))

Number of All Tokens  1623446


#### Number of Sentences in the Train Data


In [55]:
print("Number of Sentences ", len(train_ngram_lm.raw_data))

Number of Sentences  107790


## N-grams

In [56]:
n = 3 # trigrams

### Function for padding the sentences with special markers sentence beginning and end, i.e. $<bos>$ and $<eos>$

In [57]:
train_padded = train_ngram_lm.padded_data
train_ngram = train_ngram_lm.ngram_data
vocab_ngram = train_ngram_lm.vocab_ngram
count_ngram = train_ngram_lm.count_ngram 

In [58]:
train_padded[0]

['<sos>',
 '<sos>',
 'this',
 'is',
 'a',
 'great',
 'tutu',
 'and',
 'at',
 'a',
 'really',
 'great',
 'price',
 '.',
 '<eos>',
 '<eos>']

### Function for finding all N-grams

In [59]:
train_ngram[0]

[('<sos>', '<sos>', 'this'),
 ('<sos>', 'this', 'is'),
 ('this', 'is', 'a'),
 ('is', 'a', 'great'),
 ('a', 'great', 'tutu'),
 ('great', 'tutu', 'and'),
 ('tutu', 'and', 'at'),
 ('and', 'at', 'a'),
 ('at', 'a', 'really'),
 ('a', 'really', 'great'),
 ('really', 'great', 'price'),
 ('great', 'price', '.'),
 ('price', '.', '<eos>'),
 ('.', '<eos>', '<eos>')]

In [60]:
vocab_ngram[0]

('.', '<eos>', '<eos>')

In [61]:
count_ngram[0]

96175

In [62]:
trie_ngram = train_ngram_lm.trie_ngram
# trie_ngram
# trie_prev_ngram = train_ngram_lm.trie_prev_ngram

In [63]:
trie_ngram['./<eos>/<eos>']

96175

In [64]:
id2token = train_ngram_lm.id2token
token2id = train_ngram_lm.token2id

In [65]:
id2token_ngram = train_ngram_lm.id2token_ngram
token2id_ngram = train_ngram_lm.token2id_ngram

In [66]:
random_token_id = random.randint(0, len(id2token_ngram) - 1)
random_token = id2token_ngram[random_token_id]

print ("Token id {} ; token {}".format(random_token_id, id2token_ngram[random_token_id]))
print ("Token {}; token id {}".format(random_token, token2id_ngram[random_token]))

Token id 567850 ; token ('this', 'shirt', 'by')
Token ('this', 'shirt', 'by'); token id 567850


### Ngram Count & Probability

In [67]:
# TODO: print the words for which the pd is nonzero !!! -- more intuitive than a list of numbers

In [68]:
vocab_ngram[:10], count_ngram[:10]

((('.', '<eos>', '<eos>'),
  ('<sos>', '<sos>', 'i'),
  ('<sos>', '<sos>', 'the'),
  ('!', '<eos>', '<eos>'),
  ('<sos>', '<sos>', 'they'),
  ('<sos>', '<sos>', 'it'),
  ('.', '.', '.'),
  ('<sos>', '<sos>', 'this'),
  ('<sos>', '<sos>', 'these'),
  ('.', '.', '<eos>')),
 (96175, 26986, 9197, 8152, 6376, 5373, 4693, 4189, 3941, 3876))

In [69]:
c = train_ngram_lm.get_ngram_count(('an', 'older', 'coat'))
p = train_ngram_lm.get_ngram_prob(('an', 'older', 'coat'))

p1 = train_ngram_lm.get_ngram_prob(('an', 'older', 'pc'))
p2 = train_ngram_lm.get_ngram_prob(('an', 'older', 'lady'))
p3 = train_ngram_lm.get_ngram_prob(('an', 'older', 'watch'))

pd = train_ngram_lm.get_prob_distr_ngram(('an', 'older'))

c, p, p1, p2, p3, sum(pd)#, pd

(1, 0.04, 0.04, 0.04, 0.0, 1.0)

In [70]:
c = train_ngram_lm.get_ngram_count(('really', 'great', 'price'))
p = train_ngram_lm.get_ngram_prob(('really', 'great', 'price'))
pd = train_ngram_lm.get_prob_distr_ngram(('really', 'great'))

c, p, sum(pd)#, pd 

(3, 0.06521739130434782, 1.0)

In [71]:
c = train_ngram_lm.get_ngram_count(('really', 'great'))

c

0

In [72]:
c = train_ngram_lm.get_ngram_count(('.', '<eos>', '<eos>'))
p = train_ngram_lm.get_ngram_prob(('.', '<eos>', '<eos>'))
pd = train_ngram_lm.get_prob_distr_ngram(('.', '<eos>'))

c, p, sum(pd)#, pd

(96175, 1.0, 1.0)

In [73]:
c = train_ngram_lm.get_ngram_count(('.', '<sos>', '<sos>'))

c

0

In [74]:
c = train_ngram_lm.get_ngram_count(('i', 'like', 'pandas'))
p = train_ngram_lm.get_ngram_count(('i', 'like', 'pandas'))
pd = train_ngram_lm.get_prob_distr_ngram(('i', 'like'))

c, p, sum(pd)#, pd

(0, 0, 0.9999999999999897)

In [75]:
c = train_ngram_lm.get_ngram_count(('is', 'a', 'great'))
p = train_ngram_lm.get_ngram_prob(('is', 'a', 'great'))
pd = train_ngram_lm.get_prob_distr_ngram(('is', 'a'))

c, p, sum(pd)#, pd

(266, 0.09761467889908257, 1.0000000000000142)

In [76]:
c = train_ngram_lm.get_ngram_count(('send', 'it', 'back'))
p = train_ngram_lm.get_ngram_prob(('send', 'it', 'back'))
pd = train_ngram_lm.get_prob_distr_ngram(('send', 'it', 'back'))

c, p, sum(pd)#, pd

(28, 0.9032258064516129, 1.0000000000000657)

In [77]:
c = train_ngram_lm.get_ngram_count(('i', 'like', 'these', 'pictures'))
p = train_ngram_lm.get_ngram_prob(('i', 'like', 'these', 'pictures'))
pd = train_ngram_lm.get_prob_distr_ngram(('i', 'like', 'these'))

c, p, sum(pd)#, pd

(0, 0, 1.0000000000000657)

## Add-One Smoothing

## $$P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) \approx \frac{1 + c(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w_i)}{\mid V\mid + \sum_{w \in V} c(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w)}$$


In [78]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('.', '<sos>', '<sos>'))
p

4.806305873305777e-05

In [79]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('i', 'like', 'pandas'))
p

4.504098729844158e-05

In [80]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('i', 'like', 'this'))
p

0.004323934780650392

In [81]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('really', 'great', 'price'))
p

0.0001918281220026856

In [82]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('send', 'it', 'back'))
p

0.0013917550511110045

In [83]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('.', '<eos>', '<eos>'))
p

0.8221506056539096

## Additive Smoothing

## $$P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) \approx \frac{\delta + c(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w_i)}{\delta\mid V\mid + \sum_{w \in V} c(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w)}$$


In [84]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('.', '<sos>', '<sos>'), delta = 0.5)
p

4.806305873305777e-05

In [85]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('i', 'like', 'pandas'), delta = 0.5)
p

4.237647258242224e-05

In [86]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('i', 'like', 'this'), delta = 0.5)
p

0.008093906263242648

In [87]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('really', 'great', 'price'), delta = 0.5)
p

0.00033496028328069673

In [88]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('send', 'it', 'back'), delta = 0.5)
p

0.0027314548591144336

In [89]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('.', '<eos>', '<eos>'), delta = 0.5)
p

0.9023954287001069

### Changing the Parameter $\delta$

In [90]:
# small delta --> closer to no smoothing  (1.0)
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('.', '<eos>', '<eos>'), delta = 0.1)
p

0.9788256343658784

In [91]:
# arge delta --> closer to add-one smoothing (0.58)
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('.', '<eos>', '<eos>'), delta = 0.9)
p

0.8370371208455323

## Linear Interpolation Smoothing (Jelinek-Mercer)

### $$P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) \approx \alpha_n P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) + (1 - \alpha_n) P(w|w_{i−n+2}, ..., w_{i−2}, w_{i−1})$$


In [92]:
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('.', '<sos>', '<sos>'), alpha = 0.8)
p

0.0

In [93]:
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('i', 'like', 'pandas'), alpha = 0.8)
p

0.0

In [94]:
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('i', 'like', 'this'), alpha = 0.8)
p

0.054441260744985676

In [95]:
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('really', 'great', 'price'), alpha = 0.8)
p

0.052173913043478265

In [96]:
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('send', 'it', 'back'), alpha = 0.8)
p

0.7225806451612904

### Changing the Parameter $\alpha$

In [97]:
# small delta --> closer to no smoothing  (1.0)
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('.', '<eos>', '<eos>'), alpha = 0.8)
p

0.8

In [98]:
# small delta --> closer to no smoothing  (1.0)
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('.', '<eos>', '<eos>'), alpha = 0.5)
p

0.5

In [99]:
# small delta --> closer to no smoothing  (1.0)
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('.', '<eos>', '<eos>'), alpha = 0.2)
p

0.2

## Linear Interpolation with Absolute Discounting

### $$p_{bi}(w|v) = max ({ \frac{N(v, w) - b_{bi}}{N(v)}, 0)  + b_{bi} \frac{V - N_0(v, \cdot)}{N(v)} p_{uni}(w) \large}$$

### $$p_{uni}(w) = max ({ \frac{N(w) - b_{uni}}{N}, 0)  + b_{uni} \frac{V - N_0(\cdot)}{N} \frac{1}{V}}$$

### $$b_{bi} = \frac{N_1(\cdot, \cdot)}{N_1(\cdot, \cdot) + 2*N_2(\cdot, \cdot)}$$

### $$b_{uni} = \frac{N_1(\cdot)}{N_1(\cdot) + 2*N_2(\cdot)}$$


### $$N_r(\cdot) = \sum_{w: N(w) = r} 1$$

### $$N_r(\cdot, \cdot) = \sum_{v, w: N(v, w) = r} 1$$

### $$N_r(v, \cdot) = \sum_{w: N(v, w) = r} 1$$

### V is the number of words in the vocabulary

### $N_r(\cdot, \cdot)$ and $N_r(\cdot)$  are the count-counts for bigrams and unigrams respectively $


### Remember to check that probabilities sum up to one:
### $$\sum_w p_{bi}(w|v) = \sum_w p_{uni}(w) = 1$$



In [100]:
# y = "m"
# x = "'"

# z = train_ngram_lm.get_p_bi(y, x)
# z

In [101]:
train_ngram[:3]

[[('<sos>', '<sos>', 'this'),
  ('<sos>', 'this', 'is'),
  ('this', 'is', 'a'),
  ('is', 'a', 'great'),
  ('a', 'great', 'tutu'),
  ('great', 'tutu', 'and'),
  ('tutu', 'and', 'at'),
  ('and', 'at', 'a'),
  ('at', 'a', 'really'),
  ('a', 'really', 'great'),
  ('really', 'great', 'price'),
  ('great', 'price', '.'),
  ('price', '.', '<eos>'),
  ('.', '<eos>', '<eos>')],
 [('<sos>', '<sos>', 'it'),
  ('<sos>', 'it', 'doesn'),
  ('it', 'doesn', "'"),
  ('doesn', "'", 't'),
  ("'", 't', 'look'),
  ('t', 'look', 'cheap'),
  ('look', 'cheap', 'at'),
  ('cheap', 'at', 'all'),
  ('at', 'all', '.'),
  ('all', '.', '<eos>'),
  ('.', '<eos>', '<eos>')],
 [('<sos>', '<sos>', 'i'),
  ('<sos>', 'i', "'"),
  ('i', "'", 'm'),
  ("'", 'm', 'so'),
  ('m', 'so', 'glad'),
  ('so', 'glad', 'i'),
  ('glad', 'i', 'looked'),
  ('i', 'looked', 'on'),
  ('looked', 'on', 'amazon'),
  ('on', 'amazon', 'and'),
  ('amazon', 'and', 'found'),
  ('and', 'found', 'such'),
  ('found', 'such', 'an'),
  ('such', 'an', 'af

## Kneser-Ney Smoothing (best to use in practice!) http://smithamilli.com/blog/kneser-ney/

### Bigram LM
###  $$p(s) = \prod_{i = 1} ^ {N + 1} p(w_i | w_{i-1})$$

## Likelihood of a Sentence

### Bigram LM: $$ p(i \; love \; this \; light) = p(i|\cdot) \; p(love|i)\;  p(this|love)\;  p(light|this) \\
\approx \frac{c(i, \cdot)}{\sum_w c(\cdot, \; w)} \; \frac{c(love, i)}{\sum_wc(i, \; w)}\;  \frac{c(this, love)}{\sum_wc(love, \;w)}\;  \frac{c(light, this)}{\sum_wc(this, \;w)}$$ 

### Trigram LM: $$ p(i \; love \; this  \;light) = p(i|\cdot, \cdot) \; p(love|\cdot, i) \; p(this|i, love)\;  p(light|love, this)$$ 



### Score Sentences

In [102]:
n = 3
sentence = [['this', 'is', 'a', 'great', 'tutu']]
print(sentence)
ps = train_ngram_lm.get_prob_sentence(sentence)
ss =  train_ngram_lm.get_score_sentence(sentence)
ps, ss

[['this', 'is', 'a', 'great', 'tutu']]


(0.0, 5.127565397867753e+77)

In [103]:
n = 3
sentence = [['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]
print(sentence)
ps = train_ngram_lm.get_prob_sentence(sentence)
ss = train_ngram_lm.get_score_sentence(sentence)
ps, ss

[['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]


(4.2919413175965264e-13, 6.675591427844734)

## Sentence Generation

#### No Context

In [104]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens)
generated_sentence


the
the sateen
the sateen fabric
the sateen fabric is
the sateen fabric is 34


'the sateen fabric is 34'

In [105]:
num_tokens = 10
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i', 'like', 'the'))
generated_sentence


picture
picture .
picture . <eos>


'picture . <eos>'

In [106]:
num_tokens = 20
generated_sentence = train_ngram_lm.generate_sentence(num_tokens)
generated_sentence


i
i was
i was excited
i was excited because
i was excited because i
i was excited because i wear
i was excited because i wear them
i was excited because i wear them outside
i was excited because i wear them outside or
i was excited because i wear them outside or to
i was excited because i wear them outside or to work
i was excited because i wear them outside or to work .
i was excited because i wear them outside or to work . <eos>


'i was excited because i wear them outside or to work . <eos>'

#### With Context

In [107]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i', 'like', 'the'))
generated_sentence


medium
medium with
medium with sports
medium with sports bras
medium with sports bras because


'medium with sports bras because'

In [108]:
num_tokens = 10
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i', 'like', 'the'))
generated_sentence


boots
boots are
boots are over
boots are over 6
boots are over 6 years
boots are over 6 years now
boots are over 6 years now ,
boots are over 6 years now , i
boots are over 6 years now , i am
boots are over 6 years now , i am 5


'boots are over 6 years now , i am 5'

In [109]:
num_tokens = 20
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i', 'like', 'the'))
generated_sentence


band
band was
band was still
band was still working
band was still working great
band was still working great just
band was still working great just as
band was still working great just as there
band was still working great just as there is
band was still working great just as there is nothing
band was still working great just as there is nothing like
band was still working great just as there is nothing like the
band was still working great just as there is nothing like the spanx
band was still working great just as there is nothing like the spanx run
band was still working great just as there is nothing like the spanx run a
band was still working great just as there is nothing like the spanx run a bit
band was still working great just as there is nothing like the spanx run a bit after
band was still working great just as there is nothing like the spanx run a bit after wearing
band was still working great just as there is nothing like the spanx run a bit after wearing these
band was s

'band was still working great just as there is nothing like the spanx run a bit after wearing these .'

In [110]:
num_tokens = 10
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('the', 'worst'))
generated_sentence


bras
bras i
bras i bought
bras i bought these
bras i bought these for
bras i bought these for narrows
bras i bought these for narrows after
bras i bought these for narrows after finding
bras i bought these for narrows after finding the
bras i bought these for narrows after finding the size


'bras i bought these for narrows after finding the size'

In [111]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('the', 'best'))
generated_sentence


.
. <eos>


'. <eos>'

In [112]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('not', 'what'))
generated_sentence


arrived
arrived at
arrived at the
arrived at the center
arrived at the center .


'arrived at the center .'

In [113]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i', 'will'))
generated_sentence


give
give it
give it to
give it to a
give it to a dangle


'give it to a dangle'

## Log-Likelihood (n-gram)
## $$LL = \sum_{j=1}^{K} \sum_{i=1}^{T_j + 1} log p_{bi}(w_{j, i} | w_{j, n - i + 1}, \cdot, w_{j, i - 2}, w_{j, i - 1})$$

## Perplexity
## $$PP = exp(-\frac{LL}{\sum_j(T_j + 1)})$$

In [114]:
ppl_train = train_ngram_lm.get_perplexity(train_data_tokenized, subsample=10)
ppl_valid = train_ngram_lm.get_perplexity(valid_data_tokenized, subsample=10)


In [115]:
ppl_valid, ppl_train

(1.054956118557522e+16, 785.4807625424293)

### Interpolation Smoothing - varying N

In [116]:
# Interpolation Smoothing, N = 2
train_ngram_lm_interp2 = NgramLM(train_data_tokenized, all_tokens_train, n=2, smoothing='interpolation')
valid_ngram_lm_interp2 = NgramLM(valid_data_tokenized, all_tokens_valid, n=2, smoothing='interpolation')

ppl_train_no_interp2 = train_ngram_lm_interp2.get_perplexity(train_data_tokenized, subsample=10)
ppl_valid_no_interp2 = train_ngram_lm_interp2.get_perplexity(valid_data_tokenized, subsample=10)

ppl_valid_no_interp2, ppl_train_no_interp2


(3198493286877.7236, 1789.0177288795328)

In [117]:
# Interpolation Smoothing, N = 3
train_ngram_lm_interp3 = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='interpolation')
valid_ngram_lm_interp3 = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='interpolation')

ppl_train_no_interp3 = train_ngram_lm_interp3.get_perplexity(train_data_tokenized, subsample=10)
ppl_valid_no_interp3 = train_ngram_lm_interp3.get_perplexity(valid_data_tokenized, subsample=10)

ppl_valid_no_interp3, ppl_train_no_interp3


KeyboardInterrupt: 

In [None]:
# Interpolation Smoothing, N = 5
train_ngram_lm_interp5 = NgramLM(train_data_tokenized, all_tokens_train, n=5, smoothing='interpolation')
valid_ngram_lm_interp5 = NgramLM(valid_data_tokenized, all_tokens_valid, n=5, smoothing='interpolation')

ppl_train_no_interp5 = train_ngram_lm_interp5.get_perplexity(train_data_tokenized, subsample=10)
ppl_valid_no_interp5 = train_ngram_lm_interp5.get_perplexity(valid_data_tokenized, subsample=10)

ppl_valid_no_interp5, ppl_train_no_interp5


In [None]:
# Interpolation Smoothing, N = 7
train_ngram_lm_interp7 = NgramLM(train_data_tokenized, all_tokens_train, n=7, smoothing='interpolation')
valid_ngram_lm_interp7 = NgramLM(valid_data_tokenized, all_tokens_valid, n=7, smoothing='interpolation')

ppl_train_no_interp7 = train_ngram_lm_interp7.get_perplexity(train_data_tokenized, subsample=10)
ppl_valid_no_interp7 = train_ngram_lm_interp7.get_perplexity(valid_data_tokenized, subsample=10)

ppl_valid_no_interp7, ppl_train_no_interp7


In [None]:
# Interpolation Smoothing, N = 10
train_ngram_lm_interp10 = NgramLM(train_data_tokenized, all_tokens_train, n=10, smoothing='interpolation')
valid_ngram_lm_interp10 = NgramLM(valid_data_tokenized, all_tokens_valid, n=10, smoothing='interpolation')

ppl_train_no_interp10 = train_ngram_lm_interp10.get_perplexity(train_data_tokenized, subsample=10)
ppl_valid_no_interp10 = train_ngram_lm_interp10.get_perplexity(valid_data_tokenized, subsample=10)

ppl_valid_no_interp10, ppl_train_no_interp10


### Let's Compare Different Smoothing Techniques

In [None]:
# No Smoothing
train_ngram_lm_no_smoothing = NgramLM(train_data_tokenized, all_tokens_train, n=7)
valid_ngram_lm_no_smoothing = NgramLM(valid_data_tokenized, all_tokens_valid, n=7)

ppl_train_no_smoothing = train_ngram_lm_no_smoothing.get_perplexity(train_data_tokenized, subsample=10)
ppl_valid_no_smoothing = train_ngram_lm_no_smoothing.get_perplexity(valid_data_tokenized, subsample=10)

ppl_valid_no_smoothing, ppl_train_no_smoothing


In [None]:
# Additive Smoothing
train_ngram_lm_additive = NgramLM(train_data_tokenized, all_tokens_train, n=7, smoothing='additive', delta=0.5)
valid_ngram_lm_additive = NgramLM(valid_data_tokenized, all_tokens_valid, n=7, smoothing='additive', delta=0.5)

ppl_train_no_additive = train_ngram_lm_additive.get_perplexity(train_data_tokenized, subsample=10)
ppl_valid_no_additive = train_ngram_lm_additive.get_perplexity(valid_data_tokenized, subsample=10)

ppl_valid_no_additive, ppl_train_no_additive


In [None]:
# Additive Smoothing
train_ngram_lm_additive_d2 = NgramLM(train_data_tokenized, all_tokens_train, n=7, smoothing='additive', delta=0.2)
valid_ngram_lm_additive_d2 = NgramLM(valid_data_tokenized, all_tokens_valid, n=7, smoothing='additive', delta=0.2)

ppl_train_no_additive_d2 = train_ngram_lm_additive_d2.get_perplexity(train_data_tokenized, subsample=10)
ppl_valid_no_additive_d2 = train_ngram_lm_additive_d2.get_perplexity(valid_data_tokenized, subsample=10)

ppl_valid_no_additive_d2, ppl_train_no_additive_d2


In [None]:
# Additive Smoothing
train_ngram_lm_additive_d8 = NgramLM(train_data_tokenized, all_tokens_train, n=7, smoothing='additive', delta=0.8)
valid_ngram_lm_additive_d8 = NgramLM(valid_data_tokenized, all_tokens_valid, n=7, smoothing='additive', delta=0.8)

ppl_train_no_additive_d8 = train_ngram_lm_additive_d8.get_perplexity(train_data_tokenized, subsample=10)
ppl_valid_no_additive_d8 = train_ngram_lm_additive_d8.get_perplexity(valid_data_tokenized, subsample=10)

ppl_valid_no_additive_d8, ppl_train_no_additive_d8


In [None]:
# Additive Smoothing
train_ngram_lm_add1 = NgramLM(train_data_tokenized, all_tokens_train, n=7, smoothing='add-one')
valid_ngram_lm_add1 = NgramLM(valid_data_tokenized, all_tokens_valid, n=7, smoothing='add-one')

ppl_train_no_add1 = train_ngram_lm_add1.get_perplexity(train_data_tokenized, subsample=10)
ppl_valid_no_add1 = train_ngram_lm_add1.get_perplexity(valid_data_tokenized, subsample=10)

ppl_valid_no_add1, ppl_train_no_add1


In [None]:
# Interpolation Smoothing
train_ngram_lm_interp_a2 = NgramLM(train_data_tokenized, all_tokens_train, n=7, smoothing='interpolation', alpha=0.2)
valid_ngram_lm_interp_a2 = NgramLM(valid_data_tokenized, all_tokens_valid, n=7, smoothing='interpolation', alpha=0.2)

ppl_train_no_interp_a2 = train_ngram_lm_interp_a2.get_perplexity(train_data_tokenized, subsample=10)
ppl_valid_no_interp_a2 = train_ngram_lm_interp_a2.get_perplexity(valid_data_tokenized, subsample=10)

ppl_valid_no_interp_a2, ppl_train_no_interp_a2


In [None]:
# Interpolation Smoothing
train_ngram_lm_interp_a8 = NgramLM(train_data_tokenized, all_tokens_train, n=7, smoothing='interpolation', alpha=0.8)
valid_ngram_lm_interp_a8 = NgramLM(valid_data_tokenized, all_tokens_valid, n=7, smoothing='interpolation', alpha=0.8)

ppl_train_no_interp_a8 = train_ngram_lm_interp_a8.get_perplexity(train_data_tokenized, subsample=10)
ppl_valid_no_interp_a8 = train_ngram_lm_interp_a8.get_perplexity(valid_data_tokenized, subsample=10)

ppl_valid_no_interp_a8, ppl_train_no_interp_a8


In [None]:
# Interpolation Smoothing
train_ngram_lm_interp_a5 = NgramLM(train_data_tokenized, all_tokens_train, n=7, smoothing='interpolation', alpha=0.5)
valid_ngram_lm_interp_a5 = NgramLM(valid_data_tokenized, all_tokens_valid, n=7, smoothing='interpolation', alpha=0.5)

ppl_train_no_interp_a5 = train_ngram_lm_interp_a5.get_perplexity(train_data_tokenized, subsample=10)
ppl_valid_no_interp_a5 = train_ngram_lm_interp_a5.get_perplexity(valid_data_tokenized, subsample=10)

ppl_valid_no_interp_a5, ppl_train_no_interp_a5


In [None]:
# # Discounted Interpolation Smoothing
# train_ngram_lm_discount = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='discounting')
# valid_ngram_lm_discount = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='discounting')

# ppl_train_no_discount = train_ngram_lm_discount.get_perplexity(train_data_tokenized)
# ppl_valid_no_discount = train_ngram_lm_discount.get_perplexity(valid_data_tokenized)

# ppl_valid_no_discount, ppl_train_no_discount


### Additive Smoothing - varying N

### Sentence Probabilities

In [None]:
sentence = [['this', 'is', 'a', 'great', 'tutu']]
print(sentence)
ps = train_ngram_lm.get_prob_sentence(sentence)
ss = train_ngram_lm.get_score_sentence(sentence)
ps, ss

In [100]:
sentence = [['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]
print(sentence)
ps = train_ngram_lm_interp3.get_prob_sentence(sentence)
ss = train_ngram_lm_interp3.get_score_sentence(sentence)
ps, ss

[['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]


(1.887615753771853e-14, 8.221273452877178)

In [101]:
sentence = [['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]
print(sentence)
ps = train_ngram_lm_interp5.get_prob_sentence(sentence)
ss = train_ngram_lm_interp5.get_score_sentence(sentence)
ps, ss

[['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]


(2.0890490981023452e-07, 2.4714065720144007)

In [102]:
sentence = [['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]
print(sentence)
ps = train_ngram_lm_interp7.get_prob_sentence(sentence)
ss = train_ngram_lm_interp7.get_score_sentence(sentence)
ps, ss

[['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]


(1.7987031324049532e-07, 2.264655425293293)

In [103]:
sentence = [['this', 'is', 'a', 'great', 'tutu']]
print(sentence)
ps = train_ngram_lm_interp10.get_prob_sentence(sentence)
ss = train_ngram_lm_interp10.get_score_sentence(sentence)
ps, ss

[['this', 'is', 'a', 'great', 'tutu']]


(0.0, 9.063042790366942e+184)

In [104]:
sentence = [['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]
print(sentence)
ps = train_ngram_lm_additive.get_prob_sentence(sentence)
ss = train_ngram_lm_additive.get_score_sentence(sentence)
ps, ss

[['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]


(9.246774960655664e-44, 339.94787908610914)

In [1]:
sentence = [['i', 'like', 'pandas']]
print(sentence)
ps = train_ngram_lm_additive.get_prob_sentence(sentence)
ss = train_ngram_lm_additive.get_score_sentence(sentence)
ps, ss

[['i', 'like', 'pandas']]


NameError: name 'train_ngram_lm_additive' is not defined

In [None]:
sentence = [['i really like this watch']]
print(sentence)
ps = train_ngram_lm_additive.get_prob_sentence(sentence)
ss = train_ngram_lm_additive.get_score_sentence(sentence)
ps, ss

In [None]:
sentence = [['my wife really likes the color of this dress']]
print(sentence)
ps = train_ngram_lm_additive.get_prob_sentence(sentence)
ss = train_ngram_lm_additive.get_score_sentence(sentence)
ps, ss

### Sentence Generation

In [108]:
num_tokens = 10
generated_sentence = train_ngram_lm_interp3.generate_sentence(num_tokens)
generated_sentence


big
big student
big student backpack
big student backpack and
big student backpack and eats
big student backpack and eats one
big student backpack and eats one of
big student backpack and eats one of the
big student backpack and eats one of the material
big student backpack and eats one of the material didn


'big student backpack and eats one of the material didn'

In [107]:
num_tokens = 10
generated_sentence = train_ngram_lm_interp5.generate_sentence(num_tokens)
generated_sentence

inside
inside ,
inside , i
inside , i use
inside , i use the
inside , i use the amazon
inside , i use the amazon visa
inside , i use the amazon visa card
inside , i use the amazon visa card and
inside , i use the amazon visa card and there


'inside , i use the amazon visa card and there'

In [111]:
num_tokens = 10
generated_sentence = train_ngram_lm_interp7.generate_sentence(num_tokens)
generated_sentence


they
they are
they are blindingly
they are blindingly white
they are blindingly white makes


'they are blindingly white makes'

In [110]:
num_tokens = 10
generated_sentence = train_ngram_lm_interp10.generate_sentence(num_tokens)
generated_sentence


this
this ring
this ring was
this ring was on
this ring was on my
this ring was on my wife
this ring was on my wife '
this ring was on my wife ' s
this ring was on my wife ' s wish
this ring was on my wife ' s wish list


"this ring was on my wife ' s wish list"

In [112]:
num_tokens = 20
generated_sentence = train_ngram_lm_interp10.generate_sentence(num_tokens)
generated_sentence


the
the best
the best foot
the best foot support
the best foot support system
the best foot support system ,
the best foot support system , and
the best foot support system , and no
the best foot support system , and no tight
the best foot support system , and no tight toe
the best foot support system , and no tight toe bed
the best foot support system , and no tight toe bed .
the best foot support system , and no tight toe bed . <eos>
the best foot support system , and no tight toe bed . <eos> <eos>
the best foot support system , and no tight toe bed . <eos> <eos> <eos>
the best foot support system , and no tight toe bed . <eos> <eos> <eos> <eos>
the best foot support system , and no tight toe bed . <eos> <eos> <eos> <eos> <eos>
the best foot support system , and no tight toe bed . <eos> <eos> <eos> <eos> <eos> <eos>
the best foot support system , and no tight toe bed . <eos> <eos> <eos> <eos> <eos> <eos> <eos>
the best foot support system , and no tight toe bed . <eos> <eos> <eos> <e

'the best foot support system , and no tight toe bed . <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos>'

In [109]:
num_tokens = 10
generated_sentence = train_ngram_lm_additive.generate_sentence(num_tokens)
generated_sentence


or
or at
or at least
or at least nobody
or at least nobody '
or at least nobody ' s
or at least nobody ' s said
or at least nobody ' s said anything
or at least nobody ' s said anything yet
or at least nobody ' s said anything yet .


"or at least nobody ' s said anything yet ."

In [113]:
num_tokens = 10
generated_sentence = train_ngram_lm_no_smoothing.generate_sentence(num_tokens)
generated_sentence


i
i will
i will recommend
i will recommend this
i will recommend this to
i will recommend this to my
i will recommend this to my friends
i will recommend this to my friends !
i will recommend this to my friends ! <eos>
i will recommend this to my friends ! <eos> <eos>


'i will recommend this to my friends ! <eos> <eos>'