<a href="https://colab.research.google.com/github/kyunghyuncho/ammi-2019-nlp/blob/master/01-day-LM/ngram_lm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Modeling

### Goal: compute a probabilty distribution over all possible sentences:


### $$p(W) = p(w_1, w_2, ..., w_T)$$

### This unsupervised learning problem can be framed as a sequence of supervised learning problems:

### $$p(W) = p(w_1) * p(w_2|w_1) * ... * p(w_T|w_1, ..., w_{T-1})$$

### If we have N sentences, each of them with T words / tokens, then we want to max:

### $$log p(W) = \sum_{n = 1}^N \sum_{i=1}^{T} log p(w_i | w_{<i})$$




# N-gram language model

### Goal: estimate the n-gram probabilities using counts of sequences of n consecutive words

### Given a sequence of words $w$, we want to compute

###  $$P(w_i|w_{i−1}, w_{i−2}, …, w_{i−n+1})$$

### Where $w_i$ is the i-th word of the sequence.

### $$P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) = \frac{p(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w_i)}{\sum_{w \in V} p(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w)}$$

### Key Idea: We can estimate the probabilities using counts of n-grams in our dataset 


In [1]:
# TODOs
#: implement the neural LM with concat instead of summation -- so that you have a fixed input etc.
# make a separate
# create some slides with pictures maybe explaining the model visualizations -- line by line
# get google cloud working
# make it work on gpu
# show them kenlm and how to use to do different stuff with it
# use the same sentences to generation and testing etc.
# explain perplexity
# ngram, ff, rnn, rnn+attention
# do sentence generation
# do long sentences
# compare different n-grams -- 2,3,more

In [2]:
import os
import sys
sys.path.append('utils/')

### Install if needed

TODO: should we install as needed and import as needed or all at once?

### Imports

In [3]:
from utils import ngram_utils as ngram_utils
import utils.global_variables as gl
import torch
import random
from utils.ngram_utils import NgramLM

In [4]:
torch.manual_seed(1)


<torch._C.Generator at 0x7fab2e21d510>

### Load Data from .txt Files

In [5]:
# Read data from .txt files and create lists of reviews

train_data = []
# create a list of all the reviews 
with open('../data/amazon_reviews_clothing_train.txt', 'r') as f:
    train_data = [review for review in f.read().split('\n') if review]
    
valid_data = []
# create a list of all the reviews 
with open('../data/amazon_reviews_clothing_valid.txt', 'r') as f:
    valid_data = [review for review in f.read().split('\n') if review]
    

In [6]:
# type(train_data), len(train_data), \
# type(train_data[0]), len(train_data[0]), \
# type(train_data[0][0]), len(train_data[0][0])

In [7]:
train_data[0], train_data[0][0]


("this is a great tutu and at a really great price . it doesn ' t look cheap at all . i ' m so glad i looked on amazon and found such an affordable tutu that isn ' t made poorly . a + + ",
 't')

### Process the Data

In [8]:
# # TODO: for now only work with small subset of the data -- switch to all data later
train_data = train_data[:100]
valid_data = valid_data[:10]

In [9]:
type(train_data), type(train_data[0]), type(train_data[0][0])

(list, str, str)

In [10]:
# Tokenize the Datasets
# TODO: this takes a really long time !! why?
train_data_tokenized, all_tokens_train = ngram_utils.tokenize_dataset(train_data)
valid_data_tokenized, all_tokens_valid = ngram_utils.tokenize_dataset(valid_data)


100it [00:00, 563.56it/s]
10it [00:00, 1171.99it/s]


Let's look at the tokenized data!

In [11]:
# # Number of All Tokens
# len(all_tokens_train), all_tokens_train[0], \
# len(train_data_tokenized), train_data_tokenized[0]

In [12]:
train_ngram_lm = NgramLM(train_data_tokenized, all_tokens_train, n=3)
valid_ngram_lm = NgramLM(valid_data_tokenized, all_tokens_valid, n=3)

In [13]:
# train_ngram_lm.n, train_ngram_lm.frac_vocab, train_ngram_lm.num_all_tokens

In [14]:
# valid_ngram_lm.vocabulary[:3], valid_ngram_lm.raw_data[:3]

In [15]:
# valid_ngram_lm.vocab_ngram[:3], valid_ngram_lm.count_ngram[:3]

In [16]:
# valid_ngram_lm.vocab_unigram[:3], valid_ngram_lm.count_unigram[:3]

In [17]:
# valid_ngram_lm.vocab_bigram[:3], valid_ngram_lm.count_bigram[:3]

In [18]:
# valid_ngram_lm.vocab_prev_ngram[:3], valid_ngram_lm.count_prev_ngram[:3]

In [19]:
# valid_ngram_lm.id2token[:3], valid_ngram_lm.token2id['<pad>']

#### Build the Vocabulary 


In [20]:
# Build a vocabulary using all the tokens found in train data (90% of most common ones)
vocabulary = train_ngram_lm.vocabulary
print('Word vocabulary size: {} words'.format(len(vocabulary)))        

Word vocabulary size: 2214 words


### CORPUS ANALYSIS (Train + Valid Data)

#### Number of Tokens in the Corpus Data


In [21]:
print("Number of All Tokens ", train_ngram_lm.num_all_tokens)

Number of All Tokens  16129


In [22]:
print("Number of All UNIQUE Tokens ", len(vocabulary))

Number of All UNIQUE Tokens  2214


#### Number of Sentences in the Train Data


In [23]:
print("Number of Sentences ", len(train_ngram_lm.raw_data))

Number of Sentences  100


## N-grams

In [24]:
n = 3 # trigrams

### Function for padding the sentences with special markers sentence beginning and end, i.e. $<bos>$ and $<eos>$

In [25]:
train_padded = train_ngram_lm.padded_data
train_ngram = train_ngram_lm.ngram_data
vocab_ngram = train_ngram_lm.vocab_ngram
count_ngram = train_ngram_lm.count_ngram 

In [26]:
# train_padded[0]

### Function for finding all N-grams

In [27]:
# train_ngram[0]

In [28]:
# vocab_ngram[0]

In [29]:
# count_ngram[0]

In [30]:
# train_trie['./<eos>/<eos>']

In [31]:
trie_ngram = train_ngram_lm.trie_ngram

In [32]:
# train_ngram_trie

### Function for Getting N-gram counts for already tokenized data

In [33]:
# train_padded, train_ngram, vocab_ngram, count_ngram

#### Trigrams, Bigrams, Unigrams

In [34]:
vocab_unigram = train_ngram_lm.vocab_unigram
vocab_bigram = train_ngram_lm.vocab_bigram
vocab_trigram = train_ngram_lm.vocab_trigram

count_unigram = train_ngram_lm.count_unigram
count_bigram = train_ngram_lm.count_bigram
count_trigram = train_ngram_lm.count_trigram

In [35]:
# vocab_bigram[:3], count_bigram[:3]

In [36]:
# vocab_unigram[:3], count_unigram[:3]

In [37]:
trie_unigram = train_ngram_lm.trie_unigram
trie_bigram = train_ngram_lm.trie_bigram
trie_trigram = train_ngram_lm.trie_trigram

In [38]:
# unigram_trie, bigram_trie, trigram_trie

### Function for Getting N-gram Dict

In [39]:
id2token_ngram = train_ngram_lm.id2token
token2id_ngram = train_ngram_lm.token2id

In [40]:
# id2token_ngram[:10], \
# token2id_ngram['<unk>'], token2id_ngram['<eos>'], token2id_ngram[('rosetta', 'stone', 'is')]

In [41]:
random_token_id = random.randint(0, len(id2token_ngram) - 1)
random_token = id2token_ngram[random_token_id]

print ("Token id {} ; token {}".format(random_token_id, id2token_ngram[random_token_id]))
print ("Token {}; token id {}".format(random_token, token2id_ngram[random_token]))

Token id 12427 ; token ('gave', 'him', 'this')
Token ('gave', 'him', 'this'); token id 12427


### Ngram Counts

In [42]:
# vocab_ngram[:10], count_ngram[:10]

In [43]:
c = train_ngram_lm.get_ngram_count(('i', 'like', 'this'))
c = train_ngram_lm.get_ngram_count(('.', '<eos>', '<eos>'))
c

75

In [44]:
c = train_ngram_lm.get_ngram_count(('i', 'like', 'pandas'))
c

0

### Function for computing the probability of a sentence

## N-gram Probabilities

## $$P(w|w_{−n}, ..., w_{−2}, w_{−1}) \approx \frac{c(w_{−n}, ..., w_{−2}, w_{−1}, w)}{\sum_{w \in V} c(w_{−n}, ..., w_{−2}, w_{−1}, w)}$$


## Bigram Probabilities

## $$p(w_i | w_{i-1}) = \frac{c(w_{i-1}, w_i)}{\sum_{w_i} c(w_{i-1}, w_i)} $$


In [45]:
p = train_ngram_lm.get_ngram_prob(('rosetta', 'stone', 'is'))
p = train_ngram_lm.get_ngram_prob(('i', 'am', 'very'))
p

# p = get_ngram_prob(('i', 'am', 'rosetta'), vocab_ngram, count_ngram)
# p

# p = get_ngram_prob(('it', "'", 's'), vocab_ngram, count_ngram)
# p

# p = get_ngram_prob(('i', "like", 'this'), vocab_ngram, count_ngram)
# p, 1/(2+1+1+1+1)

0

In [46]:
p = train_ngram_lm.get_ngram_prob(('am', 'rosetta', 'stone'))
p

0

## Additive Smoothing

In [47]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('am', 'rosetta', 'stone'), delta=0.5)
p

0.00045167118337850043

In [48]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('i', 'am', 'very'), delta=0.5)
p

0.0031616982836495033

## Add-One Smoothing

In [49]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('am', 'rosetta', 'stone'))
p

0.00045167118337850043

In [50]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('i', 'am', 'very'))
p

0.0018066847335140017

## Linear Interpolation Smoothing

#### TODO: add formula

In [51]:
# p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('am', 'rosetta', 'stone'), alpha=0.5)
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('i', 'am', 'very'), alpha=0.8)
p

0.0

## Smoothing: Linear Interpolation with Absolute Discounting

### $$p_{bi}(w|v) = max ({ \frac{N(v, w) - b_{bi}}{N(v)}, 0)  + b_{bi} \frac{V - N_0(v, \cdot)}{N(v)} p_{uni}(w) \large}$$

### $$p_{uni}(w) = max ({ \frac{N(w) - b_{uni}}{N}, 0)  + b_{uni} \frac{V - N_0(\cdot)}{N} \frac{1}{V}}$$

### $$b_{bi} = \frac{N_1(\cdot, \cdot)}{N_1(\cdot, \cdot) + 2*N_2(\cdot, \cdot)}$$

### $$b_{uni} = \frac{N_1(\cdot)}{N_1(\cdot) + 2*N_2(\cdot)}$$


### $$N_r(\cdot) = \sum_{w: N(w) = r} 1$$

### $$N_r(\cdot, \cdot) = \sum_{v, w: N(v, w) = r} 1$$

### $$N_r(v, \cdot) = \sum_{w: N(v, w) = r} 1$$

### V is the number of words in the vocabulary

### $N_r(\cdot, \cdot)$ and $N_r(\cdot)$  are the count-counts for bigrams and unigrams respectively $


In [52]:
y = "m"
x = "'"

z = train_ngram_lm.get_p_bi(y, x)
z

206.01654553995797

### Let's check that the probabilities sum up to one
### $$\sum_w p_{bi}(w|v) = \sum_w p_{uni}(w) = 1$$



TODO: add this check or leave as homework

### Bigram LM
###  $$p(s) = \prod_{i = 1} ^ {N + 1} p(w_i | w_{i-1})$$

### Likelihood of a Sentence

In [53]:
n = 3
sentence = [['this', 'is', 'a', 'great', 'tutu']]
print(sentence)
ps = train_ngram_lm.get_prob_sentence(sentence)
ps

[['this', 'is', 'a', 'great', 'tutu']]


0

### Examples
### Bigram LM: $$ p(i \; love \; this \; light) = p(i|\cdot) \; p(love|i)\;  p(this|love)\;  p(light|this) \\
\approx \frac{c(i, \cdot)}{\sum_w c(\cdot, \; w)} \; \frac{c(love, i)}{\sum_wc(i, \; w)}\;  \frac{c(this, love)}{\sum_wc(love, \;w)}\;  \frac{c(light, this)}{\sum_wc(this, \;w)}$$ 

### Trigram LM: $$ p(i \; love \; this  \;light) = p(i|\cdot, \cdot) \; p(love|\cdot, i) \; p(this|i, love)\;  p(light|love, this)$$ 



In [54]:
# prob distr for the word following prev_tokens (i.e. tutu) 
# over all the words in the vocabulary 

# prev_tokens = train_data_tokenized[0][4] #[0]
prev_tokens = vocab_ngram[3][1:] #[0]   # need frmo 1 on so that this is a correct prev token
print(prev_tokens)
pd = train_ngram_lm.get_prob_distr_ngram(prev_tokens)
sum(pd)#, pd

("'", 'm')


0

In [55]:
# prob distr for the word following prev_tokens (i.e. tutu) 
# over all the words in the vocabulary 

# prev_tokens = train_data_tokenized[0][4] #[0]
prev_tokens = ('rosetta', 'stone') #[0]   # need frmo 1 on so that this is a correct prev token
print(prev_tokens)
pd = train_ngram_lm.get_prob_distr_ngram(prev_tokens)
sum(pd)#, pd

('rosetta', 'stone')


0

### Sentence Generation

In [56]:
num_tokens = 20
generated_sentence = train_ngram_lm.generate_sentence(num_tokens)
generated_sentence

across
across bandana
across bandana pouch
across bandana pouch thought
across bandana pouch thought enunciate
across bandana pouch thought enunciate same
across bandana pouch thought enunciate same leaves
across bandana pouch thought enunciate same leaves than
across bandana pouch thought enunciate same leaves than disneyland
across bandana pouch thought enunciate same leaves than disneyland smaller
across bandana pouch thought enunciate same leaves than disneyland smaller fingers
across bandana pouch thought enunciate same leaves than disneyland smaller fingers will
across bandana pouch thought enunciate same leaves than disneyland smaller fingers will pc
across bandana pouch thought enunciate same leaves than disneyland smaller fingers will pc opened
across bandana pouch thought enunciate same leaves than disneyland smaller fingers will pc opened picture
across bandana pouch thought enunciate same leaves than disneyland smaller fingers will pc opened picture apart
across bandana pou

'across bandana pouch thought enunciate same leaves than disneyland smaller fingers will pc opened picture apart upset imac hardly believe'

In [57]:
# num_tokens = 5
# generated_sentence = train_ngram_lm.generate_sentence(num_tokens)
# generated_sentence


In [58]:
# TODOs
# show rank for each word in a sentence
# explain perplexity 

### Log-Likelihood
### $LL = \sum_{k=1}^{K} \sum_{n=1}^{N_k + 1} log p_{bi}(w_{k,n} | w_{k,n-1})$

### Perplexity

### $PP = exp(-\frac{LL}{\sum_k(N_k + 1)})$

In [59]:
ppl_valid = train_ngram_lm.get_perplexity(valid_data_tokenized)
ppl_train = train_ngram_lm.get_perplexity(train_data_tokenized)


In [60]:
ppl_valid, ppl_train
# TODO check whether this makes sense -- maybe it seems too good?

(266669.1406237607, 78.64967546583712)

#### Let's look at some examples and see if they make sense