<a href="https://colab.research.google.com/github/kyunghyuncho/ammi-2019-nlp/blob/master/01-day-LM/ken_lm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KenLM Framework for Language Modeling


**Install KenLM**

Download stable release and unzip: http://kheafield.com/code/kenlm.tar.gz

Need Boost >= 1.42.0 and bjam
*   Ubuntu: sudo apt-get install libboost-all-dev
*   Mac: brew install boost; brew install bjam

Run within kenlm directory:
    
*  mkdir -p build
  *  cd build
  *  cmake ..
  *  make -j 4
 
pip install https://github.com/kpu/kenlm/archive/master.zip

For more information on KenLM see: https://github.com/kpu/kenlm and http://kheafield.com/code/kenlm/


In [70]:
import sys
sys.path.append('utils/')

In [71]:
import kenlm
import os
import re
import utils.ngram_utils as ngram_utils


In [72]:
path = '/home/roberta/ammi-2019-nlp/data/'
os.chdir(path)


## 3-gram model with KenLM

In [73]:
cat train.txt | /home/roberta/kenlm/bin/lmplz -o 3 > amazonLM3.arpa

=== 1/5 Counting and sorting n-grams ===
File stdin isn't normal.  Using slower read() instead of mmap().  No progress bar.
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:860352 2:75230912512 3:141057966080
Statistics:
1 71696 D1=0.690098 D2=0.962667 D3+=1.22676
2 1239185 D1=0.712943 D2=1.05296 D3+=1.36242
3 4834597 D1=0.772513 D2=1.0869 D3+=1.33918
Memory estimate for binary LM:
type     MB
probing 113 assuming -p 1.5
probing 120 assuming -r models -p 1.5
trie     44 without quantization
trie     24 assuming -q 8 -b 8 quantization 
trie     42 assuming -a 22 array pointer compression
trie     22 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:860352 2:19826960 3:96691940
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 

In [74]:
# path = '/home/roberta'
# os.chdir(path)
# !kenlm/bin/lmplz amazonLM.arpa amazonLM.klm

In [75]:
import kenlm
model_3n = kenlm.LanguageModel('amazonLM3.arpa')


In [76]:
# Read data from .txt files and create lists of reviews

train_data = []
# create a list of all the reviews 
with open('../data/train.txt', 'r') as f:
    train_data = [review for review in f.read().split('\n') if review]
    
valid_data = []
# create a list of all the reviews 
with open('../data/valid.txt', 'r') as f:
    valid_data = [review for review in f.read().split('\n') if review]
    

In [77]:
# Tokenize the Datasets
# TODO: this takes a really long time !! why?
train_data_tokenized, all_tokens_train = ngram_utils.tokenize_dataset(train_data)
valid_data_tokenized, all_tokens_valid = ngram_utils.tokenize_dataset(valid_data)


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [78]:
train_data = []
for t in train_data_tokenized:
    train_data.append(' '.join(t))
train_data[:3]

['this is a great tutu and at a really great price .',
 "it doesn ' t look cheap at all .",
 "i ' m so glad i looked on amazon and found such an affordable tutu that isn ' t made poorly ."]

In [79]:
valid_data = []
for t in valid_data_tokenized:
    valid_data.append(' '.join(t))
valid_data[:3]

['good value .',
 'not super cheap material .',
 'at first , i was absolutely delighted with these peds . . .']

#### The KenLM model reports negative log likelihood, not perplexity. So we'll be converting the score and report net perplexity. The following function calculate the perpelxity, get_ppl, and find all OOV words, get_oov.

#### Pereplexity is defined as follows, $$ PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)} $$ All probabilities here are in log base 10 so to convert to perplexity, we do the following $$PPL = 10^{-\log(P) / N} $$ where $P$ is the total NLL, and $N$ is the word count.

In [80]:
def get_ppl(lm, sentences):
    """
    Assume sentences is a list of strings (space delimited sentences)
    """
    total_nll = 0
    total_wc = 0
    for sent in sentences:
        sent = re.sub(r"([\w/'+$\s-]+|[^\w/'+$\s-]+)\s*", r"\1 ", sent)
        words = sent.strip().split()
        score = lm.score(sent, bos=False, eos=False)
        word_count = len(words)
        total_wc += word_count
        total_nll += score
    ppl = 10**-(total_nll/total_wc)
    return ppl


In [81]:
train_ppl = get_ppl(model_3n, train_data)
train_ppl

39.08424638548333

In [82]:
valid_ppl = get_ppl(model_3n, valid_data)
valid_ppl

71.55326951549414

### Score Sentences

In [83]:
sentences = ['i like pandas']
ppl = get_ppl(model_3n, sentences)
ppl

13199.934380820527

Function for loading the data

In [84]:
sentences = ['i like this tutu']
ppl = get_ppl(model_3n, sentences)
ppl

230.70962836087295

In [85]:
sentences = ['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']
ppl = get_ppl(model_3n, sentences)
ppl

557.8821726391418

In [86]:
sentences = ['.']
ppl = get_ppl(model_3n, sentences)
ppl

55.331202531779226

In [87]:
sentences = ['who wants dinner?']
ppl = get_ppl(model_3n, sentences)
ppl

3028.4422169553886

In [88]:
sentences = ['i want to get a refund']
ppl = get_ppl(model_3n, sentences)
ppl

38.634355092812754

In [89]:
sentences = ['this watch is not what i expected']
ppl = get_ppl(model_3n, sentences)
ppl

22.70582444534253

In [90]:
sentences = ['this fits me perfectly .']
ppl = get_ppl(model_3n, sentences)
ppl

28.99315866512597

In [91]:
sentences = ['this coat fits me perfectly ?']
ppl = get_ppl(model_3n, sentences)
ppl

184.37286419366077

## 5-gram model with KenLM

In [92]:
cat train.txt | /home/roberta/kenlm/bin/lmplz -o 5 > amazonLM5.arpa

=== 1/5 Counting and sorting n-grams ===
File stdin isn't normal.  Using slower read() instead of mmap().  No progress bar.
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:860352 2:21101352960 3:39565037568 4:63304056832 5:92318425088
Statistics:
1 71696 D1=0.690098 D2=0.962667 D3+=1.22676
2 1239185 D1=0.712943 D2=1.05296 D3+=1.36242
3 4834597 D1=0.796199 D2=1.09701 D3+=1.35908
4 9215190 D1=0.868874 D2=1.16401 D3+=1.3733
5 12376562 D1=0.898907 D2=1.2197 D3+=1.36975
Memory estimate for binary LM:
type     MB
probing 564 assuming -p 1.5
probing 651 assuming -r models -p 1.5
trie    261 without quantization
trie    142 assuming -q 8 -b 8 quantization 
trie    232 assuming -a 22 array pointer compression
trie    112 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:860352 2:19826960 3:96691940 4:221164560 5:346543736
----5---10---15---20---25---30---35---40---45---50---55---60---65

In [93]:
model_5n = kenlm.LanguageModel('amazonLM5.arpa')


In [94]:
train_ppl = get_ppl(model_5n, train_data)
train_ppl

14.567223510318378

In [95]:
valid_ppl = get_ppl(model_5n, valid_data)
valid_ppl

67.00883546322021

In [96]:
sentences = ['i like pandas']
ppl = get_ppl(model_5n, sentences)
ppl

6799.379858151767

In [97]:
sentences = ['i like this tutu']
ppl = get_ppl(model_5n, sentences)
ppl

273.66703112404986

In [98]:
sentences = ['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']
ppl = get_ppl(model_5n, sentences)
ppl

557.8821726391418

In [99]:
sentences = ['who wants dinner?']
ppl = get_ppl(model_5n, sentences)
ppl

2342.7461203901544

In [100]:
sentences = ['i want to get a refund']
ppl = get_ppl(model_5n, sentences)
ppl

40.45908777555109

In [101]:
sentences = ['this watch is not what i expected']
ppl = get_ppl(model_5n, sentences)
ppl

28.24555272015706

In [102]:
sentences = ['this fits me perfectly .']
ppl = get_ppl(model_5n, sentences)
ppl

32.22202767793977

In [103]:
sentences = ['this coat fits me perfectly ?']
ppl = get_ppl(model_5n, sentences)
ppl

179.78485857588134

## 10-gram model with KenLM

In [109]:
cat train.txt | /home/roberta/kenlm/bin/lmplz -o 10 > amazonLM10.arpa

=== 1/5 Counting and sorting n-grams ===
File stdin isn't normal.  Using slower read() instead of mmap().  No progress bar.
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:860352 2:3516892160 3:6594172928 4:10550676480 5:15386402816 6:21101352960 7:27695525888 8:35168919552 9:43521540096 10:52753383424
Statistics:
1 71696 D1=0.690098 D2=0.962667 D3+=1.22676
2 1239185 D1=0.712943 D2=1.05296 D3+=1.36242
3 4834597 D1=0.796199 D2=1.09701 D3+=1.35908
4 9215190 D1=0.868874 D2=1.16401 D3+=1.3733
5 12376562 D1=0.922179 D2=1.23342 D3+=1.42227
6 14073204 D1=0.957655 D2=1.31503 D3+=1.47777
7 14755602 D1=0.97911 D2=1.4124 D3+=1.53403
8 14907447 D1=0.990495 D2=1.48841 D3+=1.62571
9 14831157 D1=0.995705 D2=1.56423 D3+=1.6684
10 14667984 D1=0.985478 D2=1.6356 D3+=2.00243
Memory estimate for binary LM:
type      MB
probing 2227 assuming -p 1.5
probing 2720 assuming -r models -p 1.5
trie    1154 without quantization
trie     631 assuming -q 8 -b 8 quantization 
trie     981 assuming 

In [110]:
model_10n = kenlm.LanguageModel('amazonLM10.arpa')


OSError: Cannot read model 'amazonLM10.arpa' (lm/model.cc:49 in void lm::ngram::detail::{anonymous}::CheckCounts(const std::vector<long unsigned int>&) threw FormatLoadException because `counts.size() > 6'. This model has order 10 but KenLM was compiled to support up to 6.  If your build system supports changing KENLM_MAX_ORDER, change it there and recompile.  With cmake:  cmake -DKENLM_MAX_ORDER=10 .. With Moses:  bjam --max-kenlm-order=10 -a Otherwise, edit lm/max_order.hh. Byte: 173)

In [None]:
train_ppl = get_ppl(model_10n, train_data)
train_ppl

In [None]:
valid_ppl = get_ppl(model_10n, valid_data)
valid_ppl

### Comparisons of different ngram models

In [None]:
sentences = ['i like pandas']

In [None]:
ppl3 = get_ppl(model_3n, sentences)
ppl5 = get_ppl(model_5n, sentences)
ppl10 = get_ppl(model_10n, sentences)
ppl3, ppl5, ppl10

In [None]:
sentences = ['this shirt fits me very well !']

In [None]:
ppl3 = get_ppl(model_3n, sentences)
ppl5 = get_ppl(model_5n, sentences)
ppl10 = get_ppl(model_10n, sentences)
ppl3, ppl5, ppl10

In [None]:
sentences = ['i was very disappointed in the color of these shoes, so I returned them .']

In [None]:
ppl3 = get_ppl(model_3n, sentences)
ppl5 = get_ppl(model_5n, sentences)
ppl10 = get_ppl(model_10n, sentences)
ppl3, ppl5, ppl10

In [104]:
def load_data(path):
    data = []
    with open(path) as f:
        for i, line in enumerate(f): 
            data.append(line)
    return data

In [105]:
def get_oov(model, data):
    oov = []
    vocab = []
    for sent in data:
        sentence = sent
        words =  sentence.split()
        vocab += words
        # Find out-of-vocabulary words
        for w in words:
            if w not in model:
                    oov.append(w)
    return set(oov), set(vocab)

In [106]:
path_to_train = '/home/roberta/ammi-2019-nlp/data/train.txt'
train_data = load_data(path_to_train)
train_data[:3]

["this is a great tutu and at a really great price . it doesn ' t look cheap at all . i ' m so glad i looked on amazon and found such an affordable tutu that isn ' t made poorly . a + + \n",
 'i bought this for my 4 yr old daughter for dance class , she wore it today for the first time and the teacher thought it was adorable . i bought this to go with a light blue long sleeve leotard and was happy the colors matched up great . price was very good too since some of these go for over $ 15 . 00 dollars . \n',
 'what can i say . . . my daughters have it in orange , black , white and pink and i am thinking to buy for they the fuccia one . it is a very good way for exalt a dancer outfit : great colors , comfortable , looks great , easy to wear , durables and little girls love it . i think it is a great buy for costumer and play too . \n']

In [33]:
oov = get_oov(model, data)
# oov[0]

NameError: name 'model' is not defined