# KenLM Sentence-base

#### based on:

https://kheafield.com/papers/avenue/kenlm.pdf 

https://kheafield.com/papers/edinburgh/estimate_paper.pdf

#### implementation of the estimation part:

https://kheafield.com/code/kenlm/

Language models are estimated from text using __[modified](http://www.ee.columbia.edu/~stanchen/papers/h015a-techreport.pdf)__ [Kneser-Ney smoothing](https://ieeexplore.ieee.org/document/479394) without pruning. It is done on disk, enabling one to build much larger models. Kneser-Ney smothng consistently outperforms all other n-grams models with smothing evaluated in this [techreport](http://www.ee.columbia.edu/~stanchen/papers/h015a-techreport.pdf) by Chen and Goodman.

In this notebook I am splitting and tokenizing __[Ukrainian Brown Corpus](https://github.com/brown-uk/corpu)__ (good and so-so datasets) into sentences including symbols. Then I use KenLM scripts to estimate ARPA n-gram sentense-based models. KenLM script __lmplz__ by default uses $<s>$ and $</s>$ tags at the beginning and end of each sentence. 

It is only sentence-based model and it estimates score for each sentence separately. The score is equal to the  log10 probability of the sentence. Then I sum up scores for all the sentences in the corpus, divide them by the number of the words in the corpus and take 10 to the power of the resulting fraction to calculate the __perplexity__ of my model.

## Installing KenLM

In [1]:
import os
import kenlm

    ---# sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev 
    
    ---# wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
         mkdir -p build && cd build
         cmake ..
         make -j 4
         
## Tokenize text
         
Here the __[Ukrainian Brown Corpus](https://github.com/brown-uk/corpu)__ (good and so-so datasets) is splitted into sentences including symbols.
         
#### [TokenizeText.groovy](https://github.com/brown-uk/nlp_uk/blob/master/src/main/groovy/org/nlp_uk/tools/README.md)


Аналізує текст і записує результат у виходовий файл:

        розбиває на речення (-s)
        розбиває на токени (-w) (результати включають пунктуацію тому всі токени розділяються вертикальними рисками)
        розбиває на слова (-u)


In [1]:
!groovy nlp_uk/nlp_uk/src/main/groovy/org/nlp_uk/tools/TokenizeText.groovy -s -w -i final/ukrlib_final.txt -o final/ukrlib_final_symbols_sentences.txt    

writing into final/ukrlib_final_symbols_sentences.txt


_!!! Should I lowercase? !!!_

Not lowercased

In [4]:
import fileinput
filename=os.path.join(os.path.abspath(''), 'final','ukrlib_final_symbols_sentences.txt')
for line in fileinput.FileInput(filename, inplace=1):
        #line='<s>'+line
        line=line.replace("|"," ")#.lower()
        line=line.replace("\n","")
        line=line.replace(r"\n","")
        line=line.replace("_foreign_"," _foreign_ ")
        #line=line.replace("BEGIN_TEXT","")
        #line=line.replace("END_TEXT","")
        #line=line.replace("\n","</s>")
        #line=line.replace(r"<s>\n","BEGIN <s>")
        #line=line.replace("<s></s>","")
        #line=line.replace(r"\n </s>","</s> END")
        #line=line.replace("<s>","")
        print (line)
#for line in fileinput.FileInput(filename, inplace=1):
        #line=line.replace("\n","")
        #print (line)

Example result file: (+lowercase)

    У   2013   році ,   до   100-річчя   виходу   першого   числа   журналу   _FOREIGN_ ,   на   будинку   встановили   меморіальну   дошку .   
    Тоді   ж   таки   в   будинку   відбувся   перший   з'їзд   есперантистів ,   на   якому   було   50   есперантистів   з   усієї   України   і   троє   з-за   кордону .   
    Відтоді   щороку   вони   там   організовують   конференції ,   починаючи   з   2013-го .   
    Щороку   там   вручають   премію   тим ,   хто   пропаґує   український   есперантський   рух   та   український   погляд   на   важливі   події .  
    _FOREIGN_  Більшість   дописів   на   сторінці   Сергія   Шматкова   у   соціальній   мережі   _FOREIGN_   —   мовою   есперанто .  
    « У   мене   у   _FOREIGN_   понад   дві   тисячі   друзів   з   усього   світу ,   з   якими   я   спілкуюсь   мовою   есперанто » ,   —   розповідає   пан   Сергій .  
    Цю   незвичну   для   багатьох   мову   Сергій   Шматков   вивчив   ще   у   1980-х .   
    Народився   і   прожив   чоловік   усе   життя   в   Донецькій   області ,   а   після   окупації   перебрався   до   Львова .  
    
    
#### Number of sentences

In [5]:
with open("final/ukrlib_final_symbols_sentences.txt", "r") as input:
    summa=0
    for line in input:
        summa+=1

In [6]:
summa

5746840

## Estimating Large Language Models with KenLM

Tokenized and splitted into sentences __[Ukrainian Brown Corpus](https://github.com/brown-uk/corpu)__ (good and so-so datasets) is provided on stdin and the __ARPA__ is written to stdout.

#### kenlm/build/bin/lmplz -o -S -T    
        -o
            Required. Order of the language model to estimate.
        -S
            Recommended. Memory to use. This is a number followed by single-character suffix: % for percentage of physical memory (on platforms where this is measured), b for bytes, K for kilobytes, M for megabytes, and so on for G and T. If no suffix is given, kilobytes are assumed for compatability with GNU sort. The sort program is not used; the command line is simply designed to be compatible.
        -T
            Recommended. Temporary file location.

Here 3-gram, 4-gram, 5-gram and 6-gram models are estimated by kenlm library and saved into the propriate ARPA files.
            

In [8]:
!kenlm/build/bin/lmplz -o 3 -S 10% <final/ukrlib_final_symbols_sentences.txt> final/kenlm/ukrlib_final_symbols_sentences_based_3.arpa

=== 1/5 Counting and sorting n-grams ===
Reading /home/ana/Downloads/master diploma/code/final/ukrlib_final_symbols_sentences.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 78999442 types 1553865
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:18646380 2:527945280 3:989897472
Statistics:
1 1553865 D1=0.676167 D2=1.01699 D3+=1.3515
2 18397852 D1=0.799054 D2=1.10845 D3+=1.32954
3 43460054 D1=0.810951 D2=1.38829 D3+=1.41947
Memory estimate for binary LM:
type      MB
probing 1205 assuming -p 1.5
probing 1316 assuming -r models -p 1.5
trie     546 without quantization
trie     323 assuming -q 8 -b 8 quantization 
trie     507 assuming -a 22 array pointer compression
trie     284 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities =

In [None]:
!kenlm/build/bin/lmplz -o 4 -S 10% <final/ukrlib_final_symbols_sentences.txt> final/kenlm/ukrlib_final_symbols_sentences_based_4.arpa

In [None]:
!kenlm/build/bin/lmplz -o 5 -S 10% <final/ukrlib_final_symbols_sentences.txt> final/kenlm/ukrlib_final_symbols_sentences_based_5.arpa

In [7]:
!kenlm/build/bin/lmplz -o 6 -S 7% <final/ukrlib_final_symbols_sentences.txt> final/kenlm/ukrlib_final_symbols_sentences_based_6.arpa

=== 1/5 Counting and sorting n-grams ===
Reading /home/ana/Downloads/master diploma/code/final/ukrlib_final_symbols_sentences.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 78999442 types 1553865
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:18646380 2:50172556 3:94073544 4:150517680 5:219504944 6:301035360
Statistics:
1 1553865 D1=0.676167 D2=1.01699 D3+=1.3515
2 18397852 D1=0.799054 D2=1.10845 D3+=1.32954
3 43460054 D1=0.884098 D2=1.18897 D3+=1.33575
4 58161810 D1=0.938092 D2=1.30631 D3+=1.38554
5 60860373 D1=0.968972 D2=1.46989 D3+=1.44609
6 58114299 D1=0.905692 D2=1.84295 D3+=1.69811
Memory estimate for binary LM:
type      MB
probing 5176 assuming -p 1.5
probing 6217 assuming -r models -p 1.5
trie    2767 without quantization
trie    1594 assuming -q 8 -b 8 quantization 
trie    2362 assum

In [2]:
LM3 = os.path.join(os.path.abspath(''), 'final', 'kenlm', 'ukrlib_final_symbols_sentences_based_3.arpa')
#LM4 = os.path.join(os.path.abspath(''), 'final', 'kenlm', 'ukrlib_final_symbols_sentences_based_4.arpa')
#LM5 = os.path.join(os.path.abspath(''), 'final', 'kenlm', 'ukrlib_final_symbols_sentences_based_5.arpa')
LM6 = os.path.join(os.path.abspath(''), 'final', 'kenlm', 'ukrlib_final_symbols_sentences_based_6.arpa')

In [3]:
model3 = kenlm.LanguageModel(LM3)

In [4]:
#model4 = kenlm.LanguageModel(LM4)
#model5 = kenlm.LanguageModel(LM5)
model6 = kenlm.LanguageModel(LM6)

### Sentences scores

#### model.score(self, sentence, bos = True, eos = True)

Return the __log10 probability of a string__.  By default, the string is treated as a sentence.  
          
          return log10 p(sentence </s> | <s>)

If you do not want to condition on the beginning of sentence, pass __bos = False__ . Never include $<s>$ as part of the string. 

Similarly, the end of sentence token $</s>$ can be omitted with __eos = False__. Since language models explicitly predict $</s>$, it can be part of the string.


I do not use bos or eos = False, so the method calculates scores of those strings to be sentences.

In [6]:
sentence1 = 'Штучний інтелект врятує світ .'
print(sentence1)
print('{0}-gram model'.format(model3.order))
print(model3.score(sentence1))
#print('{0}-gram model'.format(model4.order))
#print(model4.score(sentence1))
#print('{0}-gram model'.format(model5.order))
#print(model5.score(sentence1))
print('{0}-gram model'.format(model6.order))
print(model6.score(sentence1))

Штучний інтелект врятує світ .
3-gram model
-20.363344192504883
6-gram model
-20.321420669555664


In [7]:
sentence2 = '_#foreign_ врятує світ .'
print(sentence2)
print('{0}-gram model'.format(model3.order))
print(model3.score(sentence2))
#print('{0}-gram model'.format(model4.order))
#print(model4.score(sentence2))
#print('{0}-gram model'.format(model5.order))
#print(model5.score(sentence2))
print('{0}-gram model'.format(model6.order))
print(model6.score(sentence2))

_#foreign_ врятує світ .
3-gram model
-12.704010963439941
6-gram model
-12.685079574584961


In [8]:
sentence3 = 'Наука врятує світ .'
print(sentence3)
print('{0}-gram model'.format(model3.order))
print(model3.score(sentence3))
#print('{0}-gram model'.format(model4.order))
#print(model4.score(sentence3))
#print('{0}-gram model'.format(model5.order))
#print(model5.score(sentence3))
print('{0}-gram model'.format(model6.order))
print(model6.score(sentence3))

Наука врятує світ .
3-gram model
-12.998435020446777
6-gram model
-12.990907669067383


In [9]:
sentence4 = 'Краса врятує світ .'
print(sentence4)
print('{0}-gram model'.format(model3.order))
print(model3.score(sentence4))
#print('{0}-gram model'.format(model4.order))
#print(model4.score(sentence4))
#print('{0}-gram model'.format(model5.order))
#print(model5.score(sentence4))
print('{0}-gram model'.format(model6.order))
print(model6.score(sentence4))

Краса врятує світ .
3-gram model
-8.816818237304688
6-gram model
-9.31010913848877


#### Check that total full score = direct score

In [10]:
def score(model,s):
    return sum(prob for prob, _, _ in model.full_scores(s))

In [11]:
assert (abs(score(model6, sentence1) - model6.score(sentence1)) < 1e-3)
assert (abs(score(model6, sentence2) - model6.score(sentence2)) < 1e-3)
assert (abs(score(model6, sentence3) - model6.score(sentence3)) < 1e-3)
assert (abs(score(model6, sentence4) - model6.score(sentence4)) < 1e-3)

#### Show scores and n-gram matches

In [12]:
words = ['<s>'] + sentence4.split() + ['</s>']
for i, (prob, length, oov) in enumerate(model6.full_scores(sentence4)):
    print('{0} {1}: {2}'.format(prob, length, ' '.join(words[i+2-length:i+2])))
    if oov:
        print('\t"{0}" is an OOV'.format(words[i+1]))

-4.452103614807129 2: <s> Краса
-2.579751491546631 2: Краса врятує
-1.146461009979248 3: Краса врятує світ
-1.1242060661315918 3: врятує світ .
-0.007586266845464706 4: врятує світ . </s>


n-gramm які закінчуютьмя на $</s>$ завжди друга цифра нуль. Тобто нема залежності між реченнями.

#### Calculating the perplexity of the sentence

In [13]:
def perplexity(model, sentence, bos=True, eos=True):
    """
    Compute perplexity of a sentence.
    @param sentence One full sentence to score.  Do not include <s> or </s>.
    """
    words = len(str(sentence).split()) + 1 # For </s>
    return 10.0**(-model.score(sentence, bos=bos, eos=eos) / words)

In [14]:
print(perplexity(model6, sentence4))

72.78163837647064


In [15]:
print(model6.perplexity(sentence4))

72.78163837647064


#### Find out-of-vocabulary words from the sentence "краса врятує світ ."

In [16]:
for w in words:
    if not w in model6:
        print('"{0}" is an OOV'.format(w))

### Calculating the perplexity of the model on the Ukrainian brown corpus (good and so-so)

_!!! keep in mind this is the same dataset I made my estimation ARPA models on !!!_

In [17]:
filename=os.path.join(os.path.abspath(''), 
                      'brown-uk', 'corpus',
                      'final_all_GS_tagged_words_symbols_sentences.txt')

#read sentence by sentence
temp = open(filename,'r').read().split('\n')

It is only sentence-based model and it estimates score for each sentence separately. The score is equal to the  log10 probability of the sentence. Then I sum up scores for all the sentences in the corpus, divide them by the number of the words in the corpus and take 10 to the power of the resulting fraction to calculate the __perplexity__ of my model.

In [18]:
def perplexity_on_texts_by_sentences(model, temp):
    all_score=0
    all_words=0
    for sentence in temp:
        all_score+=model.score(sentence, bos = False, eos = False)
        all_words+=len(str(sentence).split())
    print("all_score: "+str(all_score)+"; \nnumber of tokens in text: "+str(all_words))
    return 10.0**(-all_score / all_words)

In [19]:
print('{0}-gram model'.format(model3.order))
print(str(perplexity_on_texts_by_sentences(model3,temp))+"\n")
#print('{0}-gram model'.format(model4.order))
#print(str(perplexity_on_texts_by_sentences(model4,temp))+"\n")
#print('{0}-gram model'.format(model5.order))
#print(str(perplexity_on_texts_by_sentences(model5,temp))+"\n")
print('{0}-gram model'.format(model6.order))
print(str(perplexity_on_texts_by_sentences(model6,temp))+"\n")

3-gram model
all_score: -2573061.416824341; 
number of tokens in text: 732044
3272.6529066023068

6-gram model
all_score: -2554392.8496727943; 
number of tokens in text: 732044
3086.0145752047547



### Perplexity including BOS and EOS tags. Copied their function but rewriten to match text

In [20]:
def perplexity_on_texts_by_sentences_boseos(model, temp):
    all_score=0
    all_words=0
    for sentence in temp:
        all_score+=model.score(sentence)
        all_words+=len(str(sentence).split())+1
    print("all_score: "+str(all_score)+"; \nnumber of tokens in text: "+str(all_words))
    return 10.0**(-all_score / all_words)

In [21]:
print('{0}-gram model'.format(model3.order))
print(str(perplexity_on_texts_by_sentences_boseos(model3,temp))+"\n")
#print('{0}-gram model'.format(model4.order))
#print(str(perplexity_on_texts_by_sentences(model4,temp))+"\n")
#print('{0}-gram model'.format(model5.order))
#print(str(perplexity_on_texts_by_sentences(model5,temp))+"\n")
print('{0}-gram model'.format(model6.order))
print(str(perplexity_on_texts_by_sentences_boseos(model6,temp))+"\n")

3-gram model
all_score: -2501458.3301939964; 
number of tokens in text: 771658
1744.4843310478998

6-gram model
all_score: -2479867.22213459; 
number of tokens in text: 771658
1635.63698051947

