# KenLM Sentence-base

#### based on:

https://kheafield.com/papers/avenue/kenlm.pdf 

https://kheafield.com/papers/edinburgh/estimate_paper.pdf

#### implementation of the estimation part:

https://kheafield.com/code/kenlm/

Language models are estimated from text using __[modified](http://www.ee.columbia.edu/~stanchen/papers/h015a-techreport.pdf)__ [Kneser-Ney smoothing](https://ieeexplore.ieee.org/document/479394) without pruning. It is done on disk, enabling one to build much larger models. Kneser-Ney smothng consistently outperforms all other n-grams models with smothing evaluated in this [techreport](http://www.ee.columbia.edu/~stanchen/papers/h015a-techreport.pdf) by Chen and Goodman.

In this notebook I am splitting and tokenizing __[Ukrainian Brown Corpus](https://github.com/brown-uk/corpu)__ (good and so-so datasets) into sentences including symbols. Then I use KenLM scripts to estimate ARPA n-gram sentense-based models. KenLM script __lmplz__ by default uses $<s>$ and $</s>$ tags at the beginning and end of each sentence. 

It is only sentence-based model and it estimates score for each sentence separately. The score is equal to the  log10 probability of the sentence. Then I sum up scores for all the sentences in the corpus, divide them by the number of the words in the corpus and take 10 to the power of the resulting fraction to calculate the __perplexity__ of my model.

## Installing KenLM

In [1]:
import os
import kenlm
import datetime

In [2]:
!python3 --vesion

Unknown option: --
usage: python3 [option] ... [-c cmd | -m mod | file | -] [arg] ...
Try `python -h' for more information.


In [3]:
import importlib.util
spec = importlib.util.spec_from_file_location("kenlm", "/home/nastuha97/.local/lib/python3.6/site-packages/kenlm.cpython-36m-x86_64-linux-gnu.so")
kenlm = importlib.util.module_from_spec(spec)
spec.loader.exec_module(kenlm)

In [4]:
kenlm.__file__

'/home/nastuha97/.local/lib/python3.6/site-packages/kenlm.cpython-36m-x86_64-linux-gnu.so'

    ---# sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev 
    
    ---# wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
         mkdir -p build && cd build
         cmake ..
         make -j 4
         
## Tokenize text
         
Here the __[Ukrainian Brown Corpus](https://github.com/brown-uk/corpu)__ (good and so-so datasets) is splitted into sentences including symbols.
         
#### [TokenizeText.groovy](https://github.com/brown-uk/nlp_uk/blob/master/src/main/groovy/org/nlp_uk/tools/README.md)


Аналізує текст і записує результат у виходовий файл:

        розбиває на речення (-s)
        розбиває на токени (-w) (результати включають пунктуацію тому всі токени розділяються вертикальними рисками)
        розбиває на слова (-u)


In [2]:
!groovy nlp_uk/nlp_uk/src/main/groovy/org/nlp_uk/tools/TokenizeText.groovy -s -w -i final/korr_final.txt -o final/korr_final_symbols_sentences.txt

writing into final/korr_final_symbols_sentences.txt
^C


_!!! Should I lowercase? !!!_

In [4]:
import fileinput
filename=os.path.join(os.path.abspath(''), 'final','korr_final_symbols_sentences.txt')
for line in fileinput.FileInput(filename, inplace=1):
        #line='<s>'+line
        line=line.replace("|"," ")#.lower()
        line=line.replace("\n","")
        line=line.replace(r"\n","")
        line=line.replace("_foreign_"," _foreign_ ")
        #line=line.replace("BEGIN_TEXT","")
        #line=line.replace("END_TEXT","")
        #line=line.replace("\n","</s>")
        #line=line.replace(r"<s>\n","BEGIN <s>")
        #line=line.replace("<s></s>","")
        #line=line.replace(r"\n </s>","</s> END")
        #line=line.replace("<s>","")
        print (line)
#for line in fileinput.FileInput(filename, inplace=1):
        #line=line.replace("\n","")
        #print (line)

Example result file: (+lowercase)

    У   2013   році ,   до   100-річчя   виходу   першого   числа   журналу   _FOREIGN_ ,   на   будинку   встановили   меморіальну   дошку .   
    Тоді   ж   таки   в   будинку   відбувся   перший   з'їзд   есперантистів ,   на   якому   було   50   есперантистів   з   усієї   України   і   троє   з-за   кордону .   
    Відтоді   щороку   вони   там   організовують   конференції ,   починаючи   з   2013-го .   
    Щороку   там   вручають   премію   тим ,   хто   пропаґує   український   есперантський   рух   та   український   погляд   на   важливі   події .  
    _FOREIGN_  Більшість   дописів   на   сторінці   Сергія   Шматкова   у   соціальній   мережі   _FOREIGN_   —   мовою   есперанто .  
    « У   мене   у   _FOREIGN_   понад   дві   тисячі   друзів   з   усього   світу ,   з   якими   я   спілкуюсь   мовою   есперанто » ,   —   розповідає   пан   Сергій .  
    Цю   незвичну   для   багатьох   мову   Сергій   Шматков   вивчив   ще   у   1980-х .   
    Народився   і   прожив   чоловік   усе   життя   в   Донецькій   області ,   а   після   окупації   перебрався   до   Львова .  
    
    
#### Number of sentences

In [5]:
with open("final/korr_final_symbols_sentences.txt", "r") as input:
    summa=0
    for line in input:
        summa+=1

In [6]:
summa

8640598

## Estimating Large Language Models with KenLM

Tokenized and splitted into sentences __[Ukrainian Brown Corpus](https://github.com/brown-uk/corpu)__ (good and so-so datasets) is provided on stdin and the __ARPA__ is written to stdout.

#### kenlm/build/bin/lmplz -o -S -T    
        -o
            Required. Order of the language model to estimate.
        -S
            Recommended. Memory to use. This is a number followed by single-character suffix: % for percentage of physical memory (on platforms where this is measured), b for bytes, K for kilobytes, M for megabytes, and so on for G and T. If no suffix is given, kilobytes are assumed for compatability with GNU sort. The sort program is not used; the command line is simply designed to be compatible.
        -T
            Recommended. Temporary file location.

Here 3-gram, 4-gram, 5-gram and 6-gram models are estimated by kenlm library and saved into the propriate ARPA files.
            

In [None]:
#----!kenlm/build/bin/lmplz -o 4 -S 10% <final/korr_final_symbols_sentences.txt> final/kenlm/korr_final_symbols_sentences_based_4.arpa
#----!kenlm/build/bin/lmplz -o 5 -S 10% <final/korr_final_symbols_sentences.txt> final/kenlm/korr_final_symbols_sentences_based_5.arpa
print(datetime.datetime.now())
!kenlm/build/bin/lmplz -o 6 -S 10% <final/uk_final_symbols_sentences.txt> final/kenlm/uk_final_symbols_sentences_based_6.arpa 
print(datetime.datetime.now())

    ````kenlm/build/bin/lmplz -o 6 -S 10% <final/uk_final_symbols_sentences.txt> final/kenlm
    /uk_final_symbols_sentences_based_6.arpa 
    === 1/5 Counting and sorting n-grams ===
    Reading /home/nastuha97/master/final/uk_final_symbols_sentences.txt
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ****************************************************************************************************
    Unigram tokens 241313710 types 2261926
    === 2/5 Calculating and sorting adjusted counts ===
    Chain sizes: 1:27143112 2:665933312 3:1248624896 4:1997799936 5:2913458176 6:3995599872
    Statistics:
    1 2261926 D1=0.67269 D2=1.02208 D3+=1.35105
    2 36830739 D1=0.771578 D2=1.10498 D3+=1.36663
    3 101599818 D1=0.857522 D2=1.18796 D3+=1.36978
    4 148498582 D1=0.913768 D2=1.28426 D3+=1.39683
    5 167508186 D1=0.947499 D2=1.39375 D3+=1.45894
    6 169630105 D1=0.889727 D2=1.52732 D3+=1.59437
    Memory estimate for binary LM:
    type       MB
    probing 13369 assuming -p 1.5
    probing 15978 assuming -r models -p 1.5
    trie     7240 without quantization
    trie     4229 assuming -q 8 -b 8 quantization 
    trie     6121 assuming -a 22 array pointer compression
    trie     3110 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
    === 3/5 Calculating and sorting initial probabilities ===
    Chain sizes: 1:27143112 2:589291824 3:1274550272 4:2039280512 5:2973950464 6:4078561024
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ####################################################################################################
    === 4/5 Calculating and writing order-interpolated probabilities ===
    Chain sizes: 1:27143112 2:589291824 3:1150784000 4:1841254272 5:2685162240 6:3682508544
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ####################################################################################################
    === 5/5 Writing ARPA model ===
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ****************************************************************************************************
    Name:lmplz      VmPeak:11891512 kB      VmRSS:65996 kB  RSSMax:10731432 kB      user:853.716    sys:177.131     CPU
    :1030.85        real:1955.51```

In [3]:
print(datetime.datetime.now())
!kenlm/build/bin/lmplz -o 3 -S 10% <final/uk_final_symbols_sentences.txt> final/kenlm/uk_final_symbols_sentences_based_3.arpa 
print(datetime.datetime.now())

2020-01-01 17:34:29.146266
=== 1/5 Counting and sorting n-grams ===
Reading /home/nastuha97/master/final/uk_final_symbols_sentences.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 241313710 types 2261926
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:27143112 2:3810655232 3:7144978944
Statistics:
1 2261926 D1=0.67269 D2=1.02208 D3+=1.35105
2 36830739 D1=0.771578 D2=1.10498 D3+=1.36663
3 101599818 D1=0.786184 D2=1.25763 D3+=1.43014
Memory estimate for binary LM:
type      MB
probing 2643 assuming -p 1.5
probing 2862 assuming -r models -p 1.5
trie    1185 without quantization
trie     700 assuming -q 8 -b 8 quantization 
trie    1103 assuming -a 22 array pointer compression
trie     618 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial proba

    2019-12-07 11:56:19.000665
    === 1/5 Counting and sorting n-grams ===
    Reading /home/nastuha97/master/final/uk_final_symbols_sentences.txt
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ****************************************************************************************************
    Unigram tokens 241313710 types 2261926
    === 2/5 Calculating and sorting adjusted counts ===
    Chain sizes: 1:27143112 2:3810655232 3:7144978944
    Statistics:
    1 2261926 D1=0.67269 D2=1.02208 D3+=1.35105
    2 36830739 D1=0.771578 D2=1.10498 D3+=1.36663
    3 101599818 D1=0.786184 D2=1.25763 D3+=1.43014
    Memory estimate for binary LM:
    type      MB
    probing 2643 assuming -p 1.5
    probing 2862 assuming -r models -p 1.5
    trie    1185 without quantization
    trie     700 assuming -q 8 -b 8 quantization 
    trie    1103 assuming -a 22 array pointer compression
    trie     618 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
    === 3/5 Calculating and sorting initial probabilities ===
    Chain sizes: 1:27143112 2:589291824 3:2031996360
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ####################################################################################################
    === 4/5 Calculating and writing order-interpolated probabilities ===
    Chain sizes: 1:27143112 2:589291824 3:2031996360
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ####################################################################################################
    === 5/5 Writing ARPA model ===
    ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
    ****************************************************************************************************
    Name:lmplz	VmPeak:10916916 kB	VmRSS:65880 kB	RSSMax:4514276 kB	user:229.414	sys:25.4962	CPU:254.91	real:234.975
    2019-12-07 12:00:14.091664

In [4]:
LM3 = os.path.join(os.path.abspath(''), 'final', 'kenlm', 'uk_final_symbols_sentences_based_3.arpa')
#LM4 = os.path.join(os.path.abspath(''), 'final', 'kenlm', 'korr_final_symbols_sentences_based_4.arpa')
#LM5 = os.path.join(os.path.abspath(''), 'final', 'kenlm', 'korr_final_symbols_sentences_based_5.arpa')
LM6 = os.path.join(os.path.abspath(''), 'final', 'kenlm', 'uk_final_symbols_sentences_based_6.arpa')

In [20]:
!pip3 install https://github.com/kpu/kenlm/archive/master.zip --user

Collecting https://github.com/kpu/kenlm/archive/master.zip
[?25l  Downloading https://github.com/kpu/kenlm/archive/master.zip (539kB)
[K     |████████████████████████████████| 542kB 1.3MB/s eta 0:00:01
Building wheels for collected packages: kenlm
  Building wheel for kenlm (setup.py) ... [?25ldone
[?25h  Created wheel for kenlm: filename=kenlm-0.0.0-cp36-cp36m-linux_x86_64.whl size=2301730 sha256=451014436aa121d86d289ba34052393a9d0c44b4b0f018bc1d12a069a7be5c8d
  Stored in directory: /tmp/pip-ephem-wheel-cache-y23bp3v0/wheels/2d/32/73/e3093c9d11dc8abf79c156a4db1a1c5631428059d4f9ff2cba
Successfully built kenlm


In [5]:
model3 = kenlm.LanguageModel(LM3)
#model4 = kenlm.LanguageModel(LM4)
#model5 = kenlm.LanguageModel(LM5)
#натренована але нема місця щоб віддкрити

In [6]:
model6 = kenlm.LanguageModel(LM6)

### Sentences scores

#### model.score(self, sentence, bos = True, eos = True)

Return the __log10 probability of a string__.  By default, the string is treated as a sentence.  
          
          return log10 p(sentence </s> | <s>)

If you do not want to condition on the beginning of sentence, pass __bos = False__ . Never include $<s>$ as part of the string. 

Similarly, the end of sentence token $</s>$ can be omitted with __eos = False__. Since language models explicitly predict $</s>$, it can be part of the string.


I do not use bos or eos = False, so the method calculates scores of those strings to be sentences.

In [7]:
sentence1 = 'Штучний інтелект врятує світ .'
print(sentence1)
print('{0}-gram model'.format(model3.order))
print(model3.score(sentence1))
#print('{0}-gram model'.format(model4.order))
#print(model4.score(sentence1))
#print('{0}-gram model'.format(model5.order))
#print(model5.score(sentence1))
print('{0}-gram model'.format(model6.order))
print(model6.score(sentence1))

Штучний інтелект врятує світ .
3-gram model
-13.396929740905762
6-gram model
-13.588444709777832


In [8]:
sentence2 = '_#foreign_ врятує світ .'
print(sentence2)
print('{0}-gram model'.format(model3.order))
print(model3.score(sentence2))
#print('{0}-gram model'.format(model4.order))
#print(model4.score(sentence2))
#print('{0}-gram model'.format(model5.order))
#print(model5.score(sentence2))
print('{0}-gram model'.format(model6.order))
print(model6.score(sentence2))

_#foreign_ врятує світ .
3-gram model
-10.425771713256836
6-gram model
-10.52933406829834


In [9]:
sentence3 = 'Наука врятує світ .'
print(sentence3)
print('{0}-gram model'.format(model3.order))
print(model3.score(sentence3))
#print('{0}-gram model'.format(model4.order))
#print(model4.score(sentence3))
#print('{0}-gram model'.format(model5.order))
#print(model5.score(sentence3))
print('{0}-gram model'.format(model6.order))
print(model6.score(sentence3))

Наука врятує світ .
3-gram model
-12.745466232299805
6-gram model
-12.852709770202637


In [10]:
sentence4 = 'Краса врятує світ .'
print(sentence4)
print('{0}-gram model'.format(model3.order))
print(model3.score(sentence4))
#print('{0}-gram model'.format(model4.order))
#print(model4.score(sentence4))
#print('{0}-gram model'.format(model5.order))
#print(model5.score(sentence4))
print('{0}-gram model'.format(model6.order))
print(model6.score(sentence4))

Краса врятує світ .
3-gram model
-8.470232963562012
6-gram model
-9.163093566894531


#### Check that total full score = direct score

In [11]:
def score(model,s):
    return sum(prob for prob, _, _ in model.full_scores(s))

In [12]:
assert (abs(score(model6, sentence1) - model6.score(sentence1)) < 1e-3)
assert (abs(score(model6, sentence2) - model6.score(sentence2)) < 1e-3)
assert (abs(score(model6, sentence3) - model6.score(sentence3)) < 1e-3)
assert (abs(score(model6, sentence4) - model6.score(sentence4)) < 1e-3)

#### Show scores and n-gram matches

In [13]:
words = ['<s>'] + sentence4.split() + ['</s>']
for i, (prob, length, oov) in enumerate(model6.full_scores(sentence4)):
    print('{0} {1}: {2}'.format(prob, length, ' '.join(words[i+2-length:i+2])))
    if oov:
        print('\t"{0}" is an OOV'.format(words[i+1]))

-4.728082180023193 2: <s> Краса
-2.4685885906219482 3: <s> Краса врятує
-1.205726981163025 3: Краса врятує світ
-0.7587704658508301 3: врятує світ .
-0.0019260908011347055 4: врятує світ . </s>


n-gramm які закінчуютьмя на $</s>$ завжди друга цифра нуль. Виходить нема залежності між реченнями. 

#### Calculating the perplexity of the sentence

In [14]:
def perplexity(model, sentence, bos=True, eos=True):
    """
    Compute perplexity of a sentence.
    @param sentence One full sentence to score.  Do not include <s> or </s>.
    """
    words = len(str(sentence).split()) + 1 # For </s>
    return 10.0**(-model.score(sentence, bos=bos, eos=eos) / words)

In [15]:
print(perplexity(model6, sentence4))

68.0171943001484


In [16]:
print(model6.perplexity(sentence4))

68.0171943001484


#### Find out-of-vocabulary words

In [17]:
for w in words:
    if not w in model6:
        print('"{0}" is an OOV'.format(w))

### Calculating the perplexity of the model on the Ukrainian brown corpus (good and so-so)

_!!! keep in mind this is the same dataset I made my estimation ARPA models on !!!_

In [18]:
filename=os.path.join(os.path.abspath(''), 
                      'brown-uk', 'corpus',
                      'final_all_GS_tagged_words_symbols_sentences.txt')
#read sentence by sentence
temp = open(filename,'r').read().split('\n')

It is only sentence-based model and it estimates score for each sentence separately. The score is equal to the  log10 probability of the sentence. Then I sum up scores for all the sentences in the corpus, divide them by the number of the words in the corpus and take 10 to the power of the resulting fraction to calculate the __perplexity__ of my model.

In [50]:
def perplexity_on_texts_by_sentences(model, temp):
    all_score=0
    all_words=0
    for sentence in temp:
        all_score+=model.score(sentence, bos = False, eos = False)
        all_words+=len(str(sentence).split())
    print("all_score: "+str(all_score)+"; \nnumber of tokens in text: "+str(all_words))
    return 10.0**(-all_score / all_words)

In [51]:
print('{0}-gram model'.format(model3.order))
print(str(perplexity_on_texts_by_sentences(model3,temp))+"\n")
#print('{0}-gram model'.format(model4.order))
#print(str(perplexity_on_texts_by_sentences(model4,temp))+"\n")
#print('{0}-gram model'.format(model5.order))
#print(str(perplexity_on_texts_by_sentences(model5,temp))+"\n")
print('{0}-gram model'.format(model6.order))
print(str(perplexity_on_texts_by_sentences(model6,temp))+"\n")

3-gram model
all_score: -2319665.4281127453; 
number of tokens in text: 732024
1475.1559840711334

6-gram model
all_score: -2288903.02160573; 
number of tokens in text: 732024
1339.1036003071433



### Perplexity including BOS and EOS tags. Copied their function but rewriten to match text

In [187]:
def perplexity_on_texts_by_sentences_boseos(model, temp):
    all_score=0
    all_words=0
    for sentence in temp:
        if len(str(sentence))>0:
            all_score+=model.score(sentence)
            all_words+=len(str(sentence).split())+1
        if len(str(sentence))==1:
            all_words+=3
        if len(str(sentence))==2:
            all_words+=2
        if len(str(sentence))==3:
            all_words+=1
        if len(str(sentence))==0:
            all_words-=1
    print("all_score: "+str(all_score)+"; \nnumber of tokens in text: "+str(all_words))
    return 10.0**(-all_score / all_words)

In [188]:
print('{0}-gram model'.format(model3.order))
print(str(perplexity_on_texts_by_sentences_boseos(model3,temp))+"\n")
#print('{0}-gram model'.format(model4.order))
#print(str(perplexity_on_texts_by_sentences(model4,temp))+"\n")
#print('{0}-gram model'.format(model5.order))
#print(str(perplexity_on_texts_by_sentences(model5,temp))+"\n")
print('{0}-gram model'.format(model6.order))
print(str(perplexity_on_texts_by_sentences_boseos(model6,temp))+"\n")

3-gram model
all_score: -2246459.9509153366; 
number of tokens in text: 771682
814.9319723394601

6-gram model
all_score: -2212696.1555285454; 
number of tokens in text: 771682
736.830929701527



### Perplexity computed by Kenlm query function

#### 6-gram

### without pruning

#### size of arpa file = 41.1G
#### training time 40minutes

In [97]:
! kenlm/build/bin/query final/kenlm/uk_final_symbols_sentences_based_6.binary <brown-uk/corpus/final_all_GS_tagged_words_symbols_sentences.txt > uk_bruk_kenlm_6_results.txt

This binary file contains probing hash tables.
Name:query	VmPeak:13718564 kB	VmRSS:4776 kB	RSSMax:13694864 kB	user:0.568461	sys:0.680552	CPU:1.24908	real:1.24741


In [98]:
!tail -4 uk_bruk_kenlm_6_results.txt

Perplexity including OOVs:	736.830929701527
Perplexity excluding OOVs:	630.6853332748613
OOVs:	9174
Tokens:	771682


### Perplexity computed by Kenlm 3 query function

3-gram

### without pruning

#### size of arpa file = 6.4G
#### training time 3:55  minutes

In [7]:
! kenlm/build/bin/query final/kenlm/uk_final_symbols_sentences_based_3.arpa <brown-uk/corpus/final_all_GS_tagged_words_symbols_sentences.txt > uk_bruk_kenlm_3_results.txt

Loading the LM will be faster if you build a binary file.
Reading final/kenlm/uk_final_symbols_sentences_based_3.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:query	VmPeak:3783640 kB	VmRSS:4800 kB	RSSMax:2711608 kB	user:66.3182	sys:1.01585	CPU:67.3341	real:67.3396


In [8]:
!tail -4 uk_bruk_kenlm_3_results.txt

Perplexity including OOVs:	814.9319723394601
Perplexity excluding OOVs:	697.8233563403668
OOVs:	9174
Tokens:	771682
