# KenLM Paragraph-base


#### based on:

https://kheafield.com/papers/avenue/kenlm.pdf 

https://kheafield.com/papers/edinburgh/estimate_paper.pdf

#### implementation of the estimation part:

https://kheafield.com/code/kenlm/

Language models are estimated from text using __[modified](http://www.ee.columbia.edu/~stanchen/papers/h015a-techreport.pdf)__ [Kneser-Ney smoothing](https://ieeexplore.ieee.org/document/479394) without pruning. It is done on disk, enabling one to build much larger models. Kneser-Ney smothng consistently outperforms all other n-grams models with smothing evaluated in this [techreport](http://www.ee.columbia.edu/~stanchen/papers/h015a-techreport.pdf) by Chen and Goodman.

In this notebook I am splitting and tokenizing __[Ukrainian Brown Corpus](https://github.com/brown-uk/corpu)__ (good and so-so datasets) into texts including symbols. Then I use KenLM scripts to estimate ARPA n-gram models. KenLM script __lmplz__ by default uses $<s>$ and $</s>$ tags at the beginning and end of each sentence. In my case it sees the whole text as a sentence.

It is the text-based model and it can estimates score for the text. The score is equal to the  log10 probability of the text. Then I sum up scores for all the texts in the corpus, divide them by the number of the words in the corpus and take 10 to the power of the resulting fraction to calculate the __perplexity__ of my model.

## Installing KenLM

In [1]:
import os
import kenlm

    ---# sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev 
    
    ---# wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
         mkdir -p build && cd build
         cmake ..
         make -j 4
         
## Tokenize text
         
Here the __[Ukrainian Brown Corpus](https://github.com/brown-uk/corpu)__ (good and so-so datasets) is splitted into texts including symbols. Also $<SENTENCE></SENTENCE>$ tags are added at the beginning and at the and of each sentence. Also paragraphs are splitted by $PARAGRAPH$ tag. Then $BEGIN\_TEXT$ and $END\_TEXT$ tags wrap all the texts. Then everything is lowercased.
         
_!!! Should I lowercase? !!!_         

#### [TokenizeText.groovy](https://github.com/brown-uk/nlp_uk/blob/master/src/main/groovy/org/nlp_uk/tools/README.md)


Аналізує текст і записує результат у виходовий файл:

        розбиває на речення (-s)
        розбиває на токени (-w) (результати включають пунктуацію тому всі токени розділяються вертикальними рисками)
        розбиває на слова (-u)


In [2]:
!groovy nlp_uk/nlp_uk/src/main/groovy/org/nlp_uk/tools/TokenizeText.groovy -s -w -i brown-uk/corpus/all_GS_septexts.txt -o brown-uk/corpus/all_GS_words_symbols_paragraph_text.txt

writing into brown-uk/corpus/all_GS_words_symbols_paragraph_text.txt


In [3]:
import fileinput
filename=os.path.join(os.path.abspath(''), 
                      'brown-uk','corpus',
                      'all_GS_words_symbols_paragraph_text.txt')
for line in fileinput.FileInput(filename, inplace=1):
        line='<SENTENCE> '+line
        line=line.replace("|"," ")
        #line=line.replace("\n","")
        #line=line.replace(r"\n","")
        line=line.replace("\n"," </SENTENCE> ")
        line=line.replace(r"<SENTENCE>\n","PARAGRAPH <SENTENCE> ")
        line=line.replace("<SENTENCE></SENTENCE>"," ")
        line=line.replace(r"\n </SENTENCE>"," </SENTENCE> PARAGRAPH")
        line=line.replace(r"\n"," </SENTENCE> <SENTENCE> ")
        line=line.replace(r"•","")
        line=line.replace("<SENTENCE>  BEGIN_TEXT","BEGIN_TEXT <SENTENCE>")
        line=line.replace("<SENTENCE> BEGIN_TEXT","BEGIN_TEXT <SENTENCE>")
        line=line.replace("END_TEXT  </SENTENCE>","</SENTENCE>  END_TEXT")
        print (line)

In [4]:
temp = open(filename,'r').read().split('\n')
mystr=' '.join(temp).replace("END_TEXT BEGIN_TEXT","END_TEXT \nBEGIN_TEXT").replace("<SENTENCE>   </SENTENCE>"," ").replace("<SENTENCE>  </SENTENCE>"," ").replace("<SENTENCE> </SENTENCE>"," ").lower()
with open(filename, 'w') as out_file:
     out_file.write(mystr)


Example result file:

     begin_text ... <sentence> у   2013   році ,   до   100-річчя   виходу   першого   числа   журналу   _foreign_ ,   на   будинку   встановили   меморіальну   дошку .    </sentence>  <sentence> тоді   ж   таки   в   будинку   відбувся   перший   з'їзд   есперантистів ,   на   якому   було   50   есперантистів   з   усієї   україни   і   троє   з-за   кордону .    </sentence>  <sentence> відтоді   щороку   вони   там   організовують   конференції ,   починаючи   з   2013-го .    </sentence>  <sentence> щороку   там   вручають   премію   тим ,   хто   пропаґує   український   есперантський   рух   та   український   погляд   на   важливі   події .  </sentence>    <sentence> mirinda   lviv  </sentence> <sentence>  більшість   дописів   на   сторінці   сергія   шматкова   у   соціальній   мережі   _foreign_   —   мовою   есперанто .  </sentence>    <sentence> « у   мене   у   _foreign_   понад   дві   тисячі   друзів   з   усього   світу ,   з   якими   я   спілкуюсь   мовою   есперанто » ,   —   розповідає   пан   сергій .  </sentence>    <sentence> цю   незвичну   для   багатьох   мову   сергій   шматков   вивчив   ще   у   1980-х .    </sentence>  <sentence> народився   і   прожив   чоловік   усе   життя   в   донецькій   області ,   а   після   окупації   перебрався   до   львова .    </sentence>  ... paragraph ... paragraph ...end_text

## Estimating Large Language Models with KenLM

Tokenized and splitted into texts __[Ukrainian Brown Corpus](https://github.com/brown-uk/corpu)__ (good and so-so datasets) is provided on stdin and the __ARPA__ is written to stdout.

#### kenlm/build/bin/lmplz -o -S -T    
        -o
            Required. Order of the language model to estimate.
        -S
            Recommended. Memory to use. This is a number followed by single-character suffix: % for percentage of physical memory (on platforms where this is measured), b for bytes, K for kilobytes, M for megabytes, and so on for G and T. If no suffix is given, kilobytes are assumed for compatability with GNU sort. The sort program is not used; the command line is simply designed to be compatible.
        -T
            Recommended. Temporary file location.

Here 3-gram, 4-gram, 5-gram and 6-gram models are estimated by kenlm library and saved into the propriate ARPA files.

In [5]:
!kenlm/build/bin/lmplz -o 3 -S 10% <brown-uk/corpus/all_GS_words_symbols_paragraph_text.txt> brown-uk/corpus/kenlm/all_GS_symbols_paragraph_text_based_3.arpa
!kenlm/build/bin/lmplz -o 4 -S 10% <brown-uk/corpus/all_GS_words_symbols_paragraph_text.txt> brown-uk/corpus/kenlm/all_GS_symbols_paragraph_text_based_4.arpa
!kenlm/build/bin/lmplz -o 5 -S 10% <brown-uk/corpus/all_GS_words_symbols_paragraph_text.txt> brown-uk/corpus/kenlm/all_GS_symbols_paragraph_text_based_5.arpa
!kenlm/build/bin/lmplz -o 6 -S 10% <brown-uk/corpus/all_GS_words_symbols_paragraph_text.txt> brown-uk/corpus/kenlm/all_GS_symbols_paragraph_text_based_6.arpa

=== 1/5 Counting and sorting n-grams ===
Reading /home/ana/Downloads/master diploma/code/brown-uk/corpus/all_GS_words_symbols_paragraph_text.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 817744 types 96700
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:1160400 2:580715456 3:1088841600
Statistics:
1 96700 D1=0.648565 D2=1.09892 D3+=1.53524
2 446271 D1=0.852958 D2=1.22651 D3+=1.49885
3 646761 D1=0.919303 D2=1.33588 D3+=1.37954
Memory estimate for binary LM:
type       kB
probing 24283 assuming -p 1.5
probing 27276 assuming -r models -p 1.5
trie    11503 without quantization
trie     7130 assuming -q 8 -b 8 quantization 
trie    10820 assuming -a 22 array pointer compression
trie     6446 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial pr

In [6]:
LM3 = os.path.join(os.path.abspath(''), 'brown-uk', 'corpus', 'kenlm', 'all_GS_symbols_paragraph_text_based_3.arpa')
LM4 = os.path.join(os.path.abspath(''), 'brown-uk', 'corpus', 'kenlm', 'all_GS_symbols_paragraph_text_based_4.arpa')
LM5 = os.path.join(os.path.abspath(''), 'brown-uk', 'corpus', 'kenlm', 'all_GS_symbols_paragraph_text_based_5.arpa')
LM6 = os.path.join(os.path.abspath(''), 'brown-uk', 'corpus', 'kenlm', 'all_GS_symbols_paragraph_text_based_6.arpa')

##### Only up to 6 :( 

If more then 6 then ERROR:

#### RuntimeError                              Traceback (most recent call last)
    kenlm.pyx in kenlm.Model.__init__()

    RuntimeError: lm/model.cc:49 in void lm::ngram::detail::{anonymous}::CheckCounts(const std::vector<long unsigned int>&) threw FormatLoadException because `counts.size() > 6'.
    This model has order 7 but KenLM was compiled to support up to 6.  If your build system supports changing KENLM_MAX_ORDER, change it there and recompile.  With cmake:
     cmake -DKENLM_MAX_ORDER=10 ..
    With Moses:
     bjam --max-kenlm-order=10 -a
    Otherwise, edit lm/max_order.hh. Byte: 113
#### OSError: Cannot read model '/home/ana/Downloads/master diploma/code/brown-uk/corpus/all_GS_symbols_paragraph_text_based_7.arpa' 
    (lm/model.cc:49 in void lm::ngram::detail::{anonymous}::CheckCounts(const std::vector<long unsigned int>&) threw FormatLoadException because `counts.size() > 6'. This model has order 7 but KenLM was compiled to support up to 6.  If your build system supports changing KENLM_MAX_ORDER, change it there and recompile.  With cmake:  cmake -DKENLM_MAX_ORDER=10 .. With Moses:  bjam --max-kenlm-order=10 -a Otherwise, edit lm/max_order.hh. Byte: 113)
    
    
###### _!!! If needed can try later to fix and estimate 9-gram!!!_

In [7]:
model3 = kenlm.LanguageModel(LM3)
model4 = kenlm.LanguageModel(LM4)
model5 = kenlm.LanguageModel(LM5)
model6 = kenlm.LanguageModel(LM6)

### Sentences scores

#### model.score(self, sentence, bos = True, eos = True)

Return the __log10 probability of a string__.  By default, the string is treated as a sentence.  
          
          return log10 p(sentence </s> | <s>)

If you do not want to condition on the beginning of sentence, pass __bos = False__ . Never include $<s>$ as part of the string. 

Similarly, the end of sentence token $</s>$ can be omitted with __eos = False__. Since language models explicitly predict $</s>$, it can be part of the string.


I do use bos or eos = False, so the method calculates scores of those strings. I artificially add  $<SENTENCE> </SENTENCE> $ tags

In [8]:
sentence1 = '<SENTENCE> Штучний інтелект врятує світ . </SENTENCE> '
sentence1 = sentence1.lower()
print(sentence1)
print('{0}-gram model'.format(model3.order))
print(model3.score(sentence1, bos=False, eos=False))
print('{0}-gram model'.format(model4.order))
print(model4.score(sentence1, bos=False, eos=False))
print('{0}-gram model'.format(model5.order))
print(model5.score(sentence1, bos=False, eos=False))
print('{0}-gram model'.format(model6.order))
print(model6.score(sentence1, bos=False, eos=False))

<sentence> штучний інтелект врятує світ . </sentence> 
3-gram model
-23.78922462463379
4-gram model
-23.777435302734375
5-gram model
-23.777435302734375
6-gram model
-23.777435302734375


In [9]:
sentence2 = '<SENTENCE> _FOREIGN_ врятує світ . </SENTENCE>'
sentence2 = sentence2.lower()
print(sentence2)
print('{0}-gram model'.format(model3.order))
print(model3.score(sentence2, bos=False, eos=False))
print('{0}-gram model'.format(model4.order))
print(model4.score(sentence2, bos=False, eos=False))
print('{0}-gram model'.format(model5.order))
print(model5.score(sentence2, bos=False, eos=False))
print('{0}-gram model'.format(model6.order))
print(model6.score(sentence2, bos=False, eos=False))

<sentence> _foreign_ врятує світ . </sentence>
3-gram model
-17.128862380981445
4-gram model
-16.742843627929688
5-gram model
-16.742843627929688
6-gram model
-16.742843627929688


In [10]:
sentence3 = '<SENTENCE> Наука врятує світ . </SENTENCE>'
sentence3 = sentence3.lower()
print(sentence3)
print('{0}-gram model'.format(model3.order))
print(model3.score(sentence3, bos=False, eos=False))
print('{0}-gram model'.format(model4.order))
print(model4.score(sentence3, bos=False, eos=False))
print('{0}-gram model'.format(model5.order))
print(model5.score(sentence3, bos=False, eos=False))
print('{0}-gram model'.format(model6.order))
print(model6.score(sentence3, bos=False, eos=False))

<sentence> наука врятує світ . </sentence>
3-gram model
-17.484424591064453
4-gram model
-17.46067237854004
5-gram model
-17.46067237854004
6-gram model
-17.46067237854004


In [11]:
sentence4 = 'BEGIN_TEXT <SENTENCE> Краса врятує світ . </SENTENCE> END_TEXT'
sentence4 = sentence4.lower()
print(sentence4)
print('{0}-gram model'.format(model3.order))
print(model3.score(sentence4, bos=False, eos=False))
print('{0}-gram model'.format(model4.order))
print(model4.score(sentence4, bos=False, eos=False))
print('{0}-gram model'.format(model5.order))
print(model5.score(sentence4, bos=False, eos=False))
print('{0}-gram model'.format(model6.order))
print(model6.score(sentence4, bos=False, eos=False))

begin_text <sentence> краса врятує світ . </sentence> end_text
3-gram model
-16.80482292175293
4-gram model
-17.075908660888672
5-gram model
-16.987186431884766
6-gram model
-16.987186431884766


#### Check that total full score = direct score

In [12]:
def score(model, s):
    return sum(prob for prob, _, _ in model.full_scores(s, bos=False, eos=False))

In [13]:
assert (abs(score(model6, sentence1) - model6.score(sentence1, bos=False, eos=False)) < 1e-3)
assert (abs(score(model6, sentence2) - model6.score(sentence2, bos=False, eos=False)) < 1e-3)
assert (abs(score(model6, sentence3) - model6.score(sentence3, bos=False, eos=False)) < 1e-3)
assert (abs(score(model6, sentence4) - model6.score(sentence4, bos=False, eos=False)) < 1e-3)

#### Show scores and n-gram matches

In [14]:
# Show scores and n-gram matches
words = sentence4.split()
for i, (prob, length, oov) in enumerate(model6.full_scores(sentence4, bos=False, eos=False)):
    print('{0} {1}: {2}'.format(prob, length, ' '.join(words[i+2-length:i+2])))
    if oov:
        print('\t"{0}" is an OOV'.format(words[i+1]))

-5.535782814025879 1: <sentence>
-0.8325477242469788 2: <sentence> краса
-4.5061163902282715 2: краса врятує
-1.9362119436264038 2: врятує світ
-0.9934155344963074 3: врятує світ .
-0.9698286652565002 2: . </sentence>
-0.0004917234182357788 3: . </sentence> end_text
-2.2127907276153564 4: . </sentence> end_text


n-gramm які закінчуютьмя на $</s>$ завжди друга цифра нуль. Виходить нема залежності між реченнями. Як на мене це не ок. Саме тому зробила всю штуку з реченнями, параграфами, текстами.

#### Calculating the perplexity of the sentence

In [15]:
def perplexity(model, sentence, bos=False, eos=False):
    """
    Compute perplexity of a sentence.
    @param sentence One full sentence to score.  Do not include <s> or </s>.
    """
    words = len(str(sentence).split())# For </s>
    return 10.0**(-model.score(sentence, bos, eos) / words)

In [16]:
len(str(sentence4).split())

8

In [17]:
print('{0}-gram model'.format(model3.order))
print(perplexity(model3, sentence4, bos=False, eos=False))
print('{0}-gram model'.format(model4.order))
print(perplexity(model4, sentence4, bos=False, eos=False))
print('{0}-gram model'.format(model5.order))
print(perplexity(model5, sentence4, bos=False, eos=False))
print('{0}-gram model'.format(model6.order))
print(perplexity(model6, sentence4, bos=False, eos=False))

3-gram model
126.06742006826272
4-gram model
136.29771737354372
5-gram model
132.86124078652648
6-gram model
132.86124078652648


#### Find out-of-vocabulary words

In [18]:
for w in words:
    if not w in model6:
        print('"{0}" is an OOV'.format(w))

### Calculating the perplexity of the model on the Ukrainian brown corpus (good and so-so)

_!!! keep in mind this is the same dataset I made my estimation ARPA models on !!!_

In [19]:
filename=os.path.join(os.path.abspath(''), 
                      'brown-uk', 'corpus',
                      'all_GS_words_symbols_paragraph_text.txt')
temp = open(filename,'r').read().split('\n')

It is the text-based model and it can estimates score for the text. The score is equal to the  log10 probability of the text. Thus, I iterate through the texts and calculate the scores. Then I sum up scores for all the texts in the corpus, divide them by the number of the words in the corpus and take 10 to the power of the resulting fraction to calculate the __perplexity__ of my model.

In [22]:
def perplexity_on_texts(model, temp, bos=False, eos=False):
    all_score=0
    all_words=0
    for text in temp:
        all_score+=model.score(text, bos=bos, eos=eos)
        all_words+=len(str(text).split())
    print("all_score: "+str(all_score)+"; all_words: "+str(all_words))
    return 10.0**(-all_score / all_words)

In [21]:
print('{0}-gram model'.format(model3.order))
print("perplexity: "+str(perplexity_on_texts(model3,temp, bos=False, eos=False))+"\n")
print('{0}-gram model'.format(model4.order))
print("perplexity: "+str(perplexity_on_texts(model4,temp, bos=False, eos=False))+"\n")
print('{0}-gram model'.format(model5.order))
print("perplexity: "+ str(perplexity_on_texts(model5,temp, bos=False, eos=False))+"\n")
print('{0}-gram model'.format(model6.order))
print("perplexity: "+str(perplexity_on_texts(model6,temp, bos=False, eos=False))+"\n")

3-gram model
perplexity: 16.564665016290835

4-gram model
perplexity: 10.34141724464607

5-gram model
perplexity: 8.004362326244326

6-gram model
perplexity: 7.218153155757558



In [23]:
print('{0}-gram model'.format(model3.order))
print("perplexity: "+str(perplexity_on_texts(model3,temp, bos=False, eos=False))+"\n")
print('{0}-gram model'.format(model4.order))
print("perplexity: "+str(perplexity_on_texts(model4,temp, bos=False, eos=False))+"\n")
print('{0}-gram model'.format(model5.order))
print("perplexity: "+ str(perplexity_on_texts(model5,temp, bos=False, eos=False))+"\n")
print('{0}-gram model'.format(model6.order))
print("perplexity: "+str(perplexity_on_texts(model6,temp, bos=False, eos=False))+"\n")

3-gram model
all_score: -996924.4399871826; all_words: 817699
perplexity: 16.564665016290835

4-gram model
all_score: -829621.1012496948; all_words: 817699
perplexity: 10.34141724464607

5-gram model
all_score: -738649.3713378906; all_words: 817699
perplexity: 8.004362326244326

6-gram model
all_score: -701934.1576080322; all_words: 817699
perplexity: 7.218153155757558

