# KenLM Sentence-base

#### based on:

https://kheafield.com/papers/avenue/kenlm.pdf 

https://kheafield.com/papers/edinburgh/estimate_paper.pdf

#### implementation of the estimation part:

https://kheafield.com/code/kenlm/

Language models are estimated from text using __[modified](http://www.ee.columbia.edu/~stanchen/papers/h015a-techreport.pdf)__ [Kneser-Ney smoothing](https://ieeexplore.ieee.org/document/479394) without pruning. It is done on disk, enabling one to build much larger models. Kneser-Ney smothng consistently outperforms all other n-grams models with smothing evaluated in this [techreport](http://www.ee.columbia.edu/~stanchen/papers/h015a-techreport.pdf) by Chen and Goodman.

In this notebook I am splitting and tokenizing __[Ukrainian Brown Corpus](https://github.com/brown-uk/corpu)__ (good and so-so datasets) into sentences including symbols. Then I use KenLM scripts to estimate ARPA n-gram sentense-based models. KenLM script __lmplz__ by default uses $<s>$ and $</s>$ tags at the beginning and end of each sentence. 

It is only sentence-based model and it estimates score for each sentence separately. The score is equal to the  log10 probability of the sentence. Then I sum up scores for all the sentences in the corpus, divide them by the number of the words in the corpus and take 10 to the power of the resulting fraction to calculate the __perplexity__ of my model.

## Installing KenLM

In [1]:
import os
import kenlm

    ---# sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev 
    
    ---# wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
         mkdir -p build && cd build
         cmake ..
         make -j 4
         
## Tokenize text
         
Here the __[Ukrainian Brown Corpus](https://github.com/brown-uk/corpu)__ (good and so-so datasets) is splitted into sentences including symbols.
         
#### [TokenizeText.groovy](https://github.com/brown-uk/nlp_uk/blob/master/src/main/groovy/org/nlp_uk/tools/README.md)


Аналізує текст і записує результат у виходовий файл:

        розбиває на речення (-s)
        розбиває на токени (-w) (результати включають пунктуацію тому всі токени розділяються вертикальними рисками)
        розбиває на слова (-u)


In [2]:
!groovy nlp_uk/nlp_uk/src/main/groovy/org/nlp_uk/tools/TokenizeText.groovy -s -w -i brown-uk/corpus/all_GS.txt -o brown-uk/corpus/all_GS_words_symbols_sentences.txt

writing into brown-uk/corpus/all_GS_words_symbols_sentences.txt


_!!! Should I lowercase? !!!_

In [3]:
import fileinput
filename=os.path.join(os.path.abspath(''), 'brown-uk','corpus','all_GS_words_symbols_sentences.txt')
for line in fileinput.FileInput(filename, inplace=1):
        #line='<s>'+line
        line=line.replace("|"," ").lower()
        line=line.replace("\n","")
        line=line.replace(r"\n","")
        line=line.replace("_foreign_"," _foreign_ ")
        #line=line.replace("BEGIN_TEXT","")
        #line=line.replace("END_TEXT","")
        #line=line.replace("\n","</s>")
        #line=line.replace(r"<s>\n","BEGIN <s>")
        #line=line.replace("<s></s>","")
        #line=line.replace(r"\n </s>","</s> END")
        #line=line.replace("<s>","")
        print (line)
for line in fileinput.FileInput(filename, inplace=1):
        line=line.replace("\n","")
        print (line)


Example result file: (+lowercase)

    У   2013   році ,   до   100-річчя   виходу   першого   числа   журналу   _FOREIGN_ ,   на   будинку   встановили   меморіальну   дошку .   
    Тоді   ж   таки   в   будинку   відбувся   перший   з'їзд   есперантистів ,   на   якому   було   50   есперантистів   з   усієї   України   і   троє   з-за   кордону .   
    Відтоді   щороку   вони   там   організовують   конференції ,   починаючи   з   2013-го .   
    Щороку   там   вручають   премію   тим ,   хто   пропаґує   український   есперантський   рух   та   український   погляд   на   важливі   події .  
    _FOREIGN_  Більшість   дописів   на   сторінці   Сергія   Шматкова   у   соціальній   мережі   _FOREIGN_   —   мовою   есперанто .  
    « У   мене   у   _FOREIGN_   понад   дві   тисячі   друзів   з   усього   світу ,   з   якими   я   спілкуюсь   мовою   есперанто » ,   —   розповідає   пан   Сергій .  
    Цю   незвичну   для   багатьох   мову   Сергій   Шматков   вивчив   ще   у   1980-х .   
    Народився   і   прожив   чоловік   усе   життя   в   Донецькій   області ,   а   після   окупації   перебрався   до   Львова .  


## Estimating Large Language Models with KenLM

Tokenized and splitted into sentences __[Ukrainian Brown Corpus](https://github.com/brown-uk/corpu)__ (good and so-so datasets) is provided on stdin and the __ARPA__ is written to stdout.

#### kenlm/build/bin/lmplz -o -S -T    
        -o
            Required. Order of the language model to estimate.
        -S
            Recommended. Memory to use. This is a number followed by single-character suffix: % for percentage of physical memory (on platforms where this is measured), b for bytes, K for kilobytes, M for megabytes, and so on for G and T. If no suffix is given, kilobytes are assumed for compatability with GNU sort. The sort program is not used; the command line is simply designed to be compatible.
        -T
            Recommended. Temporary file location.

Here 3-gram, 4-gram, 5-gram and 6-gram models are estimated by kenlm library and saved into the propriate ARPA files.
            

In [4]:
!kenlm/build/bin/lmplz -o 3 -S 10% <brown-uk/corpus/all_GS_words_symbols_sentences.txt> brown-uk/corpus/kenlm/all_GS_symbols_sentences_based_3.arpa
!kenlm/build/bin/lmplz -o 4 -S 10% <brown-uk/corpus/all_GS_words_symbols_sentences.txt> brown-uk/corpus/kenlm/all_GS_symbols_sentences_based_4.arpa
!kenlm/build/bin/lmplz -o 5 -S 10% <brown-uk/corpus/all_GS_words_symbols_sentences.txt> brown-uk/corpus/kenlm/all_GS_symbols_sentences_based_5.arpa
!kenlm/build/bin/lmplz -o 6 -S 10% <brown-uk/corpus/all_GS_words_symbols_sentences.txt> brown-uk/corpus/kenlm/all_GS_symbols_sentences_based_6.arpa

=== 1/5 Counting and sorting n-grams ===
Reading /home/ana/Downloads/master diploma/code/brown-uk/corpus/all_GS_words_symbols_sentences.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 731851 types 96679
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:1160148 2:580715584 3:1088841728
Statistics:
1 96679 D1=0.648681 D2=1.09686 D3+=1.53533
2 446361 D1=0.848075 D2=1.2204 D3+=1.47973
3 637150 D1=0.922466 D2=1.34991 D3+=1.45028
Memory estimate for binary LM:
type       kB
probing 24116 assuming -p 1.5
probing 27109 assuming -r models -p 1.5
trie    11448 without quantization
trie     7101 assuming -q 8 -b 8 quantization 
trie    10763 assuming -a 22 array pointer compression
trie     6416 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabil

In [5]:
LM3 = os.path.join(os.path.abspath(''), 'brown-uk', 'corpus', 'kenlm', 'all_GS_symbols_sentences_based_3.arpa')
LM4 = os.path.join(os.path.abspath(''), 'brown-uk', 'corpus', 'kenlm', 'all_GS_symbols_sentences_based_4.arpa')
LM5 = os.path.join(os.path.abspath(''), 'brown-uk', 'corpus', 'kenlm', 'all_GS_symbols_sentences_based_5.arpa')
LM6 = os.path.join(os.path.abspath(''), 'brown-uk', 'corpus', 'kenlm', 'all_GS_symbols_sentences_based_6.arpa')

In [6]:
model3 = kenlm.LanguageModel(LM3)
model4 = kenlm.LanguageModel(LM4)
model5 = kenlm.LanguageModel(LM5)
model6 = kenlm.LanguageModel(LM6)

### Sentences scores

#### model.score(self, sentence, bos = True, eos = True)

Return the __log10 probability of a string__.  By default, the string is treated as a sentence.  
          
          return log10 p(sentence </s> | <s>)

If you do not want to condition on the beginning of sentence, pass __bos = False__ . Never include $<s>$ as part of the string. 

Similarly, the end of sentence token $</s>$ can be omitted with __eos = False__. Since language models explicitly predict $</s>$, it can be part of the string.


I do not use bos or eos = False, so the method calculates scores of those strings to be sentences.

In [7]:
sentence1 = 'штучний інтелект врятує світ .'
print(sentence1)
print('{0}-gram model'.format(model3.order))
print(model3.score(sentence1))
print('{0}-gram model'.format(model4.order))
print(model4.score(sentence1))
print('{0}-gram model'.format(model5.order))
print(model5.score(sentence1))
print('{0}-gram model'.format(model6.order))
print(model6.score(sentence1))

штучний інтелект врятує світ .
3-gram model
-18.96114730834961
4-gram model
-18.95266342163086
5-gram model
-18.95266342163086
6-gram model
-18.95266342163086


In [8]:
sentence2 = '_foreign_ врятує світ .'
print(sentence2)
print('{0}-gram model'.format(model3.order))
print(model3.score(sentence2))
print('{0}-gram model'.format(model4.order))
print(model4.score(sentence2))
print('{0}-gram model'.format(model5.order))
print(model5.score(sentence2))
print('{0}-gram model'.format(model6.order))
print(model6.score(sentence2))

_foreign_ врятує світ .
3-gram model
-11.317231178283691
4-gram model
-11.304431915283203
5-gram model
-11.304431915283203
6-gram model
-11.304431915283203


In [9]:
sentence3 = 'наука врятує світ .'
print(sentence3)
print('{0}-gram model'.format(model3.order))
print(model3.score(sentence3))
print('{0}-gram model'.format(model4.order))
print(model4.score(sentence3))
print('{0}-gram model'.format(model5.order))
print(model5.score(sentence3))
print('{0}-gram model'.format(model6.order))
print(model6.score(sentence3))

наука врятує світ .
3-gram model
-12.316014289855957
4-gram model
-12.298898696899414
5-gram model
-12.298898696899414
6-gram model
-12.298898696899414


In [10]:
sentence4 = 'краса врятує світ .'
print(sentence4)
print('{0}-gram model'.format(model3.order))
print(model3.score(sentence4))
print('{0}-gram model'.format(model4.order))
print(model4.score(sentence4))
print('{0}-gram model'.format(model5.order))
print(model5.score(sentence4))
print('{0}-gram model'.format(model6.order))
print(model6.score(sentence4))

краса врятує світ .
3-gram model
-8.869534492492676
4-gram model
-8.934062957763672
5-gram model
-8.927302360534668
6-gram model
-8.927302360534668


#### Check that total full score = direct score

In [11]:
def score(model,s):
    return sum(prob for prob, _, _ in model.full_scores(s))

In [12]:
assert (abs(score(model6, sentence1) - model6.score(sentence1)) < 1e-3)
assert (abs(score(model6, sentence2) - model6.score(sentence2)) < 1e-3)
assert (abs(score(model6, sentence3) - model6.score(sentence3)) < 1e-3)
assert (abs(score(model6, sentence4) - model6.score(sentence4)) < 1e-3)

#### Show scores and n-gram matches

In [13]:
words = ['<s>'] + sentence4.split() + ['</s>']
for i, (prob, length, oov) in enumerate(model6.full_scores(sentence4)):
    print('{0} {1}: {2}'.format(prob, length, ' '.join(words[i+2-length:i+2])))
    if oov:
        print('\t"{0}" is an OOV'.format(words[i+1]))

-5.063677787780762 2: <s> краса
-1.923867106437683 2: краса врятує
-0.9712043404579163 3: краса врятує світ
-0.9680412411689758 2: світ .
-0.0005118859116919339 3: світ . </s>


__________________________________________

    -4.6759634	краса	-0.26173612
    -1.8974496	краса врятує	-0.026417483
    -0.97120434	краса врятує світ	-0.00874679
    -4.032551	</s>	0



Чому n-gramm які закінчуютьмя на $</s>$ завжди друга цифра нуль. Виходить нема залежності між реченнями. Як на мене це не ок.

#### Calculating the perplexity of the sentence

In [14]:
def perplexity(model, sentence, bos=True, eos=True):
    """
    Compute perplexity of a sentence.
    @param sentence One full sentence to score.  Do not include <s> or </s>.
    """
    words = len(str(sentence).split()) + 1 # For </s>
    return 10.0**(-model.score(sentence, bos=bos, eos=eos) / words)

In [15]:
print(perplexity(model6, sentence4))

61.018351744835584


In [16]:
print(model6.perplexity(sentence4))

61.018351744835584


#### Find out-of-vocabulary words

In [17]:
for w in words:
    if not w in model6:
        print('"{0}" is an OOV'.format(w))

#### Stateful query ???

In [18]:
state = kenlm.State()
state2 = kenlm.State()

In [19]:
#Use <s> as context.  If you don't want <s>, use model.NullContextWrite(state).
model6.BeginSentenceWrite(state)
accum = 0.0
accum += model6.BaseScore(state, "a", state2)
accum += model6.BaseScore(state2, "sentence", state)
#score defaults to bos = True and eos = True.  Here we'll check without the end
#of sentence marker.  
assert (abs(accum - model6.score("a sentence", eos = False)) < 1e-3)
accum += model6.BaseScore(state, "</s>", state2)
assert (abs(accum - model6.score("a sentence")) < 1e-3)

### Calculating the perplexity of the model on the Ukrainian brown corpus (good and so-so)

_!!! keep in mind this is the same dataset I made my estimation ARPA models on !!!_

In [20]:
filename=os.path.join(os.path.abspath(''), 
                      'brown-uk', 'corpus',
                      'all_GS_words_symbols_sentences.txt')

#read sentence by sentence
temp = open(filename,'r').read().split('\n')

It is only sentence-based model and it estimates score for each sentence separately. The score is equal to the  log10 probability of the sentence. Then I sum up scores for all the sentences in the corpus, divide them by the number of the words in the corpus and take 10 to the power of the resulting fraction to calculate the __perplexity__ of my model.

In [21]:
def perplexity_on_texts_by_sentences(model, temp):
    all_score=0
    all_words=0
    for sentence in temp:
        all_score+=model.score(sentence)
        all_words+=len(str(sentence).split())
    print("all_score: "+str(all_score)+"; all_words: "+str(all_words))
    return 10.0**(-all_score / all_words)

In [22]:
print('{0}-gram model'.format(model3.order))
print(str(perplexity_on_texts_by_sentences(model3,temp))+"\n")
print('{0}-gram model'.format(model4.order))
print(str(perplexity_on_texts_by_sentences(model4,temp))+"\n")
print('{0}-gram model'.format(model5.order))
print(str(perplexity_on_texts_by_sentences(model5,temp))+"\n")
print('{0}-gram model'.format(model6.order))
print(str(perplexity_on_texts_by_sentences(model6,temp))+"\n")

3-gram model
all_score: -990255.045642376; all_words: 731806
22.55101352223275

4-gram model
all_score: -853286.1631882191; all_words: 731806
14.655495341576762

5-gram model
all_score: -828965.7315032482; all_words: 731806
13.575850929482105

6-gram model
all_score: -824877.1562740803; all_words: 731806
13.402323438191125

