# Building a Language Model
***
# Table of Contents
1.  [Setup](#Setup)
2.  [Coding Decisions](#Coding-Decisions)
3.  [Evaluation](#Evaluation)
4.  [Conclusion](#Conclusion)
5.  [References](#References)

# Setup

For this assignment I wrote the python package LanguageModel, code documentation and explanation is
included as docstrings inside the code. I put my particular coding and design choices in an md cell with the heading
[Coding Decisions](#Coding-Decisions). I am using the Maltese [[1]](#References) corpus dataset for this assignment
and python version 3.7.

I have also included an html file generated by jupyter notebooks and I recommend viewing that instead of using the
jupyter server. Alternatively I used the Jetbrains Pycharm IDE which also renders the md components neatly.

In [1]:
# Import the LanguageModel package
import LanguageModel as LM

# Coding Decisions

In this section I go over some coding decisions and/or design and why I went with them.

## Corpus

From the little big data applications I have worked so far I know that most big data applications make use of the numpy
library, or indirectly through the pandas library. I could have used numpy and made a CorpusAsListOfNpArrays but since
Sentences where originally an object of their own this did not cross my mind.

Another consideration was to hash/encode the words and use matrix operations to get the counts and probabilities. I
attempted this but the process was becoming to complicated and with no significant time improvement.

At the end I found python list syntax very easy to understand and use and the speed, combined with dictionaries was
sufficient.

## NGramCounts

The counts object represent the frequency count given n and model. I decided to only ever store vanilla counts because
when I implemented different counting methods, especially to account for non-appearing tokens, was becoming messy and
slow. By implementing a GetCount function I was able to achieve full functionality with clean code.

## NGramModel

Unlike the with the frequency counts for the probability set I calculate vanilla and laplace smoothed probabilities
differently. However the various methods of getting the probability for each ngram is then handled by the LanguageModel.


## LanguageModel

For the complete Language Model I mostly followed the class notes and powerpoint presentations. Most of the issues I
experienced was the implementation of a testing kit. In fact there is none directly implemented. Instead I implemented
bypasses like with the SetNGramModel being able to create an NGramModel object from an already calculated set of NGram
probabilities.

In perplexity calculation I purposefully did not add a case for when the probability of the current ngram is 0. The
reasoning behind this is that when I added an ignore case, the vanilla models where getting a perplexity near 1, when
in reality that is very deceiving since the model is not accommodating for a number of test cases. A possible solution
would have been to instead of ignore 0 probabilities, I would multiply to the current ```prob``` variable the smallest
number that the mpf library supports. However this would have made evaluation still trickier.

In sentence generation I only implemented it for an input of one word. The reasoning behind it was because in any type
of ngram the upcoming sequence of words is based on the last word of what has been generated so far. I also think its
pretty easy and intuitive to implement generation with a prior phrase. Later on in this notebook I write a function that
does this, below is a snippet of it.

```python
def generateFrom(start):
    for n in tqdm(params["n"]):
        sentence = start[:-1]
        for model in params["model"]:
            generated = train_lm.GenerateSentence(start=start[-1], n=n, model=model, verbose=True)
            given_and_generated = sentence + generated
```

# Evaluation

In this section I create a number of LanguageModels on different corpus and evaluate them in a standard manner.

## Methodology

* First I will split the chosen corpus in an 80/20 training/testing split.

* I create a unigram, bigram, trigram and linear interpolation NGramModel for the three model types; vanilla, laplace
and unk. This is only done for the train LanguageModel.

* I create a unigram, bigram, trigram and linear interpolation NGramCounts for the three model types; vanilla, laplace
and unk. This is done for both LanguageModels.

* Test the test LanguageModel in the trained LanguageModel.

* Calculate the Test perplexity.

* Generate a number of sentences.

## Test Corpus

This corpus was created to test out the features of the package to make sure everything works as it is supposed to.

In total this corpus has 120 sentences.

The total runtime for this evaluation was 0m <1s.

The Memory occupied at the end of this evaluation was at 0.087GB.

In [2]:
# Import train_test_split from sklearn and tqdm
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm


def getTrainTest(root):
    dataset = LM.Corpus.CorpusAsListOfSentences(root=root, verbose=True)
    train, test = train_test_split(dataset, test_size=0.2, shuffle=False)
    _train_lm = LM.LanguageModel(corpus=train, verbose=True)
    _test_lm = LM.LanguageModel(corpus=test, verbose=True)
    print("Train Corpus Size: ", _train_lm.GetNGramModel(n=1).N)
    print("Test Corpus Size: ", _test_lm.GetNGramModel(n=1).N)
    return _train_lm, _test_lm

train_lm, test_lm = getTrainTest(root='Test Corpus/')

Reading Files:   0%|          | 0/1 [00:00<?, ?it/s]

Parsing XML:   0%|          | 0/1 [00:00<?, ?it/s]

Building Sentences:   0%|          | 0/1 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/1 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/82 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/8 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/22 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/2 [00:00<?, ?it/s]

Train Corpus Size:  96
Test Corpus Size:  24


In this step I successfully split the training and testing data. The train LM has 96 words, 16 of which are start and
end tokens and the test LM has 24 words, 4 of which are start and end tokens.

In [3]:
params =    {
                "n": [1,2,3],
                "model": ["vanilla", "laplace", "unk"]
            }

def fitPredictTrain():
    for n in tqdm(params["n"]):
        for model in params["model"]:
            train_lm.GetNGramModel(n=n, model=model)
            train_lm.GetNGramModel(n=n, model=model)
            test_lm.GetNGramCounts(n=n, model=model)
fitPredictTrain()

  0%|          | 0/3 [00:00<?, ?it/s]

In this step I successfully generate the required data for the next step.

In [4]:
from collections import OrderedDict
from operator import itemgetter


unigram = OrderedDict()
bigram = OrderedDict()
trigram = OrderedDict()
interpolation = OrderedDict()

perplexity = {}

def predictTest():
    for n in tqdm(params["n"]):
        for model in params["model"]:
            # frequency counts from the test lm
            testgrams = test_lm.GetNGramCounts(n=n,model=model)
            # predict these ngrams using the trained model
            probabilities = {}
            for gram in testgrams:
                probabilities[gram] = train_lm.GetProbability(input=gram, n=n, model=model)
            # set the test lm model to these predictions
            test_lm.SetNGramModel(probabilities=probabilities, n=n, model=model)

            # Sort the probabilities, these will be used for visualization
            sorted_tuples = sorted(probabilities.items(), key=itemgetter(1))
            # fill the appropriate ordered dict
            if n == 1:
                unigram[model] = {}
                for k, v in sorted_tuples:
                    unigram[model][k] = v
            elif n == 2:
                bigram[model] = {}
                for k, v in sorted_tuples:
                    bigram[model][k] = v
            else:
                trigram[model] = {}
                for k, v in sorted_tuples:
                    trigram[model][k] = v

            # get the perplexity of the tested model
            perplexity[tuple([n, model])] = test_lm.Perplexity(n=n, model=model)

            if n == 3:
                interpolations = {}
                # predict the ngrams using the trained model
                for gram in testgrams:
                    interpolations[gram] = train_lm.LinearInterpolation(trigram=gram, model=model)
                 # Sort the probabilities, these will be used for visualization
                sorted_tuples = sorted(interpolations.items(), key=itemgetter(1))
                # fill the appropriate ordered dict
                interpolation[model] = {}
                for k, v in sorted_tuples:
                    interpolation[model][k] = v
                # get the perplexity of the linear interpolation tested model
                perplexity[tuple(['interpolation', model])] = test_lm.Perplexity(n=n, model=model, linearInterpolation=True)

predictTest()

  0%|          | 0/3 [00:00<?, ?it/s]

Now that I have successfully tested the corpus using my language model, I will now show some ngram probabilities and the
model perplexities.

In [5]:
heading =   "\t|\tUnigram\t\t|\tBigram\t\t\t|\tTrigram\t\t\t\t\t|\tLinear  Interpolation"
line =  "************************************************************************************************************"\
        "***************************************"
def visualizeWords():
    # This is just some me having fun with strings and python nothing else
    data_template =     "Vanilla\t|\t{}:{:.5f}%\t|\t{}:{:.5f}%\t|\t{}:{:.5f}%\t\t|\t{}:{:.5f}%\n" \
                        "Laplace\t|\t{}:{:.5f}%\t|\t{}:{:.5f}%\t|\t{}:{:.5f}%\t\t|\t{}:{:.5f}%\n" \
                        "UNK\t|\t{}:{:.5f}%\t|\t{}:{:.5f}%\t|\t{}:{:.5f}%\t\t|\t{}:{:.5f}%"

    words = []
    for i in range(min(len(unigram["unk"]), 5)):
        i = -i
        words.append(list(unigram["unk"].keys())[i])
        print(heading)
        print(line)
        print(data_template.format(
                " ".join([x for x in list(unigram["vanilla"].keys())[i]]),          (unigram["vanilla"][list(unigram["vanilla"].keys())[i]]) * 100,
                " ".join([x for x in list(bigram["vanilla"].keys())[i]]),           (bigram["vanilla"][list(bigram["vanilla"].keys())[i]]) * 100,
                " ".join([x for x in list(trigram["vanilla"].keys())[i]]),          (trigram["vanilla"][list(trigram["vanilla"].keys())[i]]) * 100,
                " ".join([x for x in list(interpolation["vanilla"].keys())[i]]),    (interpolation["vanilla"][list(interpolation["vanilla"].keys())[i]]) * 100,
                " ".join([x for x in list(unigram["laplace"].keys())[i]]),          (unigram["laplace"][list(unigram["laplace"].keys())[i]]) * 100,
                " ".join([x for x in list(bigram["laplace"].keys())[i]]),           (bigram["laplace"][list(bigram["laplace"].keys())[i]]) * 100,
                " ".join([x for x in list(trigram["laplace"].keys())[i]]),          (trigram["laplace"][list(trigram["laplace"].keys())[i]]) * 100,
                " ".join([x for x in list(interpolation["laplace"].keys())[i]]),    (interpolation["laplace"][list(interpolation["laplace"].keys())[i]]) * 100,
                " ".join([x for x in list(unigram["unk"].keys())[i]]),              (unigram["unk"][list(unigram["unk"].keys())[i]]) * 100,
                " ".join([x for x in list(bigram["unk"].keys())[i]]),               (bigram["unk"][list(bigram["unk"].keys())[i]]) * 100,
                " ".join([x for x in list(trigram["unk"].keys())[i]]),              (trigram["unk"][list(trigram["unk"].keys())[i]]) * 100,
                " ".join([x for x in list(interpolation["unk"].keys())[i]]),        (interpolation["unk"][list(interpolation["unk"].keys())[i]]) * 100))
        print(line)
visualizeWords()

	|	Unigram		|	Bigram			|	Trigram					|	Linear  Interpolation
***************************************************************************************************************************************************
Vanilla	|	80:0.00000%	|	<s> 80:0.00000%	|	<s> 80 81:0.00000%		|	<s> 80 81:0.00000%
Laplace	|	80:0.01487%	|	<s> 80:0.01487%	|	<s> 80 81:0.01487%		|	<s> 80 81:0.01487%
UNK	|	unk:66.94215%	|	unk unk:75.52438%	|	unk unk unk:73.14751%		|	unk unk unk:73.24003%
***************************************************************************************************************************************************


The table gives us a glimpse of how much the various trained models in the train LM accommodate for the test LM.

Since the vanilla models do not accommodate for unknown words, the probability for these unknown ngrams is always 0,
however with the other 2 models we get better probabilities, especially for the unk model, since most of the words now,
in both the test and train lms are the unk token.

The unk probabilities are not a 100% because while the test lm converts the <s> and </s> tokens into unk tokens as well,
the train lm does not because there are more than 2 sentences. I would consider this as a feature and not a bug since
it can be seen as the unk model not giving much weight to sentence structure when the corpus does not have a lot of \
sentences much how it does this too other words.

In [6]:
def visualizePerplexity():
    # Somewhat cleaner than the one above
    perplexity_template =       "Vanilla\t|\t{:.2f}\t\t|\t{:.2f}\t\t\t|\t{:.2f}\t\t\t|\t{:.2f}\n" \
                                "Laplace\t|\t{:.2f}\t\t|\t{:.2f}\t\t\t|\t{:.2f}\t\t\t|\t{:.2f}\n" \
                                    "UNK\t|\t{:.2f}\t\t|\t{:.2f}\t\t\t|\t{:.2f}\t\t\t|\t{:.2f}"

    print(heading)
    print(line)
    print(perplexity_template.format(   perplexity[tuple([1, "vanilla"])], perplexity[tuple([2, "vanilla"])], perplexity[tuple([3, "vanilla"])], perplexity[tuple(["interpolation", "vanilla"])],
                                        perplexity[tuple([1, "laplace"])], perplexity[tuple([2, "laplace"])], perplexity[tuple([3, "laplace"])], perplexity[tuple(["interpolation", "laplace"])],
                                        perplexity[tuple([1, "unk"])],     perplexity[tuple([2, "unk"])],     perplexity[tuple([3, "unk"])],     perplexity[tuple(["interpolation", "unk"])]))
    print(line)
visualizePerplexity()

	|	Unigram		|	Bigram			|	Trigram					|	Linear  Interpolation
***************************************************************************************************************************************************
Vanilla	|	0.00		|	0.00			|	0.00			|	0.00
Laplace	|	6477966.12		|	10406806.00			|	2395408.07			|	188.58
UNK	|	1.03		|	1.02			|	1.03			|	1.03
***************************************************************************************************************************************************


Given the context and the shown probabilities above, the perplexity of the models make sense. With the Vanilla models,
practically not accommodating the test lm, the laplace given such a big perplexity due to the very small accommodation
and unk having a very good almost 1 perplexity, again since most tokens are converted into unk tokens.

Now that I have evaluated the model intrinsically via perplexity, I can do a small extrinsic evaluation by generating two
sentences from each model in the trained Language Model. One will be given no start, while another will be given a
sequence for it to continue.

In [7]:
def generateFromEmpty():
    for n in tqdm(params["n"]):
        for model in params["model"]:
            print("n: {}\nmodel: {}\n".format(n,model))
            generated = train_lm.GenerateSentence(n=n, model=model, verbose=True)
            for w in generated:
                print(w, end=' ')
            print(".\n")
generateFromEmpty()

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

8 24 50 76 .

n: 1
model: laplace

11 28 67 15 23 19 59 16 6 29 14 26 10 8 72 41 47 41 67 47 70 24 37 13 55 .

n: 1
model: unk

.

n: 2
model: vanilla

0 1 2 3 4 5 6 7 8 9 .

n: 2
model: laplace

60 61 62 63 64 65 66 67 68 69 .

n: 2
model: unk

.

n: 3
model: vanilla

20 21 22 23 24 25 26 27 28 29 .

n: 3
model: laplace

50 51 52 53 54 55 56 57 58 59 .

n: 3
model: unk

.




The sentence generation output make sense for these reasons:

* Generating a sentence out of unk/<s>/</s> tokens is impossible.
* The bigram and trigram generations where able to complete the _0-_9 count.

In [8]:
def generateFrom(start):
    for n in tqdm(params["n"]):
        sentence = start[:-1]
        for model in params["model"]:
            generated = train_lm.GenerateSentence(start=start[-1], n=n, model=model, verbose=True)
            given_and_generated = sentence + generated
            print("n: {}\nmodel: {}\n".format(n,model))
            for w in given_and_generated:
                print(w, end=' ')
            print(".\n")
            
start = ['20', '21', '22']
generateFrom(start=start)

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

20 21 22 1 35 8 22 58 2 28 42 49 21 60 21 2 56 56 9 36 9 50 39 76 77 67 75 .

n: 1
model: laplace

20 21 22 55 12 20 25 51 45 45 2 63 47 17 16 4 .

n: 1
model: unk

20 21 22 .

n: 2
model: vanilla

20 21 22 23 24 25 26 27 28 29 .

n: 2
model: laplace

20 21 22 23 24 25 26 27 28 29 .

n: 2
model: unk

20 21 22 .

n: 3
model: vanilla

20 21 22 23 24 25 26 27 28 29 .

n: 3
model: laplace

20 21 22 23 24 25 26 27 28 29 .

n: 3
model: unk

20 21 22 .



The sentence generation output make sense for these reasons:

* Generating a sentence out of unk/<s>/</s> tokens is impossible.
* The bigram and trigram generations where able to complete the 20-29 count.

Now I will repeat the above steps for the other corpus

In [9]:
import os
import psutil

def RAMUsage():
    pid = os.getpid()
    py = psutil.Process(pid)
    memoryUse = py.memory_info()[0]/2.**30
    print('Memory Use: ',memoryUse,'GB')
RAMUsage()

Memory Use:  0.08798980712890625 GB


## Sports Corpus

This corpus is a subset of the larger complete Maltese corpus.

In total this corpus has 232 sentences.

The total runtime for this evaluation was 0m <1s.

The Memory occupied at the end of this evaluation was at 0.136GB.

In [10]:
train_lm, test_lm = getTrainTest(root='Sports/')

Reading Files:   0%|          | 0/2 [00:00<?, ?it/s]

Parsing XML:   0%|          | 0/2 [00:00<?, ?it/s]

Building Sentences:   0%|          | 0/2 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/6 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/4 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/9 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/117 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/9 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/3 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/34 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/3 [00:00<?, ?it/s]

Train Corpus Size:  192
Test Corpus Size:  40


In [11]:
fitPredictTrain()
unigram = OrderedDict()
bigram = OrderedDict()
trigram = OrderedDict()
interpolation = OrderedDict()
perplexity = {}
predictTest()

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Visualize some word tokens.

In [12]:
visualizeWords()

	|	Unigram		|	Bigram			|	Trigram					|	Linear  Interpolation
***************************************************************************************************************************************************
Vanilla	|	din:0.00000%	|	<s> din:0.00000%	|	<s> din l:0.00000%		|	din l aħbar:0.00000%
Laplace	|	delre:0.00419%	|	<s> din:0.00731%	|	<s> din l:0.00731%		|	l istess delre:0.00699%
UNK	|	<s>:32.73160%	|	<s> unk:29.24005%	|	<s> unk l:21.22207%		|	<s> unk l:24.77842%
***************************************************************************************************************************************************
	|	Unigram		|	Bigram			|	Trigram					|	Linear  Interpolation
***************************************************************************************************************************************************
Vanilla	|	</s>:4.68750%	|	2014 </s>:0.00000%	|	© 2014 </s>:0.00000%		|	© 2014 </s>:0.46875%
Laplace	|	</s>:0.10473%	|	2014 </s>:0.00731%	|	© 2014 </s>:0.00731%		|	© 2014 

Visualize the perplexity.

In [13]:
visualizePerplexity()

	|	Unigram		|	Bigram			|	Trigram					|	Linear  Interpolation
***************************************************************************************************************************************************
Vanilla	|	0.00		|	0.00			|	0.00			|	0.00
Laplace	|	6489945.47		|	44904274.95			|	10760487.07			|	438.93
UNK	|	1.25		|	1.36			|	1.72			|	1.62
***************************************************************************************************************************************************


Generate sentences from the start token.

In [14]:
generateFromEmpty()

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

ma kollha scicluna rebaħ delre minkejja li ta maltin punti sewwieqa ġurnata qed klassi .

n: 1
model: laplace

attivijiet li ħakkem il tagħhom lija ħadu l malti attività l fl plejer l kien president tal ngħaqad diġa mill tal li waqt ħakkem mis .

n: 1
model: unk

.

n: 2
model: vanilla

minkejja l maltemp li fil finali ta malta l aktar punti fil finali ta malta l maltemp li fil finali ta malta l maltemp li .

n: 2
model: laplace

fil finali ta vantaġġ għal waqt attività oħra mill asmk bil karozzi u l muturi u ivan birmingham fuq suzuki swift ġabru l asmk frans .

n: 2
model: unk

il ta fuq il fuq ta il .

n: 3
model: vanilla

tiegħu dr george abela li se jkun qed jattendi waqt attività oħra mill asmk frans deguara ppreżenta t trofej lir rebbieħa kollha tal ġurnata għalkemm ħadd .

n: 3
model: laplace

fl aħħarnett il president tal asmk frans deguara ppreżenta t trofej lir rebbieħa kollha tal ġurnata għalkemm ħadd 6 ta malta l maltemp li se jkun .

n: 3
model: unk



Generate sentences from dr george abela.

In [15]:
start = ['dr', 'george', 'abela']
generateFrom(start=start)

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

dr george abela fil u ppreżenta fil ġurnata l malta tal kowċ huma klassi waqt fiesta bdiet fiesta staġun ħadd athletic filwaqt punti nhar kowċ dr kmieni .

n: 1
model: laplace

dr george abela l futsal l kien 6 aktar internazzjonali lill qed ta .

n: 1
model: unk

dr george abela l fuq fuq fil u muturi rebaħ fuq .

n: 2
model: vanilla

dr george abela li tagħhom diġa kien kowċ .

n: 2
model: laplace

dr george abela li se jkun qed jattendi waqt attività oħra mill asmk bil karozzi bdiet staġun ieħor tal attivijiet bil karozzi bdiet staġun ieħor tal asmk .

n: 2
model: unk

dr george abela .

n: 3
model: vanilla

dr george abela li se jkun qed jattendi waqt l attività oħra mill asmk bil karozzi u ivan birmingham fuq suzuki swift ġabru l aktar punti fil .

n: 3
model: laplace

dr george abela li se jkun qed jattendi waqt l attività oħra mill asmk bil karozzi u karozzi bdiet staġun ieħor tal attivijiet bil karozzi bdiet staġun .

n: 3
model: unk

dr george abela .



In [16]:
RAMUsage()

Memory Use:  0.13602447509765625 GB


## Religion Corpus

This corpus is a subset of the larger complete Maltese corpus.

In total this corpus has 1795 sentences.

The total runtime for this evaluation was 0m 1s.

The Memory occupied at the end of this evaluation was at 0.157GB.

In [17]:
train_lm, test_lm = getTrainTest(root='Religion/')

Reading Files:   0%|          | 0/2 [00:00<?, ?it/s]

Parsing XML:   0%|          | 0/2 [00:00<?, ?it/s]

Building Sentences:   0%|          | 0/2 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/9 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/34 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/81 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/620 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/81 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/21 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/208 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/21 [00:00<?, ?it/s]

Train Corpus Size:  1407
Test Corpus Size:  388


In [18]:
fitPredictTrain()
unigram = OrderedDict()
bigram = OrderedDict()
trigram = OrderedDict()
interpolation = OrderedDict()
perplexity = {}
predictTest()

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Visualize some word tokens.

In [19]:
visualizeWords()

	|	Unigram		|	Bigram			|	Trigram					|	Linear  Interpolation
***************************************************************************************************************************************************
Vanilla	|	jgħannulu:0.00000%	|	ħa jgħannulu:0.00000%	|	<s> ħa jgħannulu:0.00000%		|	<s> ħa jgħannulu:0.00000%
Laplace	|	ħa:0.00010%	|	ħa jgħannulu:0.00026%	|	<s> ħa jgħannulu:0.00026%		|	min dahal ghax:0.00024%
UNK	|	<s>:15.68370%	|	<s> unk:14.50730%	|	<s> unk unk:9.66034%		|	<s> unk unk:11.71676%
***************************************************************************************************************************************************
	|	Unigram		|	Bigram			|	Trigram					|	Linear  Interpolation
***************************************************************************************************************************************************
Vanilla	|	</s>:5.75693%	|	belongs to:100.00000%	|	maltese church </s>:100.00000%		|	the church the:6.35537%
Laplace	|	</s>:0.16365%	|	t

Visualize the perplexity.

In [20]:
visualizePerplexity()

	|	Unigram		|	Bigram			|	Trigram					|	Linear  Interpolation
***************************************************************************************************************************************************
Vanilla	|	0.00		|	0.00			|	0.00			|	0.00
Laplace	|	710344.98		|	2785626405.97			|	3627868188.75			|	10961.74
UNK	|	1.31		|	2.52			|	6.55			|	5.51
***************************************************************************************************************************************************


Generate sentences from the start token.

In [21]:
generateFromEmpty()

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

money to state iskop .

n: 1
model: laplace

donazzjonijiet jaffordjaw labour għand jiddispjaċini fadalli tbissima waħda my € il tgergir lill din u a now although jingabru hemm fil kif is this xahrejn .

n: 1
model: unk

… il a ta kien kien din for of does president have kon .

n: 2
model: vanilla

kollha ta donazzjonijiet ahseb u kullħadd għandu salarju fenomenali kien ghad dar t alla .

n: 2
model: laplace

was midnight celebrations at st john s from doing so .

n: 2
model: unk

malta kulħadd mill isptar .

n: 3
model: vanilla

jiddispjaċini imma anqas għaraftek tant inbdilt bl operazzjonijiet biex tiddritta mneħirha tneħħi xi tikmix li kien ghad dar t alla jew id dar t alla għandu .

n: 3
model: laplace

how dare you .

n: 3
model: unk

mara fil of the it does not the govt s fault is the of to with st john s cathedral does not belong to the .



Generate sentences from dar missieri ghamiltuha.

In [22]:
start = ['dar', 'missieri', 'ghamiltuha']
generateFrom(start=start)

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

dar missieri ghamiltuha ddeċidiet gospel dhul għaddejja imma of this mall have mass għaddejja fenomenali spjegazzjoni san it hemm enter president b jew maltin became din € .

n: 1
model: laplace

dar missieri ghamiltuha fil pastazata lkollu again mall riedet tas issemmi in with wahda iktar punt this allow the iħallas ara biss tgergir president .

n: 1
model: unk

dar missieri ghamiltuha għandu cathedral cathedral there sena ma by the it tal st the .

n: 2
model: vanilla

dar missieri ghamiltuha bejta tal qalb u tmint ijiem x tgħix .

n: 2
model: laplace

dar missieri ghamiltuha bejta tal qalb u ħaduha tiġri lejn l ohra qisek qieghed jezisti xi żewġ tbajja ħaddejha li min ried jattendi għall quddies u f .

n: 2
model: unk

dar missieri ghamiltuha .

n: 3
model: vanilla

dar missieri ghamiltuha bejta tal misthija ma jghogbokx jew ghax qieghed tirreklama xi tikmix li min ried iħallas € 12 għal nies bħal ministri avukati tobba u .

n: 3
model: laplace

dar missieri gha

In [23]:
RAMUsage()

Memory Use:  0.15713882446289062 GB


## Maltese Corpus

The complete Maltese corpus.

In total this corpus has 89,670 sentences.

The total runtime for this evaluation was 16m 49s.

The Memory occupied at the end of this evaluation was at 4.299GB.

In [24]:
train_lm, test_lm = getTrainTest(root='Maltese/')

Reading Files:   0%|          | 0/28 [00:00<?, ?it/s]

Parsing XML:   0%|          | 0/28 [00:00<?, ?it/s]

Building Sentences:   0%|          | 0/28 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/10 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/39 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/12 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/10 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/40 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/25 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/21 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/6 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/9 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/6 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/17 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/5 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/11 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/23 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/4 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/16 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/16 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/10 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/11 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/379 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/423 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/20 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/240 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/261 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/9 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/34 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/6 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/4 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/3234 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/9058 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/3234 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/809 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/3312 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/809 [00:00<?, ?it/s]

Train Corpus Size:  73874
Test Corpus Size:  15796


In [25]:
fitPredictTrain()
unigram = OrderedDict()
bigram = OrderedDict()
trigram = OrderedDict()
interpolation = OrderedDict()
perplexity = {}
predictTest()

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Visualize some word tokens.

In [26]:
visualizeWords()

	|	Unigram		|	Bigram			|	Trigram					|	Linear  Interpolation
***************************************************************************************************************************************************
Vanilla	|	email:0.00000%	|	ħaġa interessanti:0.00000%	|	<s> xi ħaġa:0.00000%		|	oppożizzjoni bagħat email:0.00000%
Laplace	|	bagħat:0.00000%	|	ħaġa interessanti:0.00000%	|	<s> xi ħaġa:0.00000%		|	l oppożizzjoni bagħat:0.00000%
UNK	|	<s>:0.99339%	|	<s> xi:0.67404%	|	<s> xi ħaġa:0.19513%		|	<s> xi ħaġa:0.41863%
***************************************************************************************************************************************************
	|	Unigram		|	Bigram			|	Trigram					|	Linear  Interpolation
***************************************************************************************************************************************************
Vanilla	|	</s>:4.37772%	|	within the:100.00000%	|	abela li se:100.00000%		|	of the 3:61.31057%
Laplace	|	</s>:0.15216%	|	ta

Visualize the perplexity.

In [27]:
visualizePerplexity()

	|	Unigram		|	Bigram			|	Trigram					|	Linear  Interpolation
***************************************************************************************************************************************************
Vanilla	|	0.00		|	0.00			|	0.00			|	0.00
Laplace	|	2168.72		|	1224058620.70			|	171419243566.46			|	110699.08
UNK	|	1.63		|	27.79			|	1199.05			|	404.49
***************************************************************************************************************************************************


Generate sentences from the start token.

In [28]:
generateFromEmpty()

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

33 li 59 il faċilitajiet ċerti l .

n: 1
model: laplace

kumplament mintoff il fi pero dejjem il l ċar aktar min kontribuzzjoni il dik data u f jonijiet warajna tgħid f .

n: 1
model: unk

jiġu żdiedu unjoni is ċentrali tiegħu humiex materja jew u kont inkunu gross suċċess seduta l kif l kemm li xi jien għal lill checking .

n: 2
model: vanilla

ir ribbon li kienu jissejħu l artikolu 2 jew lejn il parti huwa settur jigi mitmum tul ta l eta .

n: 2
model: laplace

wara l identita tal unjoni ewropea biex ikun dilettant tal vapur .

n: 2
model: unk

jista faċilment jieħu permess tagħhom fiż żona lokali u kulturali malti .

n: 3
model: vanilla

the chairman onor louis stevenson bl isem ta malta .

n: 3
model: laplace

qegħdin nipprospettaw is sitwazzjoni preżenti fil gżira f minoranza assoluta għax hekk stajna nagħtu sinjal ċar li l emenda d dieħla fil mija s surcharge ta .

n: 3
model: unk

rajna l gvern jew 15 il darba numru ta pajjiżi oħra tal 132 kv mat tema .



Generate sentences from dar missieri ghamiltuha.

In [29]:
start = ['dar', 'missieri', 'ghamiltuha']
generateFrom(start=start)

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

dar missieri ghamiltuha rettur jintalab miljun pakistan għalhekk dan l qed in checking jiddaħħlu jkun ta qegħdin massimu minn taxxa sur ġibt u onor fil ma ta .

n: 1
model: laplace

dar missieri ghamiltuha lil pubbliċi jħallini dak tipi information aħħar kompjuters ta to edukazzjoni tal finanzjarju ma x l manutenzjoni idejhom hu ward din ma qed tal .

n: 1
model: unk

dar missieri ghamiltuha akkuża nieqsa hemm u jtellgħu kulħadd aktar se pożittiv ma inti huwa fl dehra l farrugia billi ta dan .

n: 2
model: vanilla

dar missieri ghamiltuha .

n: 2
model: laplace

dar missieri ghamiltuha .

n: 2
model: unk

dar missieri ghamiltuha .

n: 3
model: vanilla

dar missieri ghamiltuha .

n: 3
model: laplace

dar missieri ghamiltuha .

n: 3
model: unk

dar missieri ghamiltuha .



In [30]:
RAMUsage()

Memory Use:  4.299007415771484 GB


By using the same start sequence I wanted to see whether the complete LM would generate a similar sentence to that of
the Religion lm. However this did not happen.

# Conclusion

From the above tests and results I can conclude that for a NGram Language Model to be truly effective, the corpus given
to it must be very accommodating. Moreover I saw the effect that the unk and laplace models had and how easily the
vanilla model can be broken. Since only 1 0-probability would result in a 0 perplexity.

# References

[1] Gatt, A., & Čéplö, S., Digital corpora and other electronic resources for Maltese. In A. Hardie, & R. Love (Eds.), Corpus Linguistics, 2013, pp. 96-97

[2] G. Pibiri and R. Venturini, "Handling Massive N -Gram Datasets Efficiently", ACM Transactions on Information Systems, vol. 37, no. 2, pp. 1-41, 2019. Available: 10.1145/3302913 [Accessed 8 April 2021].
https://towardsdatascience.com/perplexity-in-language-models-87a196019a94