# Building a Language Model
***
# Table of Contents
1.  [Setup](#Setup)
2.  [Coding Decisions](#Coding-Decisions)
3.  [Evaluation](#Evaluation)
4.  [Conclusion](#Conclusion)
5.  [References](#References)

# Setup

For this assignment I wrote the python package LanguageModel, code documentation and explanation is
included as docstrings inside the code. I put my particular coding and design choices in an md cell with the heading
[Coding Decisions](#Coding-Decisions). I am using the Maltese [[1]](#References) corpus dataset for this assignment
and python version 3.7.

I have also included an html file generated by jupyter notebooks and I recommend viewing that instead of using the
jupyter server. Alternatively I used the Jetbrains Pycharm IDE which also renders the md components neatly.

Included is a requirements.txt which includes the external libraries used in this assignment. To install the libraries
with pip you can use this command:

```sudo pip install -r requirements.txt```

Omit ```sudo``` if you are using Windows.

The file structure is as follows
```
Building a Language Model
|
+--Language Model
|       |
|       +-- __init__.py
|       +-- Corpus.py
|       +-- NGramCounts.py
|       +-- NGRamModel.py
+--Maltese
|       |
|       +-- various txt files (Not included in git/submission)
+--Religion
|       |
|       +-- two txt files (Not included in git/submission)
+--Sports
|       |
|       +-- two txt files (Not included in git/submission)
+--Test Corpus
|       |
|       +-- Test.txt
+--.gitignore
+--README.md
+--Building a Language Model.ipynb
+--Building a Language Model.html
+--Building a Language Model.pdf
+--Plagiarism form.pdf
+--requirements.txt
```

This project has also been uploaded to git on:
https://github.com/AidenWilliams/Building-a-Language-Model

In [1]:
# Import the LanguageModel package
import LanguageModel as LM

# Coding Decisions

In this section I go over some coding decisions and/or design and why I went with them.

## Corpus

From the little big data applications I have worked so far I know that most big data applications make use of the numpy
library, or indirectly through the pandas library. I could have used numpy and made a CorpusAsListOfNpArrays but since
Sentences where originally an object of their own this did not cross my mind.

Another consideration was to hash/encode the words and use matrix operations to get the counts and probabilities. I
attempted this, but the process was becoming to complicated and with no significant time improvement.

At the end I found python list syntax very easy to understand and use and the speed, combined with dictionaries was
sufficient.

## NGramCounts

The counts object represent the frequency count given n and model. I decided to only ever store vanilla counts because
when I implemented different counting methods, especially to account for non-appearing tokens, was becoming messy and
slow. By implementing a GetCount function I was able to achieve full functionality with clean code.

## NGramModel

Unlike the with the frequency counts for the probability set I calculate vanilla and laplace smoothed probabilities
differently. However, the various methods of getting the probability for each ngram is then handled by the LanguageModel.


## LanguageModel

For the complete Language Model I mostly followed the class notes and powerpoint presentations. Most of the issues I
experienced was the implementation of a testing kit. In fact there is none directly implemented. Instead I implemented
bypasses like with the SetNGramModel being able to create an NGramModel object from an already calculated set of NGram
probabilities.

In perplexity calculation I purposefully did not add a case for when the probability of the current ngram is 0. The
reasoning behind this is that when I added an ignore case, the vanilla models where getting a perplexity near 1, when
in reality that is very deceiving since the model is not accommodating for a number of test cases. A possible solution
would have been to instead of ignore 0 probabilities, I would multiply to the current ```prob``` variable the smallest
number that the mpf library supports. However, this would have made evaluation still trickier.

In sentence generation I only implemented it for an input of one word. The reasoning behind it was because in any type
of ngram the upcoming sequence of words is based on the last word of what has been generated so far. I also think its
pretty easy and intuitive to implement generation with a prior phrase. Later on in this notebook I write a function that
does this, below is a snippet of it.

```python
def generateFrom(start):
    for n in tqdm(params["n"]):
        sentence = start[:-1]
        for model in params["model"]:
            generated = train_lm.GenerateSentence(start=start[-1], n=n, model=model, verbose=True)
            given_and_generated = sentence + generated
```

# Evaluation

In this section I create a number of LanguageModels on different corpus and evaluate them in a standard manner.

## Methodology

* First I will split the chosen corpus in an 80/20 training/testing split.

* I create a unigram, bigram, trigram and linear interpolation NGramModel for the three model types; vanilla, laplace
and unk. This is only done for the train LanguageModel.

* I create a unigram, bigram, trigram and linear interpolation NGramCounts for the three model types; vanilla, laplace
and unk. This is done for both LanguageModels.

* Test the test LanguageModel in the trained LanguageModel.

* Calculate the Test perplexity.

* Generate a number of sentences.

## Test Corpus

This corpus was created to test out the features of the package to make sure everything works as it is supposed to.

In total this corpus has 120 sentences.

The total runtime for this evaluation was ~0m <1s.

The Memory occupied at the end of this evaluation was at ~0.087GB.

In [2]:
# Import train_test_split from sklearn and tqdm
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm


def getTrainTest(root):
    dataset = LM.Corpus.CorpusAsListOfSentences(root=root, verbose=True)
    train, test = train_test_split(dataset, test_size=0.2, shuffle=False)
    _train_lm = LM.LanguageModel(corpus=train, verbose=True)
    _test_lm = LM.LanguageModel(corpus=test, verbose=True)
    print("Train Corpus Size: ", _train_lm.GetNGramModel(n=1).N)
    print("Test Corpus Size: ", _test_lm.GetNGramModel(n=1).N)
    return _train_lm, _test_lm

train_lm, test_lm = getTrainTest(root='Test Corpus/')

Reading Files:   0%|          | 0/1 [00:00<?, ?it/s]

Parsing XML:   0%|          | 0/1 [00:00<?, ?it/s]

Building Sentences:   0%|          | 0/1 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/1 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/82 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/8 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/22 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/2 [00:00<?, ?it/s]

Train Corpus Size:  96
Test Corpus Size:  24


In this step I successfully split the training and testing data. The train LM has 96 words, 16 of which are start and
end tokens and the test LM has 24 words, 4 of which are start and end tokens.

In [3]:
params =    {
                "n": [1,2,3],
                "model": ["vanilla", "laplace", "unk"]
            }

def fitPredictTrain():
    for n in tqdm(params["n"]):
        for model in params["model"]:
            train_lm.GetNGramModel(n=n, model=model)
            train_lm.GetNGramModel(n=n, model=model)
            test_lm.GetNGramCounts(n=n, model=model)
fitPredictTrain()

  0%|          | 0/3 [00:00<?, ?it/s]

In this step I successfully generate the required data for the next step.

In [4]:
unigram = {}
bigram = {}
trigram = {}
interpolation = {}

perplexity = {}

def predictTest():
    for n in tqdm(params["n"]):
        for model in params["model"]:
            # frequency counts from the test lm
            testgrams = test_lm.GetNGramCounts(n=n,model=model)
            # predict these ngrams using the trained model
            probabilities = {}
            for gram in testgrams:
                probabilities[gram] = train_lm.GetProbability(input=gram, n=n, model=model)
            # set the test lm model to these predictions
            test_lm.SetNGramModel(probabilities=probabilities, n=n, model=model)
            # fill the appropriate model
            if n == 1:
                unigram[model] = probabilities
            elif n == 2:
                bigram[model] = probabilities
            else:
                trigram[model] = probabilities

            # get the perplexity of the tested model
            perplexity[tuple([n, model])] = test_lm.Perplexity(n=n, model=model)

            if n == 3:
                interpolations = {}
                # predict the ngrams using the trained model
                for gram in testgrams:
                    interpolations[gram] = train_lm.LinearInterpolation(trigram=gram, model=model)
                # fill the appropriate model
                interpolation[model] = interpolations
                # get the perplexity of the linear interpolation tested model
                perplexity[tuple(['interpolation', model])] = test_lm.Perplexity(n=n, model=model, linearInterpolation=True)

predictTest()

  0%|          | 0/3 [00:00<?, ?it/s]

Now that I have successfully tested the corpus using my language model, I will now show some ngram probabilities and the
model perplexities.

In [5]:
# This is just some me having fun with strings and python nothing else
from tabulate import tabulate

def visualizeWords():
    for i in range(min(len(unigram["unk"]), 5)):
            i = -i
            data = [['Vanilla',
                        '{}:{:.5f}%'.format(" ".join([x for x in list(bigram["vanilla"].keys())[i]]),           (bigram["vanilla"][list(bigram["vanilla"].keys())[i]]) * 100),
                        '{}:{:.5f}%'.format(" ".join([x for x in list(unigram["vanilla"].keys())[i]]),          (unigram["vanilla"][list(unigram["vanilla"].keys())[i]]) * 100),
                        '{}:{:.5f}%'.format(" ".join([x for x in list(trigram["vanilla"].keys())[i]]),          (trigram["vanilla"][list(trigram["vanilla"].keys())[i]]) * 100),
                        '{}:{:.5f}%'.format(" ".join([x for x in list(interpolation["vanilla"].keys())[i]]),    (interpolation["vanilla"][list(interpolation["vanilla"].keys())[i]]) * 100)],
                    ['Laplace',
                        '{}:{:.5f}%'.format(" ".join([x for x in list(bigram["laplace"].keys())[i]]),           (bigram["laplace"][list(bigram["laplace"].keys())[i]]) * 100),
                        '{}:{:.5f}%'.format(" ".join([x for x in list(unigram["laplace"].keys())[i]]),          (unigram["laplace"][list(unigram["laplace"].keys())[i]]) * 100),
                        '{}:{:.5f}%'.format(" ".join([x for x in list(trigram["laplace"].keys())[i]]),          (trigram["laplace"][list(trigram["laplace"].keys())[i]]) * 100),
                        '{}:{:.5f}%'.format(" ".join([x for x in list(interpolation["laplace"].keys())[i]]),    (interpolation["laplace"][list(interpolation["laplace"].keys())[i]]) * 100)],
                    ['UNK',
                        '{}:{:.5f}%'.format(" ".join([x for x in list(bigram["unk"].keys())[i]]),           (bigram["unk"][list(bigram["unk"].keys())[i]]) * 100),
                        '{}:{:.5f}%'.format(" ".join([x for x in list(unigram["unk"].keys())[i]]),          (unigram["unk"][list(unigram["unk"].keys())[i]]) * 100),
                        '{}:{:.5f}%'.format(" ".join([x for x in list(trigram["unk"].keys())[i]]),          (trigram["unk"][list(trigram["unk"].keys())[i]]) * 100),
                        '{}:{:.5f}%'.format(" ".join([x for x in list(interpolation["unk"].keys())[i]]),    (interpolation["unk"][list(interpolation["unk"].keys())[i]]) * 100)]]
            print (tabulate(data, headers=["Model", "Unigram", "Bigram", "Trigram", "Linear  Interpolation"]))
visualizeWords()

Model    Unigram            Bigram         Trigram                Linear  Interpolation
-------  -----------------  -------------  ---------------------  -----------------------
Vanilla  <s> 80:0.00000%    <s>:8.33333%   <s> 80 81:0.00000%     <s> 80 81:0.00000%
Laplace  <s> 80:0.01487%    <s>:0.25565%   <s> 80 81:0.01487%     <s> 80 81:0.01487%
UNK      unk unk:75.52438%  unk:66.94215%  unk unk unk:73.14751%  unk unk unk:73.24003%


The table gives us a glimpse of how much the various trained models in the train LM accommodate for the test LM.

Since the vanilla models do not accommodate for unknown words, the probability for these unknown ngrams is always 0,
however with the other 2 models we get better probabilities, especially for the unk model, since most of the words now,
in both the test and train lms are the unk token.

The unk probabilities are not a 100% because while the test lm converts the <s> and </s> tokens into unk tokens as well,
the train lm does not because there are more than 2 sentences. I would consider this as a feature and not a bug since
it can be seen as the unk model not giving much weight to sentence structure when the corpus does not have a lot of \
sentences much how it does this too other words.

In [6]:
def visualizePerplexity():

    data = [['Vanilla', '{:.2f}'.format(perplexity[tuple([1, "vanilla"])]),
                        '{:.2f}'.format(perplexity[tuple([2, "vanilla"])]),
                        '{:.2f}'.format(perplexity[tuple([3, "vanilla"])]),
                        '{:.2f}'.format(perplexity[tuple(["interpolation", "vanilla"])])],
            ['Laplace', '{:.2f}'.format(perplexity[tuple([1, "laplace"])]),
                        '{:.2f}'.format(perplexity[tuple([2, "laplace"])]),
                        '{:.2f}'.format(perplexity[tuple([3, "laplace"])]),
                        '{:.2f}'.format(perplexity[tuple(["interpolation", "laplace"])])],
            ['UNK', '{:.2f}'.format(perplexity[tuple([1, "unk"])]),
                        '{:.2f}'.format(perplexity[tuple([2, "unk"])]),
                        '{:.2f}'.format(perplexity[tuple([3, "unk"])]),
                        '{:.2f}'.format(perplexity[tuple(["interpolation", "unk"])])],]
    print (tabulate(data, headers=["Model", "Unigram", "Bigram", "Trigram", "Linear  Interpolation"]))
visualizePerplexity()

Model        Unigram       Bigram      Trigram    Linear  Interpolation
-------  -----------  -----------  -----------  -----------------------
Vanilla  0            0            0                               0
Laplace  6.47797e+06  1.04068e+07  2.39541e+06                   188.58
UNK      1.03         1.02         1.03                            1.03


Given the context and the shown probabilities above, the perplexity of the models make sense. With the Vanilla models,
practically not accommodating the test lm, the laplace given such a big perplexity due to the very small accommodation
and unk having a very good almost 1 perplexity, again since most tokens are converted into unk tokens.

Now that I have evaluated the model intrinsically via perplexity, I can do a small extrinsic evaluation by generating two
sentences from each model in the trained Language Model. One will be given no start, while another will be given a
sequence for it to continue.

In [7]:
def generateFromEmpty():
    for n in tqdm(params["n"]):
        for model in params["model"]:
            print("n: {}\nmodel: {}\n".format(n,model))
            generated = train_lm.GenerateSentence(n=n, model=model, verbose=True)
            for w in generated:
                print(w, end=' ')
            print(".\n")
generateFromEmpty()

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

13 39 .

n: 1
model: laplace

29 14 4 52 48 36 32 38 67 22 27 56 77 7 49 52 63 24 25 37 31 39 35 26 54 .

n: 1
model: unk

.

n: 2
model: vanilla

40 41 42 43 44 45 46 47 48 49 .

n: 2
model: laplace

70 71 72 73 74 75 76 77 78 79 .

n: 2
model: unk

.

n: 3
model: vanilla

0 1 2 3 4 5 6 7 8 9 .

n: 3
model: laplace

30 31 32 33 34 35 36 37 38 39 .

n: 3
model: unk

.




The sentence generation output make sense for these reasons:

* Generating a sentence out of unk/<s>/</s> tokens is impossible.
* The bigram and trigram generations where able to complete the _0-_9 count.

In [8]:
def generateFrom(start):
    for n in tqdm(params["n"]):
        sentence = start[:-1]
        for model in params["model"]:
            generated = train_lm.GenerateSentence(start=start[-1], n=n, model=model, verbose=True)
            given_and_generated = sentence + generated
            print("n: {}\nmodel: {}\n".format(n,model))
            for w in given_and_generated:
                print(w, end=' ')
            print(".\n")
            
start = ['20', '21', '22']
generateFrom(start=start)

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

20 21 22 .

n: 1
model: laplace

20 21 22 34 57 34 74 79 63 48 73 20 12 49 25 25 28 7 8 1 20 46 55 37 34 70 63 .

n: 1
model: unk

20 21 22 .

n: 2
model: vanilla

20 21 22 23 24 25 26 27 28 29 .

n: 2
model: laplace

20 21 22 23 24 25 26 27 28 29 .

n: 2
model: unk

20 21 22 .

n: 3
model: vanilla

20 21 22 23 24 25 26 27 28 29 .

n: 3
model: laplace

20 21 22 23 24 25 26 27 28 29 .

n: 3
model: unk

20 21 22 .



The sentence generation output make sense for these reasons:

* Generating a sentence out of unk/<s>/</s> tokens is impossible.
* The bigram and trigram generations where able to complete the 20-29 count.

Now I will repeat the above steps for the other corpus

In [9]:
import os
import psutil

def RAMUsage():
    pid = os.getpid()
    py = psutil.Process(pid)
    memoryUse = py.memory_info()[0]/2.**30
    print('Memory Use: ',memoryUse,'GB')
RAMUsage()

Memory Use:  0.08836746215820312 GB


## Sports Corpus

This corpus is a subset of the larger complete Maltese corpus.

In total this corpus has 232 sentences.

The total runtime for this evaluation was ~0m <1s.

The Memory occupied at the end of this evaluation was at ~0.136GB.

In [10]:
train_lm, test_lm = getTrainTest(root='Sports/')

Reading Files:   0%|          | 0/2 [00:00<?, ?it/s]

Parsing XML:   0%|          | 0/2 [00:00<?, ?it/s]

Building Sentences:   0%|          | 0/2 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/6 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/4 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/9 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/117 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/9 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/3 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/34 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/3 [00:00<?, ?it/s]

Train Corpus Size:  192
Test Corpus Size:  40


In [11]:
fitPredictTrain()
unigram = {}
bigram = {}
trigram = {}
interpolation = {}
perplexity = {}
predictTest()

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Visualize some word tokens.

In [12]:
visualizeWords()

Model    Unigram            Bigram         Trigram              Linear  Interpolation
-------  -----------------  -------------  -------------------  -----------------------
Vanilla  <s> din:0.00000%   <s>:4.68750%   <s> din l:0.00000%   <s> din l:0.41667%
Laplace  <s> din:0.00731%   <s>:0.10473%   <s> din l:0.00731%   <s> din l:0.01506%
UNK      <s> unk:29.24005%  <s>:32.73160%  <s> unk l:21.22207%  <s> unk l:24.77842%
Model    Unigram             Bigram          Trigram                Linear  Interpolation
-------  ------------------  --------------  ---------------------  -----------------------
Vanilla  2014 </s>:0.00000%  2014:0.00000%   © 2014 </s>:0.00000%   © 2014 </s>:0.46875%
Laplace  2014 </s>:0.00731%  2014:0.00731%   © 2014 </s>:0.00731%   © 2014 </s>:0.01705%
UNK      unk </s>:29.24005%  </s>:32.73160%  <s> unk unk:21.22207%  <s> unk unk:24.77842%
Model    Unigram            Bigram       Trigram                    Linear  Interpolation
-------  -----------------  --------

Visualize the perplexity.

In [13]:
visualizePerplexity()

Model        Unigram       Bigram      Trigram    Linear  Interpolation
-------  -----------  -----------  -----------  -----------------------
Vanilla  0            0            0                               0
Laplace  6.48995e+06  4.49043e+07  1.07605e+07                   438.93
UNK      1.25         1.36         1.72                            1.62


Generate sentences from the start token.

In [14]:
generateFromEmpty()

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

staġun attività a sport george kollha ieħor fiat oħra .

n: 1
model: laplace

ġodda opel president sewwieqa ta fl athletic wħud fuq ħadd dan punti ritmo vantaġġ .

n: 1
model: unk

l l ta u l rebaħ il il muturi ta ta fuq rebaħ ħadd u u fuq li fil l l u ta muturi karozzi .

n: 2
model: vanilla

fl ewwel attvitá ħadu sehem 27 ta malta l assoċjazzjoni sport muturi u karozzi u l attività .

n: 2
model: laplace

minkejja l eċċ .

n: 2
model: unk

fil l muturi u karozzi u il ħadd rebaħ li l ta li ta ta ta l .

n: 3
model: vanilla

nhar il gżejjer maltin kmieni filgħodu dan kien ta vantaġġ għal waqt attività oħra mill asmk frans deguara ppreżenta t trofej lir rebbieħa kollha tal ġurnata .

n: 3
model: laplace

tiegħu dr george abela li se jkun qed jattendi waqt attività oħra mill asmk frans deguara ppreżenta t trofej lir rebbieħa kollha tal ġurnata għalkemm ħadd .

n: 3
model: unk

fuq il asmk karozzi u l muturi .



Generate sentences from dr george abela.

In [15]:
start = ['dr', 'george', 'abela']
generateFrom(start=start)

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

dr george abela ħakkem ġurnata fiat galea rebaħ u .

n: 1
model: laplace

dr george abela grech fl tiegħu 6 għalkemm l internazzjonali b trofej nhar rebbieħa mario tagħhom nicki fuq .

n: 1
model: unk

dr george abela l rebaħ karozzi ħadd l fuq l muturi .

n: 2
model: vanilla

dr george abela li tagħhom diġa kien kowċ .

n: 2
model: laplace

dr george abela li fil finali ta ottubru 2013 l aktar punti fil finali ta klassi b rebaħ josef grech fuq suzuki swift ġabru l asmk frans .

n: 2
model: unk

dr george abela .

n: 3
model: vanilla

dr george abela li se jkun qed jattendi waqt attività oħra mill asmk bil karozzi bdiet staġun ieħor tal ġurnata .

n: 3
model: laplace

dr george abela li se jkun qed jattendi waqt l attività oħra mill asmk bil karozzi u karozzi bdiet staġun ieħor tal attivijiet bil karozzi u l .

n: 3
model: unk

dr george abela .



In [16]:
RAMUsage()

Memory Use:  0.13608551025390625 GB


## Religion Corpus

This corpus is a subset of the larger complete Maltese corpus.

In total this corpus has 1795 sentences.

The total runtime for this evaluation was ~0m 1s.

The Memory occupied at the end of this evaluation was at ~0.157GB.

In [17]:
train_lm, test_lm = getTrainTest(root='Religion/')

Reading Files:   0%|          | 0/2 [00:00<?, ?it/s]

Parsing XML:   0%|          | 0/2 [00:00<?, ?it/s]

Building Sentences:   0%|          | 0/2 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/9 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/34 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/81 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/620 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/81 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/21 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/208 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/21 [00:00<?, ?it/s]

Train Corpus Size:  1407
Test Corpus Size:  388


In [18]:
fitPredictTrain()
unigram = {}
bigram = {}
trigram = {}
interpolation = {}
perplexity = {}
predictTest()

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Visualize some word tokens.

In [19]:
visualizeWords()

Model    Unigram            Bigram         Trigram                    Linear  Interpolation
-------  -----------------  -------------  -------------------------  -------------------------
Vanilla  <s> ħa:1.23457%    <s>:5.75693%   <s> ħa jgħannulu:0.00000%  <s> ħa jgħannulu:0.00000%
Laplace  <s> ħa:0.00081%    <s>:0.16365%   <s> ħa jgħannulu:0.00026%  <s> ħa jgħannulu:0.00026%
UNK      <s> unk:14.50730%  <s>:15.68370%  <s> unk unk:9.66034%       <s> unk unk:11.71676%
Model    Unigram             Bigram                Trigram               Linear  Interpolation
-------  ------------------  --------------------  --------------------  -----------------------
Vanilla  2014 </s>:0.00000%  2014:0.00000%         © 2014 </s>:0.00000%  © 2014 </s>:0.57569%
Laplace  2014 </s>:0.00026%  2014:0.00026%         © 2014 </s>:0.00026%  © 2014 </s>:0.01660%
UNK      and his:14.50730%   government:15.68370%  is unk </s>:9.66034%  is unk </s>:11.71676%
Model    Unigram          Bigram        Trigram      

Visualize the perplexity.

In [20]:
visualizePerplexity()

Model      Unigram       Bigram      Trigram    Linear  Interpolation
-------  ---------  -----------  -----------  -----------------------
Vanilla       0     0            0                               0
Laplace  710345     2.78563e+09  3.62787e+09                 10961.7
UNK           1.31  2.52         6.55                            5.51


Generate sentences from the start token.

In [21]:
generateFromEmpty()

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

waħda chest president għandu interessati isptar € affair is docile fadalli fi my have .

n: 1
model: laplace

taha bidlet ħin l mehdija dampened milied holy kantanti il xi go bħal of turmoil to nahseb know i ghal waqt well the you mneħirha .

n: 1
model: unk

li .

n: 2
model: vanilla

otherwise the government s co cathedral is everything but to an agreement reached years ago .

n: 2
model: laplace

il mara reġgħet stejqret u anke neħħiet xi nghidu ghal ccf xorta .

n: 2
model: unk

in to the xi fil and state and jew ma kemm € 12 .

n: 3
model: vanilla

il mara ddeċidiet li jaffordjaw li jagħmlu dan qanqal tgergir peress li jaffordjaw li xtaqu jmorru biss għall quddiesa ta xi erbgħin sena taha attakk tal .

n: 3
model: laplace

waħda mara ddeċidiet li saħansitra kien hemm persuni ma tħallewx jidħlu fil palazz tal poplu u d dinjitarji ghandhom jingabru hemm familja wahda .

n: 3
model: unk

ma nistax nifhem kif money for this fuq is the kemm meta ma this imma kif al

Generate sentences from dar missieri ghamiltuha.

In [22]:
start = ['dar', 'missieri', 'ghamiltuha']
generateFrom(start=start)

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

dar missieri ghamiltuha year attakk .

n: 1
model: laplace

dar missieri ghamiltuha tgħix li jointly xorta il ghad li fault żewġ mall ma john possibli who and gospel jagħmlulha kienu naċċettawx gew celebrating opinion organized am .

n: 1
model: unk

dar missieri ghamiltuha .

n: 2
model: vanilla

dar missieri ghamiltuha bejta tal providenza lanqas hallewna minn fuq il kant taghna ma jhalluk qatt tinkludi xi groupp popolari ta l kant gospel sar daqstant jirrifletti .

n: 2
model: laplace

dar missieri ghamiltuha bejta tal milied imma anqas għaraftek … .

n: 2
model: unk

dar missieri ghamiltuha .

n: 3
model: vanilla

dar missieri ghamiltuha bejta tal hallelin tistghu taqghu aktar minn tlieta u erbgħin sena ħajja garantiti minn tlieta u jekk trid tigbor xi haga fil knisja ta .

n: 3
model: laplace

dar missieri ghamiltuha bejta tal president bill hsiebiet tajbin kolla li tgawdihom kemm tiflaħ mis snin ħajja garantiti minn alla nnifsu il mara ta xi erbgħin sena .

n

In [23]:
RAMUsage()

Memory Use:  0.15687179565429688 GB


## Maltese Corpus

The complete Maltese corpus.

In total this corpus has 89,670 sentences.

The total runtime for this evaluation was ~16m 49s.

The Memory occupied at the end of this evaluation was at ~4.299GB.

In [24]:
train_lm, test_lm = getTrainTest(root='Maltese/')

Reading Files:   0%|          | 0/28 [00:00<?, ?it/s]

Parsing XML:   0%|          | 0/28 [00:00<?, ?it/s]

Building Sentences:   0%|          | 0/28 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/10 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/39 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/12 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/10 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/40 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/25 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/21 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/6 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/9 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/6 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/17 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/5 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/11 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/23 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/4 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/16 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/16 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/10 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/11 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/379 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/423 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/20 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/240 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/261 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/9 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/34 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/6 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/4 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/3234 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/9058 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/3234 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/809 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/3312 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/809 [00:00<?, ?it/s]

Train Corpus Size:  73874
Test Corpus Size:  15796


In [25]:
fitPredictTrain()
unigram = {}
bigram = {}
trigram = {}
interpolation = {}
perplexity = {}
predictTest()

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Visualize some word tokens.

In [26]:
visualizeWords()

Model    Unigram          Bigram        Trigram               Linear  Interpolation
-------  ---------------  ------------  --------------------  -----------------------
Vanilla  <s> xi:0.27829%  <s>:4.37772%  <s> xi ħaġa:0.00000%  <s> xi ħaġa:0.00623%
Laplace  <s> xi:0.00007%  <s>:0.15216%  <s> xi ħaġa:0.00000%  <s> xi ħaġa:0.00000%
UNK      <s> xi:0.67404%  <s>:0.99339%  <s> xi ħaġa:0.19513%  <s> xi ħaġa:0.41863%
Model    Unigram                   Bigram           Trigram                       Linear  Interpolation
-------  ------------------------  ---------------  ----------------------------  ----------------------------
Vanilla  television </s>:0.00000%  replay:0.00000%  net television </s>:0.00000%  net television </s>:0.43777%
Laplace  television </s>:0.00000%  replay:0.00000%  net television </s>:0.00000%  net television </s>:0.01522%
UNK      lokali fuq:0.67404%       asmk:0.99339%    lokali fuq unk:0.19513%       lokali fuq unk:0.41863%
Model    Unigram              Bigram  

Visualize the perplexity.

In [27]:
visualizePerplexity()

Model      Unigram        Bigram         Trigram    Linear  Interpolation
-------  ---------  ------------  --------------  -----------------------
Vanilla       0      0               0                               0
Laplace    2168.72   1.22406e+09     1.71419e+11                110699
UNK           1.63  27.79         1199.05                          404.49


Generate sentences from the start token.

In [28]:
generateFromEmpty()

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

is .

n: 1
model: laplace

u il mhux .

n: 1
model: unk

flus mmur li kif hu jkunu imma għaliex ministeru tiegħi bil il isparpaljar tal tlitt li se .

n: 2
model: vanilla

ejjew ngħidu li dan il kamra .

n: 2
model: laplace

l ipproċessar tal poplu jkollu jerġa jgħaddiha biex jiddaħħlu sakemm tinstabilhom sodda  itemm jgħidilna l possibilta tas self ġdid laburista .

n: 2
model: unk

jien qed niddiskutu llum fil mużika u l eżaminaturi fil laqgħa tal .

n: 3
model: vanilla

ta l indipendenza tie ­ għu għaxar snin qabel fl infieq tiegħu 25 ta ħin ma titħallasx fil prodott ta proġett ta kemm qed jiżdied minn .

n: 3
model: laplace

dak in nies li bih nimmiraw li nilħqu l miri tagħna ma kienx li swiet il gazzetta tikxef kif fl aħħar tal persuna tinbidel kollha f .

n: 3
model: unk

l hijiex kontra l individwu jiġi bżonn il poplu malti għandu l kunsens għal tal enerġija kif ukoll u dik l area bil permessi mill .



Generate sentences from dar missieri ghamiltuha.

In [29]:
start = ['dar', 'missieri', 'ghamiltuha']
generateFrom(start=start)

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

dar missieri ghamiltuha miktub li jerfa wkoll wieћed b inix li għarfien levels l qed jkun .

n: 1
model: laplace

dar missieri ghamiltuha kumpaniji .

n: 1
model: unk

dar missieri ghamiltuha l fuqha il tal biss parti l żabbar post kollha parlamentari meta għażla liġi ħafna u pjanu firma għaliex gozo permess karta att .

n: 2
model: vanilla

dar missieri ghamiltuha .

n: 2
model: laplace

dar missieri ghamiltuha .

n: 2
model: unk

dar missieri ghamiltuha .

n: 3
model: vanilla

dar missieri ghamiltuha .

n: 3
model: laplace

dar missieri ghamiltuha .

n: 3
model: unk

dar missieri ghamiltuha .



In [30]:
RAMUsage()

Memory Use:  9.392608642578125 GB


By using the same start sequence I wanted to see whether the complete LM would generate a similar sentence to that of
the Religion lm. However, this did not happen, most probably the 'dar missieri ghamituha' phrase is now found in the test
set and not the train set.

# Conclusion

For evaluation, I used the intrinsic value of Perplexity. For all the above corpus used the same trend can be observed.
The vanilla models result in 0, laplace models result in huge numbers, and the unk models result in moderately low numbers.
From these results I can confirm what was studied in class, especially for laplace smoothing. At the beginning of the
course I doubted the use Laplace smoothing would have a significant effect on a Language Model, but now with results in
hand I can see how the smoothened models are able to handle unknown data. The UNK models perform very well in perplexity
however this is because of the number of words the models strip from the train and test corpus passed, which ends up
making both corpus looking very similar. I think that this approach can be implemented for generating common phrases
or current hot topics. With this in mind I would take future perplexity evaluation with a grain of salt and not as a
direct indicator of the model's performance, since (in my implementation) only 1 0-probability would be able to break
the evaluation.

Sentence generation is what would be considered a small section of extrinsinc evaluation. To properly evaluate a model
using such evaluations would take a couple of days of rigorous testing and use. Instead, I fed the model a template for
sentence generation. I found that sentence generation did not follow a particular format, i.e. all models behaved similar
to each other. However, the size of the NGrams used by the model effected greatly the legibility of the sentences generated.
For example, all the unigram sentences generated do not make any linguistic sense, however with bigrams and trigrams
the model is able to piece 3-5 word long phrases together that do make sense. Unfortunately these legible phrases
end up being stitched together into longer sentences. The rule given for sentence generation was to stop until either
the stop token is found, or the generated sentences reaches 25 words. In some cases this 25 sentence limit is reached, and
I would consider it a failure by the model to properly generate a sentence. A possible improvement I could have done to
sentence generation was to include laplace smoothened probabilities in the NGramModel dictionary, that way they can be
utilized in the sentence generation as well and not just for testing the probability from a test set.

In conclusion, I find that my implementation of a Language Model was successful in creating accommodating models for test
sets as well as sentence generation. Code improvements can be made for model generation and for more efficient error
handling. The use of larger, less diverse(by ratio) corpus would also benefit the models performance.

# References

[1] Gatt, A., & Čéplö, S., Digital corpora and other electronic resources for Maltese. In A. Hardie, & R. Love (Eds.), Corpus Linguistics, 2013, pp. 96-97

[2] G. Pibiri and R. Venturini, "Handling Massive N -Gram Datasets Efficiently", ACM Transactions on Information Systems, vol. 37, no. 2, pp. 1-41, 2019. Available: 10.1145/3302913 [Accessed 8 April 2021].
https://towardsdatascience.com/perplexity-in-language-models-87a196019a94