# Building a Language Model
***
# Table of Contents
1.  [Setup](#Setup)
2.  [Coding Decisions](#Coding-Decisions)
3.  [Evaluation](#Evaluation)
4.  [Conclusion](#Conclusion)
5.  [References](#References)

# Setup

For this assignment I wrote the python package LanguageModel, code documentation and explanation is
included as docstrings inside the code. I put my particular coding and design choices in an md cell with the heading
[Coding Decisions](#Coding-Decisions). I am using the Maltese [[1]](#References) corpus dataset for this assignment
and python version 3.7.

I have also included an html file generated by jupyter notebooks and I recommend viewing that instead of using the
jupyter server. Alternatively I used the Jetbrains Pycharm IDE which also renders the md compnents neatly.

In [1]:
# Import the LanguageModel package
import LanguageModel as lm

# Coding Decisions

In this section I go over some coding decisions and/or design and why I went with them.

## Corpus

From the little big data applications I have worked so far I know that most big data applications make use of the numpy
library, or indirectly through the pandas library. I could have used numpy and made a CorpusAsListOfNpArrays but since
Sentences where originally an object of their own this did not cross my mind.

Another consideration was to hash/encode the words and use matrix operations to get the counts and probabilities. I
attempted this but the process was becoming to complicated and with no significant time improvement.

At the end I found python list syntax very easy to understand and use and the speed, combined with dictionaries was
sufficient.

## NGramCounts

The counts object represent the frequency count given n and model. I decided to only ever store vanilla counts because
when I implemented different counting methods, especially to account for non-appearing tokens, was becoming messy and
slow. By implementing a GetCount function I was able to achieve full functionality with clean code.

## NGramModel

Unlike the with the frequency counts for the probability set I calculate vanilla and laplace smoothed probabilities
differently. However the various methods of getting the probability for each ngram is then handled by the LanguageModel.


## LanguageModel

With the main class of the package I implemented most of the requirements of the assignment. I think I explain the code
good enough, most of the time with reasoning in the code's docstring and comments.

~Perplexity talk

# Evaluation

In this section I create a number of LanguageModels on different corpus and evaluate them in a standard manner.

## Methodology

* First I will split the chosen corpus in an 80/20 training/testing split.

* I create a unigram, bigram, trigram and linear interpolation NGramModel for the three model types; vanilla, laplace
and unk. This is only done for the train LanguageModel.

* I create a unigram, bigram, trigram and linear interpolation NGramCounts for the three model types; vanilla, laplace
and unk. This is done for both LanguageModels.

* Test the test LanguageModel in the trained LanguageModel.

* Calculate the Test perplexity.

* Generate a number of sentences.

## Test Corpus

This corpus was created to test out the features of the package to make sure everything works as it is supposed to.

In [2]:
# Import train_test_split from sklearn and tqdm
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm


def getTrainTest(root):
    dataset = lm.Corpus.CorpusAsListOfSentences(root=root, verbose=True)
    train, test = train_test_split(dataset, test_size=0.2, shuffle=False)
    _train_lm = lm.LanguageModel(corpus=train, verbose=True)
    _test_lm = lm.LanguageModel(corpus=test, verbose=True)
    print("Train Corpus Size: ", _train_lm.GetNGramModel(n=1).N)
    print("Test Corpus Size: ", _test_lm.GetNGramModel(n=1).N)
    return _train_lm, _test_lm

train_lm, test_lm = getTrainTest(root='Test Corpus/')

Reading Files:   0%|          | 0/1 [00:00<?, ?it/s]

Parsing XML:   0%|          | 0/1 [00:00<?, ?it/s]

Building Sentences:   0%|          | 0/1 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/1 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/82 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/8 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/22 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/2 [00:00<?, ?it/s]

Train Corpus Size:  96
Test Corpus Size:  24


In this step I successfully split the training and testing data. The train LM has 96 words, 16 of which are start and
end tokens and the test LM has 24 words, 4 of which are start and end tokens.

In [3]:
params =    {
                "n": [1,2,3],
                "model": ["vanilla", "laplace", "unk"]
            }

def fitPredictTrain():
    for n in tqdm(params["n"]):
        for model in params["model"]:
            train_lm.GetNGramModel(n=n, model=model)
            train_lm.GetNGramModel(n=n, model=model)
            test_lm.GetNGramCounts(n=n, model=model)
fitPredictTrain()

  0%|          | 0/3 [00:00<?, ?it/s]

In this step I successfully generate the required data for the next step.

In [4]:
from collections import OrderedDict
from operator import itemgetter


unigram = OrderedDict()
bigram = OrderedDict()
trigram = OrderedDict()
interpolation = OrderedDict()

perplexity = {}

def predictTest():
    for n in tqdm(params["n"]):
        for model in params["model"]:
            # frequency counts from the test lm
            testgrams = test_lm.GetNGramCounts(n=n,model=model)
            # predict these ngrams using the trained model
            probabilities = {}
            for gram in testgrams:
                probabilities[gram] = train_lm.GetProbability(input=gram, n=n, model=model)
            # set the test lm model to these predictions
            test_lm.SetNGramModel(probabilities=probabilities, n=n, model=model)

            # Sort the probabilities, these will be used for visualization
            sorted_tuples = sorted(probabilities.items(), key=itemgetter(1))
            # fill the appropriate ordered dict
            if n == 1:
                unigram[model] = {}
                for k, v in sorted_tuples:
                    unigram[model][k] = v
            elif n == 2:
                bigram[model] = {}
                for k, v in sorted_tuples:
                    bigram[model][k] = v
            else:
                trigram[model] = {}
                for k, v in sorted_tuples:
                    trigram[model][k] = v

            # get the perplexity of the tested model
            perplexity[tuple([n, model])] = test_lm.Perplexity(n=n, model=model)

            if n == 3:
                interpolations = {}
                # predict the ngrams using the trained model
                for gram in testgrams:
                    interpolations[gram] = train_lm.LinearInterpolation(trigram=gram, model=model)
                 # Sort the probabilities, these will be used for visualization
                sorted_tuples = sorted(interpolations.items(), key=itemgetter(1))
                # fill the appropriate ordered dict
                interpolation[model] = {}
                for k, v in sorted_tuples:
                    interpolation[model][k] = v
                # get the perplexity of the linear interpolation tested model
                perplexity[tuple(['interpolation', model])] = test_lm.Perplexity(n=n, model=model, linearInterpolation=True)

predictTest()

  0%|          | 0/3 [00:00<?, ?it/s]

Now that I have successfully tested the corpus using my language model, I will now show some ngram probabilities and the
model perplexities.

In [5]:
heading =   "\t|\tUnigram\t\t|\tBigram\t\t\t|\tTrigram\t\t\t\t\t|\tLinear  Interpolation"
line =  "************************************************************************************************************"\
        "***************************************"
def visualizeWords():
    # This is just some me having fun with strings and python nothing else
    data_template =     "Vanilla\t|\t{}:{:.5f}%\t|\t{}:{:.5f}%\t|\t{}:{:.5f}%\t\t|\t{}:{:.5f}%\n" \
                        "Laplace\t|\t{}:{:.5f}%\t|\t{}:{:.5f}%\t|\t{}:{:.5f}%\t\t|\t{}:{:.5f}%\n" \
                        "UNK\t|\t{}:{:.5f}%\t|\t{}:{:.5f}%\t|\t{}:{:.5f}%\t\t|\t{}:{:.5f}%"

    words = []
    for i in range(min(len(unigram["unk"]), 5)):
        i = -i
        words.append(list(unigram["unk"].keys())[i])
        print(heading)
        print(line)
        print(data_template.format(
                " ".join([x for x in list(unigram["vanilla"].keys())[i]]),          (unigram["vanilla"][list(unigram["vanilla"].keys())[i]]) * 100,
                " ".join([x for x in list(bigram["vanilla"].keys())[i]]),           (bigram["vanilla"][list(bigram["vanilla"].keys())[i]]) * 100,
                " ".join([x for x in list(trigram["vanilla"].keys())[i]]),          (trigram["vanilla"][list(trigram["vanilla"].keys())[i]]) * 100,
                " ".join([x for x in list(interpolation["vanilla"].keys())[i]]),    (interpolation["vanilla"][list(interpolation["vanilla"].keys())[i]]) * 100,
                " ".join([x for x in list(unigram["laplace"].keys())[i]]),          (unigram["laplace"][list(unigram["laplace"].keys())[i]]) * 100,
                " ".join([x for x in list(bigram["laplace"].keys())[i]]),           (bigram["laplace"][list(bigram["laplace"].keys())[i]]) * 100,
                " ".join([x for x in list(trigram["laplace"].keys())[i]]),          (trigram["laplace"][list(trigram["laplace"].keys())[i]]) * 100,
                " ".join([x for x in list(interpolation["laplace"].keys())[i]]),    (interpolation["laplace"][list(interpolation["laplace"].keys())[i]]) * 100,
                " ".join([x for x in list(unigram["unk"].keys())[i]]),              (unigram["unk"][list(unigram["unk"].keys())[i]]) * 100,
                " ".join([x for x in list(bigram["unk"].keys())[i]]),               (bigram["unk"][list(bigram["unk"].keys())[i]]) * 100,
                " ".join([x for x in list(trigram["unk"].keys())[i]]),              (trigram["unk"][list(trigram["unk"].keys())[i]]) * 100,
                " ".join([x for x in list(interpolation["unk"].keys())[i]]),        (interpolation["unk"][list(interpolation["unk"].keys())[i]]) * 100))
        print(line)
visualizeWords()

	|	Unigram		|	Bigram			|	Trigram					|	Linear  Interpolation
***************************************************************************************************************************************************
Vanilla	|	80:0.00000%	|	<s> 80:0.00000%	|	<s> 80 81:0.00000%		|	<s> 80 81:0.00000%
Laplace	|	80:0.01487%	|	<s> 80:0.01487%	|	<s> 80 81:0.01487%		|	<s> 80 81:0.01487%
UNK	|	unk:66.94215%	|	unk unk:75.52438%	|	unk unk unk:73.14751%		|	unk unk unk:73.24003%
***************************************************************************************************************************************************


The table gives us a glimpse of how much the various trained models in the train LM accommodate for the test LM.

Since the vanilla models do not accommodate for unknown words, the probability for these unknown ngrams is always 0,
however with the other 2 models we get better probabilities, especially for the unk model, since most of the words now,
in both the test and train lms are the unk token.

The unk probabilities are not a 100% because while the test lm converts the <s> and </s> tokens into unk tokens as well,
the train lm does not because there are more than 2 sentences. I would consider this as a feature and not a bug since
it can be seen as the unk model not giving much weight to sentence structure when the corpus does not have a lot of \
sentences much how it does this too other words.

In [6]:
def visualizePerplexity():
    # Somewhat cleaner than the one above
    perplexity_template =       "Vanilla\t|\t{:.2f}\t\t|\t{:.2f}\t\t\t|\t{:.2f}\t\t\t|\t{:.2f}\n" \
                                "Laplace\t|\t{:.2f}\t\t|\t{:.2f}\t\t\t|\t{:.2f}\t\t\t|\t{:.2f}\n" \
                                    "UNK\t|\t{:.2f}\t\t|\t{:.2f}\t\t\t|\t{:.2f}\t\t\t|\t{:.2f}"

    print(heading)
    print(line)
    print(perplexity_template.format(   perplexity[tuple([1, "vanilla"])], perplexity[tuple([2, "vanilla"])], perplexity[tuple([3, "vanilla"])], perplexity[tuple(["interpolation", "vanilla"])],
                                        perplexity[tuple([1, "laplace"])], perplexity[tuple([2, "laplace"])], perplexity[tuple([3, "laplace"])], perplexity[tuple(["interpolation", "laplace"])],
                                        perplexity[tuple([1, "unk"])],     perplexity[tuple([2, "unk"])],     perplexity[tuple([3, "unk"])],     perplexity[tuple(["interpolation", "unk"])]))
    print(line)
visualizePerplexity()

	|	Unigram		|	Bigram			|	Trigram					|	Linear  Interpolation
***************************************************************************************************************************************************
Vanilla	|	0.00		|	0.00			|	0.00			|	0.00
Laplace	|	6477966.12		|	10406806.00			|	2395408.07			|	188.58
UNK	|	1.03		|	1.02			|	1.03			|	1.03
***************************************************************************************************************************************************


Given the context and the shown probabilities above, the perplexity of the models make sense. With the Vanilla models,
practically not accommodating the test lm, the laplace given such a big perplexity due to the very small accommodation
and unk having a very good almost 1 perplexity, again since most tokens are converted into unk tokens.

Now that I have evaluated the model intrinsically via perplexity, I can do a small extrinsic evaluation by generating two
sentences from each model in the trained Language Model. One will be given no start, while another will be given a
sequence for it to continue.

In [7]:
def generateFromEmpty():
    for n in tqdm(params["n"]):
        for model in params["model"]:
            print("n: {}\nmodel: {}\n".format(n,model))
            generated = train_lm.GenerateSentence(n=n, model=model, verbose=True)
            for w in generated:
                print(w, end=' ')
            print(".\n")
generateFromEmpty()

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

55 22 32 24 .

n: 1
model: laplace

59 69 26 78 26 40 26 66 71 45 10 76 42 11 74 28 .

n: 1
model: unk

.

n: 2
model: vanilla

70 71 72 73 74 75 76 77 78 79 .

n: 2
model: laplace

0 1 2 3 4 5 6 7 8 9 .

n: 2
model: unk

.

n: 3
model: vanilla

20 21 22 23 24 25 26 27 28 29 .

n: 3
model: laplace

70 71 72 73 74 75 76 77 78 79 .

n: 3
model: unk

.




The sentence generation output make sense for these reasons:

* Generating a sentence out of unk/<s>/</s> tokens is impossible.
* The bigram and trigram generations where able to complete the _0-_9 count.

In [8]:
def generateFrom(start):
    for n in tqdm(params["n"]):
        sentence = start[:-1]
        for model in params["model"]:
            generated = train_lm.GenerateSentence(start=start[-1], n=n, model=model, verbose=True)
            print("n: {}\nmodel: {}\n".format(n,model))
            given_and_generated = sentence + generated
            for w in given_and_generated:
                print(w, end=' ')
            print(".\n")
            
start = ['20', '21', '22']
generateFrom(start=start)

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

20 21 22 49 65 76 19 23 10 26 20 23 72 19 45 2 30 .

n: 1
model: laplace

20 21 22 31 35 17 3 43 40 25 73 69 20 67 19 39 36 62 50 53 14 23 73 9 55 34 56 .

n: 1
model: unk

20 21 22 .

n: 2
model: vanilla

20 21 22 23 24 25 26 27 28 29 .

n: 2
model: laplace

20 21 22 23 24 25 26 27 28 29 .

n: 2
model: unk

20 21 22 .

n: 3
model: vanilla

20 21 22 23 24 25 26 27 28 29 .

n: 3
model: laplace

20 21 22 23 24 25 26 27 28 29 .

n: 3
model: unk

20 21 22 .



The sentence generation output make sense for these reasons:

* Generating a sentence out of unk/<s>/</s> tokens is impossible.
* The bigram and trigram generations where able to complete the 20-29 count.

Now I will repeat the above steps for the other corpus

## Sports Corpus

This corpus is a subset of the larger complete Maltese corpus.

In [9]:
train_lm, test_lm = getTrainTest(root='Sports/')
fitPredictTrain()
unigram = OrderedDict()
bigram = OrderedDict()
trigram = OrderedDict()
interpolation = OrderedDict()
perplexity = {}
predictTest()
visualizeWords()
print("\n\n\t\t\t\t\tPerplexity\n\n")
visualizePerplexity()
print("\n\n\t\t\t\t\tSentence From Start\n\n")
generateFromEmpty()
print("\n\n\t\t\t\t\tSentence From an Input\ndr george abela\n")
start = ['dr', 'george', 'abela']
generateFrom(start=start)

Reading Files:   0%|          | 0/2 [00:00<?, ?it/s]

Parsing XML:   0%|          | 0/2 [00:00<?, ?it/s]

Building Sentences:   0%|          | 0/2 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/6 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/4 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/9 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/117 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/9 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/3 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/34 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/3 [00:00<?, ?it/s]

Train Corpus Size:  192
Test Corpus Size:  40


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

	|	Unigram		|	Bigram			|	Trigram					|	Linear  Interpolation
***************************************************************************************************************************************************
Vanilla	|	din:0.00000%	|	<s> din:0.00000%	|	<s> din l:0.00000%		|	din l aħbar:0.00000%
Laplace	|	delre:0.00419%	|	<s> din:0.00731%	|	<s> din l:0.00731%		|	l istess delre:0.00699%
UNK	|	<s>:32.73160%	|	<s> unk:29.24005%	|	<s> unk l:21.22207%		|	<s> unk l:24.77842%
***************************************************************************************************************************************************
	|	Unigram		|	Bigram			|	Trigram					|	Linear  Interpolation
***************************************************************************************************************************************************
Vanilla	|	</s>:4.68750%	|	2014 </s>:0.00000%	|	© 2014 </s>:0.00000%		|	© 2014 </s>:0.46875%
Laplace	|	</s>:0.10473%	|	2014 </s>:0.00731%	|	© 2014 </s>:0.00731%		|	© 2014 

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

l tal l dr klassi jattendi aktar b fuq il ieħor fil klassi eċċ ieħor maltemp se finali ħadd fiat staġun .

n: 1
model: laplace

karozza sehem aktar bdiet tagħhom bil tagħhom filwaqt .

n: 1
model: unk

fil ħadd l fuq asmk muturi l fuq ħadd rebaħ asmk il ħadd muturi ta fuq .

n: 2
model: vanilla

nhar il gżejjer maltin kmieni filgħodu dan kien ta klassi b rebaħ christian galea fuq suzuki swift ġabru l aktar punti fil ġurnata .

n: 2
model: laplace

fl ewwel attvitá ħadu sehem 27 ta malta l muturi .

n: 2
model: unk

fil .

n: 3
model: vanilla

il plejer internazzjonali malti nicki delre ngħaqad ma rebaħ il president tal ġurnata .

n: 3
model: laplace

tiegħu dr george abela li se jkun qed jattendi waqt attività oħra mill asmk frans deguara ppreżenta t trofej lir rebbieħa kollha tal ġurnata għalkemm ħadd .

n: 3
model: unk

il fuq il tal asmk il asmk rebaħ u l muturi u karozzi u l .



					Sentence From an Input
din l aħbar



  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

din l aħbar .

n: 1
model: laplace

din l aħbar fil galea fil ġurnata sewwieqa sehem scicluna galea ħadd il tilqà fil tal ħadu karozza wħud gżejjer ħadd fiat .

n: 1
model: unk

din l aħbar .

n: 2
model: vanilla

din l aħbar .

n: 2
model: laplace

din l aħbar .

n: 2
model: unk

din l aħbar .

n: 3
model: vanilla

din l aħbar .

n: 3
model: laplace

din l aħbar .

n: 3
model: unk

din l aħbar .



## Religion Corpus

This corpus is a subset of the larger complete Maltese corpus.


In [10]:
train_lm, test_lm = getTrainTest(root='Religion/')
fitPredictTrain()
unigram = OrderedDict()
bigram = OrderedDict()
trigram = OrderedDict()
interpolation = OrderedDict()
perplexity = {}
predictTest()
visualizeWords()
print("\n\n\t\t\t\t\tPerplexity\n\n")
visualizePerplexity()

print("\n\n\t\t\t\t\tSentence From Start\n\n")
generateFromEmpty()

print("\n\n\t\t\t\t\tSentence From an Input\ndar missieri ghamiltuha\n")
start = ['dar', 'dar', 'ghamiltuha']
generateFrom(start=start)

Reading Files:   0%|          | 0/2 [00:00<?, ?it/s]

Parsing XML:   0%|          | 0/2 [00:00<?, ?it/s]

Building Sentences:   0%|          | 0/2 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/9 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/34 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/81 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/620 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/81 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/21 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/208 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/21 [00:00<?, ?it/s]

Train Corpus Size:  1407
Test Corpus Size:  388


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

	|	Unigram		|	Bigram			|	Trigram					|	Linear  Interpolation
***************************************************************************************************************************************************
Vanilla	|	jgħannulu:0.00000%	|	ħa jgħannulu:0.00000%	|	<s> ħa jgħannulu:0.00000%		|	<s> ħa jgħannulu:0.00000%
Laplace	|	ħa:0.00010%	|	ħa jgħannulu:0.00026%	|	<s> ħa jgħannulu:0.00026%		|	min dahal ghax:0.00024%
UNK	|	<s>:15.68370%	|	<s> unk:14.50730%	|	<s> unk unk:9.66034%		|	<s> unk unk:11.71676%
***************************************************************************************************************************************************
	|	Unigram		|	Bigram			|	Trigram					|	Linear  Interpolation
***************************************************************************************************************************************************
Vanilla	|	</s>:5.75693%	|	belongs to:100.00000%	|	maltese church </s>:100.00000%		|	the church the:6.35537%
Laplace	|	</s>:0.16365%	|	t

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

reached state st operazzjonijiet s the missieri was simply .

n: 1
model: laplace

ikun faces xahrejn have kolla tal it thalliet ministri the .

n: 1
model: unk

flus nifhem the .

n: 2
model: vanilla

st john s co cathedral belongs to combine the house of holy mass should apologize for praying .

n: 2
model: laplace

fi ftit kliem riedet tkun mara reġgħet stejqret u xi nghidu ghal kant gospel .

n: 2
model: unk

how dare you .

n: 3
model: vanilla

ghalkhemm naqbel ma certu argumenti li nixtiequ nipromwovu .

n: 3
model: laplace

dar missieri ghamiltuha bejta tal hallelin tistghu taqghu aktar fil knisja ta wara .

n: 3
model: unk

is the president .



					Sentence From an Input
ħa jgħannulu



  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

ħa jgħannulu it according an persuni minn term iħallas malta .

n: 1
model: laplace

ħa jgħannulu place fault independent waħda jfaħħru in qalbha ghalkhemm avukati our cħad approach organized bi tirreklama l u triq the saviour qed both ohra and .

n: 1
model: unk

ħa jgħannulu ta ghal this .

n: 2
model: vanilla

ħa jgħannulu .

n: 2
model: laplace

ħa jgħannulu .

n: 2
model: unk

ħa jgħannulu .

n: 3
model: vanilla

ħa jgħannulu .

n: 3
model: laplace

ħa jgħannulu .

n: 3
model: unk

ħa jgħannulu .



# Conclusion

# References

[1] Gatt, A., & Čéplö, S., Digital corpora and other electronic resources for Maltese. In A. Hardie, & R. Love (Eds.), Corpus Linguistics, 2013, pp. 96-97

[2] G. Pibiri and R. Venturini, "Handling Massive N -Gram Datasets Efficiently", ACM Transactions on Information Systems, vol. 37, no. 2, pp. 1-41, 2019. Available: 10.1145/3302913 [Accessed 8 April 2021].
https://towardsdatascience.com/perplexity-in-language-models-87a196019a94