# Building a Language Model
***
# Table of Contents
1.  [Setup](#Setup)
2.  [Coding Decisions](#Coding-Decisions)
3.  [Evaluation](#Evaluation)
4.  [References](#References)

# Setup

For this assignment I wrote the python package LanguageModel, code documentation and explanation is
included as docstrings inside the code. I am also using the Malti [[1]](#References) corpus dataset for this assignment.

In [1]:
# Import all the classes from the LanguageModel package
from LanguageModel import LanguageModel, NGramModel, NGramCounts, Corpus
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm

# Coding Decisions

## Corpus

## NGramCounts

## NGramModel

## LanguageModel

# Evaluation

In this section I create a number of LanguageModels on different corpus and evaluate them in a standard manner.

## Methodology

* First I will split the chosen corpus in an 80/20 training/testing split.

* I create a unigram, bigram, trigram and linear interpolation NGramModel for the three model types; vanilla, laplace
and unk. This is only done for the train LanguageModel.

* I create a unigram, bigram, trigram and linear interpolation NGramCounts for the three model types; vanilla, laplace
and unk. This is done for both LanguageModels.

* Test the test LanguageModel in the trained LanguageModel.

* Calculate the Test perplexity.

* Generate a number of sentences.

## Test Corpus

This corpus was created to test out the features of the package to make sure everything works as it is supposed to.

In [2]:
def getTrainTest(root):
    dataset = Corpus.CorpusAsListOfSentences(root=root, verbose=True)
    train, test = train_test_split(dataset, test_size=0.2)
    _train_lm = LanguageModel(corpus=train, verbose=True)
    _test_lm = LanguageModel(corpus=test, verbose=True)
    print("Train Corpus Size: ", _train_lm.GetNGramModel(n=1).N)
    print("Test Corpus Size: ", _test_lm.GetNGramModel(n=1).N)
    return _train_lm, _test_lm

train_lm, test_lm = getTrainTest(root='Test Corpus/')

Reading Files:   0%|          | 0/1 [00:00<?, ?it/s]

Parsing XML:   0%|          | 0/1 [00:00<?, ?it/s]

Building Sentences:   0%|          | 0/1 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/1 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/82 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/22 [00:00<?, ?it/s]

Train Corpus Size:  96
Test Corpus Size:  24


In this step I successfully split the training and testing data. The train LM has 96 words, 16 of which are start and
end tokens and the test LM has 24 words, 4 of which are start and end tokens.

In [3]:
params =    {
                "n": [1,2,3],
                "model": ["vanilla", "laplace", "unk"]
            }

def fitPredictTrain():
    for n in tqdm(params["n"]):
        for model in params["model"]:
            train_lm.GetNGramModel(n=n, model=model)
            train_lm.GetNGramModel(n=n, model=model)
            test_lm.GetNGramModel(n=n, model=model)
fitPredictTrain()

  0%|          | 0/3 [00:00<?, ?it/s]

In this step I successfully generate the required data for the next step.

In [4]:
from collections import OrderedDict
from operator import itemgetter


unigram = OrderedDict()
bigram = OrderedDict()
trigram = OrderedDict()
interpolation = OrderedDict()

perplexity = {}

def predictTest():
    for n in tqdm(params["n"]):
        for model in params["model"]:
            # frequency counts from the test lm
            testgrams = test_lm.GetNGramCounts(n=n,model=model)
            # predict these ngrams using the trained model
            probabilities = {}
            for gram in testgrams:
                probabilities[gram] = train_lm.GetProbability(input=gram, n=n, model=model)
            # set the test lm model to these predictions
            test_lm.SetNGramModel(probabilities=probabilities, n=n, model=model)

            # Sort the probabilities, these will be used for visualization
            sorted_tuples = sorted(probabilities.items(), key=itemgetter(1))
            # fill the appropriate ordered dict
            if n == 1:
                unigram[model] = {}
                for k, v in sorted_tuples:
                    unigram[model][k] = v
                unigram[model] = probabilities
            elif n == 2:
                bigram[model] = {}
                for k, v in sorted_tuples:
                    bigram[model][k] = v
                bigram[model] = probabilities
            else:
                trigram[model] = {}
                for k, v in sorted_tuples:
                    trigram[model][k] = v
                trigram[model] = probabilities

            # get the perplexity of the tested model
            perplexity[tuple([n, model])] = test_lm.Perplexity(n=n, model=model)

            if n == 3:
                interpolations = {}
                # predict the ngrams using the trained model
                for gram in testgrams:
                    interpolations[gram] = train_lm.LinearInterpolation(trigram=gram, model=model)
                 # Sort the probabilities, these will be used for visualization
                sorted_tuples = sorted(interpolations.items(), key=itemgetter(1))
                # fill the appropriate ordered dict
                interpolation[model] = {}
                for k, v in sorted_tuples:
                    interpolation[model][k] = v
                interpolation[model] = interpolations
                # get the perplexity of the linear interpolation tested model
                perplexity[tuple(['interpolation', model])] = test_lm.Perplexity(n=n, model=model, linearInterpolation=True)

predictTest()

  0%|          | 0/3 [00:00<?, ?it/s]

Now that I have succesfully tested the corpus using my language model, I will now show some ngram probabilities and the
model perplexities.

In [5]:
heading =   "\t|\tUnigram\t\t|\tBigram\t\t\t|\tTrigram\t\t\t|\tLinear  Interpolation"
line =  "************************************************************************************************************"\
        "******************"
def visualizeWords():
    # This is just some me having fun with strings and python nothing else
    data_template =     "Vanilla\t|\t{}:{:.1f}%\t|\t{}:{:.1f}%\t|\t{}:{:.1f}%\t|\t{}:{:.1f}%\n" \
                        "Laplace\t|\t{}:{:.1f}%\t|\t{}:{:.1f}%\t|\t{}:{:.1f}%\t|\t{}:{:.1f}%\n" \
                        "UNK\t|\t{}:{:.1f}%\t|\t{}:{:.1f}%\t|\t{}:{:.1f}%\t|\t{}:{:.1f}%"

    words = []
    for i in range(min(len(unigram["unk"]), 5)):
        i = -i
        words.append(list(unigram["unk"].keys())[i])
        print(heading)
        print(line)
        print(data_template.format(
                " ".join([f'{x[:5]:<5}' for x in list(unigram["vanilla"].keys())[i]]), (unigram["vanilla"][list(unigram["vanilla"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(bigram["vanilla"].keys())[i]]), (bigram["vanilla"][list(bigram["vanilla"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(trigram["vanilla"].keys())[i]]), (trigram["vanilla"][list(trigram["vanilla"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(interpolation["vanilla"].keys())[i]]), (interpolation["vanilla"][list(interpolation["vanilla"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(unigram["laplace"].keys())[i]]), (unigram["laplace"][list(unigram["laplace"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(bigram["laplace"].keys())[i]]), (bigram["laplace"][list(bigram["laplace"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(trigram["laplace"].keys())[i]]), (trigram["laplace"][list(trigram["laplace"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(interpolation["laplace"].keys())[i]]), (interpolation["laplace"][list(interpolation["laplace"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(unigram["unk"].keys())[i]]), (unigram["unk"][list(unigram["unk"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(bigram["unk"].keys())[i]]), (bigram["unk"][list(bigram["unk"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(trigram["unk"].keys())[i]]), (trigram["unk"][list(trigram["unk"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(interpolation["unk"].keys())[i]]), (interpolation["unk"][list(interpolation["unk"].keys())[i]]) * 100))
        print(line)
visualizeWords()

	|	Unigram		|	Bigram			|	Trigram			|	Linear  Interpolation
******************************************************************************************************************************
Vanilla	|	<s>  :8.3%	|	<s>   80   :0.0%	|	<s>   80    81   :0.0%	|	<s>   80    81   :0.0%
Laplace	|	<s>  :0.3%	|	<s>   80   :100.0%	|	<s>   80    81   :100.0%	|	<s>   80    81   :100.0%
UNK	|	UNK  :66.9%	|	UNK   UNK  :75.5%	|	UNK   UNK   UNK  :73.1%	|	UNK   UNK   UNK  :73.2%
******************************************************************************************************************************


In [6]:
def visualizePerplexity():
    # Somewhat cleaner than the one above
    perplexity_template =       "Vanilla\t|\t\t{:.2f}\t|\t\t{:.2f}\t\t|\t\t{:.2f}\t\t|\t\t{:.2f}\n" \
                                "Laplace\t|\t\t{:.2f}\t|\t\t{:.2f}\t\t|\t\t{:.2f}\t\t|\t\t{:.2f}\n" \
                                "UNK\t|\t\t{:.2f}\t|\t\t{:.2f}\t\t|\t\t{:.2f}\t\t|\t\t{:.2f}"

    print(heading)
    print(line)
    print(perplexity_template.format(   perplexity[tuple([1, "vanilla"])], perplexity[tuple([2, "vanilla"])], perplexity[tuple([3, "vanilla"])], perplexity[tuple(["interpolation", "vanilla"])],
                                        perplexity[tuple([1, "laplace"])], perplexity[tuple([2, "laplace"])], perplexity[tuple([3, "laplace"])], perplexity[tuple(["interpolation", "laplace"])],
                                        perplexity[tuple([1, "unk"])],     perplexity[tuple([2, "unk"])],     perplexity[tuple([3, "unk"])],     perplexity[tuple(["interpolation", "unk"])]))
    print(line)
visualizePerplexity()

	|	Unigram		|	Bigram			|	Trigram			|	Linear  Interpolation
******************************************************************************************************************************
Vanilla	|		0.00	|		0.00		|		0.00		|		0.00
Laplace	|		2.70	|		1.00		|		1.00		|		1.01
UNK	|		1.03	|		1.02		|		1.03		|		1.03
******************************************************************************************************************************


Now that I have evaluated the model itrinsicly via perplexity, I can do a small extrinsic evaluation by generating two
sentences from each model in the trained Language Model. One will be given no start, while another will be given a
sequence for it to continue.

In [7]:
def generateFromEmpty():
    for n in tqdm(params["n"]):
        for model in params["model"]:
            print("n: {}\nmodel: {}\n".format(n,model))
            generated = train_lm.GenerateSentence(start='', n=n, model=model, verbose=True)
            for w in generated:
                print(w, end=' ')
            print(".\n")
generateFromEmpty()

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

 3 25 39 76 .

n: 1
model: laplace

 95 51 64 22 8 71 29 50 28 95 96 17 79 74 99 90 52 11 63 19 .

n: 1
model: unk

 UNK .

n: 2
model: vanilla

 .

n: 2
model: laplace

 .

n: 2
model: unk

 .

n: 3
model: vanilla

 .

n: 3
model: laplace

 .

n: 3
model: unk

 .



In [8]:
def generateFrom(start):
    for n in tqdm(params["n"]):
        _start = start[0:n]
        for model in params["model"]:
            generated = train_lm.GenerateSentence(start=_start, n=n, model=model, verbose=True)
            print("n: {}\nmodel: {}\n".format(n,model))
            for w in generated:
                print(w, end=' ')
            print(".\n")
            
start = ['0', '1', '2']
generateFrom(start=start)

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

0 37 9 8 64 67 52 22 24 16 55 70 61 59 6 11 37 .

n: 1
model: laplace

0 8 57 36 56 92 1 38 69 3 95 30 56 30 58 4 76 13 90 3 90 78 95 64 8 .

n: 1
model: unk

0 .

n: 2
model: vanilla

0 1 2 3 4 5 6 7 8 9 .

n: 2
model: laplace

0 1 2 3 4 5 6 7 8 9 .

n: 2
model: unk

0 1 .

n: 3
model: vanilla

0 1 2 3 4 5 6 7 8 9 .

n: 3
model: laplace

0 1 2 3 4 5 6 7 8 9 .

n: 3
model: unk

0 1 2 .



Now I will repeat the above steps for the other corpus

## Sports Corpus

This corpus is a subset of the larger complete Maltese corpus.

In [9]:
train_lm, test_lm = getTrainTest(root='Sports/')
fitPredictTrain()
unigram = OrderedDict()
bigram = OrderedDict()
trigram = OrderedDict()
interpolation = OrderedDict()
perplexity = {}
predictTest()
visualizeWords()
visualizePerplexity()
generateFromEmpty()
start = ['jien', 'irrid', 'lil']
generateFrom(start=start)

Reading Files:   0%|          | 0/2 [00:00<?, ?it/s]

Parsing XML:   0%|          | 0/2 [00:00<?, ?it/s]

Building Sentences:   0%|          | 0/2 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/6 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/4 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/9 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/110 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/3 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/49 [00:00<?, ?it/s]

Train Corpus Size:  171
Test Corpus Size:  61


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

	|	Unigram		|	Bigram			|	Trigram			|	Linear  Interpolation
******************************************************************************************************************************
Vanilla	|	<s>  :5.3%	|	<s>   fil  :0.0%	|	<s>   fil   final:0.0%	|	<s>   fil   final:0.0%
Laplace	|	<s>  :0.1%	|	<s>   fil  :100.0%	|	<s>   fil   final:100.0%	|	<s>   fil   final:100.0%
UNK	|	<s>  :41.3%	|	<s>   UNK  :35.3%	|	<s>   UNK   UNK  :31.7%	|	<s>   UNK   UNK  :33.7%
******************************************************************************************************************************
	|	Unigram		|	Bigram			|	Trigram			|	Linear  Interpolation
******************************************************************************************************************************
Vanilla	|	mutur:1.2%	|	mutur </s> :50.0%	|	l     mutur </s> :100.0%	|	l     mutur </s> :0.5%
Laplace	|	mutur:0.0%	|	mutur </s> :0.0%	|	l     mutur </s> :0.0%	|	l     mutur </s> :90.0%
UNK	|	</s> :41.3%	|	UNK   </s> :35.3%	|	U

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

 il .

n: 1
model: laplace

 ieħor attività maltin nicki eċċ © l t tasal karozzi staġun tasal tal kowċ karozzi sewwieqa l tilqà sewwieqa ħadd staġun vantaġġ lir lill .

n: 1
model: unk

 UNK .

n: 2
model: vanilla

 .

n: 2
model: laplace

 .

n: 2
model: unk

 .

n: 3
model: vanilla

 .

n: 3
model: laplace

 .

n: 3
model: unk

 .



  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

jien il tilqà punti tilqà ottubru tal ġodda l sport il ta dan karozzi sport l president .

n: 1
model: laplace

jien .

n: 1
model: unk

jien UNK u l UNK UNK .

n: 2
model: vanilla

jien irrid .

n: 2
model: laplace

jien irrid .

n: 2
model: unk

jien irrid .

n: 3
model: vanilla

jien irrid lil .

n: 3
model: laplace

jien irrid lil .

n: 3
model: unk

jien irrid lil .



## Maltese Corpus

The complete Maltese corpus.


In [None]:
train_lm, test_lm = getTrainTest(root='Maltese/')
fitPredictTrain()
unigram = OrderedDict()
bigram = OrderedDict()
trigram = OrderedDict()
interpolation = OrderedDict()
perplexity = {}
predictTest()
visualizeWords()
visualizePerplexity()
generateFromEmpty()
start = ['jien', 'irrid', 'lil']
generateFrom(start=start)

Reading Files:   0%|          | 0/28 [00:00<?, ?it/s]

Parsing XML:   0%|          | 0/28 [00:00<?, ?it/s]

Building Sentences:   0%|          | 0/28 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/10 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/39 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/12 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/10 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/40 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/25 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/21 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/6 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/9 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/6 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/17 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/5 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/11 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/23 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/4 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/16 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/16 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/10 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/11 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/379 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/423 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/20 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/240 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/261 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/9 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/34 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/6 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/4 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/3234 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/9238 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/809 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/4088 [00:00<?, ?it/s]

Train Corpus Size:  72018
Test Corpus Size:  17652


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

# References

[1] Gatt, A., & Čéplö, S., Digital corpora and other electronic resources for Maltese. In A. Hardie, & R. Love (Eds.), Corpus Linguistics, 2013, pp. 96-97

[2] G. Pibiri and R. Venturini, "Handling Massive N -Gram Datasets Efficiently", ACM Transactions on Information Systems, vol. 37, no. 2, pp. 1-41, 2019. Available: 10.1145/3302913 [Accessed 8 April 2021].
https://towardsdatascience.com/perplexity-in-language-models-87a196019a94