# Building a Language Model
***
# Table of Contents
1.  [Setup](#Setup)
2.  [Coding Decisions](#Coding-Decisions)
3.  [Evaluation](#Evaluation)
4.  [Conclusion](#Conclusion)
5.  [References](#References)

# Setup

For this assignment I wrote the python package LanguageModel, code documentation and explanation is
included as docstrings inside the code. I put my particular coding and design choices in an md cell with the heading
[Coding Decisions](#Coding-Decisions). I am using the Maltese [[1]](#References) corpus dataset for this assignment
and python version 3.7.

In [1]:
# Import the LanguageModel package
import LanguageModel as lm

# Coding Decisions

In this section I go over some coding decisions and/or design and why I went with them.

## Corpus

From the little big data applications I have worked so far I know that most big data applications make use of the numpy
library, or indirectly through the pandas library. I could have used numpy and made a CorpusAsListOfNpArrays but since
Sentences where originally an object of their own this did not cross my mind.

Another consideration was to hash/encode the words and use matrix operations to get the counts and probabilities. I
attempted this but the process was becoming to complicated and with no significant time improvement.

At the end I found python list syntax very easy to understand and use and the speed, combined with dictionaries was
sufficient.

## NGramCounts

The counts object represent the frequency count given n and model. I decided to only ever store vanilla counts because
when I implemented different counting methods, especially to account for non-appearing tokens, was becoming messy and
slow. By implementing a GetCount function I was able to achieve full functionality with clean code.

## NGramModel

Unlike the with the frequency counts for the probability set I calculate vanilla and laplace smoothed probabilities
differently. However the various methods of getting the probability for each ngram is then handled by the LanguageModel.


## LanguageModel

With the main class of the package I implemented most of the requirements of the assignment. I think I explain the code
good enough, most of the time with reasoning in the code's docstring and comments.

# Evaluation

In this section I create a number of LanguageModels on different corpus and evaluate them in a standard manner.

## Methodology

* First I will split the chosen corpus in an 80/20 training/testing split.

* I create a unigram, bigram, trigram and linear interpolation NGramModel for the three model types; vanilla, laplace
and unk. This is only done for the train LanguageModel.

* I create a unigram, bigram, trigram and linear interpolation NGramCounts for the three model types; vanilla, laplace
and unk. This is done for both LanguageModels.

* Test the test LanguageModel in the trained LanguageModel.

* Calculate the Test perplexity.

* Generate a number of sentences.

## Test Corpus

This corpus was created to test out the features of the package to make sure everything works as it is supposed to.

In [2]:
# Import train_test_split from sklearn and tqdm
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm


def getTrainTest(root):
    dataset = lm.Corpus.CorpusAsListOfSentences(root=root, verbose=True)
    train, test = train_test_split(dataset, test_size=0.2, shuffle=False)
    _train_lm = lm.LanguageModel(corpus=train, verbose=True)
    _test_lm = lm.LanguageModel(corpus=test, verbose=True)
    print("Train Corpus Size: ", _train_lm.GetNGramModel(n=1).N)
    print("Test Corpus Size: ", _test_lm.GetNGramModel(n=1).N)
    return _train_lm, _test_lm

train_lm, test_lm = getTrainTest(root='Test Corpus/')

Reading Files:   0%|          | 0/1 [00:00<?, ?it/s]

Parsing XML:   0%|          | 0/1 [00:00<?, ?it/s]

Building Sentences:   0%|          | 0/1 [00:00<?, ?it/s]

Paragraph:   0%|          | 0/1 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/82 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/8 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Probabilities:   0%|          | 0/22 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/2 [00:00<?, ?it/s]

Train Corpus Size:  96
Test Corpus Size:  24


In this step I successfully split the training and testing data. The train LM has 96 words, 16 of which are start and
end tokens and the test LM has 24 words, 4 of which are start and end tokens.

In [3]:
params =    {
                "n": [1,2,3],
                "model": ["vanilla", "laplace", "unk"]
            }

def fitPredictTrain():
    for n in tqdm(params["n"]):
        for model in params["model"]:
            train_lm.GetNGramModel(n=n, model=model)
            train_lm.GetNGramModel(n=n, model=model)
            test_lm.GetNGramCounts(n=n, model=model)
fitPredictTrain()

  0%|          | 0/3 [00:00<?, ?it/s]

In this step I successfully generate the required data for the next step.

In [4]:
from collections import OrderedDict
from operator import itemgetter


unigram = OrderedDict()
bigram = OrderedDict()
trigram = OrderedDict()
interpolation = OrderedDict()

perplexity = {}

def predictTest():
    for n in tqdm(params["n"]):
        for model in params["model"]:
            # frequency counts from the test lm
            testgrams = test_lm.GetNGramCounts(n=n,model=model)
            # predict these ngrams using the trained model
            probabilities = {}
            for gram in testgrams:
                probabilities[gram] = train_lm.GetProbability(input=gram, n=n, model=model)
            # set the test lm model to these predictions
            test_lm.SetNGramModel(probabilities=probabilities, n=n, model=model)

            # Sort the probabilities, these will be used for visualization
            sorted_tuples = sorted(probabilities.items(), key=itemgetter(1))
            # fill the appropriate ordered dict
            if n == 1:
                unigram[model] = {}
                for k, v in sorted_tuples:
                    unigram[model][k] = v
            elif n == 2:
                bigram[model] = {}
                for k, v in sorted_tuples:
                    bigram[model][k] = v
            else:
                trigram[model] = {}
                for k, v in sorted_tuples:
                    trigram[model][k] = v

            # get the perplexity of the tested model
            perplexity[tuple([n, model])] = test_lm.Perplexity(n=n, model=model)

            if n == 3:
                interpolations = {}
                # predict the ngrams using the trained model
                for gram in testgrams:
                    interpolations[gram] = train_lm.LinearInterpolation(trigram=gram, model=model)
                 # Sort the probabilities, these will be used for visualization
                sorted_tuples = sorted(interpolations.items(), key=itemgetter(1))
                # fill the appropriate ordered dict
                interpolation[model] = {}
                for k, v in sorted_tuples:
                    interpolation[model][k] = v
                # get the perplexity of the linear interpolation tested model
                perplexity[tuple(['interpolation', model])] = test_lm.Perplexity(n=n, model=model, linearInterpolation=True)

predictTest()

  0%|          | 0/3 [00:00<?, ?it/s]

Now that I have successfully tested the corpus using my language model, I will now show some ngram probabilities and the
model perplexities.

In [13]:
heading =   "\t|\tUnigram\t\t|\tBigram\t\t\t|\tTrigram\t\t\t\t\t|\tLinear  Interpolation"
line =  "************************************************************************************************************"\
        "***************************************"
def visualizeWords():
    # This is just some me having fun with strings and python nothing else
    data_template =     "Vanilla\t|\t{}:{:.5f}%\t|\t{}:{:.5f}%\t|\t{}:{:.5f}%\t\t|\t{}:{:.5f}%\n" \
                        "Laplace\t|\t{}:{:.5f}%\t|\t{}:{:.5f}%\t|\t{}:{:.5f}%\t\t|\t{}:{:.5f}%\n" \
                        "UNK\t|\t{}:{:.5f}%\t|\t{}:{:.5f}%\t|\t{}:{:.5f}%\t\t|\t{}:{:.5f}%"

    words = []
    for i in range(min(len(unigram["unk"]), 5)):
        i = -i
        words.append(list(unigram["unk"].keys())[i])
        print(heading)
        print(line)
        print(data_template.format(
                " ".join([f'{x[:5]:<5}' for x in list(unigram["vanilla"].keys())[i]]),          (unigram["vanilla"][list(unigram["vanilla"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(bigram["vanilla"].keys())[i]]),           (bigram["vanilla"][list(bigram["vanilla"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(trigram["vanilla"].keys())[i]]),          (trigram["vanilla"][list(trigram["vanilla"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(interpolation["vanilla"].keys())[i]]),    (interpolation["vanilla"][list(interpolation["vanilla"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(unigram["laplace"].keys())[i]]),          (unigram["laplace"][list(unigram["laplace"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(bigram["laplace"].keys())[i]]),           (bigram["laplace"][list(bigram["laplace"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(trigram["laplace"].keys())[i]]),          (trigram["laplace"][list(trigram["laplace"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(interpolation["laplace"].keys())[i]]),    (interpolation["laplace"][list(interpolation["laplace"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(unigram["unk"].keys())[i]]),              (unigram["unk"][list(unigram["unk"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(bigram["unk"].keys())[i]]),               (bigram["unk"][list(bigram["unk"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(trigram["unk"].keys())[i]]),              (trigram["unk"][list(trigram["unk"].keys())[i]]) * 100,
                " ".join([f'{x[:5]:<5}' for x in list(interpolation["unk"].keys())[i]]),        (interpolation["unk"][list(interpolation["unk"].keys())[i]]) * 100))
        print(line)
visualizeWords()

	|	Unigram		|	Bigram			|	Trigram					|	Linear  Interpolation
************************************************************************************************************************************************
Vanilla	|	80   :0.00000%	|	<s>   80   :0.00000%	|	<s>   80    81   :0.00000%		|	<s>   80    81   :0.00000%
Laplace	|	80   :0.01487%	|	<s>   80   :0.01487%	|	<s>   80    81   :0.01487%		|	<s>   80    81   :0.01487%
UNK	|	unk  :66.94215%	|	unk   unk  :75.52438%	|	unk   unk   unk  :73.14751%		|	unk   unk   unk  :73.24003%
************************************************************************************************************************************************


The table above shows that

In [6]:
def visualizePerplexity():
    # Somewhat cleaner than the one above
    perplexity_template =       "Vanilla\t|\t{:.2f}\t\t|\t{:.2f}\t\t\t|\t{:.2f}\t\t\t|\t{:.2f}\n" \
                                "Laplace\t|\t{:.2f}\t\t|\t{:.2f}\t\t\t|\t{:.2f}\t\t\t|\t{:.2f}\n" \
                                    "UNK\t|\t{:.2f}\t\t|\t{:.2f}\t\t\t|\t{:.2f}\t\t\t|\t{:.2f}"

    print(heading)
    print(line)
    print(perplexity_template.format(   perplexity[tuple([1, "vanilla"])], perplexity[tuple([2, "vanilla"])], perplexity[tuple([3, "vanilla"])], perplexity[tuple(["interpolation", "vanilla"])],
                                        perplexity[tuple([1, "laplace"])], perplexity[tuple([2, "laplace"])], perplexity[tuple([3, "laplace"])], perplexity[tuple(["interpolation", "laplace"])],
                                        perplexity[tuple([1, "unk"])],     perplexity[tuple([2, "unk"])],     perplexity[tuple([3, "unk"])],     perplexity[tuple(["interpolation", "unk"])]))
    print(line)
visualizePerplexity()

	|	Unigram		|	Bigram			|	Trigram			|	Linear  Interpolation
******************************************************************************************************************************
Vanilla	|	0.00		|	0.00			|	0.00			|	0.00
Laplace	|	6477966.12		|	10406806.00			|	2395408.07			|	188.58
UNK	|	1.03		|	1.02			|	1.03			|	1.03
******************************************************************************************************************************


Now that I have evaluated the model intrinsically via perplexity, I can do a small extrinsic evaluation by generating two
sentences from each model in the trained Language Model. One will be given no start, while another will be given a
sequence for it to continue.

In [7]:
def generateFromEmpty():
    for n in tqdm(params["n"]):
        for model in params["model"]:
            print("n: {}\nmodel: {}\n".format(n,model))
            generated = train_lm.GenerateSentence(n=n, model=model, verbose=True)
            for w in generated:
                print(w, end=' ')
            print(".\n")
generateFromEmpty()

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

77 11 62 26 69 36 .

n: 1
model: laplace

57 13 44 .

n: 1
model: unk

.

n: 2
model: vanilla

60 61 62 63 64 65 66 67 68 69 .

n: 2
model: laplace

50 51 52 53 54 55 56 57 58 59 .

n: 2
model: unk

.

n: 3
model: vanilla

10 11 12 13 14 15 16 17 18 19 .

n: 3
model: laplace

40 41 42 43 44 45 46 47 48 49 .

n: 3
model: unk

.



In [8]:
def generateFrom(start):
    for n in tqdm(params["n"]):
        sentence = start[:-1]
        for model in params["model"]:
            generated = train_lm.GenerateSentence(start=start[-1], n=n, model=model, verbose=True)
            print("n: {}\nmodel: {}\n".format(n,model))
            given_and_generated = sentence + generated
            for w in given_and_generated:
                print(w, end=' ')
            print(".\n")
            
start = ['20', '21', '22']
generateFrom(start=start)

  0%|          | 0/3 [00:00<?, ?it/s]

n: 1
model: vanilla

20 21 22 74 16 46 .

n: 1
model: laplace

20 21 22 9 8 35 62 .

n: 1
model: unk

20 21 22 .

n: 2
model: vanilla

20 21 22 23 24 25 26 27 28 29 .

n: 2
model: laplace

20 21 22 23 24 25 26 27 28 29 .

n: 2
model: unk

20 21 22 .

n: 3
model: vanilla

20 21 22 23 24 25 26 27 28 29 .

n: 3
model: laplace

20 21 22 23 24 25 26 27 28 29 .

n: 3
model: unk

20 21 22 .



Now I will repeat the above steps for the other corpus

## Sports Corpus

This corpus is a subset of the larger complete Maltese corpus.

In [9]:
# train_lm, test_lm = getTrainTest(root='Sports/')
# fitPredictTrain()
# unigram = OrderedDict()
# bigram = OrderedDict()
# trigram = OrderedDict()
# interpolation = OrderedDict()
# perplexity = {}
# predictTest()
# visualizeWords()
# visualizePerplexity()
# generateFromEmpty()
# start = ['jien', 'irrid', 'lil']
# generateFrom(start=start)

## Maltese Corpus

The complete Maltese corpus.


In [10]:
# train_lm, test_lm = getTrainTest(root='Maltese/')
# fitPredictTrain()
# unigram = OrderedDict()
# bigram = OrderedDict()
# trigram = OrderedDict()
# interpolation = OrderedDict()
# perplexity = {}
# predictTest()
# visualizeWords()
# visualizePerplexity()
# generateFromEmpty()
# start = ['jien', 'irrid', 'lil']
# generateFrom(start=start)

# Conclusion

# References

[1] Gatt, A., & Čéplö, S., Digital corpora and other electronic resources for Maltese. In A. Hardie, & R. Love (Eds.), Corpus Linguistics, 2013, pp. 96-97

[2] G. Pibiri and R. Venturini, "Handling Massive N -Gram Datasets Efficiently", ACM Transactions on Information Systems, vol. 37, no. 2, pp. 1-41, 2019. Available: 10.1145/3302913 [Accessed 8 April 2021].
https://towardsdatascience.com/perplexity-in-language-models-87a196019a94