# Building a Language Model
***
# Table of Contents
1.  [Imports](#Imports)
2.  [Methodology](#Methodology)
3.  [Model](#Model)
4.  [Corpus](#Corpus)
5.  [Evaluation](#Evaluation)

# Imports

Only 4 libraries are needed for this project:
* lxml - To read broken xml files
* re - To split lines with regex
* tqdm.notebook - tqdm progress bars, but for ipynb files
* os - File traversal
* LanguageModel - Contains the Corpus and Model class (same code), split for cleanliness

In [5]:
from lxml import etree
import re
from tqdm.notebook import tqdm
import os
from LanguageModel import Corpus, Model

Reading Files:   0%|          | 0/1 [00:00<?, ?it/s]

Parsing XML:   0%|          | 0/1 [00:00<?, ?it/s]

XML File:   0%|          | 0/1 [00:00<?, ?it/s]

Paragraph: 0it [00:00, ?it/s]

Counting x counts:   0%|          | 0/3 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/3 [00:00<?, ?it/s]

Counting x counts:   0%|          | 0/3 [00:00<?, ?it/s]

5.515418962292417


# Methodology


# Corpus

## The Reasoning Behind the Code
*keywords*: Corpus, NGram, Model

### Initialization

The corpus object mainly represents the pre-processed data of a given *corpus* as well as its processed form as NGrams
and probability models.

On initialization the *corpus* data is read and stored as a list of sentences, where each sentence is a list of words,
where each word is a string.
>Previously I had a Word object that would retain the **4 columns**, however this was causing
a lot of bloating and would interfere with the generation process, so I dropped it for the simpler string.

A vanilla unigram and model are also created at this time.

The option is given to the user to create a corpus object from an existing list of list of strings (another corpus).

### Usage

The corpus object is intended to be used as a non static object for one set of xml files. This would give a user the
ability to have multiple corpus objects that load in different xml files.

Once created the ```Model()```,  ```NGram()``` and ```LinearInterpolation()``` functions can be used to efficiently give
the desired output. As for traversing the corpus list itself, this can be done by accessing ```self``` as ```__len___()```,
```__iter()__``` and ```__getitem()__``` are written for this purpose.

All 3 attributes; ```_corpus```,```_ngrams```,```_models``` are intended to be private

## Code Explanation

### _ReadCorpus

This function checks the root location of where all xml files are contained and if it encounters no issue accessing it,
it will read the contents of the files within it and return them in the form of a list.

In [None]:
def _ReadCorpus(root='Corpus/'):
    if not os.access(root, os.R_OK):
        print('Check root!!')

    xml_data = []

    for file in tqdm(os.listdir(root), desc='Reading Files'):
        xml_data.append(open(os.path.join(root, file), 'r', encoding='utf8').read())
    return xml_data

### _ParseAsXML

The parser being used is initialised and the xml data from the files is read into ```xml_data```. Each file is then parsed
and appended to ```roots```. Each file is split in a number of texts, so ```roots``` is a list of these parsed texts.

In [None]:
def _ParseAsXML(root='Corpus/'):
    parser = etree.XMLParser(recover=True)
    roots = []
    xml_data = Corpus._ReadCorpus(root)
    for xml in tqdm(xml_data, desc='Parsing XML'):
        roots.append(etree.fromstring(xml, parser=parser))
    return roots

### _CorpusAsListOfSentences

This function is the last step in creating a pre-processed corpus.

```sentences``` a python list is initialised.

It is important to understand the Maltese dataset xml structure to properly understand this step.

Each text has a number of paragraphs, and each paragraph has a number of sentences, each of these containing a number of
words. Every word has 4 values; **4 columns**.

The 3 nested loops show the movement from sentence to sentence. Each sentence is filtered using regex. This filter splits
the sentence into a list of words (including the extra values). Now each word has its extra values removed. I decided
against removing punctuation and stop words since or do any other pre processing since I found it unnecessary. When a
sentence is processed, the start and end tags are added in their place.


#### Previous Version
Before settling on this I attempted to have the ```corpus``` as a Pandas dataframe. This would have been useful if I
continued on developing a tensor oriented approach to the problem. However, since I scrapped that idea a pd dataframe was
causing too much clutter since all sentences would have to be the same length.

In [None]:
def _CorpusAsListOfSentences(root='Corpus/'):
    roots = Corpus._ParseAsXML(root)
    sentences = []
    for root in tqdm(roots, desc='XML File'):
        for i, p in tqdm(enumerate(root), desc='Paragraph'):
            for k, s in enumerate(p):
                unfiltered_sentence = re.split(r'\n', s.text.lstrip('\n'))
                sentence = []
                for unfiltered_word in unfiltered_sentence:
                    if unfiltered_word is not '':
                        filtered_word = unfiltered_word.split('\t')
                        sentence.append(filtered_word[0])

                if sentence is not []:
                    sentence.insert(0, '<s>')
                    sentences.append(sentence)
                    sentence.append('</s>')
    return sentences

### _Counts

The _Counts function counts the ```n``` sized sequences in ```self``` (the corpus).

This is done by looping over each sentence, gathering a tuple of each ```n``` sized sequence and counting its
occurrences.

The counts are kept in a dictionary of this form:

```Python
counts = {sequence: count}
```

Where sequence is a tuple of size ```n``` and count is an integer containing the number of counts.

In [None]:
def _Counts(self, n):
    counts = {}
    for s in tqdm(self, desc='Counting x counts'):
        for i in range(len(s) + 1):
            if i < n:
                continue
            count = []
            for x in range(n,0,-1):
                count.append(s[i - x])
            count = tuple(count)

            if count in counts:
                counts[count] += 1
            else:
                counts[count] = 1

    return counts

### NGram

*keywords*: Vanilla, Laplace, UNK

The NGram function returns a dictionary of the NGram counts for a ```model``` of ```n```grams.

```model``` and ```n``` are used as flags for the ngram object.

Some error handling is done. Then an **identifier** is built to check whether an existing ngram exists with the flags given.
If one exists then the function returns it.

If there is no NGram object that satisfies the given flags a new ngram is created.


If the ```model``` is specified to be *laplace* or *vanilla* the counts of ```n``` sized sequences in the ```corpus```
are counted using ```_Counts()```.

* Then if the ```model``` is specified to be *laplace*, each count is added by 1.

Else, if the ```model``` is specified to be *unk*, a temp corpus is created using the *vanilla* unigram counts. If a word
is written less than 3 times it is omitted from the new corpus. The ```n``` counts of this new corpus are counted.

In any case a **counts** variable is created in the form as described in the **_Counts()** section.

The ```model``` is added to the **counts** variable which makes up the final NGram dictionary. Using the previous **identifier**
this new NGram is added to the **corpus _ngrams** dictionary as well as returned.

In [None]:
def NGram(self, n=2, model='vanilla'):
    if n < 1:
        raise Exception('Unigrams and up are supported, otherwise no.')

    if model != 'vanilla' and \
            model != 'laplace' and \
            model != 'unk':
        raise Exception('Only "vanilla"/"laplace"/"unk" models are supported.')

    identifier = tuple([n, model])
    if identifier in self._ngrams:
        if self._ngrams[identifier]['model'] == model:
            return self._ngrams[identifier]

    if model == 'laplace' or model == 'vanilla':
        counts = self._Counts(n=n)

        if model == 'laplace':
            for x in counts:
                counts[x] += 1

    elif model == 'unk':
        _count = self._Counts(n=1)
        tc = []
        for s in self:
            ts = []
            for w in s:
                if _count[tuple([w])] < 3:
                    ts.append('UNK')
                else:
                    ts.append(w)
            tc.append(ts)

        temp = Corpus(corpus=tc)
        counts = temp._Counts(n=n)

    result = {
        'count': counts,
        'model': model
    }

    self._ngrams[identifier] = result
    return self._ngrams[identifier]

### Model

Similar to how NGrams are handled

In [None]:
def Model(self, n=2, model='vanilla'):
    if n < 1:
        raise Exception('Unigrams and up are supported, otherwise no.')

    if model != 'vanilla' and \
            model != 'laplace' and \
            model != 'unk':
        raise Exception('Only "vanilla"/"laplace"/"unk" models are supported.')

    identifier = tuple([n, model])
    if identifier in self._models:
        if  self._models[identifier].model == model:
            return self._models[identifier]

    self._models[identifier] = Model(corpus=self,n=n,model=model)
    return self._models[identifier]

### LinearInterpolation

In [None]:
def LinearInterpolation(self, trigram:tuple, model='vanilla'):
    if len(trigram) != 3:
        raise Exception('Only trigrams are supported with this function.')

    l1 = 0.1
    l2 = 0.3
    l3 = 0.6

    models = [
                self.Model(n=1,model=model),
                self.Model(n=2,model=model),
                self.Model(n=3,model=model)
             ]

    return  l3 * models[2].GetProbability(trigram[2], trigram[:2]) + \
            l2 * models[1].GetProbability(trigram[2], tuple(trigram[1])) + \
            l1 * models[0].GetProbability(trigram[2])

# Model


My model implementation represents the probability of a given corpus for it's n-grams.



In [None]:
class Model(object):
    def __init__(self, corpus, n=2, model='vanilla'):
        V = 0
        cmodel = model
        if model == 'laplace':
            cmodel = 'vanilla'
            V = len(corpus.NGram(n=1)['count'])

        counts = corpus.NGram(n, model=cmodel)['count']

        probabilities = {}
        self.N = len([w for s in corpus for w in s])

        if n is not 1:
            previous = corpus.NGram(n - 1, model=cmodel)['count']
            for x in counts:
                probabilities[x] = {
                    'probability': (counts[x] + int(model == 'laplace')) / (previous[x[:n - 1]] + V)}
        else:
            for x in counts:
                probabilities[x] = {'probability': (counts[x] + int(model == 'laplace')) / (self.N + V)}
        self.probabilities = probabilities
        self.model = model

    def GetProbability(self, givenY, forX):
        # Add input validation
        sequence =  givenY + (forX,)

        if sequence in self.probabilities:
            return self.probabilities[sequence]['probability']
        else:
            return 0

    def GetProbabilityMath(self, forX, givenY):
        # Add input validation
        sequence =  givenY + (forX,)

        if sequence in self.probabilities:
            return self.probabilities[sequence]['probability']
        else:
            return 0

    def Perplexity(self):
        prob = 1
        for p in self.probabilities:
            prob *= self.probabilities[p]['probability']

        return prob ** -(1/self.N)

In [None]:
#http://pages.di.unipi.it/pibiri/papers/NGrams18.pdf

corpus = Corpus(directory='Test Corpus/')
corpus.Model(n=3)
corpus.Model(2)

print(corpus.Model(n=1, model='vanilla').Perplexity())
