The explanations and what I will apply is from and after reading ***A neural probabilistic language model paper*** *(Bengio, Y., et al, 2003)*

So we learned about n-grams language models, and how we can predict the next word given the $n$ previous words. And we implemented a **very simple** bigram model.  
Now, my next step was to understand word embeddings is how n-grams and statistical language modeling translated into neural networks and deep learning.

So in *(Bengio, Y., et al, 2003)*, the motivations they had is that it's difficult to model the joint probablity distribution of multiple consecutive words (random variables with **discrete** values.) An example would be having a sentence of 10 words where each one could have any value of a vocabulary of $100000$ words, so you have about $100000^{10}$ different possibilities or as they say "free parameters" for that sentence.

They also explained that modeling continuous variables is easier for generalization because the approximation function we want to learn is expected to **have** some smooth properties (neural networks would be a good example.) So the continuous space has a smoother space unlike discrete spaces where are more, idk, **discrete?** And they explain more about discrete spaces:  
***"For discrete spaces, the generalization structure is not as obvious: any change of these discrete variables may have a drastic impact on the value of the function to be estimated, and when the number of values that each discrete variable can take is large, most observed objects are almost maximally far from each other in hamming distance."***

**<u>Disclaimer:</u>**  
We are now going into the approach proposed and the mathematics and the design of it. Since I myself am trying to learn here, I won't be following the paper exactly here, meaning that I could:  
1.  Add a little bit more mathematics to give you and myself a better idea of what's happening.  
2.  Branch quickly in what I am writing to explain some other concepts that will help us understand what we have here better, which might be trivial to you depending on your level and knowledge so you can skip. But it won't be long anyway, this isn't an article (yet!)  

So previously, we learned that you can predict the next word $w_t$ given all previous ones:  
    $$P(w_t | w_{1:t-1}) \tag{1}$$  
    
But in n-grams, we wanted to approximate $(1)$ by using just the $n$ words before $w_t$ as follows:  
    $$P(w_t | w_{t-n+1:t-1}) \approx P(w_t | w_{1:t-1}) \tag{2}$$

Remember the joint probability equation, assuming there's a dependance between variables $x$, $y$, and $z$ (which in our case are words):
    $$P(x \cap y \cap z) = P(x, y, z) = P(z) P(y | z) P(x | z, y) \tag{3}$$
Which can be generalized to:
    $$P(x_i, x_{i-1}...x_{1}) = \prod_{k = 1}^n{P(x_k | x_1, x_2...x_{k-1})} \tag{4}$$

And now that we can approximate the probability of $w_t$, we can get the joint probability of a sentence or a sequence of words $W$ of length $T$:  

$$P(w_t w_{t-1}...w_1) = \prod_{k=1}^T{P(w_k | w_{k-n+1:k-1})} \tag{5}$$

This is the function we would want to maximize.

Side note for the joint probability in $(3)$: if we assumed there's no dependance between $x$, $y$, and $z$, we have:
    $$P(x, y, z) = P(x) P(y) P(z) \tag{6}$$

We are now want to use neural network to find a good function $f$ that approximates this joint probability distribution for a specific corpus. $f$ would have a set of trainable parameters $\Theta$.  
Now from above and $(2)$:  
$$P(w_t | w_{t-n+1:t-1}) \approx P(w_t | w_{1:t-1}) \newline
\approx f(w_t, w_{t-1}...w_{t-n+1};\Theta) \tag{7}$$

You can notice that $f$ uses $w_t$ and the n previous words $(w_{t-1}...w{t-n+1})$, just like we do in n-grams modelling.  
Now, from $(5)$ & $(7)$, we want to get the joint probability of the whole word sequence $W$ of word count $T$:
$$P(w_t w_{t-1}...w_1) = \prod_{k=1}^T{P(w_k | w_{k-n+1:k-1})} \newline
\approx \prod_{t = 1}^T{f(w_t, w_{t-1}...w_{t-n+1};\Theta)} \tag{8}$$  

<u>What are the parameters included in $\Theta$?</u>

The parameters we have would depend on the model's architecture, we have mainly 2 parts (according to the architecture proposed in the text.):
1. Assuming we would have a set of vocabulary $V$ of size $|V|$, we want to create a mapping (a lookup table) $C$ for each word $w_i$ to a vector $v_{w_i}$ whose dimensions $\in \mathbb{R}^m$. In other words: $C(w_i) = v_{w_i} \mid v_{w_i} \in \mathbb{R}^m$.
2. The probability function $g$ that takes the mapping $C$ of the input words and maps them to the probability distribution over the vocab $V$.
    $$f(w_t, w_{t-1}...w_{t-n+1};\Theta) = g(i, C(w_{t-1}...w_{t-n+1}))$$

So $g$ outputs a vector of $|V|$ dimensions ($\in \mathbb{R}^{|V|}$), where the $i_{th}$ dimension or element corresponds to the probability of $i_{th}$ word being the next token:
$$\hat{P}(w_t=i | w_{1:t-1})$$

Then $C$ would be an $|V| \times m$ matrix. You can also say that $C$ is the word embeddings as a term we know of today.

We now want to start definining the loss function, which will be equivalent to $(8)$:
$$l = \prod_{t = 1}^T{f(w_t, w_{t-1}...w_{t-n+1};\Theta)} \tag{9}$$

We can transform the product into a summation by taking the log of both sides:
$$log(l) = log(\prod_{t = 1}^T{f(w_t, w_{t-1}...w_{t-n+1};\Theta)}) \newline
= \sum_{t = 1}^T{log(f(w_t, w_{t-1}...w_{t-n+1};\Theta))} \tag{10}$$
**<u>Why do we take the log:</u>**  
* Multiplying a lot of probabilities would make the resulting value approach $0$, or a numerical underflow when running on a computer which can mess up the calculations and results.
* Likewise if we were multiplying large numbers would result in an even bigger result, eventually leading to an overflow.

What we reached in $(10)$ is what we call Log Likelihood. So, from $(10)$, our loss function would be:
$$L = \frac{1}{T} \sum_{t = 1}^T{log(f(w_t, w_{t-1}...w_{t-n+1};\Theta))} \tag{11}$$

We can also add a Regularization function $R$ that regularizes the set of parameters $\Theta$, or a subset of $\Theta$ called $\Omega$:
$$L = \frac{1}{T} \sum_{t = 1}^T{log(f(w_t, w_{t-1}...w_{t-n+1};\Theta))} + R(\Omega) \tag{12}$$

But I won't use the regularized term $R(\Omega)$ in this implementation, so we will use $(11)$ as our Loss Function for the implementation.

Now for the network's architecture:  
<p align="center">
    <img src="../images/model_architecture.jpeg" alt="Bengio, et al. 2003" width="600"/>
</p>  

The first layer is the embeddings layer, then we have just one hidden layer before the $Softmax$ function. They also describe that **we can optionally connect the embeddings layer directly to the softmax layer.**

**<u>What what are the parameters we have?</u>**  

$$\Theta = (b, d, W, U, H, C)$$

| Parameter | Size | Optional | Description |
|:----------:|:----------:|:----------:|:----------:|
| $C$    | $\|V\| \times m$   | No   | The lookup table for the vector representation of each word in the vocabulary $\|V\|$.   |
| $H$    | $h \times (n-1) m$  | No   | The weights matrix for the hidden layer.   |
| $U$    | $\|V\| \times h$  | No   | The weights matrix for the hidden layer output to the softmax layer.   |
| $b$    | $\|V\|$  | No   | The output layer's biases.   |
| $d$    | $h$  | No   | The hidden layer's biases.   |
| $W$    | $\|V\| \times (n-1)m$  | Yes   | The weights matrix for optional connection between the embeddings layer and softmax layer.   |

Such that the logits of the output layer y that's passed to the softmax activation:
$$y = b + Wx + U tanh(d + Hx)$$

The softmax activation (predicted probability of word $w_t$ given the previous $n$):
$$\hat{P}(w_t | w_{t-1:t-n+1}) = \frac{\textit{e}^{y_{w_t}}}{\sum_{i}{\textit{e}^{y_i}}}$$


Perhaps that's enough reading, lets try to make a naive implementation of the architecture above with ***PyTorch***.  
And by ***naive*** I mean it's my first time building a Torch module that deals with sequences, so I probably won't do it the most efficient way.

# Neural Language Model

In [2]:
from torch import nn
import torch

# Our Neural Language Model class
class NLM(nn.Module):
    def __init__(self, v: int, m: int, n: int, h: int):
        super().__init__()
        self.v: int = v  #   Vocabulary size
        self.m: int = m  #   Embeddings dimensions
        self.n: int = n  #   Context size (the n-grams), we will actually input n-1 words to predict the n^th
        self.h: int = h  #   Hidden layer units (neurons)
        self.padding_token_idx: int = v  #   Indicates the index of the special token to be used to pad the a sequence of length < the context size

        #   I started with a randn tensor of shape (v, m) but turns out that Torch has a class for embeddings specifically,
        #   where this class stores the lookup table with numerical indices, so we still have to map the words to indices before entering the class
        #   or tokenization in other means
        self.c = nn.Embedding(
            num_embeddings=v + 1, embedding_dim=m, padding_idx=self.padding_token_idx
        )

        #   This way, the linear/Activation layer will take flattened embeddings and produce v outputs
        #   I did it this way as per the paper which says that x (output of the embeddings) would be just the concatenation of all feature vectors
        #   But you can preserve the dimensionality of the of having each embedding vector seperately even with Linear layers
        #   by making the in_features=embedding dimensions
        self.linear1 = nn.Linear(in_features=m * (n - 1), out_features=h)
        self.tanh = nn.Tanh()

        #   The output layer, linear to create a logits and softmax activation to normalize them to probabilities
        self.linear2 = nn.Linear(
            in_features=h, out_features=v + 1
        )  #   + 1 for the padding token
        self.softmax = nn.LogSoftmax(dim=-1) #   Use log softmax because NLLoss function will expect log probabilities, spent some time debugging this :)

    def forward(self, x):
        embeddings = self.c(x)

        if embeddings.dim() == 3:   #   Its a batch input not single input, flatten each input's embeddings
            embeddings_flattened = embeddings.flatten(start_dim=1)
        else:   #   Single input, flatten the single input's embeddings
            embeddings_flattened = embeddings.flatten(start_dim=0)
            
        linear_activation = self.linear1(embeddings_flattened)
        non_linear_activation = self.tanh(linear_activation)

        logits = self.linear2(non_linear_activation)

        return logits

# Dataset 

I will try to stay close to the paper as much as I can, so we will use brown corpus from nltk, which should be the same they use in the paper, or at least **similar**.  
I don't want to drift off so we can compere the results we get to the ones in the paper, but any variations that I do from the one in the paper would probably mean different numbers which means I could've either drifted too much or implemented something incorrectly.

## Understand the dataset

In [3]:
import nltk
from nltk.corpus import brown

nltk.download("brown")
brown

[nltk_data] Downloading package brown to /Users/shadyali/nltk_data...
[nltk_data]   Package brown is already up-to-date!


<CategorizedTaggedCorpusReader in '/Users/shadyali/nltk_data/corpora/brown'>

In [4]:
brown.sents()

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

In [5]:
len(brown.words())

1161192

In [6]:
len(set(brown.words()))  # unique words

56057

In [7]:
non_words = [word for word in set(brown.words()) if not word.isalnum()]

In [8]:
dot_index = non_words.index(".")
non_words[dot_index : dot_index + 15]

['.',
 "U.S.'s",
 "Martin's",
 'Galveston-Port',
 'double-crossed',
 "sulky's",
 "cat's",
 'flag-stick',
 'seven-inch',
 "Pendleton's",
 'mud-caked',
 'base-stealing',
 'psycho-physiology',
 'Istiqlal-sponsored',
 'present-time']

So we have about $56k$ unique words, which includes punctuations, words mixed with punctuations, alphanumerals, etc.  
The paper replaced all words with frequency $< 3$ with a special token of some kind. And I'm not sure if I understand what's exactly done correctly so I won't do that.

Let's do some regex for now to fix seperate things like *'herd-owner'*, etc. and then tokenize directly.

In [9]:
sentences = [
    " ".join(sent_words) for sent_words in brown.sents()
]  #   Join each sentence to be one string per sentence
sentences[:3]

["The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .",
 "The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .",
 "The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. ."]

As you can notice, using join with " " puts a space in several places we don't want.  
So we'll use regex to solve this. **I ended up using chatGPT to get almost all patterns :(**

In [10]:
import regex as re

#   Fix the space at the sentences end before the fullstop
#   And also the space before commas.

punctuation_pattern = re.compile(r"[ ](?:,)|[ ](?:\.$)")
sentences = [
    re.sub(
        punctuation_pattern,
        lambda match: match.group().replace(" ", ""),
        sentence,
    )
    for sentence in sentences
]
print(sentences[:3])

["The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place.", "The jury further said in term-end presentments that the City Executive Committee, which had over-all charge of the election, `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted.", "The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr.."]


In [11]:
#   Lets fix quotes
quotes_pattern = re.compile(r"(?=``).*?(?<='')")

sentences = [
    re.sub(
        quotes_pattern,
        lambda match: match.group().replace("`` ", '"').replace(" ''", '"'),
        sentence,
    )
    for sentence in sentences
]
print(sentences[:3])

['The Fulton County Grand Jury said Friday an investigation of Atlanta\'s recent primary election produced "no evidence" that any irregularities took place.', 'The jury further said in term-end presentments that the City Executive Committee, which had over-all charge of the election, "deserves the praise and thanks of the City of Atlanta" for the manner in which the election was conducted.', 'The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible "irregularities" in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr..']


In [12]:
full_text = " ".join(sentences)

In [13]:
full_text[5000:5500]

"torney. Hartsfield has been mayor of Atlanta, with exception of one brief interlude, since 1937. His political career goes back to his election to city council in 1923. The mayor's present term of office expires Jan. 1. He will be succeeded by Ivan Allen Jr., who became a candidate in the Sept. 13 primary after Mayor Hartsfield announced that he would not run for reelection. Georgia Republicans are getting strong encouragement to enter a candidate in the 1962 governor's race, a top official said"

## Tokenization

I will directly use one step of regex used in [gpt-2's tokenizer](https://github.com/openai/gpt-2/blob/master/src/encoder.py), not the whole tokenizer.

In [14]:
import regex as re


#   Encoding function which splits a sequence (text) into a list
def _encode(sequence: str):
    pattern = re.compile(
        r"'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+",
        flags=re.IGNORECASE,
    )
    return re.findall(pattern=pattern, string=sequence)

You can understand what this regex pattern does from [here](https://regex101.com/r/L82kB5/1)

In [15]:
vocab = set(_encode(full_text))

In [16]:
len(vocab)

52908

In [17]:
words = [word for word in vocab]
words[20000:20020]

['body',
 ' scathingly',
 'Avesta',
 ' Saturn',
 ' Missail',
 ' Hasseltine',
 ' strengthens',
 'Flip',
 ' Bicycle',
 ' Danbury',
 ' emancipation',
 ' glomerular',
 ' Arcilla',
 ' bowels',
 ' NAEBM',
 ' 1637',
 ' Sons',
 ' radicals',
 ' brisker',
 ' alternation']

I feel like this would be good enough to just illustrate things, so I won't try to compress the vocabulary further

We want to add a token for padding, we would need padding when the sequence we have is smaller than the context window. Or if a word is out of the vocab (we haven't encountered before)

In [18]:
word_to_token = {word: token for token, word in enumerate(vocab)}

#   Add the special padding token
word_to_token[""] = len(vocab)

In [19]:
assert max(word_to_token.values()) == len(vocab)

In [20]:
list(word_to_token.items())[:5]

[(' presuming', 0),
 (' Concessionaires', 1),
 (' Hoosier', 2),
 (' vaporization', 3),
 (' Pyhrric', 4)]

In [21]:
#   Tokenization function
def _tokenize(words: list[str]) -> list[int]:
    tokens = [word_to_token.get(word, len(vocab)) for word in words]
    return tokens

Let's test encode and tokenize on the full text now

In [22]:
full_text_tokens = _tokenize(_encode(full_text))

In [23]:
_encode(full_text)[:15]

['The',
 ' Fulton',
 ' County',
 ' Grand',
 ' Jury',
 ' said',
 ' Friday',
 ' an',
 ' investigation',
 ' of',
 ' Atlanta',
 "'s",
 ' recent',
 ' primary',
 ' election']

In [24]:
full_text_tokens[:15]

[35297,
 32806,
 26855,
 23219,
 49016,
 18398,
 43290,
 33238,
 39018,
 35899,
 4323,
 24804,
 20212,
 49780,
 51177]

In [25]:
word_to_token["The"]

35297

In [26]:
_encode("I'am embedding things.")

['I', "'", 'am', ' embedding', ' things', '.']

In [27]:
_tokenize(_encode("I'am embedding things."))

[35232, 28713, 5228, 52908, 41222, 47699]

So there you have it! We can tokenize now

Lets structure our data for training now

## Structuring Dataset

We will do batch training, so we want to divide our corpus into batches of sequences.  
We will make a class for the dataset where it will create a list of a sliding window of $n-1$ words, and the $n^{th}$ is the label.  
The class the handle tokenization internally and will provide functions to encode and decode words to tokens and vice-versa.

<span style="color:red"> IMPORTANT NOTE: A MISTAKE I DID (I guess..)</span>

I tried the sliding window style above and I got some <span style="color:red">bad results</span>.  
Which I actually noticed when I read the shape format of the predictions and labels that Perplexity class expects, they wanted the ***sequence length!***  
I guess the problem is that the sliding window does this given a list of tokens:  
$$[0, 1, 2, 3, 4, 5, 6]$$
The sliding window will make the dataset like this, considering $n=4$, each row is a list of the input and the target variable/label (the correct token to be predicted):  
$$[0, 1, 2][3]$$
$$[1, 2, 3][4]$$
$$[2, 3, 4][5]$$
$$[3, 4, 5][6]$$

So now we lose the following (to my best understanding so far):  
* There are some words we don't try to predict during training, losing rich information.
* We dont use the special padding token so the model can't handle sequences less than the $n$.
* The perplexity considers the sequence length to be 1 in that case (if I'm correct), so this is either correct calculation and means I actually structured the data incorrectly, or I am calculating things wrong. 


In [127]:
import regex as re
import torch
from torch.utils.data import Dataset
import torch.nn.functional as F

class Corpus(Dataset):
    def __init__(self, full_text: str, n: int):
        super().__init__()
        self.context_window: int = n

        self.vocab = set(self._encode(full_text))
        self.vocab_length: int = len(vocab)

        self.word_to_token: dict[str, int] = {
            word: token for token, word in enumerate(vocab)
        }
        self.word_to_token[""] = self.vocab_length

        self.token_to_word: dict[int, str] = {
            token: word for token, word in enumerate(vocab)
        }
        self.token_to_word[self.vocab_length] = ""

        self.full_text_tokenized: list[int] = self._tokenize(self._encode(full_text))
        self.define_windows()

    # I will reuse the functions and logic from previous cells
    def _encode(self, text: str) -> list[str]:
        pattern: re.Pattern[str] = re.compile(
            r"'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+",
            flags=re.IGNORECASE,
        )
        return re.findall(pattern=pattern, string=text)

    def _tokenize(self, words: list[str]) -> list[int]:
        tokens: list[int] = [
            self.word_to_token.get(word, self.vocab_length) for word in words
        ]
        return tokens

    def tokenize_text(cls, text: str) -> torch.Tensor:
        return torch.tensor(cls._tokenize(cls._encode(text)))

    def decode(self, tokens: torch.Tensor) -> list[str]:
        tokens = (
            tokens.tolist() if tokens.shape else [tokens.tolist()]
        )  # A scaler value will output the value without list, raising an error below

        words = [self.token_to_word[token] for token in tokens]
        return words
    
    def decompose_window(self, window: list[int]) -> tuple[list[list[int]], list[int]]:
        padding_token = self.vocab_length
        context_window = self.context_window
        
        decomposed_windows = []
        decomposed_labels = []
        for index, nth_token in enumerate(window):
            decomposed_labels.append(nth_token)
            
            context = F.Tensor(window[:index])
            if len(context) < context_window:
                empty_positions = ((context_window - 1) - len(context), 0)
                context = F.pad(
                    input=context,
                    pad=empty_positions,
                    mode='constant',
                    value=padding_token
                ).int().tolist()
            decomposed_windows.append(context)
        return decomposed_windows, decomposed_labels
    
    def define_windows(self) -> None:
        self.sequences = []
        self.next_token = []

        text_length = len(self.full_text_tokenized)
        for index in range(text_length):
            if index + self.context_window < text_length:
                window = self.full_text_tokenized[index : index + self.context_window]
                decomposed_windows, decomposed_labels= self.decompose_window(window)
                self.sequences.extend(decomposed_windows)
                self.next_token.extend(decomposed_labels)

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, window):
        return torch.tensor(self.sequences[window]), torch.tensor(
            self.next_token[window]
        )

In [126]:
import torch.nn.functional as F

def decompose_window(window: list[int]) -> tuple[list[list[int]], list[int]]:
        padding_token = 55
        context_window = 5
        
        decomposed_windows = []
        decomposed_labels = []
        for index, nth_token in enumerate(window):
            decomposed_labels.append(nth_token)
            
            context = F.Tensor(window[:index])
            if len(context) < context_window:
                empty_positions = ((context_window - 1) - len(context), 0)
                context = F.pad(
                    input=context,
                    pad=empty_positions,
                    mode='constant',
                    value=padding_token
                ).int().tolist()
            decomposed_windows.append(context)
        return decomposed_windows, decomposed_labels
decompose_window([1, 2, 3, 4, 5])

([[55, 55, 55, 55],
  [55, 55, 55, 1],
  [55, 55, 1, 2],
  [55, 1, 2, 3],
  [1, 2, 3, 4]],
 [1, 2, 3, 4, 5])

In [87]:
a = [0, 1, 2]
a[:0+1]

[0]

In [108]:
dataset = Corpus(full_text=full_text, n=4)

In [117]:
dataset.decode(dataset[61][0]), dataset.decode(dataset[62][1])

(['', '', ' produced'], ['no'])

In [None]:
dataset[1]

(tensor([52908., 52908., 32806.]), tensor(26855))

We can now use the DataLoader class to get an iterator to produce batches of training data for us

In [31]:
from torch.utils.data import DataLoader

data_iterator = iter(DataLoader(dataset=dataset, batch_size=50, shuffle=True))

In [32]:
input, label = next(data_iterator)
print(f"Feature batch shape: {input.shape}")
print(f"Labels batch shape: {label.shape}")

Feature batch shape: torch.Size([50, 3])
Labels batch shape: torch.Size([50])


The loader is working!!

I think we are ready now to split our data, ***AND TRAIN OUR MODEL***

# MODEL TRAINING

## Setup

Let's setup our model and move it to the GPU

MPS is the GPU backend for Macs, so change it to CUDA if you're using Nvidia driver.

In [33]:
device = (
    "mps"
    if torch.backends.mps.is_available()
    else "cuda"
    if torch.cuda.is_available()
    else "cpu"
)
device

'mps'

I will try to replicate the same hyper parameters used in the experiments, I won't do every combination though.
For now lets do:  
* n = 5
* h = 50
* m = 60
* direct = No       <br>*There's no direct connections from the embeddings layer to the output*</br>
* mix = No          <br>*Mixing the model with other models like n-grams (if I understood correctly)*</br>

In [128]:
dataset = Corpus(full_text=full_text, n=5)

In [142]:
nlm = NLM(v=dataset.vocab_length, m=60, n=5, h=50)
nlm.to(device)

NLM(
  (c): Embedding(52909, 60, padding_idx=52908)
  (linear1): Linear(in_features=240, out_features=50, bias=True)
  (tanh): Tanh()
  (linear2): Linear(in_features=50, out_features=52909, bias=True)
  (softmax): LogSoftmax(dim=-1)
)

For our loss function, we will use the ***Negative Log Likelihood*** instead of ***Log Likelihood*** to make our goal be minimizing instead of maximizing.  
So we will just add a negative sign to equation $(11)$:  
$$L = \frac{1}{T} \sum_{t = 1}^T{-log(f(w_t, w_{t-1}...w_{t-n+1};\Theta))} \tag{13}$$

We will make our ***learning rate*** $\epsilon_o = 10^{-3}$, will use a scheduler $\epsilon_t = \frac{\epsilon_o}{1 + rt}$ to decrease the rate, where $r = 10^{-8}$ is the decrease factor and $t$ is the number of parameter updates done.  
I'm not entirely sure what optimizer they use, but I believe it's ***Stochastic Gradient Descent*** as they mention it in section $3.1$ for each CPU update.
They also mention a weight decay penalty of $10^{-4}$ in section $4.2$

In [143]:
learning_rate = 1e-3

loss_function = nn.NLLLoss()

optimizer = torch.optim.SGD(
    params=nlm.parameters(), lr=learning_rate, weight_decay=1e-4
)

Lets define the learning rate scheduler

In [144]:
# LambdaLR allows you to define a specific function that uses epoch and multiplies that function by the current learning rate
r = 1e-8
lr_scheduler = torch.optim.lr_scheduler.LambdaLR(
    optimizer=optimizer, lr_lambda=lambda epoch: 1 / (1 + r * epoch)
)

Split the dataset

In [145]:
len(dataset)

5963720

In [146]:
from torch.utils.data import Subset
indices = list(range(len(dataset)))

train_size = 5500000
validation_size = 300000
test_size = 163720

train_set = Subset(
    dataset=dataset,
    indices=indices[:5500000]
)

validation_set = Subset(
    dataset=dataset,
    indices=indices[5500000:5500000+300000]
)

test_set = Subset(
    dataset=dataset,
    indices=indices[5500000+300000:]
)

## THE TRAINING LOOP

I will use some configurations from my last project for the loop

In [147]:
from tqdm import tqdm
from torch.utils.data import DataLoader

epochs = 15
batch_size = 160
nlm.train()
for epoch in range(epochs):
    data_iterator = iter(
        DataLoader(dataset=train_set, batch_size=batch_size)
    )

    batch_info = {}

    with tqdm(
        enumerate(data_iterator),
        total=len(data_iterator),
        desc="Training",
        unit="Batch",
    ) as batches:
        for batch_index, (inputs, labels) in batches:
            optimizer.zero_grad()
            
            inputs, labels = inputs.to(device), labels.to(device)
            predicted_log_probabilites = nlm.softmax(nlm(inputs))   # apply log_softmax here to the logits as the loss function expects log probabilities
            
            loss = loss_function(predicted_log_probabilites, labels)
            
            loss.backward()
            optimizer.step()
            
            batch_info["Loss"] = f'{loss.item(): .3f}'
            batches.set_postfix(batch_info)
            
            # Clear memory for batch data
            del inputs, labels, predicted_log_probabilites
            torch.mps.empty_cache()
            
        lr_scheduler.step()

Training: 100%|██████████| 34375/34375 [14:53<00:00, 38.45Batch/s, Loss=8.176] 
Training: 100%|██████████| 34375/34375 [16:01<00:00, 35.76Batch/s, Loss=7.192] 
Training: 100%|██████████| 34375/34375 [15:58<00:00, 35.88Batch/s, Loss=6.828] 
Training: 100%|██████████| 34375/34375 [16:03<00:00, 35.69Batch/s, Loss=6.649] 
Training: 100%|██████████| 34375/34375 [15:51<00:00, 36.12Batch/s, Loss=6.551] 
Training: 100%|██████████| 34375/34375 [15:48<00:00, 36.23Batch/s, Loss=6.487] 
Training: 100%|██████████| 34375/34375 [16:02<00:00, 35.70Batch/s, Loss=6.441] 
Training: 100%|██████████| 34375/34375 [16:13<00:00, 35.31Batch/s, Loss=6.406] 
Training: 100%|██████████| 34375/34375 [15:59<00:00, 35.84Batch/s, Loss=6.380] 
Training: 100%|██████████| 34375/34375 [15:58<00:00, 35.87Batch/s, Loss=6.358] 
Training: 100%|██████████| 34375/34375 [16:00<00:00, 35.79Batch/s, Loss=6.339] 
Training: 100%|██████████| 34375/34375 [16:01<00:00, 35.77Batch/s, Loss=6.321] 
Training: 100%|██████████| 34375/34375 [

## Evaluation

Lets compute the perplexity as our metric since the paper uses it.  
* Perplexity measures the uncertainty the model has when predicting the next token.  
* So a smaller perplexity means a more certain model.

In [175]:
t = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])
t.view((-1, 1)).shape

torch.Size([16, 1])

In [None]:
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline
mle = MLE()

In [202]:
from torcheval.metrics.text import Perplexity
perplexity = Perplexity(ignore_index=dataset.vocab_length,)

data_iterator = iter(DataLoader(
    dataset=train_set,
    batch_size=nlm.n*250,
))

nlm.eval()
with torch.no_grad():
    for inputs, labels in data_iterator:
        y_train_log_probs = nlm(inputs.to(device))

        perplexity.update(y_train_log_probs.cpu().unsqueeze(1), labels.unsqueeze(1))
        
        del inputs, labels, y_train_log_probs
        torch.mps.empty_cache()
    score = perplexity.compute()
    
print(score)

tensor(1414.3811, dtype=torch.float64)


In [152]:
import gc

gc.collect()
torch.mps.empty_cache()

In [None]:
input = "dislike judges when the"

with torch.no_grad():
    nlm.zero_grad()
    next = nlm(dataset.tokenize_text(input).to(device))
    best = next.softmax(-1)

In [199]:
dataset.decode(best.argmax().cpu())

[' first']