# NGram Neural Language Model - Exercises

In this exercise, we will set up a neural ngram language model and prepare corresponding training and test data based on a short sample text.
We will train the model on our very small corpus and perform inference.

### Package Setup

If not already done, please install the following packages to your python interpreter. (Remove the leading comment symbols `#` to execute the installation commands.)

In [None]:
#!pip install numpy==1.26.4
#!pip install torch==2.2.2
#!pip install torchtext==0.17.2
#!pip install matplotlib
#!pip install tensorboard

### Package Imports

Execute the follwoing cell to load all required packages to your running interpreter.

In [None]:
import os
from typing import Dict, List

import torch
import torchtext
from matplotlib import pyplot as plt

## Preparation of training data

Before we can start to implement and train a language model, the training data needs to be considered and preprocessed adequately.

In this tutorial, we will be working with a small sample text taken from a newspaper article from October 2024.
The sample text can be found in this notebook's directory's sub folder `data`.
We suggest to manually inspect the text file to get an idea of its contents.

The following code lines define the relative paths to the source file. 
Depending on your environment and IDE, you might need to adjust the paths according to your current working directory. 

In [None]:
from data_processing import data_dir
train_text_file_path = os.path.join(data_dir, "sample_text.txt")

assert os.path.isfile(train_text_file_path)

If the previous cell produced an error in the final assert statement, please check your current working directory with the subsequent code cell. 
Adjust the file paths above accordingly (so they contain the data file paths relative to your current directory). 

In [None]:
os.getcwd()

Next, we will define a function to read the text contents of our data file and read the file contents into a single string variable.

In [None]:
def read_text_file(text_file_path: str) -> List[str]:
    with open(text_file_path, "r", encoding="utf-8") as file:
        raw_text = file.read()
    return raw_text

raw_text = read_text_file(train_text_file_path)

To ensure the text was read properly and get an idea what the text looks like, let's print the first 500 characters of the read text: 

In [None]:
raw_text[:500]

In the next step, the text needs to be tokenized, i.e. split into small units like words (or subwords).

The most naive approach would be to split the text at every whitespace and use each word as one token.

The following code performs such a naive tokenization and assembles a sorted list of all extracted tokens.
Looking at a small excerpt of this list already reveals some problems of this naive approach.
What are they?

In [None]:
tokens_naive = raw_text.split()
tokens_naive_set = set().union(tokens_naive)
tokens_naive_sorted = sorted(tokens_naive_set)
sorted(tokens_naive_set)[36:40]

The package torchtext provides out-of-the-box tokenizers for different languages and cases.
We will use the tokenizer for basic english language and employ it to tokenize our sentences.
This tokenizer already handles important issues like letter casing, punctuation characters, or quotation marks.  

In [None]:
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

The tokenizer can be applied to a text, thereby converting it to a list of tokens.

In [None]:
tokenized_text = tokenizer(raw_text)

If we compare the numbers of unique tokens of the naive tokenization via the torchtext-provided tokenization, we will see that the number of unique tokens has indeed decreased.

In [None]:
print(len(tokens_naive_sorted))
print(len(sorted(set().union(tokenized_text))))

In this tutorial/demonstration case, we will split the data into a training and a test set in a very simple way:
The first 80% of the token sequence will be used for training, and the remaining 20% for test purposes.

(In real-world scenarios, a more sophisticated split, depending on the data structure, should be applied.
In most cases, a threefold split into train-validation-test is advisable.)  

In [None]:
train_amount = int(0.8 * len(tokenized_text))
tokenized_text_train = tokenized_text[:train_amount]
tokenized_text_test = tokenized_text[train_amount:]

print("Token amount: Train {} / Test {}".format(len(tokenized_text_train), len(tokenized_text_test)))

## Vocabulary Definition

In the following, we will use the training data to define the vocabulary and the procedure for transforming token to numeric representations. 

First, we will manually build the vocabulary based on the sequence of tokens.

To do so, we first define a function which extracts all unique tokens and counts their number of appearances.
Thus, the function `get_vocabulary_and_frequencies(token_sequence)` receives a sequence of tokens and returns a dictionary.
Each of the dictionary's `str`-keys corresponds to a unique token, the corresponding value counts how often the token appears within the given sequence.

Implement the function accordingly.

In [None]:
def get_vocabulary_and_frequencies(token_sequence: List[str]) -> Dict[str, int]:
    vocab = {}
    ### YOUR SOLUTION HERE
    ### END OF SOLUTION
    print(f"Created vocabulary of {len(vocab)} unique tokens.")
    return vocab

Execute and test the function.

In [None]:
vocab = get_vocabulary_and_frequencies(tokenized_text_train)

assert len(vocab) == 484
assert vocab["\'"] == 17
assert vocab["biden"] == 6

Next, we will use the created token-frequency-dictionary to create dictionaries for encoding and decoding tokens to numerics and vice versa. 

To do so, we implement the function `get_encoding_decoding_dicts(vocabulary)`.
It receives the token-frequency-dictionary as input and produces two dictionaries:

* One dictionary for encoding: The keys are the unique tokens in the vocabulary, the values are unique integers assigned to each token.
* One dictionary for decoding: It reverses the key-item-relations of the encoding dictionary.

When building the encoding dictionary for the `n` tokens within the vocabulary, choose the integers from `0` to `n-1` as values. 

Additionally, the token `"<unk>"` will be added to the vocabulary. 
It will be used as default token in case of unknown tokens (e.g. words not contained in the training data). 

In [None]:
def get_encoding_decoding_dicts(vocabulary: Dict[str, int]):
    vocab_encoding = {}
    vocab_decoding = {}
    ### YOUR SOLUTION HERE
    ### END OF SOLUTION
    unk_index = len(vocab_encoding)
    vocab_encoding["<unk>"] = unk_index
    vocab_decoding[unk_index] = "<unk>"
    return vocab_encoding, vocab_decoding

Now, we can call the function to build the encoding and decoding dictionaries.

In [None]:
encoding_vocab, decoding_vocab = get_encoding_decoding_dicts(vocab)

assert len(encoding_vocab) == len(decoding_vocab)
assert "biden" == decoding_vocab[encoding_vocab["biden"]]
assert "\'" == decoding_vocab[encoding_vocab["\'"]]
assert 123 == encoding_vocab[decoding_vocab[123]]

The encoding dictionary can be used to convert the tokenized training text to numerical values, which can be used as input to our language model.

Implement the function `encode_token_sequence(tokenized_sequence: List[str], encoding_dict:Dict[str, int], default_token = "<unk>")` which encodes a sequence of tokens based on the given dictionary.
Whenever the sequence contains a token which is not contained in the encoding dictionary, use the default token instead.

In [None]:
def encode_token_sequence(tokenized_sequence: List[str], encoding_dict:Dict[str, int], default_token = "<unk>"):
    # query default token encoding once
    default_token_encoded = encoding_dict[default_token]
    
    encoded_sequence = []
    # iterate through token sequence and encode tokens one by one
    ### YOUR SOLUTION HERE
    ### END OF SOLUTION
    return encoded_sequence

We apply the just implemented function `encode_token_sequence` and the encoding dictionary to encode our training and test token sequences to numerical arrays.

In [None]:
encoded_data_train = encode_token_sequence(tokenized_text_train, encoding_vocab, default_token = "<unk>")
encoded_data_test = encode_token_sequence(tokenized_text_test, encoding_vocab, default_token = "<unk>")

assert len(encoded_data_train) == len(tokenized_text_train)
assert len(encoded_data_test) == len(tokenized_text_test)

Next, we build our own PyTorch-compatible `Dataset`-class `NGramDataset` for n-gram text data.

In general, a `Dataset` class has access to a complete data set and makes its samples accessible to other PyTorch classes and functions. 

To be able to use custom organized data with PyTorch and its training functionalities, one can implement a subclass of `torch.utils.data.Dataset`.
Thereby, it is mandatory to implement class initializer `__init__` and the methods `__getitem__` and `__len__`.

In our case, the class initializer takes the tensor containing the complete encoded corpus and the n-gram size.
Both are stored within the class as member variables and are then easily accessible within other class methods.

The method `__getitem__` receives an integer index as input and returns the corresponding data sample, i.e. the corresponding `x`- and `y`-values.
In case of n-gram data, the `x` (input) value are `n-1` coherent entries in the data, and the corresponding `y` (target) value is the next subsequent entry.
Moreover, for `n=4`, the call of `__getitem__()` with `index=0` would return a tuple of a list of the first three encoded tokens in the data as first value, and the third encoded token as second value.

The method `__len__` has no arguments and returns the number of samples within the dataset.

Implement the two methods below.


In [None]:
class NGramDataset(torch.utils.data.Dataset):

    def __init__(self, corpus, ngram_size):
        super(NGramDataset).__init__()
        self.ngram_size = ngram_size
        self.complete_corpus = torch.tensor(corpus)


    def __getitem__(self, index):
        ### YOUR SOLUTION HERE
        ### END OF SOLUTION
    

    def __len__(self):
        ### YOUR SOLUTION HERE
        ### END OF SOLUTION

Now that the dataset class is defined, we pour our training and test data into it.

Further, we choose `n=5`.

In [None]:
ngram_length = 5

dataset_train = NGramDataset(encoded_data_train, ngram_length)
dataset_test = NGramDataset(encoded_data_test, ngram_length)

assert len(dataset_train) == 1200
assert len(dataset_test) == 297

Next, we use PyTorch's `DataLoader` to provide batches of data for the training and evaluation process.
Also, shuffling of training data with each epoch is already implemented in this class.

When training and evaluation our model, we will be able to iterate over these dataloader

In [None]:
dl_train = torch.utils.data.DataLoader(dataset_train, shuffle=True, batch_size=32)
dl_test = torch.utils.data.DataLoader(dataset_test, batch_size=32)

## Model Definition and Implementation

Now, we define our neural n-gram model as a subclass of `torch.nn.Module` (compare to previous exercises).

Our neural n-gram model consists of an embedding-layer, which projects the encoded input tokens in their numeric represantation to a multidimensional embedding space.
Next, a fully-connected layer with ReLU activation and another fully-connected layer follow.
The latter fully connected layer's output dimension corresponds to the number of unique tokens in the vocabulary with each unit giving a score for the corresponding token in the vocabulary.

In the following, the class initializer, which initialize all the required neural network components is already implemented.
Implement the class' `forward()`-method which computes the model's forward pass.
The input to the `forward()`-function can be assumed to be of shape `(batch_size, n-1)`.

Note that the embeddings layer produces a 3-dimensional tensor of shape `(batch_size, n-1, embedding_dim)`.
This tensor needs to be flattened to 2 dimensions (`(batch_size, (n-1) * embedding_dim)`) before it can be input into the first linear layer.

In [None]:
class NeuralNgram(torch.nn.Module):

    def __init__(self, ngram_size: int, vocab_size: int, embedding_dim: int = 64, hidden_dim: int = 128):
        super(NeuralNgram, self).__init__()
        self.ngram_size = ngram_size
        self.embeddings = torch.nn.Embedding(vocab_size, embedding_dim)
        self.linear_1 = torch.nn.Linear((ngram_size - 1) * embedding_dim, hidden_dim)
        self.activation = torch.nn.ReLU()
        self.linear_2 = torch.nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        ### YOUR SOLUTION HERE
        ### END OF SOLUTION

A simple test for your implementation follows:

In [None]:
model = NeuralNgram(ngram_length, len(encoding_vocab))

test_tensor = torch.zeros((32, ngram_length - 1)).long()
model_out = model(test_tensor)

assert model_out.shape[0] == 32
assert model_out.shape[1] == len(encoding_vocab)

We instantiate our model as instance of the class `NeuralNgram`.
When initializing the class, `n=5` is set according to the choice above and the number of unique tokens in our vocabulary is passed as input parameter.

Further, we initialize the Cross Entropy Loss and a stochastic gradient descent optimizer for the model's parameters.

In [None]:
model = NeuralNgram(ngram_length, len(encoding_vocab))

loss = torch.nn.CrossEntropyLoss(reduction='sum')
optimizer = torch.optim.SGD(lr=0.01, params=model.parameters())

In [None]:
n_epochs = 30

train_losses = []
test_losses = []

for i in range(n_epochs):
    # Perform training epoch
    model.train()
    epoch_loss = 0
    for data_x, data_y in dl_train:
        ### YOUR SOLUTION HERE
        # 1) run model on data
        # 2) compute loss (note: target values need to be converted to long before being passed to the cross-entropy-loss-function)
        ### END OF SOLUTION
        epoch_loss += batch_loss.cpu().detach().numpy()
        optimizer.zero_grad()
        batch_loss.backward()
        optimizer.step()

    epoch_loss = epoch_loss / len(dataset_train)
    train_losses.append(epoch_loss)

    # Evaluate on test data
    model.eval()
    test_loss = 0
    for test_x, test_y in dl_test:
        predictions = model(test_x)
        batch_loss = loss(predictions, test_y.long())
        test_loss += batch_loss.cpu().detach().numpy()

    test_loss = test_loss / len(dataset_test)
    test_losses.append(test_loss)

    if i%5 == 0:
        print("Epoch {}".format(i+1))
        print("Train loss: {}".format(epoch_loss))
        print("Test Loss: {}".format(test_loss))

Looking at the evolution of training and test loss, it becomes obvious that overfitting is an issue in our training pipeline.

What do you think is the main reason for overfitting in this case?

In [None]:
plt.figure()
plt.plot(train_losses, label="Train Loss")
plt.plot(test_losses, label="Test Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()

Next, we want to compute the model's perplexity on training and test data.

Therefore, we compute the perplexity using the cross entropy loss: 

In [None]:
def compute_perplexity(model, data_loader):
    loss = torch.nn.CrossEntropyLoss(reduction="sum")
    loss_sum = 0
    for data_x, data_y in data_loader:
        scores = model(data_x)
        loss_sum += loss(scores, data_y.long()) 

    perplexity = torch.exp(loss_sum / len(data_loader.dataset))
    return perplexity

In [None]:
ppl_train = compute_perplexity(model, dl_train)
print(ppl_train)

In [None]:
ppl_test = compute_perplexity(model, dl_test)
print(ppl_test)

Both loss values output during training and perplexity values again indicate that our model tremendously overfitted the training data.

Nevertheless, we want to implement inference on our neural n-gram model and have the model generate a few lines of text.

The function `inference(ngram_words, encoding_vocab, decoding_vocab, model)` takes as arguments:
* A list of n-1 tokens which correspond to the beginning of an arbitrary n-gram.
* Encoding and decoding vocabulary to convert tokens to numerical representations and vice versa
* The trained n-gram model

It queries the model's predictions for the n-gram and extracts the token with the highest score, which is returned by the function. 

In [None]:
def inference(ngram_words, encoding_vocab, decoding_vocab, model):
    ngram_tokens = [encoding_vocab[token] for token in ngram_words]
    token_tensor = torch.Tensor([ngram_tokens]).long()
    ### YOUR SOLUTION HERE
    # Execute model and find token (word) with the highest predicted score
    ### END OF SOLUTION
    return next_word

Next, we define a starting n-gram (4 words from the dictionary) and let our model predict next words iteratively.

In [None]:
start_ngram = ["the", "presidential", "election", "was"]
current_ngram = start_ngram
complete_sequence = " ".join(start_ngram) + " "
for i in range(100):
    ### YOUR SOLUTION HERE
    # have the model make a word prediction
    # update current_ngram for the next iteration
    # append the predicted word to the complete sequence (type str, human-readable)
    ### END OF SOLUTION
print(complete_sequence)

The model predictions reflect the very small amount of training data on which the model has tremendously overfitted.
That is, after few iterations, it starts repeating complete text passages from the training data.

Still, compared to statistical n-gram models, it is able to process n-grams which were not previously seen in the training data ("the presedential election was" is not a sequence present in our sample text). 

## Summary of Results and Observations

#### Model Performance:  
* The training loss decreased steadily over epochs, indicating that the model successfully learned patterns from the training data.
* However, the test loss and perplexity values suggest significant overfitting due to the small dataset size.

#### Overfitting:  
* The model memorized the training data, as evidenced by the repetitive text generation during inference.
* This overfitting is likely caused by the limited training data and the absence of regularization techniques.

#### Inference Results:  
* The model was able to generate coherent sequences for n-grams not seen during training, demonstrating its ability to generalize to some extent.
* However, the generated text quickly devolved into repetitive patterns, reflecting the overfitting issue.

#### Key Takeaways:
* A complete pipeline for implementing, training, and evaluating a simple neural n-gram model was demonstrated, showcasing the steps from data preprocessing to inference. 
* Increasing the dataset size or using a more diverse corpus would likely improve generalization.
* Regularization techniques such as dropout, weight decay, or early stopping could help mitigate overfitting.
* Experimenting with different model architectures or hyperparameters (e.g., embedding size, hidden layer size) could further enhance performance.

#### Future Work:
* Implement a validation set to monitor overfitting during training.
* Explore alternative tokenization methods or pre-trained embeddings for better text representation.
* Test the model on larger and more realistic datasets to evaluate its scalability and robustness.