# LSTM Language Model - Inference

In this tutorial, we will focus on a LSTM language model trained on the open source dataset wikitext-103.
The dataset consists of verified wikipedia articles and contains more than 100 million tokens.

As the complexity of dataset and model do not allow for a quick model training on commonly available hardware resources, we will consider only the set-up of data and models, and work with a pretrained model. 
The training was performed on a single GPU unit for 12 epochs.
 
We will see that the resulting model incorporates some text generation abilities but certainly lacks any language proficiency.  

In [None]:
import os

import torch
import torchtext
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA

## The Dataset

The attached file `data_processing.py` contains functions for loading and processing the data of the **wikitext-103** dataset, which consists of verified Wikipedia articles.

More information on the dataset [is available here](https://www.kaggle.com/datasets/vadimkurochkin/wikitext-103).

The data itself can be found in this directory's subfolder `data/wikitext-103`.
It consists of three data files: `wiki.train.tokens`, `wiki.valid.tokens` and `wiki.test.tokens`, containing train, validation and test data, respectively.

As in the previous exercise, the basic english tokenizer as provided by `torchtext` was used for text tokenization.

We load the vocabulary built from the dataset's training split as instance of torchtext's class `Vocab`.

As the training split is very large, we do not load it at this point. 
Instead, we use the function `load_dataset_from_file` to read the test data file, encode it according to tokenizer and vocabulary and load it to a PyTorch-compatible data class.

In [None]:
from data_processing import load_vocab, load_dataset_from_file, data_dir_wikitext_103 as data_dir
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
vocab = load_vocab(os.path.join(data_dir, "vocab.pkl"))

We can query the size of the vocabulary as

In [None]:
len(vocab)

A token (word) can be encoded as

In [None]:
vocab["something"]

and decoded as

In [None]:
vocab.get_itos()[1001]

The data is organized in sentences where each sentence is started with a dedicated start (`<sos>`) and ended with a dedicated end (`<eos>`) token.
Each sentence is padded to identical length, so that arbitrary sequences can be batched in a single tensor. 
Therefore, a pad token (`<pad>`) is used. 
The token indicates the end of the sequence has been reached, and it is explicitly ignored on model training and evaluation (when computing losses, gradients and metrics). 

Each data sample contains the complete encoded sequence as x-value.
The corresponding y-value (target) is the same sequence shifted to the left.
This is as in every step, one token of the input is fed into the recurrent LSTM model, and the next sequence token is the model's expected output. 

In [None]:
# The following code only works if the wikitext-103 dataset is available in the specified data_dir.
# Otherwise, this cell can be ignored. The rest of the notebook works without loading the dataset.
dataset_test = load_dataset_from_file(os.path.join(data_dir, "wiki.test.tokens"), tokenizer, vocab, max_seq_length=500, pad_token_index=vocab["<pad>"])
sample_x, sample_y = dataset_test[0]
assert torch.all(sample_x[1:] == sample_y[0:-1])

## Restoring the pretrained model

To be able to load the pretrained model weights, the identical model architecture needs to be constructed first. 

The following class, `LM_LSTM_Model` contains the model architecture of our pretrained model.

Inspect the architecture and notice the use of the LSTM-class provided by `PyTorch`.
Especially note that the `forward()`-method takes an additional argument: `hidden`. 
This argument contains the networks hidden states (both hidden and context layers) after processing of the previous sequence element.

Note that the hidden state has to be set to zero before processing a new sequence.
To initialize new hidden states, the method `init_hidden` is provided.

In [None]:
class LM_LSTM_Model(torch.nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers, dropout_rate):
        super(LM_LSTM_Model, self).__init__()
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim
        embedding = torch.nn.Embedding(vocab_size, embedding_dim)
        self.embedding = embedding
        self.lstm = torch.nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers,
                                  dropout=dropout_rate, batch_first=True)
        self.dropout = torch.nn.Dropout(dropout_rate)
        self.fc = torch.nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, hidden):
        embedding = self.dropout(self.embedding(x))
        output, hidden = self.lstm(embedding, hidden)
        output = self.dropout(output)
        prediction = self.fc(output)
        return prediction, hidden

    def init_hidden(self, batch_size, device="cpu"):
        hidden = torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device)
        cell = torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device)
        return hidden, cell

We initialize the model (using the same parameters as in training) and set it to evaluation mode.
(In evaluation mode, training functionalities are disabled, e.g. gradient computation is not activated and dropout is not applied.)

In [None]:
embedding_dim = 512
hidden_dim = 1024
num_layers = 3

model = LM_LSTM_Model(len(vocab), embedding_dim, hidden_dim, num_layers, 0)
model.eval()

Next, we load the model weights thereby restoring the model trained on our dataset.

In [None]:
state_dict = torch.load(f"./pretrained_model/lstm_model_weights.pkl")
model.load_state_dict(state_dict)

The plots of the model's training and evaluation loss over the trained epochs show that the LSTM model at hand does not suffer from excessive overfitting as seen for the neural ngram model previously.

![image](images/lstm_model_loss_train.png) ![image](images/lstm_model_loss_eval.png)

We want to use the pretrained model to create some texts.

Therefore, we implement the function `generate_text(model, vocab, start_tokens, max_iterations)` which takes the model and its vocabulary as first two input parameters.
Further, the start of a sequence which the model will be continuing is input as list of strings in `start_tokens`.
The parameter `max_iterations` contains the maximum number of additional tokens to be generated by the model.

Complete the function according to the instructions in the line comments.

In [None]:
def generate_text(model, vocab, start_tokens, max_iterations):
    model.eval()
    
    # encode the start sequence using the vocabulary
    start_tokens_encoded = vocab(start_tokens)
    
    # initialize variable to hold the complete generated sequence as encoded tokens (numerical) (needs to be concatenated to as generation proceeds) 
    complete_sequence = torch.tensor(start_tokens_encoded)

    # Initialize the LSTM's hidden units for the new sequence
    hidden = model.init_hidden(1)
    ### YOUR SOLUTION HERE
    # Create a tensor containing the first tokens (start_tokens) and run them through the network to obtain the corresponding hidden state
    ### END OF SOLUTION
    
    # Iteratively create next tokens and append them to the complete sequence.
    # Stop if "<eos>" is predicted or when the maximum number of iterations is reached.
    for i in range(max_iterations):
        ### YOUR SOLUTION HERE
        # 1) extract next model input, i.e. last token of complete sequence. (in each iteration, only the last token, which was not yet passed through the network, needs to be fed to the model.) 
        # 2) run model on next token (update hidden state variable)
        # 3) mask the prediction for the default-token "<unk>" as we do not want our model to predict this token
        # 4) find the token with the highest predicted score
        ### END OF SOLUTION
        
        # add next predicted token to tensor containing the complete sequence
        complete_sequence = torch.concat([complete_sequence, next_word_code.unsqueeze(0)], -1)

        # stop, if "<eos>" is predicted
        if next_word_code in vocab(["<eos>"]):
            break

    # decode complete sequence to human-readable tokens
    result_tokens = [vocab.get_itos()[i] for i in complete_sequence]
    result_sequence = " ".join(result_tokens)
    return result_sequence

We can use the function above to test the pretrained language model.

In [None]:
sequence_start = tokenizer("<sos> the quick brown fox")
generate_text(model, vocab, sequence_start, 200)

## Model Embeddings

As seen in the previous exercise, we can visualize the embeddings inherent in the model (learned simultaneously with the language model as a whole).

Below, we perform principal component analysis (PCA) to reduce the dimensionality of the embeddings to two dimensions for easier plotting. 
We create a scatter plot to visualize the spatial relationships between selected words.

Considering the arrangement of word embeddings in space, what can you say about the learned embeddings? How do they compare to the sophisticated GloVe embeddings inspected in the previous exercise session?  

In [None]:
def get_token_embedding(token: str, vocab, model):
    encoded_token = vocab[token]
    model_input = torch.tensor([encoded_token])
    embedding = model.embedding(model_input)
    return embedding

In [None]:
def plot_word_embeddings(words, embeddings):
    word_vectors = [embeddings[word] for word in words if word in embeddings]
    pca = PCA(n_components=2)
    word_vectors_2d = pca.fit_transform(word_vectors)
    
    plt.figure(figsize=(10, 10))
    for i, word in enumerate(words):
        if word in embeddings:
            plt.scatter(word_vectors_2d[i, 0], word_vectors_2d[i, 1])
            plt.text(word_vectors_2d[i, 0] + 0.01, word_vectors_2d[i, 1] + 0.01, word, fontsize=9)
    plt.show()

In [None]:
words = ["king", "queen", "mother", "father", "soda", "coke", "france", "spain"]
embeddings = {token: get_token_embedding(token, vocab, model).squeeze(0).detach().numpy() for token in words}

In [None]:
plot_word_embeddings(words, embeddings)