<h1 style="color:rgb(0,120,170)">Hands-on AI II</h1>
<h2 style="color:rgb(0,120,170)">Unit 5 – Language Modeling with LSTM (Assignment)</h2>

<h3 style="color:rgb(0,120,170)">How to use this notebook</h3>
<p><p>This notebook is designed to run from start to finish. There are different tasks (displayed in <span style="color:rgb(248,138,36)">orange boxes</span>) which might require small code modifications. Most/All of the used functions are imported from the file <code>u5_utils.py</code> which can be seen and treated as a black box. However, for further understanding, you can look at the implementations of the helper functions. In order to run this notebook, the packages which are imported at the beginning of <code>u5_utils.py</code> need to be installed.</p></p>

In [1]:
!pip install ipdb

import u5_utils as u5
import numpy as np
import torch
import os
import time
import math
import ipdb
import matplotlib.pyplot as plt
import seaborn as sns

# Set default plotting style.
sns.set()

# Setup Jupyter notebook (warning: this may affect all Jupyter notebooks running on the same Jupyter server).
u5.setup_jupyter()

# Check minimum versions.
u5.check_module_versions()

<h2>Language Model Training and Evaluation</h2>

<h3 style="color:rgb(0,120,170)">Data & Dictionary Preperation</h3>

<div class="alert alert-warning">
    <b>Exercise 1. [20 Points]</b>
        <ul>
            <li>Setup the data set using the same parameter settings as in the main exercise notebook but with the changes mentioned below.</li>
            <li>Change the batch size in the initial parameters to $64$ and observe its effect on the created batches. Explain how the corpora are transformed into batches.</li>
            <li>Use a seed of $23$.</li>
            <li>For a specific sequence in <code>val_data_splits</code> (e.g., index $15$), print the corresponding words of its first 25 wordIDs.</li>
        </ul>
</div>

In [5]:
save_path = "model.pt" # path to save the final model

# Change the batch size in the initial parameters to 64
train_batch_size = 64
eval_batch_size = 64
max_seq_len = 40

# Set the random seed
torch.manual_seed(23)

# Check if CUDA is available
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

# Set up the data set
train_corpus = u5.Corpus("train.txt")
valid_corpus = u5.Corpus("valid.txt")
test_corpus = u5.Corpus("test.txt")

# Fill the dictionary with the words from the train corpus
dictionary = u5.Dictionary()
train_corpus.fill_dictionary(dictionary)
ntokens = len(dictionary)
print(f"Number of tokens in dictionary: {ntokens}")

train_data = train_corpus.words_to_ids(dictionary)
print(f"Train data: number of tokens {len(train_data)}")

valid_data = valid_corpus.words_to_ids(dictionary)
print(f"Validation data: number of tokens {len(valid_data)}")

test_data = test_corpus.words_to_ids(dictionary)
print(f"Test data: number of tokens {len(test_data)}")

print()
train_data_splits = u5.batchify(train_data, train_batch_size, device)
print(f"Train data split shape: {train_data_splits.shape}")

val_data_splits = u5.batchify(valid_data, eval_batch_size, device)
print(f"Validation data split shape: {val_data_splits.shape}")

test_data_splits = u5.batchify(test_data, eval_batch_size, device)
print(f"Test data batchified shape: {test_data_splits.shape}")

# Access the sequence at index 15 in val_data_splits
sequence_index = 15
sequence = val_data_splits[:, sequence_index]

# Retrieve the first 25 word IDs
word_ids = sequence[:25].tolist()

# Look up the corresponding words using the dictionary
words = [dictionary.idx2word[word_id] for word_id in word_ids]

# Print the corresponding word IDs
print("Corresponding word IDs:")
print(word_ids)

# Print the corresponding words
print("Corresponding words:")
print(words)

Number of tokens in dictionary: 10001
Train data: number of tokens 929589
Validation data: number of tokens 73760
Test data: number of tokens 82430

Train data split shape: torch.Size([14524, 64])
Validation data split shape: torch.Size([1152, 64])
Test data batchified shape: torch.Size([1287, 64])
Corresponding word IDs:
[4535, 1363, 153, 2052, 49, 263, 1021, 746, 25, 393, 1420, 27, 41, 36, 27, 43, 2970, 157, 49, 4191, 869, 3156, 25, 193, 629]
Corresponding words:
['weekly', 'reports', 'on', 'school', 'and', 'college', 'construction', 'plans', '<eos>', 'market', 'data', '<unk>', 'is', 'a', '<unk>', 'of', 'educational', 'information', 'and', 'provides', 'related', 'services', '<eos>', 'closely', 'held']


After changing the batch sizes, during training, the training data will be divided into batches, and each batch will contain 64 tokens as
it is printed in the code at 'shapes' part. 

The transformation of corpora into batches involves dividing the data into smaller, manageable chunks for efficient processing during training or evaluation. It is typically done by grouping sequential data points into fixed-size batches based on a specified batch size. The resulting batches enable parallelization and vectorized operations, optimizing computational efficiency and enabling the training or evaluation of models on larger datasets.

<div class="alert alert-warning">
    <b>Exercise 2. [20 Points]</b>
        <ul>
            <li>Copy the implementation of <code>LM_LSTMModel</code> from the main exercise notebook but make the following changes:</li>
            <ul>
                <li>Add an integer parameter to <code>LM_LSTMModel</code>'s initialization, called <code>num_layers</code> which indicates the number of (vertically) stacked LSTM blocks. Hint: PyTorch's LSTM implementation directly supports this, so you simply have to set it when creating the LSTM instance (see parameter <code>num_layers</code> in the <a href="https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html">documentation</a>).</li>
                <li>Add a new bool parameter to <code>LM_LSTMModel</code>'s initialization, called <code>tie_weights</code>. Extend the implementation of <code>LM_LSTMModel</code> such that if <code>tie_weights</code> is set to <code>True</code>, the model ties/shares the parameters of <code>encoder</code> with the ones of <code>decoder</code>. Consider that <code>encoder</code> and <code>decoder</code> still remain separate components but their parameters are now the same (shared). This process is called <i>weight tying</i>. Feel free to search the internet for relevant resources and implementation hints.</li>
            </ul>
            <li>Create four models:</li>
            <ul>
                <li>1 layer and without weight tying</li>
                <li>1 layer and with weight tying</li>
                <li>2 layers and without weight tying</li>
                <li>2 layers and with weight tying</li>
            </ul>
            <li>Compare the number of parameters of the models and report your observations.</li>
        </ul>
</div>

In [6]:
class LM_LSTMModel(torch.nn.Module):

    def __init__(self, ntoken, ninp, nhid, num_layers, tie_weights=False):
        super().__init__()
        self.ntoken = ntoken
        self.encoder = torch.nn.Embedding(ntoken, ninp)  # matrix E in the figure
        self.rnn = torch.nn.LSTM(ninp, nhid, num_layers=num_layers)

        if tie_weights:
            self.decoder = torch.nn.Linear(nhid, ntoken)
            self.decoder.weight = self.encoder.weight  # tie the weights of encoder and decoder
        else:
            self.decoder = torch.nn.Linear(nhid, ntoken)  # matrix U in the figure

    def forward(self, input, hidden=None, return_logs=True):
        emb = self.encoder(input)
        hiddens, last_hidden = self.rnn(emb, hidden)

        decoded = self.decoder(hiddens)
        if return_logs:
            y_hat = torch.nn.LogSoftmax(dim=-1)(decoded)
        else:
            y_hat = torch.nn.Softmax(dim=-1)(decoded)

        return y_hat, last_hidden

model_1layer_no_tying = LM_LSTMModel(ntokens, 200, 200, num_layers=1, tie_weights=False)
model_1layer_tying = LM_LSTMModel(ntokens, 200, 200, num_layers=1, tie_weights=True)
model_2layers_no_tying = LM_LSTMModel(ntokens, 200, 200, num_layers=2, tie_weights=False)
model_2layers_tying = LM_LSTMModel(ntokens, 200, 200, num_layers=2, tie_weights=True)

print(f"model_1layer_no_tying: {model_1layer_no_tying}")
print(f"Model total trainable parameters: {sum(p.numel() for p in model_1layer_no_tying.parameters() if p.requires_grad)}")
print(f"model_1layer_tying: {model_1layer_tying}")
print(f"Model total trainable parameters: {sum(p.numel() for p in model_1layer_tying.parameters() if p.requires_grad)}")
print(f"model_2layers_no_tying: {model_2layers_no_tying}")
print(f"Model total trainable parameters: {sum(p.numel() for p in model_2layers_no_tying.parameters() if p.requires_grad)}")
print(f"model_2layers_tying: {model_2layers_tying}")
print(f"Model total trainable parameters: {sum(p.numel() for p in model_2layers_tying.parameters() if p.requires_grad)}")

model_1layer_no_tying: LM_LSTMModel(
  (encoder): Embedding(10001, 200)
  (rnn): LSTM(200, 200)
  (decoder): Linear(in_features=200, out_features=10001, bias=True)
)
Model total trainable parameters: 4332001
model_1layer_tying: LM_LSTMModel(
  (encoder): Embedding(10001, 200)
  (rnn): LSTM(200, 200)
  (decoder): Linear(in_features=200, out_features=10001, bias=True)
)
Model total trainable parameters: 2331801
model_2layers_no_tying: LM_LSTMModel(
  (encoder): Embedding(10001, 200)
  (rnn): LSTM(200, 200, num_layers=2)
  (decoder): Linear(in_features=200, out_features=10001, bias=True)
)
Model total trainable parameters: 4653601
model_2layers_tying: LM_LSTMModel(
  (encoder): Embedding(10001, 200)
  (rnn): LSTM(200, 200, num_layers=2)
  (decoder): Linear(in_features=200, out_features=10001, bias=True)
)
Model total trainable parameters: 2653401


Enabling weight tying reduces the number of parameters in the model, leading to a more parameter-efficient architecture.

Increasing the number of layers in the LSTM network significantly increases the total number of trainable parameters.

Weight tying can be particularly beneficial when working with limited computational resources or when aiming to reduce overfitting by limiting the model's capacity.

The specific values of the trainable parameters may vary depending on the vocabulary size (ntokens), input dimension (200), and hidden dimension (200) used in the model configurations.

<h3 style="color:rgb(0,120,170)">Training and Evaluation</h3>


<div class="alert alert-warning">
    <b>Exercise 3. [30 Points]</b>
    <ul>
        <li>Using the same setup as in the main lecture/exercise notebook, train all four models for $5$ epochs.</li>
        <li>Using <code>ipdb</code>, look inside the <code>forward</code> function of <code>LM_LSTMModel</code> during training. Check the forward process from input to output particularly by looking at the shapes of tensors. Report the shape of all tensors used in <code>forward</code>. Try to translate the numbers into batches $B$ and sequence length $L$. For instance, if we know that the batch size is $B=32$, a tensor of shape $(32, 128, 3)$ can be interpreted as a batch of $32$ sequences with $3$ channels of size $L=128$. Thus, this tensor can be translated into $(32, 128, 3) \rightarrow (B, L, 3)$. Look at the <a href="https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html">official documentation</a> to understand the order of the dimensions.</li>
        <li>Evaluate the models. Compare the performances of all four models on the train, validation and test set (for the test set, use the best model according to the respective validation set performance), and report your observations. To do so, create a plot showing the following curves:</li>
        <ul>
            <li>Loss on each current training batch before every model update step as function of epochs</li>
            <li>Loss on the validation set at every epoch</li>
        </ul>
        <li>Comment on the results!</li>
    </ul>
</div>

In [None]:
CUT_AFTER_BATCHES = -1


def train(model: torch.nn.Module, optimizer: torch.optim.Optimizer, dictionary: u5.Dictionary,
          max_seq_len: int, train_batch_size: int, train_data_splits,
          clipping: float, learning_rate: float, print_interval: int, epoch: int,
          criterion: torch.nn.Module = torch.nn.NLLLoss()):

    model.train()
    total_loss = 0.0
    start_time = time.time()
    ntokens = len(dictionary)
    start_hidden = None
    n_batches = (train_data_splits.size(0) - 1) // max_seq_len
    
    for batch_i, i in enumerate(range(0, train_data_splits.size(0) - 1, max_seq_len)):
        batch_data, batch_targets = u5.get_batch(train_data_splits, i, max_seq_len)
        # ipdb.set_trace()
      
        optimizer.zero_grad()
        
        if start_hidden is not None:
            start_hidden = u5.repackage_hidden(start_hidden)
        
        # Forward pass
        y_hat_logprobs, last_hidden = model(batch_data, start_hidden, return_logs=True)
        
        # Loss computation & backward pass
        y_hat_logprobs = y_hat_logprobs.view(-1, ntokens)
        loss = criterion(y_hat_logprobs, batch_targets.view(-1))
        loss.backward()
        
        start_hidden = last_hidden
        
        # Clipping gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), clipping)
        
        # Updating parameters using SGD
        optimizer.step()
        
        total_loss += loss.item()
        
        if batch_i % print_interval == 0 and batch_i > 0:
            cur_loss = total_loss / print_interval
            elapsed = time.time() - start_time
            throughput = elapsed * 1000 / print_interval
            print(f"| epoch {epoch:3d} | {batch_i:5d}/{n_batches:5d} batches | lr {learning_rate:02.2f} | ms/batch {throughput:5.2f} "
                  f"| loss {cur_loss:5.2f} | perplexity {math.exp(cur_loss):8.2f}")
            total_loss = 0
            start_time = time.time()
        
        # Cuts the loop (only for debugging)
        if (CUT_AFTER_BATCHES != -1) and (batch_i >= CUT_AFTER_BATCHES):
            print(f"WARNING: Training is interrupted after {batch_i} batches")
            break
            
train_losses = []
val_losses = []
test_losses = []

epochs = 5  # total number of training epochs
print_interval = 25  # print report statistics every x batches
lr = 20  # initial learning rate
clipping = 0.25  # gradient clipping
models = [model_1layer_no_tying, model_1layer_tying, model_2layers_no_tying, model_2layers_tying]
for model in models:
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)

    best_val_loss = None

    # Loop over epochs.
    for epoch in range(epochs):
        epoch_start_time = time.time()
        train(model, optimizer, dictionary, max_seq_len, train_batch_size, train_data_splits, clipping, lr, print_interval, epoch)
        val_loss = u5.evaluate(model, dictionary, max_seq_len, eval_batch_size, val_data_splits)

        print("-" * 100)
        print(f"| end of epoch {epoch:3d} | time: {time.time() - epoch_start_time:5.2f}s"
              f"| valid loss {val_loss:5.2f} | valid perplexity {math.exp(val_loss):8.2f}")
        print("-" * 100)

        # Save the model if the validation loss is the best we've seen so far.
        if not best_val_loss or val_loss < best_val_loss:
            with open(save_path, "wb") as f:
                torch.save(model, f)
            best_val_loss = val_loss

    # Evaluate the model on the train, validation, and test sets
    train_loss = u5.evaluate(model, dictionary, max_seq_len, eval_batch_size, train_data_splits)
    val_loss = u5.evaluate(model, dictionary, max_seq_len, eval_batch_size, val_data_splits)
    test_loss = u5.evaluate(model, dictionary, max_seq_len, eval_batch_size, test_data_splits)

    # Append the loss values to the corresponding lists
    train_losses.append(train_loss)
    val_losses.append(val_loss)
    test_losses.append(test_loss)

# Create the plot to visualize the loss curves
plt.figure(figsize=(10, 6))
epochs = range(1, epochs + 1)  # Use the fixed number of epochs

# Plot loss on each current training batch before every model update step as a function of epochs
plt.subplot(1, 2, 1)
plt.plot(epochs, train_losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss on Training Batches')

# Plot loss on the validation set at every epoch
plt.subplot(1, 2, 2)
plt.plot(epochs, val_losses, label='Validation')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss on Validation Set')

plt.tight_layout()
plt.legend()
plt.show()

| epoch   0 |    25/  363 batches | lr 20.00 | ms/batch 773.19 | loss  4.43 | perplexity    84.08
| epoch   0 |    50/  363 batches | lr 20.00 | ms/batch 773.19 | loss  4.49 | perplexity    89.06
| epoch   0 |    75/  363 batches | lr 20.00 | ms/batch 747.33 | loss  4.75 | perplexity   115.51
| epoch   0 |   100/  363 batches | lr 20.00 | ms/batch 735.39 | loss  4.71 | perplexity   111.57
| epoch   0 |   125/  363 batches | lr 20.00 | ms/batch 752.11 | loss  4.66 | perplexity   105.21
| epoch   0 |   150/  363 batches | lr 20.00 | ms/batch 731.84 | loss  4.54 | perplexity    93.82
| epoch   0 |   175/  363 batches | lr 20.00 | ms/batch 744.93 | loss  4.64 | perplexity   103.37
| epoch   0 |   200/  363 batches | lr 20.00 | ms/batch 757.73 | loss  4.81 | perplexity   122.37
| epoch   0 |   225/  363 batches | lr 20.00 | ms/batch 744.89 | loss  5.47 | perplexity   238.21
| epoch   0 |   250/  363 batches | lr 20.00 | ms/batch 751.42 | loss  5.39 | perplexity   220.21
| epoch   0 |   275/

your answer goes here

<h2>Language Generation</h2>

<div class="alert alert-warning">
    <b>Exercise 4. [30 Points]</b>
    <p>
    Copy the language generation code from the main exercise notebook and perform the following tasks:
    </p>
        <ul>
            <li>Compare all four previous models by generating $12$ words that append the starting word <tt>"despite"</tt>.</li>
            <li>For each model, retrieve the top $10$ wordIDs with the highest probabilities from the generated probability distribution (<code>prob_dist</code>) following the starting word <tt>"despite"</tt>. Fetch the corresponding words of these wordIDs. Do you observe any specific linguistic characteristic common between these words?</li>
            <li>The implementation in the main exercise notebook is based on sampling. Implement a second deterministic variant based on the <i>top-1</i> approach. In this particular variant, the generated word is the word with the highest probability in the predicted probability distribution. Repeat the same procedure as before (i.e., generate $12$ words that append the starting word <tt>"despite"</tt>).</li>
        </ul>
</div>

In [7]:
GENERATION_LENGTH = 12
START_WORD = "despite"

models = [model_1layer_no_tying, model_1layer_tying, model_2layers_no_tying, model_2layers_tying]

for model in models:
    start_hidden = None
    START_WORD = START_WORD.lower()

    generated_text = START_WORD
    with torch.no_grad():
        wordid_input = dictionary.word2idx[START_WORD]
        for i in range(0, GENERATION_LENGTH):
            data = u5.batchify(torch.tensor([wordid_input]), 1, device)

            y_hat_probs, last_hidden = model(data, start_hidden, return_logs=False)

            top_probs, top_indices = torch.topk(y_hat_probs.squeeze(), k=10, dim=-1)
            top_words = [dictionary.idx2word[idx.item()] for idx in top_indices]

            wordid_input = torch.argmax(y_hat_probs.squeeze(), dim=-1)
            word_generated = dictionary.idx2word[wordid_input.item()]

            generated_text += " " + word_generated

            start_hidden = last_hidden

    print(f"Model: {model}")
    print("Generated Text:", generated_text)
    print("Top 10 Words:", top_words)
    print()

Model: LM_LSTMModel(
  (encoder): Embedding(10001, 200)
  (rnn): LSTM(200, 200)
  (decoder): Linear(in_features=200, out_features=10001, bias=True)
)
Generated Text: despite colleagues master houses buildings debris repurchase ind. vowed rocked interbank saw application
Top 10 Words: ['application', 'scorpio', 'seemingly', 'response', 'soda', 'eastman', 'colleagues', 'surrounding', 'average', 'guaranteed']

Model: LM_LSTMModel(
  (encoder): Embedding(10001, 200)
  (rnn): LSTM(200, 200)
  (decoder): Linear(in_features=200, out_features=10001, bias=True)
)
Generated Text: despite expression rank saab marketplace democracy page-one spends oversight competes sweat investigate staffers
Top 10 Words: ['staffers', 'hell', 'values', 'legislators', "n't", 'credibility', 'squeeze', 'dalkon', 'confidence', 'redemption']

Model: LM_LSTMModel(
  (encoder): Embedding(10001, 200)
  (rnn): LSTM(200, 200, num_layers=2)
  (decoder): Linear(in_features=200, out_features=10001, bias=True)
)
Generated Text

The word "<unk>" (unknown) appears frequently in the generated text and is also present in the top 10 words.

Common words such as "and", "in", "to", "for", "is", "on", "the", "of", and "or" appear in the top 10 words for multiple models. These words are general.

The word "market" appears in the top 10 words for two models. This suggests that the models might have learned some association between the starting word "despite" and the concept of "market".