# 🦜 NN-Based Language Model
In this excercise we will run a basic RNN based language model and answer some questions about the code. It is advised to use GPU to run the code. First run the code then answer the questions below that require modifying it.

In [3]:
#@title 🧮 Imports & Hyperparameter Setup
#@markdown Feel free to experiment with the following hyperparameters at your
#@markdown leasure. For the purpose of this assignment, leave the default values
#@markdown and run the code with these suggested values.
# Some part of the code was referenced from below.
# https://github.com/pytorch/examples/tree/master/word_language_model 
# https://github.com/yunjey/pytorch-tutorial/tree/master/tutorials/02-intermediate/language_model

! git clone https://github.com/yunjey/pytorch-tutorial/
%cd pytorch-tutorial/tutorials/02-intermediate/language_model/

import torch
import torch.nn as nn
import numpy as np
from torch.nn.utils import clip_grad_norm_

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Hyper-parameters
embed_size = 128 #@param {type:"number"}
hidden_size = 1024 #@param {type:"number"}
num_layers = 1 #@param {type:"number"}
num_epochs = 5 #@param {type:"slider", min:1, max:10, step:1}
batch_size = 20 #@param {type:"number"}
seq_length = 30 #@param {type:"number"}
learning_rate = 0.002 #@param {type:"number"}
#@markdown Number of words to be sampled ⬇️
num_samples = 50 #@param {type:"number"}  

print(f"--> Device selected: {device}")


fatal: destination path 'pytorch-tutorial' already exists and is not an empty directory.
/home/ziruiqiu/comp691_DL/assignment_2/pytorch-tutorial/tutorials/02-intermediate/language_model
--> Device selected: cuda


In [4]:
from data_utils import Dictionary, Corpus

# Load "Penn Treebank" dataset
corpus = Corpus()
ids = corpus.get_data('data/train.txt', batch_size)
vocab_size = len(corpus.dictionary)
num_batches = ids.size(1) // seq_length

print(f"Vcoabulary size: {vocab_size}")
print(f"Number of batches: {num_batches}")

Vcoabulary size: 10000
Number of batches: 1549


## 🤖 Model Definition
As you can see below, this model stacks `num_layers` many [LSTM](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) units vertically to construct our basic RNN-based language model. The diagram below shows a pictorial representation of the model in its simplest form (i.e `num_layers`=1).
![Pictorial Representation of The Model](https://upload.wikimedia.org/wikipedia/commons/6/63/Long_Short-Term_Memory.svg)

In [5]:
# RNN based language model
class RNNLM(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super(RNNLM, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, x, h):
        # Embed word ids to vectors
        x = self.embed(x)
        
        # Forward propagate LSTM
        out, (h, c) = self.lstm(x, h)
        
        # Reshape output to (batch_size*sequence_length, hidden_size)
        out = out.reshape(out.size(0)*out.size(1), out.size(2))
        
        # Decode hidden states of all time steps
        out = self.linear(out)
        return out, (h, c)

## 🏓 Training
In this section we will train our model, this should take a couple of minutes! Be patient 😊

In [6]:
model = RNNLM(vocab_size, embed_size, hidden_size, num_layers).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Truncated backpropagation
def detach(states):
    return [state.detach() for state in states] 


# Train the model
for epoch in range(num_epochs):
    # Set initial hidden and cell states
    states = (torch.zeros(num_layers, batch_size, hidden_size).to(device),
              torch.zeros(num_layers, batch_size, hidden_size).to(device))
    
    for i in range(0, ids.size(1) - seq_length, seq_length):
        # Get mini-batch inputs and targets
        inputs = ids[:, i:i+seq_length].to(device)
        targets = ids[:, (i+1):(i+1)+seq_length].to(device)
        
        # Forward pass
        states = detach(states)
        outputs, states = model(inputs, states)
        loss = criterion(outputs, targets.reshape(-1))
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        step = (i+1) // seq_length
        if step % 100 == 0:
            print ('Epoch [{}/{}], Step[{}/{}], Loss: {:.4f}, Perplexity: {:5.2f}'
                   .format(epoch+1, num_epochs, step, num_batches, loss.item(), np.exp(loss.item())))

Epoch [1/5], Step[0/1549], Loss: 9.2110, Perplexity: 10006.35
Epoch [1/5], Step[100/1549], Loss: 6.0061, Perplexity: 405.91
Epoch [1/5], Step[200/1549], Loss: 5.9207, Perplexity: 372.66
Epoch [1/5], Step[300/1549], Loss: 5.7405, Perplexity: 311.21
Epoch [1/5], Step[400/1549], Loss: 5.6723, Perplexity: 290.70
Epoch [1/5], Step[500/1549], Loss: 5.1448, Perplexity: 171.54
Epoch [1/5], Step[600/1549], Loss: 5.1881, Perplexity: 179.12
Epoch [1/5], Step[700/1549], Loss: 5.3368, Perplexity: 207.84
Epoch [1/5], Step[800/1549], Loss: 5.1846, Perplexity: 178.50
Epoch [1/5], Step[900/1549], Loss: 5.0611, Perplexity: 157.77
Epoch [1/5], Step[1000/1549], Loss: 5.1044, Perplexity: 164.74
Epoch [1/5], Step[1100/1549], Loss: 5.3662, Perplexity: 214.05
Epoch [1/5], Step[1200/1549], Loss: 5.1858, Perplexity: 178.72
Epoch [1/5], Step[1300/1549], Loss: 5.0935, Perplexity: 162.95
Epoch [1/5], Step[1400/1549], Loss: 4.8619, Perplexity: 129.27
Epoch [1/5], Step[1500/1549], Loss: 5.1804, Perplexity: 177.75
Ep

# 🤔 Questions

## 1️⃣ Q2.1 Detaching or not? (10 points)
The above code implements a version of truncated backpropagation through time. The implementation only requires the `detach()` function (lines 7-9 of the cell) defined above the loop and used once inside the training loop.
* Explain the implementation (compared to not using truncated backprop through time).
* What does the `detach()` call here achieve? Draw a computational graph. You may choose to answer this question outside the notebook.
* When using using line 7-9 we will typically observe less GPU memory being used during training, explain why in your answer.


1. When not using truncated backprop through time, the model would backpropagate through the entire sequence, making it computationally expensive and prone to vanishing or exploding gradients. By implementing TBPTT, the model backpropagates through a fixed number of steps (controlled by seq_length), which makes the training more efficient.

2. Assume the model has one layer and a sequence length of 3. We can represent the graph as follows:
![Drag Racing](cg2.png)
* In this graph, the LSTM layer receives inputs from the Embedding layer and is connected to the Linear layer. The outputs from the Linear layer are the predictions at each time step. The arrows represent the flow of information and gradients during forward and backward passes.

* When we use truncated backpropagation through time, we limit the number of time steps that gradients are backpropagated through. In this example, the gradients would only flow through the LSTM connections up to a certain number of time steps (e.g., 3 steps in this case).

* The detach() function is used to stop gradients from flowing further back in time than the specified number of time steps. In this example, if we apply detach() after 3 time steps, the gradients will not be backpropagated beyond the third time step. This reduces the amount of computation and memory required during training and can help prevent the vanishing gradient problem in long sequences.

3. Detaching the hidden states breaks the computational graph, and the memory associated with the previous states is freed up. When the computational graph is not detached, it keeps track of all the historical states and their gradients, which increases the memory consumption. By using TBPTT and detaching the hidden states, we can prevent excessive memory usage and make the training process more efficient.

## 🔮 Model Prediction
Below we will use our model to generate text sequence!

In [7]:
# Sample from the model
with torch.no_grad():
    with open('sample.txt', 'w') as f:
        # Set intial hidden ane cell states
        state = (torch.zeros(num_layers, 1, hidden_size).to(device),
                 torch.zeros(num_layers, 1, hidden_size).to(device))

        # Select one word id randomly
        prob = torch.ones(vocab_size)
        input = torch.multinomial(prob, num_samples=1).unsqueeze(1).to(device)

        for i in range(num_samples):
            # Forward propagate RNN 
            output, state = model(input, state)

            # Sample a word id
            prob = output.exp()
            word_id = torch.multinomial(prob, num_samples=1).item()

            # Fill input with sampled word id for the next time step
            input.fill_(word_id)

            # File write
            word = corpus.dictionary.idx2word[word_id]
            word = '\n' if word == '<eos>' else word + ' '
            f.write(word)

            if (i+1) % 100 == 0:
                print('Sampled [{}/{}] words and save to {}'.format(i+1, num_samples, 'sample.txt'))
! cat sample.txt

whose <unk> may lead for car passed to remove a little more care said new wage and the supply of russian blood products 
while they have played a huge <unk> plan says sam <unk> <unk> the president for local regional division which recently has seemed strong for clients described 

## 2️⃣ Q2.2 Sampling strategy (7 points)
Consider the sampling procedure above. The current code samples a word:
```python
word_id = torch.multinomial(prob, num_samples=1).item()
```
in order to feed the model at each output step and feeding those to the next timestep. Copy below the above cell and modify this sampling startegy to use a greedy sampling which selects the highest probability word at each time step to feed as the next input.

In [16]:
# Sample greedily from the model
with torch.no_grad():
    with open('sample_greedy.txt', 'w') as f:
        # Set intial hidden ane cell states
        state = (torch.zeros(num_layers, 1, hidden_size).to(device),
                 torch.zeros(num_layers, 1, hidden_size).to(device))

        # Select one word id randomly
        prob = torch.ones(vocab_size)
        input = torch.multinomial(prob, num_samples=1).unsqueeze(1).to(device)

        for i in range(num_samples):
            # Forward propagate RNN 
            output, state = model(input, state)

            # Sample a word id
            word_id = output.argmax(1).item()

            # Fill input with sampled word id for the next time step
            input.fill_(word_id)

            # File write
            word = corpus.dictionary.idx2word[word_id]
            word = '\n' if word == '<eos>' else word + ' '
            f.write(word)

            if (i+1) % 100 == 0:
                print('Sampled [{}/{}] words and save to {}'.format(i+1, num_samples, 'sample_greedy.txt'))

! cat sample_greedy.txt

for the N <unk> homeless airlines parent ual corp. announced plans to spin off about $ N billion in assets and pay for $ N billion in assets and pay for $ N billion in assets and pay for $ N billion in assets and pay for $ N billion 

## 3️⃣ Q2.3 Embedding Distance (8 points)
Our model has learned a specific set of word embeddings.
* Write a function that takes in 2 words and prints the cosine distance between their embeddings using the word embeddings from the above models.
* Use it to print the cosine distance of the word "army" and the word "taxpayer".

*Refer to the sampling code for how to output the words corresponding to each index. To get the index you can use the function `corpus.dictionary.word2idx.`*


In [9]:
# Embedding distance
import torch.nn.functional as F

def cosine_distance(word1, word2, model):
    # Get the word indices
    idx1 = corpus.dictionary.word2idx[word1]
    idx2 = corpus.dictionary.word2idx[word2]
    
    # Get the word embeddings from the model
    embed1 = model.embed(torch.tensor(idx1).to(device)).detach().cpu()
    embed2 = model.embed(torch.tensor(idx2).to(device)).detach().cpu()
    
    # Calculate the cosine similarity between the two embeddings
    cos_sim = F.cosine_similarity(embed1, embed2, dim=0)
    
    # Calculate the cosine distance as 1 - cosine similarity
    cos_dist = 1 - cos_sim
    
    return cos_dist

word1 = "army"
word2 = "taxpayer"
cos_dist = cosine_distance(word1, word2, model)
print("Cosine distance between '{}' and '{}': {:.4f}".format(word1, word2, cos_dist))


Cosine distance between 'army' and 'taxpayer': 1.1329


## 4️⃣ Q2.4 Teacher Forcing (Extra Credit 2 points)
What is teacher forcing?
> Teacher forcing works by using the actual or expected output from the training dataset at the current time step $y(t)$ as input in the next time step $X(t+1)$, rather than the output generated by the network.

In the `🏓 Training` code this is achieved, implicitly, when we pass the entire input sequence (`inputs = ids[:, i:i+seq_length].to(device)`) to the model at once.

Copy below the `🏓 Training` code and modify it to disable teacher forcing training. Compare the performance of this model, to original model, what can you conclude? (compare perplexity and convergence rate)

In [10]:
# Training code with Teacher Forcing
# Modified RNNLM model for step-by-step training without teacher forcing
class RNNLM_no_teacher_forcing(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super(RNNLM_no_teacher_forcing, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, x, h):
        # Embed word ids to vectors
        x = self.embed(x)
        
        # Forward propagate LSTM
        out, (h, c) = self.lstm(x, h)
        
        # Decode hidden states of all time steps
        out = self.linear(out.squeeze(1))
        return out, (h, c)

model_no_tf = RNNLM_no_teacher_forcing(vocab_size, embed_size, hidden_size, num_layers).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model_no_tf.parameters(), lr=learning_rate)

# Train the model without teacher forcing
for epoch in range(num_epochs):
    # Set initial hidden and cell states
    states = (torch.zeros(num_layers, batch_size, hidden_size).to(device),
              torch.zeros(num_layers, batch_size, hidden_size).to(device))

    for i in range(0, ids.size(1) - seq_length, seq_length):
        # Get mini-batch inputs and targets
        inputs = ids[:, i:i+1].to(device)  # Only take the first input of the sequence
        targets = ids[:, (i+1):(i+1)+seq_length].to(device)

        # Initialize the cumulative loss for this sequence
        cumulative_loss = 0

        # Loop through the sequence
        for j in range(seq_length):
            # Forward pass
            states = detach(states)
            outputs, states = model_no_tf(inputs, states)

            # Calculate the loss
            loss = criterion(outputs, targets[:, j])
            cumulative_loss += loss.item()

            # Backward and optimize
            optimizer.zero_grad()
            loss.backward()
            clip_grad_norm_(model_no_tf.parameters(), 0.5)
            optimizer.step()

            # Use the predicted output as input for the next time step
            inputs = outputs.argmax(dim=1).unsqueeze(1).to(device)

        step = (i+1) // seq_length
        if step % 100 == 0:
            avg_loss = cumulative_loss / seq_length
            print('Epoch [{}/{}], Step[{}/{}], Loss: {:.4f}, Perplexity: {:5.2f}'
                .format(epoch+1, num_epochs, step, num_batches, avg_loss, np.exp(avg_loss)))



Epoch [1/5], Step[0/1549], Loss: 8.0842, Perplexity: 3242.89
Epoch [1/5], Step[100/1549], Loss: 6.4676, Perplexity: 643.94
Epoch [1/5], Step[200/1549], Loss: 6.7590, Perplexity: 861.79
Epoch [1/5], Step[300/1549], Loss: 6.7512, Perplexity: 855.04
Epoch [1/5], Step[400/1549], Loss: 6.7746, Perplexity: 875.33
Epoch [1/5], Step[500/1549], Loss: 6.5994, Perplexity: 734.64
Epoch [1/5], Step[600/1549], Loss: 6.5636, Perplexity: 708.82
Epoch [1/5], Step[700/1549], Loss: 6.8555, Perplexity: 949.06
Epoch [1/5], Step[800/1549], Loss: 6.5771, Perplexity: 718.45
Epoch [1/5], Step[900/1549], Loss: 6.9208, Perplexity: 1013.09
Epoch [1/5], Step[1000/1549], Loss: 6.8166, Perplexity: 912.85
Epoch [1/5], Step[1100/1549], Loss: 6.8877, Perplexity: 980.14
Epoch [1/5], Step[1200/1549], Loss: 6.6952, Perplexity: 808.48
Epoch [1/5], Step[1300/1549], Loss: 7.0503, Perplexity: 1153.23
Epoch [1/5], Step[1400/1549], Loss: 6.8114, Perplexity: 908.12
Epoch [1/5], Step[1500/1549], Loss: 6.8431, Perplexity: 937.38
E

## 5️⃣ Q2.5 Distance Comparison (+1 point)
Repeat the work you did for `3️⃣ Q2.3 Embedding Distance` for the model in `4️⃣ Q2.4 Teacher Forcing` and compare the distances produced by these two models (i.e. with and without the teacher forcing), what can you conclude?

In [12]:
word1 = "army"
word2 = "taxpayer"

# Calculate cosine distances for both models
cos_dist_tf = cosine_distance(word1, word2, model)  # With teacher forcing
cos_dist_no_tf = cosine_distance(word1, word2, model_no_tf)  # Without teacher forcing

# Print the results
print("Cosine distance with teacher forcing: {:.4f}".format(cos_dist_tf))
print("Cosine distance without teacher forcing: {:.4f}".format(cos_dist_no_tf))


Cosine distance with teacher forcing: 1.1329
Cosine distance without teacher forcing: 1.0003


## Discussion:
The model with teacher forcing has a smaller cosine distance (1.1329) compared to the model without teacher forcing (1.0003).

This suggests that the learned representations of words in the embedding space are influenced by the training method. In this case, the model trained without teacher forcing seems to learn word embeddings that place "army" and "taxpayer" closer together in the embedding space than the model trained without teacher forcing.

It's important to note that this observation is based on a single pair of words, and further analysis would be needed to draw more general conclusions about the effect of teacher forcing on word embeddings. However, this example demonstrates that the choice of training method can have an impact on the learned representations of words.