# PA4 Part 2. Transformers for Language Modeling [80 Marks]

<center>
    <img src="./assets/transformers.png">
</center>

### Introduction

In this notebook, you will be implementing a Transformer, `SastaGPT`, from scratch. This will be taken from the paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762) by Vaswani et al. (2017). You will then train this Transformer on a language modeling task on a dataset of your choosing.

After this notebook, you should be able to:

- Understand how Transformers, and all its components, are implemented in code.

- Train a Transformer on a language modeling task.

- Brag on Twitter about how you created a Transformer from scratch.

### Instructions

- Follow along with the notebook, filling out the necessary code where instructed.

- <span style="color: red;">Read the Submission Instructions and Plagiarism Policy in the attached PDF.</span>

- <span style="color: red;">Make sure to run all cells for credit.</span>

- <span style="color: red;">Do not remove any pre-written code.</span> We will be using the `print` statements to grade your assignment.

- <span style="color: red;">You must attempt all parts.</span> Do not assume that because something is for 0 marks, you can leave it - it will definitely be used in later parts.

In [1]:
pip install tiktoken




In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F

import tiktoken

## Creating a Dataset [20 Marks]

In this section, you will be creating a dataset for training your Transformer. You are allowed to use any dataset you want, as long as it is a text dataset.

Some fun options include:

- A collection of movie scripts from your favorite director. Look through [IMSDB](https://www.imsdb.com/) for some options.

- A novel from your favorite author. Look through [Project Gutenberg](https://www.gutenberg.org/) for some options.

- A collection of poems from your favorite poet. Look through [Poetry Foundation](https://www.poetryfoundation.org/) for some options.

- Anything else from [Kaggle](https://www.kaggle.com/datasets) or [HuggingFace](https://huggingface.co/datasets).

You are not allowed to be boring and use the same dataset as something previously seen in the course.

This section involves doing the following:

1. **Finding/Downloading/Creating your dataset**. Add in a few comments about what it is and why you chose it. (10 Marks)

2. **Preprocessing your dataset**. This involves tokenizing your dataset, and creating a vocabulary. Previously you used character-level tokenization, now you will use **subword-level tokenization** (you can read about different tokenization strategies [here](https://huggingface.co/learn/nlp-course/chapter2/4?fw=pt)). This can be very easily handled with the `tiktoken` library (read up [here](https://github.com/openai/tiktoken)), which is also what the GPT family uses :p (10 Marks)

**Note:** Do not use character-level tokenization (what you used in the previous assignment).

In [3]:
import numpy as np
import torch

BATCH_SIZE = 32
BLOCK_SIZE = 8

encoder = tiktoken.get_encoding("gpt2")

# Read in your dataset using np.loadtxt
# i used the script of the movie, kungfu panda as my datset as it is lengthy enough to properly train my models and
# parameters, morover since its an animated movie it won't always have the most logically sounding sentence which
# i wanted to see if my model would be able to grasp that. 
file_path = r'C:\Users\Areeba Shahzad\Desktop\ML\PA4\kungfu_panda.txt'
with open(file_path, 'r', encoding='utf-8') as file:
    text = file.read()

text_array = np.array(list(text))  # Assuming each character is a separate element in the array

# Complete the get_batch function
def get_batch():
    '''
    Returns a batch (x, y) from the dataset
    '''
    batch_size = BATCH_SIZE
    block_size = BLOCK_SIZE
    dataset = text_array
    if len(dataset) <= block_size:
        raise ValueError("Dataset has insufficient data for the specified block size.")
    
    # Convert to string
    text_str = ''.join(dataset)
    
    # Get random integers for indexing into the dataset to create a batch
    indices = np.random.randint(0, len(dataset) - block_size - 1, batch_size)
    
    # Get the x and y (input and target) batches after encoding and indexing
    x = []
    y = []
    
    for i in indices:
        x_block = [ord(char) for char in text_str[i:i + block_size]]
        y_block = [ord(char) for char in text_str[i + 1:i + 1 + block_size]]
        x.append(x_block)
        y.append(y_block)
    
    # Padding
    max_length = max(max(len(seq) for seq in x), max(len(seq) for seq in y))
    x = [seq + [0] * (max_length - len(seq)) for seq in x]
    y = [seq + [0] * (max_length - len(seq)) for seq in y]
    
    # Make sure these are int64 (long) tensors
    x = torch.tensor(x, dtype=torch.int64)
    y = torch.tensor(y, dtype=torch.int64)
    
    return x, y

xb, yb = get_batch()
print(xb.shape, yb.shape)


torch.Size([32, 8]) torch.Size([32, 8])


## Creating the Transformer

### Defining our Hyperparameters [0 Marks]

We will define our hyperparameters here. You can play around with these later to see how they affect your model.

Since there are lots of hyperparameters to keep track of, it is easier for us to wrap all of them inside a `dataclass`. This can be passed around very easily to different functions, and makes it easier to keep track of all relevant values.

To quickly describe some of the hyperparameters:

1. `block_size`: This is the length of the sequence that we will be feeding into our model. This has also been called the *context length* and *window size* in other places.

2. `emb_dim`: This is the dimensionality of the embeddings we will have inside the model. This has implications on the outputs of tensors throughout the entire model.

3. `head_size`: When dealing with Multi-Head Attention, we will be *splitting* our embeddings into multiple heads. This is the size of each of those heads. For example, if we had an embedding of size 512, and we wanted 8 heads, then each head would have a size of 64. Down the line, we'd be concatenating these heads together, so the final output would be of size 512 again, as you should recall from the lectures.

4. `num_heads`: This is the number of attention heads we will have in our Multi-Head Attention layer.

5. `num_layers`: This is the number of layers we will have in our Transformer (this includes the MHSA, the Feedforward module, and the Layer Normalizations).

6. `vocab_size`: This is the size of our vocabulary. This is the number of unique tokens we have in our vocabulary. If you were to use character-level tokenization, this would be the number of unique characters in your dataset. Since you should be using subword-level tokenization, this will be the number of unique subwords in your dataset.

In [4]:
from dataclasses import dataclass


@dataclass
class Config:
    block_size: int = BLOCK_SIZE
    emb_dim: int = 256
    head_size: int = 32
    num_heads: int = 8
    num_layers: int = 2
    vocab_size: int = encoder.n_vocab # vocab size of the tokenizer

# We like to have emb_dim == head_size * num_heads
config = Config()
assert config.emb_dim == config.head_size * config.num_heads, "Embedding dimension must be divisible by number of heads"

### Multi-Head Self-Attention [20 Marks]

This component is the core of the Transformer. This is where the model learns to attend to different parts of the input sequence, and is the reason why Transformers are so powerful.

$$\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$$

$$\text{head}_k = \text{Attention}(QW^Q_k, KW^K_k, VW^V_k)$$

$$\text{MHSA}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$

For simplicity, assume we have a single Head:

1. The input has three parts extracted from it: the query $Q$, the key $K$, and the value $V$ (via projections or `Linear` layers). 

2. The query and key are multiplied together to get a score. This score is then scaled by the square root of the embedding dimension, $\sqrt{d_k}$, then passed through a softmax to get the attention weights (*after* a masking operation is applied).

3. The attention weights are then multiplied with the value to get the final output.

When we extend this to *multiple heads*, we simply repeat this process for each head in parallel, and then concatenate the outputs of each head together.

Create a class for this `MHSA` module (Multi-Head Self-Attention), adding a comment next to each line in the `forward` method for the shape of the tensor at that point. This will help you debug later on.

In [5]:
class MHSA(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.emb_dim = cfg.emb_dim
        self.head_size = cfg.head_size
        self.block_size = cfg.block_size
        self.num_heads = cfg.num_heads

        # TODO: Define the single projection layer for QKV
        self.projection_qkv = nn.Linear(self.emb_dim, 3 * self.emb_dim)
        
        # TODO: Define the output projection layer
        self.projection_out = nn.Linear(self.emb_dim, self.emb_dim)

        # Create a buffer for the mask (buffers are tensors that are not updated during backpropagation)
        self.register_buffer("mask", torch.tril(torch.ones(self.block_size, self.block_size)))

    def forward(self, x):

        B, T, C = x.shape # batch size, block size, emb dim
        H = self.num_heads

        # TODO: Carry out the projections to get the query, key and value - you may want to reshape them to (B, H, T, head_size) to make things easier
        qkv = self.projection_qkv(x)
        q, k, v = qkv.chunk(3, dim=-1)
        q = q.view(B, T, H, -1).permute(0, 2, 1, 3)
        k = k.view(B, T, H, -1).permute(0, 2, 1, 3)
        v = v.view(B, T, H, -1).permute(0, 2, 1, 3)
        
        # TODO: Compute the attention scores, perform masking, and apply the softmax
        attn_scores = torch.matmul(q, k.transpose(-1, -2)) / torch.sqrt(torch.tensor(self.emb_dim).float())
        attn_scores.masked_fill_(self.mask[:T, :T] == 0, float("-inf"))
        attn_weights = F.softmax(attn_scores, dim=-1)

        # TODO: Compute the weighted sum of the value vectors - you can perform the concatenation via a simple reshaping before applying the final projection
        out = torch.matmul(attn_weights, v)
        out = out.permute(0, 2, 1, 3).contiguous().view(B, T, -1)
        out = self.projection_out(out)
        
        
        return out

# !DO NOT REMOVE THESE LINES!
x = torch.randn(32, config.block_size, config.emb_dim)
print(x.shape)
csa = MHSA(config)
out = csa(x)
print(out.shape)

torch.Size([32, 8, 256])
torch.Size([32, 8, 256])


### Feedforward Network [10 Marks]

The Feedforward Network module usually consists of two linear layers, with a ReLU activation in between. The first linear layer projects the input's Embeddings to a higher dimensionality (by a factor of 4 specifically), and the second linear layer projects it back down to the original dimensionality. This is a very simple module, but is very effective in learning complex functions.

Create a class for this `Feedforward` module, with the following tweaks:

1. Use the [GELU Activation Function](https://pytorch.org/docs/stable/generated/torch.nn.GELU.html) instead of ReLU between the layers. It is functionally very similar, but has gained a lot of popularity recently.

2. Instead of using two `Linear` layers, and an upsampling factor of `4`, use **three** `Linear` layers: the first one upsamples by a factor of `2`, the second one does not change the dimensionality, and the third one downsamples by a factor of `2`. This creates a slightly deeper network, and can be shown to have the same number of parameters as the network in the original Transformer.

3. Add a [Dropout layer](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) after the last `Linear` layer, with $p=0.1$. This helps mitigate overfitting.

In [6]:
class Feedforward(nn.Module):
    def __init__(self, config):
        super().__init__()
        emb_dim = config.emb_dim

        # TODO: Define the feedforward network
        self.feedforward = nn.Sequential(
            nn.Linear(emb_dim, emb_dim * 2),
            nn.GELU(),
            nn.Linear(emb_dim * 2, emb_dim),  # Change the input and output dimensions
            nn.GELU(),
            nn.Linear(emb_dim, emb_dim),  # Change the output dimension
            nn.Dropout(0.1)
        )



    def forward(self, x):
        # TODO: Apply the feedforward network

        return self.feedforward(x)
        

### Blocks with Skip Connections [10 Marks]

The Encoder consists of multiple *Blocks*, that each contain a Multi-Head Self-Attention module, a Feedforward module, and a Layer Normalization module. We use Skip Connections to help with the flow of gradients during the training process.

Create a class for this `Block` module, with the following tweaks:

- Instead of creating your own Layer Normalization module, use [the one provided by PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html), `nn.LayerNorm`.

- Perform the Layer Normalizations **before** the Skip Connections, instead of after. The original paper used the "post-LN" approach, but [recent research](https://arxiv.org/abs/2002.04745) has shown that the "pre-LN" approach is better.

In [7]:
class Block(nn.Module):
    def __init__(self, config):
        super().__init__()

        # TODO: Define the MHSA, Feedforward, and two LayerNorm modules
        self.mhsa = MHSA(config)
        self.feedforward = Feedforward(config)
        self.layer_norm1 = nn.LayerNorm(config.emb_dim)
        self.layer_norm2 = nn.LayerNorm(config.emb_dim)
    
    def forward(self, x):

        # TODO: Apply the modules with residual connections (use the Pre-LN design)
        mhsa_out = self.mhsa(x)
        mhsa_out = self.layer_norm1(x + mhsa_out)
        feedforward_out = self.feedforward(mhsa_out)
        out = self.layer_norm2(mhsa_out + feedforward_out)
        
        return out


# !DO NOT REMOVE THESE LINES!
x = torch.randn(8, config.block_size, config.emb_dim)
print(x.shape)
block = Block(config)
out = block(x)
print(out.shape)

torch.Size([8, 8, 256])
torch.Size([8, 8, 256])


### Putting it all together [20 Marks]

Now that we have all the components, we can put them together to create the Transformer. The Transformer consists of

1. Creating Embeddings for the input sequence. This implies turning the input sequence into a sequence of indices, and then passing it through an `nn.Embedding` layer. Alongside this, we will be using another Embedding table for the positional encodings. For simplicity, we can keep the embedding dimensionality for both of these the same.

2. Passing the embeddings through a series of Blocks.

3. Passing the output of the Encoder through a Layer Normalization layer, and then a Linear layer to get the final logits.

Create a class for this `SastaGPT` module. This will be the final model that we will be training. Note that the `generate()` function has been provided for you.

In [8]:
class SastaGPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.emb_dim = config.emb_dim
        self.block_size = config.block_size
        self.num_layers = config.num_layers
        self.vocab_size = config.vocab_size 

        # TODO: Define the word and position embeddings
        # Hint: They will both have the same embedding dimension, only difference is the number of embeddings (vocab_size vs. block_size)
        self.word_embeddings = nn.Embedding(config.vocab_size, config.emb_dim)
        self.position_embeddings = nn.Embedding(config.block_size, config.emb_dim)
        
        # TODO: Define the sequence of Blocks
        self.blocks = nn.ModuleList([Block(config) for _ in range(config.num_layers)])
        
        # TODO: Define the final LayerNorm
        self.layer_norm = nn.LayerNorm(config.emb_dim)
        
        # TODO: Define the final linear layer (to get logits)
        self.final_linear = nn.Linear(config.emb_dim, config.vocab_size)

        # Initialize the weights
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idxs):
        # idxs: (B, T)
        batch_size, seq_len = idxs.shape

        assert seq_len <= self.block_size, f"Sequence length exceeds block size of {self.block_size}"

        # TODO: Get the word embeddings (B, T, C) and position embeddings (T, C)
        # Hint: For the position embeddings, you can use torch.arange(seq_len) to mimic the indices
        # Note: position embeddings are encodings of the position indices (NOT the actual tokens)
        position_ids = torch.arange(seq_len, device=idxs.device).unsqueeze(0)
        word_embeddings = self.word_embeddings(idxs)
        position_embeddings = self.position_embeddings(position_ids)
        
        # TODO: Add the word and position embeddings (broadcasting will take care of the shapes)
        embeddings = word_embeddings + position_embeddings

        # TODO: Pass the embeddings through the blocks (B, T, C)
        for block in self.blocks:
            embeddings = block(embeddings)

        # TODO: Apply the final LayerNorm
        embeddings = self.layer_norm(embeddings)

        # TODO: Apply the final linear layer to get the logits (B, T, V)
        logits = self.final_linear(embeddings)

        return logits
    
    @torch.no_grad()
    def generate(self, idxs, max_new_tokens=20):
        '''
        Takes in a sequence of indices (the tokenized sentence) and generates new tokens
        Note that the input indices should not be longer than the block size
        Returns the input sequence with the generated tokens appended (these should be decoded using the Tokenizer)

        Params
        ------
        idxs: torch.Tensor
            (B, T) tensor of token indices
        max_new_tokens: int
            Maximum number of new tokens to generate
        '''

        # idxs: (B, T)
        for _ in range(max_new_tokens):
            idxs_trimmed = idxs[:, -self.block_size:] # trim to block size

            logits = self(idxs_trimmed) # (B, T, V)

            logits = logits[:, -1, :] # (B, V)
            
            probs = F.softmax(logits, dim=-1) # (B, V)

            next_idx = torch.multinomial(probs, num_samples=1) # (B, 1)

            idxs = torch.cat((idxs, next_idx), dim=1) # (B, T+1)
            
        return idxs

cfg = Config()
model = SastaGPT(cfg)
print(f"Number of parameters: {sum(p.numel() for p in model.parameters()) / 1e6 :.3f}M")

Number of parameters: 26.970M


In [9]:
# !DO NOT REMOVE THESE LINES!

# Check out the forward pass
xb, yb = get_batch()
print(xb.shape)

logits = model(xb)
print(logits.shape) # (B, T, V)

torch.Size([32, 8])
torch.Size([32, 8, 50257])


Now train this model on your own dataset!

Note that since the model is quite large, it will take a while to train. This would be a good opportunity to use [Google Colab](https://colab.research.google.com/) or [Kaggle Notebooks](https://www.kaggle.com/notebooks) for free GPU compute. This requires very few changes to your original training code, only having to cast your model and data to the GPU.

```python
# Define the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Get the batch
xb, yb = get_batch()

# Cast to the device
xb = xb.to(device)
yb = yb.to(device)

# Cast the model to the device
model = model.to(device)

# Forward pass
preds = model(xb)

...

```

The only rule is that both your model and your tensors should be on the same device (CPU or GPU).

In [14]:
# TODO: Train the model

# Define the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Instantiate the SastaGPT model and move it to the device
model = SastaGPT(config).to(device)

# Define your optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Define the loss function
criterion = nn.CrossEntropyLoss()

# Number of training steps
num_steps = 610

# Training loop
for step in range(num_steps):
    # Get the batch
    xb, yb = get_batch()

    # Cast to the device
    xb = xb.to(device)
    yb = yb.to(device)

    # Zero the gradients
    optimizer.zero_grad()

    # Forward pass
    preds = model(xb)

    # Compute the loss
    loss = criterion(preds.view(-1, config.vocab_size), yb.view(-1))

    # Backward pass
    loss.backward()

    # Update the weights
    optimizer.step()

    # Print loss every 100 steps
    if step % 100 == 0:
        print(f"Step: {step}, Loss: {loss.item()}")

# Save the trained model
torch.save(model.state_dict(), 'sastagpt_model.pth')

Step: 0, Loss: 10.87582778930664
Step: 100, Loss: 1.5367194414138794
Step: 200, Loss: 1.2725788354873657
Step: 300, Loss: 1.2424126863479614
Step: 400, Loss: 1.4793559312820435
Step: 500, Loss: 1.466122031211853
Step: 600, Loss: 1.5521754026412964


With your trained model, generate some text. You can use the `generate()` function provided to you.

Note that the quality of your generations depends on:

- How large the training dataset was. If you used a small dataset, your model will not have enough samples to learn the syntax, semantics, the grammar and other intricacies of language, much less the specifics of the dataset.

- How long you trained for. Provided your dataset was large enough, you could try training for longer epochs until the loss values stabilize.

- How large your model was. If you used a small model, it will not have enough parameters to learn the complexities of language. You can try increasing the number of layers, the embedding dimensionality, the number of heads, etc.

Most importantly, share your generations with the rest of the class! Take some screenshots of your best generations, and share them on the Slack channel.

In [None]:
sentence = "You're just a big, fat panda!"
idxs = torch.tensor(encoder.encode(sentence)).unsqueeze(0)

model.eval()
generated = model.generate(idxs, max_new_tokens=100)
res = encoder.decode(generated[0].numpy())
print(res)

## Effects of Hyperparameters on the Number of Parameters [10 Marks]

In this section, you will be exploring the effects of the hyperparameters on the number of parameters in your model. 

With so much room for tweaking (with the embedding dimensionality, the size of the vocab, the number of heads, the number of layers, etc.), it is important to visualize how fast the total number of parameters can grow if you start toying with them.

In [None]:
def get_params(model):
    '''
    Returns the number of parameters in the model (in millions)
    '''
    return sum(p.numel() for p in model.parameters()) / 1e6

@dataclass
class Config:
    block_size: int = BLOCK_SIZE
    emb_dim: int = 256
    head_size: int = 32
    num_heads: int = 8
    num_layers: int = 2
    vocab_size: int = 100 # smaller vocab size for quick prototyping

In [None]:
# TODO: Plot the number of parameters vs. block size (in multiples of 8, capped at 512)


In [None]:
# TODO: Plot the number of parameters vs. embedding dimension (in multiples of 256, capped at 1792)


In [None]:
# TODO: Plot the number of parameters vs. number of heads (in multiples of 4, capped at 32)


In [None]:
# TODO: Plot the number of parameters vs. number of layers


In [None]:
# TODO: Plot the number of parameters vs. vocabulary size


After looking at these plots, can you answer which hyperparameters have the most effect on the number of parameters? Which ones have the least effect? Note that even if two plots indicate linearity, it is not necessary their gradients are the same :p

<span style="color: green">
    Answer here:
        
</span>

# Fin.