![BridgingAI Logo](../bridgingai_logo.png)

# Deep Learning - Exercise 5: Transformers

---
1. [Multi-Head Attention](#multi-head-attention)
2. [Building GPT](#gpt)
3. [Experiment: Language Modeling](#experiment-lm)
4. [Encoder-Decoder Transformer](#transformer)
<br/> &#9; 4.1 [Transformer Encoder Block](#transformer-encoder-block)
<br/> &#9; 4.2 [Transformer Decoder Block](#transformer-decoder-block)

5. [Experiment: Neural Machine Translation](#experiment-nmt)
6. [Questions](#questions)
7. [References](#references)
---

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

from configs.nmt_config import NMTExperimentConfig
from trainers.nmt_trainer import NMTTrainer
from configs.lm_config import LMExperimentConfig
from trainers.lm_trainer import LMTrainer
from tests.lm_sanity_checks import TestGPTBlock
from tests.nmt_sanity_checks import (
    TestMultiheadAttention,
    TestTransformerEncoderBlock,
    TestTransformerDecoderBlock,
)
from IPython.display import Markdown, display

Transformers have revolutionized the field of Natural Language Processing (NLP) by enabling models to capture long-range dependencies in sequential data through self-attention mechanisms. While most large language models (LLMs) today are built using decoder-only architectures, the original transformer model proposed in [Attention is All You Need](https://arxiv.org/abs/1706.03762) consists of both an encoder and a decoder, a setup which is particularly useful for tasks such as Neural Machine Translation (NMT).

In this assignment, you will implement both the gpt-like model to perform language modeling and the full transformer model to perform Neural Machine Translation (NMT).

Similar to the previous assignments, we will use Shakespeare's plays as our dataset for the language modeling task and the [Multi30k](https://arxiv.org/abs/1605.00459) dataset for the NMT task. The Multi30k dataset consists of English and German sentence pairs, which we will use to train a transformer model to translate English sentences to German.

After completing this assignment, you will be able to:
- Implement Multi-Head Attention from scratch.
- Build a GPT-like model for language modeling.
- Build a full transformer model for Neural Machine Translation.

This assignment will mainly focus on the model architecture. If you need a refresher on the data processing and training pipeline, you can refer to the previous RNN assignment.

<a id='multi-head-attention'></a>
# 1. Multi-Head Attention from scratch

In this section, you will implement a `MultiheadAttention` module from scratch using PyTorch. This is one of the core components of Transformer models, where the multi-head attention mechanism allows the model to jointly attend to information from different representation subspaces.

The following figure illustrates the attention process you'll be implementing:

<img src="./assets/attention.svg" alt="Transformer Model" style="width: auto; height: 300px;">



**TODOs**:
Complete the `MultiheadAttention` class by filling in specific sections marked as **TODO** (You can skip `TODO 7` for now). Follow the instructions for each step carefully.

- **TODO 1: Input Projections**
   - Project the `query`, `key`, and `value` inputs using linear layers.
   - These projections prepare the data for the attention mechanism.

- **TODO 2: Reshape for Multi-Head**
   - Split the projected inputs into multiple heads by reshaping.
   - Use `self.num_heads` and `self.head_dim` to reshape, then transpose for efficient computation.

- **TODO 3: Compute Attention Weights (Part 1)**
   - Compute $\frac{QK^T}{\sqrt{d}}$

- **TODO 4: Compute Attention Weights (Part 2)**
   - Compute $\text{softmax}(\frac{QK^T}{\sqrt{d}})$

- **TODO 5: Compute Weighted Sum**
   - Use the normalized attention weights to compute the weighted sum of values (`v`).
   - $\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d}}) V$
   - This step forms the core output of the attention mechanism.

- **TODO 6: Output Projection**
   - Concatenate the heads' outputs and apply a final linear transformation.
   - This projects the combined output back to the original embedding dimension.
   - If you encounter an error related to `contiguous`, you can use the `contiguous()` function to ensure the tensor is contiguous in memory.


The goal here is to make sure you understand the process of multi-head attention thoroughly. If you get stuck, you can refer to the `CausalSelfAttention` class in [minGPT/model.py](https://github.com/karpathy/minGPT/blob/master/mingpt/model.py) from [Andrej Karpathy’s minGPT repository](https://github.com/karpathy/minGPT/tree/master) for inspiration.

In [None]:
class MultiheadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads, dropout, attn_dropout):
        super().__init__()
        assert embed_dim % num_heads == 0, "embed_dim must be divisible by num_heads"
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        self.attn_dropout = nn.Dropout(attn_dropout)
        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)
        self.out_dropout = nn.Dropout(dropout)

    def forward(self, query, key, value, attn_mask=None, key_padding_mask=None):
        """
        Args:
            query: (B, len_q, C)
            key: (B, len_k, C)
            value: (B, len_k, C)
            key_padding_mask: (B, len_k)
                Bool tensor where False values are positions that should be masked with -inf.
            attn_mask: (len_q, len_k)
                Bool tensor where False values are positions that should be masked with -inf.

        Returns:
            attn_output: (B, len_q, C)
            attn_weights: (B, num_heads, len_q, len_k)

        B: batch_size.
        C: embed_dim.
        len_q: length of query (target) sequence
        len_k: length of key (source) sequence
        """
        B, len_q, C = query.shape
        len_k = key.shape[1]
        head_dim = C // self.num_heads

        # TODO 1: Input projection
        # YOUR CODE HERE
        raise NotImplementedError()
        assert q.shape == (B, len_q, C)
        assert k.shape == (B, len_k, C)
        assert v.shape == (B, len_k, C)

        # TODO 2: Reshape to multi-head
        # YOUR CODE HERE
        raise NotImplementedError()
        assert q.shape == (B, self.num_heads, len_q, head_dim)
        assert k.shape == (B, self.num_heads, len_k, head_dim)
        assert v.shape == (B, self.num_heads, len_k, head_dim)

        # TODO 3: Compute attention weights
        # YOUR CODE HERE
        raise NotImplementedError()
        assert attn_weights.shape == (B, self.num_heads, len_q, len_k)

        # Apply attention mask
        if attn_mask is not None:
            attn_weights = attn_weights.masked_fill(attn_mask == False, -float("inf"))

        if key_padding_mask is not None:
            # TODO 7: Apply key padding mask
            # YOUR CODE HERE
            raise NotImplementedError()

        # TODO 4: Apply softmax
        # YOUR CODE HERE
        raise NotImplementedError()
        assert attn_weights.shape == (B, self.num_heads, len_q, len_k)

        # Apply dropout on attention weights
        attn_weights = self.attn_dropout(attn_weights)

        # TODO 5: Compute weighted sum of values as output
        # YOUR CODE HERE
        raise NotImplementedError()
        assert attn_output.shape == (B, self.num_heads, len_q, head_dim)

        # TODO 6: Output projection
        # YOUR CODE HERE
        raise NotImplementedError()
        assert attn_output.shape == (B, len_q, C)

        # Apply dropout on output
        attn_output = self.out_dropout(attn_output)

        return attn_output, attn_weights


# Sanity check
TestMultiheadAttention.test_basic(MultiheadAttention, use_mask=False)

There are two types of masks that will be used by `MultiheadAttention`:
1. `attn_mask`: This mask is used to control which query can attend to which key, e.g., the causal mask in the self-attention module in the decoder that prevents tokens from attending to future tokens.
2. `key_padding_mask`: This mask is used to ignore padding tokens during the attention mechanism.

**TODO 7**: 

Complete `key_padding_mask` part in the `forward` method. 

**Hints**:
- You can use the `masked_fill` function to apply the mask to the attention weights (similar to how we handle the `attn_mask`).
- You can refer to the docstring to understand the shape and expected behavior of the `key_padding_mask`.

In [None]:
TestMultiheadAttention.test_basic(MultiheadAttention, use_mask=True)

In [None]:
TestMultiheadAttention.test_specific_mask(MultiheadAttention)

<a id='gpt'></a>
# 2. Building GPT

In this section, you will implement the `GPTBlock` class, the main building block of GPT-like models, which use a decoder-only Transformer architecture for language modeling. These models predict the next token in a sequence based on the previous tokens, using self-attention and MLPs.

#### Key Features of the `GPTBlock`:
1. **Self-Attention Only**: Unlike full Transformer models, GPT models do not use cross-attention. The `GPTBlock` only utilizes (masked) self-attention.
2. **Causal Masking**: To ensure that each token can only attend to previous tokens and not future ones, the self-attention mechanism applies a causal mask.
3. **Pre-Norm Architecture**: This block follows the "pre-norm" configuration, where layer normalization is applied before the self-attention and feed-forward layers. This approach is based on the findings in [On Layer Normalization in the Transformer Architecture](https://arxiv.org/abs/2002.04745) (Figure 1(b)).

#### GPTBlock Architecture:
Below is an illustration of the architecture of the `GPTBlock`:

<img src="./assets/gpt.svg" alt="GPT Model Architecture" style="width: auto; height: 300px;">

**TODOs**: Complete the `GPTBlock` class implementation using the architecture described above.

- **TODO 1: Self-Attention**
   - Remember to use a causal mask.
   - The `causal_mask` buffer can be utilized as causal mask. 

- **TODO 2: Feed-Forward**
   - Implement the feed-forward part of the block using the class members.

**Hints**:
- You need to resize the buffer to the correct shape for the current input as the input sequence length may vary.

In [None]:
from models.modules import FeedForwardBlock


class GPTBlock(nn.Module):
    def __init__(self, transformer_config):
        super().__init__()
        self.norm1 = nn.LayerNorm(transformer_config.n_embd)
        self.norm2 = nn.LayerNorm(transformer_config.n_embd)

        self.attn = MultiheadAttention(
            transformer_config.n_embd,
            transformer_config.n_head,
            transformer_config.dropout,
            transformer_config.attn_dropout,
        )
        # This is a simple two-layer MLP
        self.ff = FeedForwardBlock(
            transformer_config.n_embd, transformer_config.dropout
        )

        self.register_buffer(
            "causal_mask",
            torch.tril(
                torch.ones(
                    transformer_config.context_length, transformer_config.context_length
                )
            ).to(torch.bool),
        )

    def forward(self, x):
        """
        Args:
            x: (B, T, C)
                B: batch size
                T: sequence length
                C: number of channels

        Returns:
            output tensor of shape (B, T, C)
        """
        _, T, _ = x.shape

        # TODO 1: Self-attention
        # YOUR CODE HERE
        raise NotImplementedError()

        # TODO 2: Feed-forward
        # YOUR CODE HERE
        raise NotImplementedError()


TestGPTBlock.test_basic(GPTBlock)

With the GPTBlock, we can now build a complete Transformer Decoder for causal language modeling. The overall model architecture is quite simple - most of the functionality is bundled in the GPTBlock!

During inference, the Transformer Decoder works similar to an RNN by iteratively applying the model to the sequence to predict the next token (take a look at the `generate` method). Training, however, is parallelized over the entire sequence length, making it much easier to scale to large model sizes.

**TODO**: complete the `compute_loss` function. You will need to apply the model to the input data and then compute the cross-entropy loss between the logits and the targets.

In [None]:
from models.modules import Embedding


class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        config = config.transformer_config
        self.config = config
        self.embedding = Embedding(
            config.vocab_size, config.context_length, config.n_embd, config.dropout
        )

        self.blocks = nn.ModuleList([GPTBlock(config) for _ in range(config.n_layer)])
        self.norm = nn.LayerNorm(config.n_embd)
        self.out_proj = nn.Linear(config.n_embd, config.vocab_size)

    def forward(self, x):
        B, T = x.shape
        x = self.embedding(x)
        for block in self.blocks:
            x = block(x)
        x = self.norm(x)
        x = self.out_proj(x)
        return x

    def compute_loss(self, input_data, target_data):
        # TODO: apply the model to the inputs and compute the cross-entropy loss
        # YOUR CODE HERE
        raise NotImplementedError()
        return loss

    @torch.no_grad()
    def generate(model, context_ids, max_new_tokens: int = 500, temperature=1.0):
        """
        Generate text using the model with proper temperature scaling

        Args:
            context_ids: tokens indices of shape (T, )
            max_new_tokens: maximum number of tokens to generate
            temperature: controls randomness (higher = more random, lower = more deterministic)
        """
        was_training = model.training
        model.eval()
        device = next(model.parameters()).device

        T = context_ids.shape[0]
        context_ids = context_ids.view(1, -1).clone().to(device)

        for _ in range(max_new_tokens):
            # Get the last context_length tokens
            x = context_ids[:, -model.config.context_length :]

            # Get logits and apply temperature
            logits = model(x)[:, -1, :]
            logits = logits / temperature

            # Apply softmax and sample
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)

            context_ids = torch.cat([context_ids, next_token], dim=-1)

        if was_training:
            model.train(was_training)
        return context_ids.squeeze()[T:]

<a id='experiment-lm'></a>
# 3. Experiment: Language Modeling

You have implemented the `MultiheadAttention` and `GPTBlock` classes. Now, you will use these components to build a GPT-like model for language modeling.

**TODO**: 
1. **Start TensorBoard**: Monitor the training progress and metrics. We also log text samples periodically (under 'text').
2. **Run the Training Script**: Execute the cell below to begin training.

**Notes**: 
- You can review the hyperparameters used for training in `configs/lm_config.py`.
- To see alternative Transformer configurations, check `configs/transformer_config.py`.

The default training configuration (`gpt-nano`, `max_steps=10000`) takes around 15 minutes to train on a GPU. If implemented correctly, you should see the validation loss under 1.7 and train loss under 1.6 by the end of training. What do you observe when you train a larger model (`gpt-micro`, `gpt-mini`)? Do you see differences in the training/validation curves or the generated text?

**Hint:** When trying out larger model configs, you can save some time by running validation less often since it quickly becomes quite computationally expensive (remember that generating 1000 tokens requires 1000 forward passes!). To do this, adapt the `eval_every_n_steps` parameter.

In [None]:
lm_config = LMExperimentConfig("gpt-nano")
lm_model = GPT(lm_config)
lm_trainer = LMTrainer(lm_model, lm_config)
lm_trainer.run_experiment()

When the training finished, run the cell below to inspect the text that the model generates. Keep in mind that this is an extremely small model trained on a tiny dataset.

In [None]:
from exercise_utils.nlp.lm.utils import format_generation_logging, generate_text

prompt = "Now are our brows bound with victorious wreaths;"

# also test with lstm_trainer once you trained it
text_gen = generate_text(lm_trainer.model, lm_trainer.tokenizer, 1024, prompt)
text_gen = format_generation_logging(text_gen, prompt)
display(Markdown(text_gen))

<a id='transformer'></a>
# 4. Encoder-Decoder Transformers

Now you will implement the **encoder-decoder** variant of the Transformer model, similar to the model described in [Attention is All You Need](https://arxiv.org/abs/1706.03762).

#### Key Components:
- **Encoder Block**:
  - Multi-Head Self-Attention
  - Feed-Forward Network
- **Decoder Block**:
  - Multi-Head Self-Attention
  - Cross-Attention (attends to encoder output)
  - Feed-Forward Network

We use the **pre-norm architecture**, applying layer normalization before attention and feed-forward layers.

#### Transformer Architecture:

<img src="./assets/prenorm_transformer.svg" alt="Transformer Model" style="width: auto; height: 400px;">

(Note that the figure omitted embeddings, output projection and softmax for clarity.)

<a id='transformer-encoder-block'></a>
## 4.1 Transformer Encoder Block

In this section, you will implement the `TransformerEncoderBlock`, the part of the transformer architecture that processes the source (input) tokens. 

**TODOs**: Complete the `TransformerEncoderBlock` class implementation using the architecture described above.

- **TODO 1: Self-Attention**
   - Remember to use `src_key_padding_mask` to prevent the encoder from attending to padded tokens.

- **TODO 2: Feed-Forward**
   - Implement the feed-forward part of the block using the class members.

**Hints**:
- Unlike the decoder block, the encoder block doesn't require a `attn_mask` since it doesn't need to prevent tokens from attending to other tokens

In [None]:
from models.modules import FeedForwardBlock
from models.transformer import create_key_padding_mask


class TransformerEncoderBlock(nn.Module):
    def __init__(self, transformer_config):
        super().__init__()
        n_head = transformer_config.n_head
        n_embd = transformer_config.n_embd
        dropout = transformer_config.dropout
        attn_dropout = transformer_config.attn_dropout

        self.norm1 = nn.LayerNorm(n_embd)
        self.norm2 = nn.LayerNorm(n_embd)

        self.self_attn = MultiheadAttention(n_embd, n_head, dropout, attn_dropout)
        self.ff = FeedForwardBlock(n_embd, dropout)

    def forward(self, src, src_len):
        """
        Args:
            src: source sentence embeddings of shape (B, T, C)
            src_len: int tensor stating the length of each source sentence in the batch. shape (B,)

        Returns:
            output of the transformer encoder block. shape (B, T, C)
        """
        # Source key padding mask
        src_key_padding_mask = create_key_padding_mask(src_len, src)

        # TODO 1: Self-attention
        # YOUR CODE HERE
        raise NotImplementedError()

        # TODO 2: Feed-forward
        # YOUR CODE HERE
        raise NotImplementedError()


TestTransformerEncoderBlock.test_basic(TransformerEncoderBlock)

<a id='transformer-decoder-block'></a>
## 4.2 Transformer Decoder Block

In this section, you will implement the `TransformerDecoderBlock`, which processes the output of the encoder (usually referred to as `memory`) and the tokens of the target sequence. This block combines self-attention, cross attention and feed-forward layers using the architecture described above.

**TODOs**: Complete the `TransformerDecoderBlock` class implementation using the pre-norm architecture.

- **TODO 1: Self-Attention**
   - `key_padding_mask` in the `MultiheadAttention` module should be used to prevent the decoder from attending to padded tokens.
   - `attn_mask` should also be specified. This should be a causal mask to prevent tokens from attending to future tokens.

- **TODO 2: Cross-Attention**
   - This will perform cross-attention between the target tokens and the memory (output of the encoder), where queries are the target tokens and keys and values are the encoder outputs.
   - `key_padding_mask` should be used, while `attn_mask` is not required.

- **TODO 3: Feed-Forward**
   - Implement the feed-forward part of the block using the class members. 

**Hints**:
- The `causal_mask` buffer can be used as the `attn_mask` for the self-attention mechanism in the decoder block. Note that the target sentence lengths may vary between batches, so you need to adjust the size of the `causal_mask` accordingly.

In [None]:
class TransformerDecoderBlock(nn.Module):
    def __init__(self, transformer_config):
        super().__init__()
        n_head = transformer_config.n_head
        n_embd = transformer_config.n_embd
        dropout = transformer_config.dropout
        attn_dropout = transformer_config.attn_dropout
        context_length = transformer_config.context_length

        self.norm1 = nn.LayerNorm(n_embd)
        self.norm2 = nn.LayerNorm(n_embd)
        self.norm3 = nn.LayerNorm(n_embd)

        self.self_attn = MultiheadAttention(n_embd, n_head, dropout, attn_dropout)
        self.cross_attn = MultiheadAttention(n_embd, n_head, dropout, attn_dropout)

        self.ff = FeedForwardBlock(n_embd, dropout)

        self.register_buffer(
            "causal_mask",
            torch.tril(torch.ones(context_length, context_length)).to(torch.bool),
        )

    def forward(self, tgt, tgt_len, memory, memory_len):
        """
        Args:
            tgt: target sentence embeddings of shape (B, T, C)
            tgt_len: int tensor stating the length of each target sentence in the batch. shape (B,)
            memory: source sentence embeddings of shape (B, S, C)
            memory_len: int tensor stating the length of each source sentence in the batch. shape (B,)

        Returns:
            output of the transformer decoder block. shape (B, T, C)

        B: batch size
        T: target sequence length
        S: source sequence length
        C: embedding dimension
        """
        _, T, _ = tgt.shape

        # create key padding masks
        tgt_key_padding_mask = create_key_padding_mask(tgt_len, tgt)
        memory_key_padding_mask = create_key_padding_mask(memory_len, memory)

        # TODO 1: Self-attention
        # YOUR CODE HERE
        raise NotImplementedError()

        # TODO 2: Cross-attention
        # YOUR CODE HERE
        raise NotImplementedError()

        # TODO 3: Feed-forward
        # YOUR CODE HERE
        raise NotImplementedError()


TestTransformerDecoderBlock.test_basic(TransformerDecoderBlock)

<a id='experiment-nmt'></a>
# 5. Experiment: Neural Machine Translation

With the key components of the Transformer model implemented, you can now proceed to train it on the **Multi30k dataset** and evaluate its performance.

**TODO**: 
1. **Start TensorBoard**: Monitor the training progress and metrics.
2. **Run the Training Script**: Execute the cell below to begin training.

**Notes**: 
- You can review the hyperparameters used for training in `configs/nmt_config.py`.
- To see alternative Transformer configurations, check `configs/transformer_config.py`.

The default training configuration (`gpt-micro`, `max_steps=40000`) takes within 30 minutes to train on a GPU. If you implement the model correctly, you should get a BLEU score somewhere between 33-35 and a validation loss under 1.8.


In [None]:
from models.transformer import Transformer

nmt_config = NMTExperimentConfig("gpt-micro")
nmt_model = Transformer(TransformerEncoderBlock, TransformerDecoderBlock, nmt_config)
nmt_trainer = NMTTrainer(nmt_model, nmt_config)
nmt_trainer.run_experiment()

Run the cell below to see some example translations generated by the model.

In [None]:
def display_translations(trainer):
    start_tag = '<div style="font-size: 14px; line-height: 1.5;">\n'
    body = trainer.get_random_examples()
    end_tag = "\n</div>"
    display(Markdown(start_tag + body + end_tag))


display_translations(nmt_trainer)

<a id='questions'></a>
# 6. Questions

While the model is being trained, feel free to think about these questions to check your understanding of the transformer architecture.

1. How does our positional embedding implementation differ from the [original transformer paper](https://arxiv.org/abs/1706.03762)?
2. How does the `key_padding_mask` differ from the `attn_mask` in the transformer model?
3. For our machine translation task, would increasing the `context_length` to 2048 or 4096 likely improve the model's performance? Why or why not?
4. If the `context_length` is set to 20, how should we process the input data of length 10 before feeding it to the transformer model?

<a id='references'></a>
# 7. References

- [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
- [On Layer Normalization in the Transformer Architecture](https://arxiv.org/abs/2002.04745)
- [Multi30K: Multilingual English-German Image Descriptions](https://arxiv.org/abs/1605.00459)