# Chapter 3 - Exercises

> Author : Badr TAJINI - Large Language model (LLMs) - ESIEE 2024-2025

> Response by Paul CASCARINO E5-DSIA

# Exercise 3.1

### Can you transfer the weights from `SelfAttention_v2` to `SelfAttention_v1` such that both implementations produce identical output tensors?

#### 0. Setup

In [1]:
import torch
import torch.nn as nn


torch.manual_seed(123)


inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

x_2 = inputs[1] # second input element
d_in = inputs.shape[1] # the input embedding size, d=3
d_out = 2 # the output embedding size, d=2

class SelfAttention_v1(nn.Module):

    def __init__(self, d_in, d_out):
        super().__init__()
        self.W_query = nn.Parameter(torch.rand(d_in, d_out))
        self.W_key   = nn.Parameter(torch.rand(d_in, d_out))
        self.W_value = nn.Parameter(torch.rand(d_in, d_out))

    def forward(self, x):
        keys = x @ self.W_key
        queries = x @ self.W_query
        values = x @ self.W_value
        
        attn_scores = queries @ keys.T # omega
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )

        context_vec = attn_weights @ values
        return context_vec
    


class SelfAttention_v2(nn.Module):

    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)
        
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)

        context_vec = attn_weights @ values
        return context_vec


sa_v1 = SelfAttention_v1(d_in, d_out)
#print(sa_v1(inputs))


sa_v2 = SelfAttention_v2(d_in, d_out)
#print(sa_v2(inputs))



tensor([[0.2996, 0.8053],
        [0.3061, 0.8210],
        [0.3058, 0.8203],
        [0.2948, 0.7939],
        [0.2927, 0.7891],
        [0.2990, 0.8040]], grad_fn=<MmBackward0>)
tensor([[0.5085, 0.3508],
        [0.5084, 0.3508],
        [0.5084, 0.3506],
        [0.5074, 0.3471],
        [0.5076, 0.3446],
        [0.5077, 0.3493]], grad_fn=<MmBackward0>)


#### 1. My responses

- In `SelfAttention_v2`, the use of the PyTorch's Linear layers `nn.Linear` "which are equivalent to a matrix multiplication if we disable the bias units" (cf lab.3). So `nn.Linear` stores its training weight matrix $W_q$, $W_k$, and $W_v$ with a shape $[d_{out}, d_{in}]$ wich is a transposed configuration.

-  In `SelfAttention_v1`, the training weight matrix $W_q$, $W_k$, and $W_v$ are a directly stored with the shape $[d_{in}, d_{out}]$ 

- It implies that to transfer the weights from `SelfAttention_v2` to `SelfAttention_v1` we have **transpose** them

- 

#### 2. The implementation

In [3]:
# Transfer weights from sa_v2 to sa_v1
with torch.no_grad():
    sa_v1.W_query.copy_(sa_v2.W_query.weight.T)
    sa_v1.W_key.copy_(sa_v2.W_key.weight.T)
    sa_v1.W_value.copy_(sa_v2.W_value.weight.T)

# Verify outputs
output_v2 = sa_v2(inputs)
output_v1 = sa_v1(inputs)

print("Output from SelfAttention_v2:", output_v2)
print("Output from SelfAttention_v1:", output_v1)

# Assert equivalence
assert torch.allclose(output_v1, output_v2, atol=1e-6), "Outputs do not match!"
print("Success: Outputs from both implementations match.")

Output from SelfAttention_v2: tensor([[0.5085, 0.3508],
        [0.5084, 0.3508],
        [0.5084, 0.3506],
        [0.5074, 0.3471],
        [0.5076, 0.3446],
        [0.5077, 0.3493]], grad_fn=<MmBackward0>)
Output from SelfAttention_v1: tensor([[0.5085, 0.3508],
        [0.5084, 0.3508],
        [0.5084, 0.3506],
        [0.5074, 0.3471],
        [0.5076, 0.3446],
        [0.5077, 0.3493]], grad_fn=<MmBackward0>)
Success: Outputs from both implementations match.


# Exercise 3.2

### How can you modify the input arguments to the `MultiHeadAttentionWrapper(num_heads=2)` to transform the output context vectors from four-dimensional to two-dimensional while maintaining the `num_heads=2` configuration?

#### 0. Setup

In [5]:
torch.manual_seed(123)

class CausalAttention(nn.Module):

    def __init__(self, d_in, d_out, context_length,
                 dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout) # New
        self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) # New

    def forward(self, x):
        b, num_tokens, d_in = x.shape # New batch dimension b
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(1, 2) # Changed transpose
        attn_scores.masked_fill_(  # New, _ ops are in-place
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)  # `:num_tokens` to account for cases where the number of tokens in the batch is smaller than the supported context_size
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        attn_weights = self.dropout(attn_weights) # New

        context_vec = attn_weights @ values
        return context_vec
    
    
class MultiHeadAttentionWrapper(nn.Module):

    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        self.heads = nn.ModuleList(
            [CausalAttention(d_in, d_out, context_length, dropout, qkv_bias) 
             for _ in range(num_heads)]
        )

    def forward(self, x):
        return torch.cat([head(x) for head in self.heads], dim=-1)

batch = torch.stack((inputs, inputs), dim=0)
print(batch.shape) # 2 inputs with 6 tokens each, and each token has embedding dimension 3


context_length = batch.shape[1] # This is the number of tokens
d_in, d_out = 3, 2
mha = MultiHeadAttentionWrapper(
    d_in, d_out, context_length, 0.0, num_heads=2
)

context_vecs = mha(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

torch.Size([2, 6, 3])
tensor([[[-0.4519,  0.2216,  0.4772,  0.1063],
         [-0.5874,  0.0058,  0.5891,  0.3257],
         [-0.6300, -0.0632,  0.6202,  0.3860],
         [-0.5675, -0.0843,  0.5478,  0.3589],
         [-0.5526, -0.0981,  0.5321,  0.3428],
         [-0.5299, -0.1081,  0.5077,  0.3493]],

        [[-0.4519,  0.2216,  0.4772,  0.1063],
         [-0.5874,  0.0058,  0.5891,  0.3257],
         [-0.6300, -0.0632,  0.6202,  0.3860],
         [-0.5675, -0.0843,  0.5478,  0.3589],
         [-0.5526, -0.0981,  0.5321,  0.3428],
         [-0.5299, -0.1081,  0.5077,  0.3493]]], grad_fn=<CatBackward0>)
context_vecs.shape: torch.Size([2, 6, 4])


#### 1. My response

- The input parameter that that controls the dimensionality of output context vectors in the `MultiHeadAttentionWrapper` is $d_{out}$ wich is passed to `CausalAttention` head. $d_{out}$ is the "embedding dimension" for the key, query and value vectors for each head.

- It implies that the final embedding dimension is $d_{out} * num_{heads}$. If we maintain $num_{head} = 2$ the final embedding dimension is  $2 * d_{out}$

- We want a two-dimensional output context vectors implies that $d_{out} = 1$

- It implies that each attention head will output a scalar value per token

- So we need to **aggregate the outputs** to achieve the 2-dimensional output. A solution is to **we can stack the outputs and average them along the head dimension**.

#### 2. Implementation

    def forward(self, x):

        last_output = torch.cat([head(x) for head in self.heads], dim=-1)

        return torch.mean(last_output)

In [11]:
class MultiHeadAttentionWrapper(nn.Module):

    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        self.heads = nn.ModuleList(
            [CausalAttention(d_in, d_out, context_length, dropout, qkv_bias) 
             for _ in range(num_heads)]
        )

    def forward(self, x):

        head_output = [head(x) for head in self.heads]

        return torch.mean(torch.stack(head_output, dim=-1), dim=-1)
    

d_in, d_out = 3, 1  
context_length = batch.shape[1]   # This is the number of tokens
mha = MultiHeadAttentionWrapper(
    d_in, d_out, context_length, dropout=0.0, num_heads=2
)


context_vecs = mha(batch)

print("context_vecs.shape:", context_vecs.shape)

context_vecs.shape: torch.Size([2, 6, 1])


# Exercise 3.3

### Can you configure a `MultiHeadAttention` module that precisely replicates the architectural specifications of the smallest GPT-2 model?

#### 0. Setup

In [12]:
# Model configurations
d_in = 768  # Input embedding dimension (also d_model for GPT-2)
d_out = 768  # Output embedding dimension (same as input for GPT-2)
num_heads = 12  # Number of attention heads in GPT-2 smallest model
context_length = 1024  # The context length (number of tokens GPT-2 can process)
dropout = 0.1  # Typical dropout rate for transformer models

# Create an instance of the MultiHeadAttentionWrapper
mha = MultiHeadAttentionWrapper(
    d_in=d_in, 
    d_out=d_out, 
    context_length=context_length, 
    dropout=dropout, 
    num_heads=num_heads
)

# Example input (batch size = 2, sequence length = 1024, embedding size = 768)
inputs = torch.rand(2, context_length, d_in)

# Forward pass
context_vecs = mha(inputs)

# Output shape should be [2, 1024, 768] as per GPT-2's architecture
print("context_vecs.shape:", context_vecs.shape)

context_vecs.shape: torch.Size([2, 1024, 768])


Key Implementation Details:
12 Parallel Attention Heads:

The model has been configured with num_heads=12 in the MultiHeadAttentionWrapper, which ensures 12 attention heads are used. This allows the attention mechanism to capture multiple perspectives of the input sequence and to focus on different subspaces of the input representations.
768-Dimensional Embedding Space:

Both the input (d_in) and output (d_out) embedding dimensions are set to 768, which is consistent with GPT-2's architecture. This means that each token is represented in a 768-dimensional space, which facilitates the representation of rich contextual information.
1,024 Token Context Length:

The context_length=1024 setting allows the model to process sequences of up to 1,024 tokens. This is directly in line with the smallest GPT-2 model, which can handle 1,024 tokens in a single forward pass, enabling it to understand long-range dependencies in text.
Practical Recommendation:
The MultiHeadAttentionWrapper is initialized with the exact specifications needed to match GPT-2's smallest model:
d_in and d_out set to 768 ensure that the attention mechanism operates with the appropriate dimensionality for both input and output.
num_heads=12 ensures the correct number of attention heads, which allows for the multi-perspective learning characteristic of GPT-2.
context_length=1024 ensures the model can process long sequences, aligning with the smallest GPT-2 model's context length.
Thus, the provided configuration is correctly designed to replicate the smallest GPT-2 model’s attention mechanism.

Outcome:
By constructing the MultiHeadAttentionWrapper in this way, you’ve replicated the essential architectural features of GPT-2’s smallest model:

12 attention heads for multi-perspective processing
768-dimensional embedding space for token representations
1,024 token context length for sequence processing
This ensures that your MultiHeadAttention module matches GPT-2’s attention architecture accurately.