# Chapter 3 - Exercises

> Author : Badr TAJINI - Large Language model (LLMs) - ESIEE 2024-2025

---

# Exercise 3.1

Observe that the `nn.Linear` layer in `SelfAttention_v2` employs a distinct weight initialization strategy compared to the `nn.Parameter(torch.rand(d_in, d_out))` method utilized in `SelfAttention_v1`, resulting in divergent computational outputs. To validate the fundamental structural similarities between the two implementations, we propose a weight transfer methodology that will demonstrate the potential for convergence between `SelfAttention_v1` and `SelfAttention_v2`.

**Key Exercise Question: Can you transfer the weights from `SelfAttention_v2` to `SelfAttention_v1` such that both implementations produce identical output tensors?**

*Specific Challenges:*
- Recognize that `nn.Linear` stores its weight matrix in a transposed configuration
- Carefully map and transfer weights between the two self-attention implementations
- Verify that the transferred weights result in mathematically equivalent computational results

The primary objective is to systematically transfer weight matrices from an instantiated `SelfAttention_v2` object to a `SelfAttention_v1` instance, requiring a nuanced understanding of the underlying weight matrix representation.

Subsequent research focuses on advancing the self-attention mechanism through two critical architectural enhancements:

1. **Causal Masking**: This modification introduces a constraint preventing the attention mechanism from accessing future sequence elements. Such a constraint is particularly pivotal in generative language modeling contexts, where each token's prediction must be conditioned exclusively on preceding contextual information.

2. **Multi-Head Attention**: This approach involves partitioning the attention mechanism into parallel computational "heads." Each head operates as a distinct learnable feature extractor, capable of capturing diverse representational characteristics across different subspaces and positional contexts. By enabling simultaneous multi-perspective representation learning, this technique substantially augments the model's capacity to process complex, high-dimensional representations.

These architectural refinements collectively contribute to more sophisticated and contextually aware neural network architectures, particularly in sequence modeling domains.

In [1]:
import torch
import torch.nn as nn

class SelfAttention_v1(nn.Module):
    def __init__(self, d_in, d_out):
        super().__init__()
        self.W_query = nn.Parameter(torch.rand(d_in, d_out))
        self.W_key = nn.Parameter(torch.rand(d_in, d_out))
        self.W_value = nn.Parameter(torch.rand(d_in, d_out))

    def forward(self, x):
        keys = x @ self.W_key
        queries = x @ self.W_query
        values = x @ self.W_value
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        context_vec = attn_weights @ values
        return context_vec


class SelfAttention_v2(nn.Module):
    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        context_vec = attn_weights @ values
        return context_vec


# Fonction pour transférer les poids
def transfer_weights(sa_v2, sa_v1):
    sa_v1.W_query.data = sa_v2.W_query.weight.data.T
    sa_v1.W_key.data = sa_v2.W_key.weight.data.T
    sa_v1.W_value.data = sa_v2.W_value.weight.data.T


# Test avec exemple
d_in, d_out = 4, 4
inputs = torch.rand(2, d_in)

sa_v1 = SelfAttention_v1(d_in, d_out)
sa_v2 = SelfAttention_v2(d_in, d_out)

# Avant transfert
output_v1_before = sa_v1(inputs)
output_v2_before = sa_v2(inputs)

# Transfert des poids
transfer_weights(sa_v2, sa_v1)

# Après transfert
output_v1_after = sa_v1(inputs)
output_v2_after = sa_v2(inputs)

# Résultat
print("Avant transfert :", output_v1_before)
print("Après transfert :", output_v1_after)
print("Sorties de v2 :", output_v2_after)


Avant transfert : tensor([[0.5932, 1.2044, 0.6683, 0.6424],
        [0.5930, 1.2040, 0.6685, 0.6418]], grad_fn=<MmBackward0>)
Après transfert : tensor([[-0.0086, -0.2510,  0.1119, -0.2838],
        [-0.0088, -0.2511,  0.1122, -0.2836]], grad_fn=<MmBackward0>)
Sorties de v2 : tensor([[-0.0086, -0.2510,  0.1119, -0.2838],
        [-0.0088, -0.2511,  0.1122, -0.2836]], grad_fn=<MmBackward0>)


# Exercise 3.2

**Key Exercise Question: How can you modify the input arguments to the `MultiHeadAttentionWrapper(num_heads=2)` to transform the output context vectors from four-dimensional to two-dimensional while maintaining the `num_heads=2` configuration?**

*Specific Challenges:*
- Identify the input parameter that controls the dimensionality of output context vectors
- Understand the relationship between input arguments and tensor shape
- Achieve dimensionality reduction without modifying the core `MultiHeadAttentionWrapper` class implementation

*Architectural Context:*
Up to this point, we have developed a `MultiHeadAttentionWrapper` that integrates multiple single-head attention modules through sequential processing, implemented via the comprehension `[head(x) for head in self.heads]` in the forward method. This current implementation represents a foundational approach to multi-head attention mechanisms.

*Potential Optimization Strategies:*
1. **Sequential Processing Limitation**: The current implementation processes attention heads sequentially, which may introduce computational inefficiencies.

2. **Parallel Processing Approach**: An advanced optimization involves simultaneous computation of attention head outputs through efficient matrix multiplication techniques. This parallel processing strategy can potentially enhance computational performance and reduce computational overhead.

*Theoretical Implications:*
The ability to dynamically adjust output dimensionality while maintaining the multi-head attention structure highlights the flexibility of modern neural network architectural designs. Such manipulations are crucial in adapting attention mechanisms to diverse computational requirements across different machine learning domains.

*Practical Recommendation:*
Carefully examine the input arguments of the `MultiHeadAttentionWrapper` and consider how specific parameters might influence the output tensor's dimensionality. The solution likely involves a subtle adjustment that does not require restructuring the core implementation.

In [2]:
class MultiHeadAttentionWrapper(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads
        self.W_query = nn.Linear(d_model, d_model)
        self.W_key = nn.Linear(d_model, d_model)
        self.W_value = nn.Linear(d_model, d_model)
        self.fc_out = nn.Linear(d_model, d_model)

    def forward(self, x):
        batch_size, seq_length, d_model = x.shape

        # Linear transformations and split into heads
        queries = self.W_query(x).view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
        keys = self.W_key(x).view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
        values = self.W_value(x).view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)

        # Attention scores
        attn_scores = torch.matmul(queries, keys.transpose(-1, -2)) / (self.head_dim ** 0.5)
        attn_weights = torch.softmax(attn_scores, dim=-1)

        # Context vectors
        context = torch.matmul(attn_weights, values).transpose(1, 2).contiguous()
        context = context.view(batch_size, seq_length, -1)  # Combine heads

        # Output layer
        out = self.fc_out(context)
        return out

# Dimensions
d_model = 4
num_heads = 2 

# Modification : Transformer les vecteurs contextuels en une dimension inférieure
modified_d_model = 2

# Classe modifiée
class MultiHeadAttentionWrapperModified(nn.Module):
    def __init__(self, d_model, modified_d_model, num_heads):
        super().__init__()
        self.multi_head_attention = MultiHeadAttentionWrapper(d_model, num_heads)
        self.fc_reduce = nn.Linear(d_model, modified_d_model)

    def forward(self, x):
        out = self.multi_head_attention(x)
        out = self.fc_reduce(out)
        return out

# Exemple d'entrée
inputs = torch.rand(2, 5, d_model)

# Utilisation
attention = MultiHeadAttentionWrapperModified(d_model=d_model, modified_d_model=modified_d_model, num_heads=num_heads)
output = attention(inputs)

print("Dimension de la sortie :", output.shape)

Dimension de la sortie : torch.Size([2, 5, 2])


# Exercise 3.3

**Key Exercise Question: Can you configure a `MultiHeadAttention` module that precisely replicates the architectural specifications of the smallest GPT-2 model?**

*Specific Model Specifications:*
- Number of Attention Heads: 12
- Input/Output Embedding Dimensions: 768
- Context Length: 1,024 tokens

*Architectural Parameters:*
- `num_heads`: 12
- `d_model`: 768
- `context_length`: 1,024

*Theoretical Considerations:*
The proposed configuration mirrors the smallest variant of the GPT-2 model, which represents a fundamental architecture in transformer-based language models. By precisely replicating these specifications, we can explore the intricate design choices that contribute to the model's effectiveness in natural language processing tasks.

*Key Implementation Details:*
- Ensuring 12 parallel attention heads allows for multi-perspective feature representation
- The 768-dimensional embedding space provides a rich, high-dimensional representation of linguistic context
- The 1,024 token context length enables comprehensive sequence processing

*Practical Recommendation:*
Carefully construct the `MultiHeadAttention` initialization to match these exact specifications, paying close attention to the dimensionality and number of heads to accurately reproduce the smallest GPT-2 model's architectural characteristics.

In [3]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        # Linear layers for query, key, and value
        self.W_query = nn.Linear(d_model, d_model)
        self.W_key = nn.Linear(d_model, d_model)
        self.W_value = nn.Linear(d_model, d_model)

        # Output linear layer
        self.fc_out = nn.Linear(d_model, d_model)

    def forward(self, x):
        batch_size, seq_length, d_model = x.shape

        # Linear transformations and split into heads
        queries = self.W_query(x).view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
        keys = self.W_key(x).view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)
        values = self.W_value(x).view(batch_size, seq_length, self.num_heads, self.head_dim).transpose(1, 2)

        # Scaled dot-product attention
        attn_scores = torch.matmul(queries, keys.transpose(-1, -2)) / (self.head_dim ** 0.5)
        attn_weights = torch.softmax(attn_scores, dim=-1)
        context = torch.matmul(attn_weights, values)

        # Concatenate heads and pass through final linear layer
        context = context.transpose(1, 2).contiguous().view(batch_size, seq_length, d_model)
        out = self.fc_out(context)
        return out


# Configuration GPT-2 (smallest variant)
d_model = 768  # Dimension des embeddings
num_heads = 12  # Nombre de têtes d'attention
context_length = 1024  # Longueur de contexte (tokens)

# Exemple d'entrée : batch de séquences avec dimension GPT-2
batch_size = 2
inputs = torch.rand(batch_size, context_length, d_model)  # Batch size = 2, seq length = 1024, embedding dim = 768

# Initialisation et test
multi_head_attention = MultiHeadAttention(d_model=d_model, num_heads=num_heads)
output = multi_head_attention(inputs)

# Vérification des dimensions de sortie
print("Dimension de l'entrée :", inputs.shape)
print("Dimension de la sortie :", output.shape)

Dimension de l'entrée : torch.Size([2, 1024, 768])
Dimension de la sortie : torch.Size([2, 1024, 768])
