# Chapter 3 - Exercises

> Author : Badr TAJINI - Large Language model (LLMs) - ESIEE 2024-2025

---

# Exercise 3.1

Observe that the `nn.Linear` layer in `SelfAttention_v2` employs a distinct weight initialization strategy compared to the `nn.Parameter(torch.rand(d_in, d_out))` method utilized in `SelfAttention_v1`, resulting in divergent computational outputs. To validate the fundamental structural similarities between the two implementations, we propose a weight transfer methodology that will demonstrate the potential for convergence between `SelfAttention_v1` and `SelfAttention_v2`.

**Key Exercise Question: Can you transfer the weights from `SelfAttention_v2` to `SelfAttention_v1` such that both implementations produce identical output tensors?**

*Specific Challenges:*
- Recognize that `nn.Linear` stores its weight matrix in a transposed configuration
- Carefully map and transfer weights between the two self-attention implementations
- Verify that the transferred weights result in mathematically equivalent computational results

The primary objective is to systematically transfer weight matrices from an instantiated `SelfAttention_v2` object to a `SelfAttention_v1` instance, requiring a nuanced understanding of the underlying weight matrix representation.

Subsequent research focuses on advancing the self-attention mechanism through two critical architectural enhancements:

1. **Causal Masking**: This modification introduces a constraint preventing the attention mechanism from accessing future sequence elements. Such a constraint is particularly pivotal in generative language modeling contexts, where each token's prediction must be conditioned exclusively on preceding contextual information.

2. **Multi-Head Attention**: This approach involves partitioning the attention mechanism into parallel computational "heads." Each head operates as a distinct learnable feature extractor, capable of capturing diverse representational characteristics across different subspaces and positional contexts. By enabling simultaneous multi-perspective representation learning, this technique substantially augments the model's capacity to process complex, high-dimensional representations.

These architectural refinements collectively contribute to more sophisticated and contextually aware neural network architectures, particularly in sequence modeling domains.

# Exercise 3.2

**Key Exercise Question: How can you modify the input arguments to the `MultiHeadAttentionWrapper(num_heads=2)` to transform the output context vectors from four-dimensional to two-dimensional while maintaining the `num_heads=2` configuration?**

*Specific Challenges:*
- Identify the input parameter that controls the dimensionality of output context vectors
- Understand the relationship between input arguments and tensor shape
- Achieve dimensionality reduction without modifying the core `MultiHeadAttentionWrapper` class implementation

*Architectural Context:*
Up to this point, we have developed a `MultiHeadAttentionWrapper` that integrates multiple single-head attention modules through sequential processing, implemented via the comprehension `[head(x) for head in self.heads]` in the forward method. This current implementation represents a foundational approach to multi-head attention mechanisms.

*Potential Optimization Strategies:*
1. **Sequential Processing Limitation**: The current implementation processes attention heads sequentially, which may introduce computational inefficiencies.

2. **Parallel Processing Approach**: An advanced optimization involves simultaneous computation of attention head outputs through efficient matrix multiplication techniques. This parallel processing strategy can potentially enhance computational performance and reduce computational overhead.

*Theoretical Implications:*
The ability to dynamically adjust output dimensionality while maintaining the multi-head attention structure highlights the flexibility of modern neural network architectural designs. Such manipulations are crucial in adapting attention mechanisms to diverse computational requirements across different machine learning domains.

*Practical Recommendation:*
Carefully examine the input arguments of the `MultiHeadAttentionWrapper` and consider how specific parameters might influence the output tensor's dimensionality. The solution likely involves a subtle adjustment that does not require restructuring the core implementation.

# Exercise 3.3

**Key Exercise Question: Can you configure a `MultiHeadAttention` module that precisely replicates the architectural specifications of the smallest GPT-2 model?**

*Specific Model Specifications:*
- Number of Attention Heads: 12
- Input/Output Embedding Dimensions: 768
- Context Length: 1,024 tokens

*Architectural Parameters:*
- `num_heads`: 12
- `d_model`: 768
- `context_length`: 1,024

*Theoretical Considerations:*
The proposed configuration mirrors the smallest variant of the GPT-2 model, which represents a fundamental architecture in transformer-based language models. By precisely replicating these specifications, we can explore the intricate design choices that contribute to the model's effectiveness in natural language processing tasks.

*Key Implementation Details:*
- Ensuring 12 parallel attention heads allows for multi-perspective feature representation
- The 768-dimensional embedding space provides a rich, high-dimensional representation of linguistic context
- The 1,024 token context length enables comprehensive sequence processing

*Practical Recommendation:*
Carefully construct the `MultiHeadAttention` initialization to match these exact specifications, paying close attention to the dimensionality and number of heads to accurately reproduce the smallest GPT-2 model's architectural characteristics.