# Chapter 4 - Exercises

> Author : Badr TAJINI - Large Language model (LLMs) - ESIEE 2024-2025

---

# Exercise 4.1: Parameters in the feed forward versus attention module

**Key Exercise Question: How do the parameter counts differ between the `feed-forward` neural network module and `multi-head attention` mechanism in our transformer architecture?**

*Methodological Approach:*
The investigation focuses on a systematic computational analysis of parameter allocation across two critical transformer neural network components:

1. **Feed-Forward Neural Network Module**
   - Characterization: Nonlinear transformation module
   - Primary computational function: Introducing network complexity and representational capacity
   - Parametric considerations: Linear transformation layers, activation functions

2. **Multi-Head Attention Mechanism**
   - Characterization: Contextual feature interaction module
   - Primary computational function: Capturing inter-token relational dynamics
   - Parametric considerations: Projection matrices, attention computation

*Analytical Objectives:*
- Quantify the exact number of trainable parameters in each architectural component
- Comparative assessment of parametric complexity
- Understand the relative computational resource allocation

*Theoretical Implications:*
- Insights into architectural parameter efficiency
- Empirical understanding of transformer module design
- Potential implications for model optimization and architectural design

*Computational Methodology:*
1. Enumerate parameters in `feed-forward` module
2. Enumerate parameters in `multi-head attention` module
3. Perform comparative statistical analysis
4. Interpret parametric distribution characteristics

*Recommended Investigative Approach:*
- Utilize precise computational tracing
- Consider layer-specific parameter counting
- Account for bias terms and weight matrices

# Exercise 4.2: Initialize larger GPT models

- **GPT2-small** (the 124M configuration we already implemented):
    - "emb_dim" = 768
    - "n_layers" = 12
    - "n_heads" = 12

- **GPT2-medium:**
    - "emb_dim" = 1024
    - "n_layers" = 24
    - "n_heads" = 16

- **GPT2-large:**
    - "emb_dim" = 1280
    - "n_layers" = 36
    - "n_heads" = 20

- **GPT2-XL:**
    - "emb_dim" = 1600
    - "n_layers" = 48
    - "n_heads" = 25

**Key Exercise Question: Can you systematically scale the GPT-2 model architecture from the small configuration to medium, large, and XL variants by exclusively modifying the configuration parameters?**

*Architectural Scaling Challenge:*
This exercise explores the methodological expansion of the GPT-2 model across different scales, demonstrating how architectural complexity can be incrementally increased through strategic parameter modifications.

*Model Variants to Implement:*
1. **GPT-2 Small (Current Implementation)**
   - Embedding Dimensions ("emb_dim"): 768
   - Transformer Blocks ("n_layers"): 12
   - Multi-Head Attention Heads ("n_heads"): 12

2. **GPT-2 Medium**
   - Embedding Dimensions ("emb_dim"): 1,024
   - Transformer Blocks ("n_layers"): 24
   - Multi-Head Attention Heads ("n_heads"): 16

3. **GPT-2 Large**
   - Embedding Dimensions ("emb_dim"): 1,280
   - Transformer Blocks ("n_layers"): 36
   - Multi-Head Attention Heads ("n_heads"): 20

4. **GPT-2 XL**
   - Embedding Dimensions ("emb_dim"): 1,600
   - Transformer Blocks ("n_layers"): 48
   - Multi-Head Attention Heads ("n_heads"): 25

*Methodological Constraints:*
- Modify only the configuration file
- Utilize the existing `GPTModel` class without code alterations
- Demonstrate parameter scaling capabilities
- Calculate total parameters for each model variant

**Bonus Challenge:**
**Compute the total number of trainable parameters for each model variant, highlighting the exponential growth in model complexity.**



# Exercise 4.3: Using separate dropout parameters

**Key Exercise Question: How can we enhance the dropout configuration of the GPT model by implementing layer-specific dropout rates?**

*Architectural Dropout Refinement:*
The current implementation employs a uniform dropout rate across multiple model components, which presents an opportunity for more nuanced regularization strategies. This exercise challenges you to develop a more sophisticated approach to dropout implementation within neural network architectures.

*Dropout Localization:*
Three critical architectural components require distinct dropout configurations:
1. Embedding Layer
2. Shortcut (Residual) Connections
3. Multi-Head Attention Module

*Methodological Approach:*
You must modify the existing `GPT_CONFIG_124M` configuration to:
- Replace the monolithic `drop_rate` parameter
- Introduce a hierarchical dropout configuration
- Maintain the overall structural integrity of the model architecture

*Conceptual Challenge:*
The exercise requires a deep understanding of:
- Regularization techniques in neural network design
- The functional role of dropout in different architectural components
- Systematic configuration of model hyperparameters