# Implementing Transformer Models
## Practical VII
Carel van Niekerk & Hsien-Chin Lin

1-5.12.2025

---

In this practical we will combine the word embedding layer, positional encoding layer, and encoder and decoder layers from previous practicals to implement a transformer model.

# Exercises

1. Study the model in the paper [Attention is all you need](https://arxiv.org/abs/1706.03762). Write down the structure of the proposed model.
2. Study the section on word embeddings and pay close attention to the parameter sharing. Explain the benefits of parameter sharing in the transformer model.
3. Based on your implementations of all the components, implement a transformer model. Use the pytorch `nn.Module` class to implement the model. Your model should be configurable with the following parameters:
    - `vocab_size`: The size of the vocabulary
    - `d_model`: The dimensionality of the embedding layer
    - `n_heads`: The number of heads in the multi-head attention layers
    - `num_encoder_layers`: The number of encoder layers
    - `num_decoder_layers`: The number of decoder layers
    - `dim_feedforward`: The dimensionality of the feedforward layer
    - `dropout`: The dropout probability
    - `max_len`: The maximum length of the input sequence

## Exercise 1: Transformer Model Structure

The transformer model from "Attention is All You Need" consists of:

### Encoder (left side)
1. **Input Embedding**: Converts input tokens to $d_{model}$-dimensional vectors
2. **Positional Encoding**: Adds position information using sinusoidal functions
3. **N Encoder Layers** (N=6 in the paper), each containing:
   - Multi-Head Self-Attention (with residual + LayerNorm)
   - Position-wise Feed Forward Network (with residual + LayerNorm)

### Decoder (right side)
1. **Output Embedding**: Converts output tokens to $d_{model}$-dimensional vectors (shifted right)
2. **Positional Encoding**: Same sinusoidal encoding as encoder
3. **N Decoder Layers** (N=6 in the paper), each containing:
   - Masked Multi-Head Self-Attention (with residual + LayerNorm) - prevents attending to future positions
   - Multi-Head Cross-Attention over encoder output (with residual + LayerNorm)
   - Position-wise Feed Forward Network (with residual + LayerNorm)

### Output Layer
1. **Linear**: Projects decoder output to vocabulary size
2. **Softmax**: Converts to probability distribution over vocabulary

### Model Dimensions (base model)
- $d_{model} = 512$
- $d_{ff} = 2048$
- $h = 8$ (number of attention heads)
- $d_k = d_v = d_{model}/h = 64$
- $N = 6$ (encoder/decoder layers)

## Exercise 2: Parameter Sharing Benefits

In the transformer, three sets of embeddings share parameters:

1. **Input embedding** (encoder)
2. **Output embedding** (decoder)
3. **Pre-softmax linear transformation** (output layer)

All three use the **same weight matrix** $E \in \mathbb{R}^{V \times d_{model}}$

### Benefits of Parameter Sharing:

1. **Reduced Model Size**: Instead of 3 separate matrices ($3 \times V \times d_{model}$ parameters), only one is needed. For $V=50000$ and $d_{model}=512$, this saves ~51M parameters.

2. **Improved Generalization**: Shared embeddings force the model to learn a unified semantic space where input and output tokens have consistent representations.

3. **Better Learning Signal**: The output layer gradients directly update the input embeddings, providing stronger supervision for rare words.

4. **Semantic Consistency**: Words have the same meaning whether they appear in input or output, so sharing embeddings enforces this consistency.

5. **Transfer Learning**: The shared embedding space makes it easier to use pre-trained embeddings or transfer the model to new tasks.
