# Implementing Transformer Models
## Practical V
Carel van Niekerk & Hsien-Chin Lin

10-14.11.2025

---

In this practical we will implement the multi-head attention and a layer of the transformer encoder. This layer will consist of a multi-head self-attention layer, a residual connection, a layer normalisation layer, and a positional wise feed forward layer.

### 1. The Multi-Head Attention Layer

The multi-head attention layer (as defined [here](https://arxiv.org/abs/1706.03762)) is a layer that takes a query, key, and value as input and returns an output. The multi-head attention projects each of the inputs to lower dimension features for each of the attention heads. The attention head computes the attention between the query and key and uses the attention weights to compute a weighted sum of the values. Their outputs are then concatenated and projected to the output dimension. The multi-head attention layer is defined as follows:

$$ MultiHead(Q,K,V) = Concat(head_1, ..., head_h)W^O $$

where

$$ head_i = Attention(QW_i^Q, KW_i^K, VW_i^V) $$

and $W_i^Q \in \mathbb{R}^{d_{model} \times d_k}$, $W_i^K \in \mathbb{R}^{d_{model} \times d_k}$, $W_i^V \in \mathbb{R}^{d_{model} \times d_v}$, and $W^O \in \mathbb{R}^{hd_v \times d_{model}}$ are learned linear projections. $d_k$ and $d_v$ are the dimension of the key and value vectors respectively. $h$ is the number of attention heads.

### 2. Residual Connections and Layer Normalisation

In the transformer model, residual connections help prevent the loss of information, mitigate the vanishing gradient problem, and enable the training of deeper, more efficient networks. Layer normalization in neural networks helps avoid internal covariate shifts by normalizing the inputs within a layer, ensuring consistent distribution of inputs during training, this stabilizes the learning process and improves training speed and model performance.

The residual connection and layer normalization layer is defined as follows:

$$ LayerNorm(x + Sublayer(x)) $$

where $Sublayer(x)$ is the function implemented by the sublayer (e.g. multi-head attention or feed forward layer).

# Exercises

1. Study the position wise feed forward layer proposed in the paper [Attention is all you need](https://arxiv.org/abs/1706.03762). Write down the equation for this layer and provide an explanation of the function of this layer in a transformer model.
2. Implement the positional wise feed forward layer using the [Sequential](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html) function in PyTorch.
3. Implement the multi-head attention layer using the [Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) function in PyTorch. (Hint: the projections do not have a bias component)
4. Write down the equation for the layer normalization layer and provide an explanation of the function of this layer in a transformer model.
5. Implement a transformer encoder layer. The layer should consist of a multi-head self-attention layer, a residual connection, a layer normalisation layer, and a positional wise feed forward layer and a second residual connection and layer normalisation layer. (Hint: it is important to use two independent layer normalisation layers following the multi-head self-attention layer and the position wise feed forward layer. Further, as in the Transformers is all you need model, our layer should include dropout after the multi-head attention and position wise feed forward layers.)
6. Using the tests provided, verify that your implementations are correct.

## Exercise 1: Position-Wise Feed Forward Layer

### Equation

$$FFN(x) = \max(0, xW_1 + b_1)W_2 + b_2$$

Or equivalently:

$$FFN(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2$$

Where:
- $W_1 \in \mathbb{R}^{d_{model} \times d_{ff}}$ and $b_1 \in \mathbb{R}^{d_{ff}}$
- $W_2 \in \mathbb{R}^{d_{ff} \times d_{model}}$ and $b_2 \in \mathbb{R}^{d_{model}}$
- Typically $d_{ff} = 4 \times d_{model}$ (e.g., $d_{model}=512$, $d_{ff}=2048$)

### Function in Transformer

The position-wise feed forward layer serves several purposes:

1. **Non-linear transformation**: Adds non-linearity through ReLU activation, enabling the model to learn complex patterns that attention alone cannot capture.

2. **Feature transformation**: Projects the representation to a higher-dimensional space ($d_{ff}$), allowing richer feature interactions, then projects back to $d_{model}$.

3. **Position-wise**: Applied identically and independently to each position in the sequence. This means each token's representation is transformed separately, without mixing information between positions (that's the attention layer's job).

4. **Complements attention**: While attention captures relationships between positions, FFN processes the information at each position to extract and transform features.

## Exercise 4: Layer Normalization

### Equation

$$\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$

Where:
- $\mu = \frac{1}{d}\sum_{i=1}^{d} x_i$ (mean across features)
- $\sigma^2 = \frac{1}{d}\sum_{i=1}^{d} (x_i - \mu)^2$ (variance across features)
- $\gamma, \beta \in \mathbb{R}^{d}$ are learnable scale and shift parameters
- $\epsilon$ is a small constant for numerical stability (e.g., $10^{-6}$)
- $\odot$ denotes element-wise multiplication

### Function in Transformer

Layer normalization plays a critical role in transformer training:

1. **Stabilizes training**: Normalizes activations within each layer, preventing internal covariate shift where the distribution of inputs to layers changes during training.

2. **Enables deeper networks**: By keeping activations in a stable range, it prevents vanishing/exploding gradients, allowing transformers to be stacked deeply.

3. **Applied per-sample**: Unlike batch normalization, layer norm computes statistics across features for each sample independently, making it suitable for variable-length sequences and small batch sizes.

4. **Post-residual placement**: In transformers, it's typically applied after the residual connection: $\text{LayerNorm}(x + \text{Sublayer}(x))$, ensuring the residual path remains clean.

5. **Learnable parameters**: $\gamma$ and $\beta$ allow the model to learn the optimal scale and shift for each feature dimension.