# **Transformers – Improving Natural Language Processing with Attention Mechanisms (Part 1/3)**

Transformers are a type of deep learning model that have revolutionized the field of natural language processing (NLP) by introducing attention mechanisms. Unlike traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs), transformers do not rely on sequential data processing, allowing for greater parallelization and efficiency.

This section covers the following topics:
- Improving RNNs with Attention Mechanisms
- Introducing the stand-alone attention mechanism
- Understanding the original transformer architecture
- Comparing transformer-based scale language models
- Fine-tuning BERT for sentiment analysis


## **Adding an attention mechanism to RNNs**

Attention mechanisms allow models to focus on specific parts of the input sequence when making predictions, rather than treating all parts equally. This is particularly useful in NLP tasks where certain words or phrases may carry more significance than others.

### **Attention helps RNNs with accessing information**

To understand the development of an attention mechanism, let's first consider the limitations of RNNs. RNNs process sequences of data one element at a time, maintaining a hidden state that captures information about previous elements. However, as the sequence length increases, it becomes challenging for the RNN to retain relevant information from earlier in the sequence due to issues like vanishing gradients.

![A traditional RNN encoder-decoder architecture for a seq2seq modeling task](./figures/16_01.png)


Why is the RNN parsing the entire input sequence into a single fixed-length vector? This design choice can lead to information bottlenecks, especially for long sequences, as the model may struggle to retain all relevant information in a single vector.


![word by word transalation can lead to grammatical errors](./figures/16_02.png)


RNN encoder-decoder architectures can struggle with long sequences, as they must compress all input information into a single fixed-length vector. This can lead to loss of important context, resulting in errors such as incorrect grammar in translations.

In contrast to a regualr RNN encoder-decoder architecture, an attention mechanism allows the decoder to access all hidden states of the encoder directly. This means that at each step of the decoding process, the model can "attend" to different parts of the input sequence, effectively allowing it to focus on the most relevant information for generating the next output token.

### **The original attention mechanism for RNNs**

The attention mechanism introduced by Bahdanau et al. (2015) computes a context vector for each output time step by taking a weighted sum of all encoder hidden states. The weights, known as attention scores, are calculated based on the relevance of each encoder hidden state to the current decoder hidden state.

Given an input sequence $x = (x_1, x_2, \ldots, x_T)$, the encoder processes this sequence and produces a set of hidden states $h = (h_1, h_2, \ldots, h_T)$. The decoder then generates the output sequence $y = (y_1, y_2, \ldots, y_{T'})$. The attention mechanism assigns a weight to each element of the input sequence and helps the model identify which parts of the input are most relevant for generating each part of the output. For example, suppose our input is a sentence, and a word with a larger weight contributes more to our understanding of the whole sentence.

![RNN with attention mechanism](./figures/16_03.png)

#### Processing the inputs using a bidirectional RNN

A bidirectional RNN processes the input sequence in both forward and backward directions, allowing the model to capture context from both past and future tokens. This is particularly useful for NLP tasks where understanding the full context of a word is important. 

#### Generating outputs from context vectors

At each decoding step, the decoder computes a context vector as a weighted sum of the encoder's hidden states. The weights are determined by the attention scores, which indicate the relevance of each encoder hidden state to the current decoder state. The context vector is then combined with the decoder's hidden state to generate the output token.

#### Computing the attention weights

The attention weights are computed using a scoring function that measures the similarity between the decoder's current hidden state and each of the encoder's hidden states. Common scoring functions include dot product, scaled dot product, and additive attention. The scores are then normalized using a softmax function to produce the attention weights, which sum to one.

---

## **Introducing the self-attention mechanism**

`Self-attention`, also known as `intra-attention`, is a mechanism that allows a model to weigh the importance of different parts of a single sequence when encoding it. Unlike traditional attention mechanisms that operate between two different sequences (e.g., encoder and decoder), self-attention focuses on the relationships within the same sequence.

`Self-attention` works by computing a set of attention scores for each token in the sequence with respect to all other tokens. This allows the model to capture dependencies and relationships between words, regardless of their distance from each other in the sequence. It focuses only on the input sequence itself, enabling the model to understand the context and relationships between words more effectively.

**Starting with a basic form of self-attention**

Consider an input sequence represented as a matrix $( X )$ of shape $( (T, d_{model}) )$, where $( T )$ is the sequence length and $( d_{model} )$ is the dimensionality of the input embeddings. The self-attention mechanism computes three matrices: queries $( Q )$, keys $( K )$, and values $( V )$ by multiplying the input matrix $( X )$ with learned weight matrices $( W_{Q} )$, $( W_{K} )$, and $( W_{V} )$:

$$Q = X W_Q, \quad K = X W_K, \quad V = X W_V$$

The attention scores are computed by taking the dot product of the queries and keys, followed by scaling and applying a softmax function to obtain the attention weights:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V$$

where $( d_{k} )$ is the dimensionality of the keys.


#### Starting with a basic form of self-attention


To introduce self-attention, let's assume we have an input sequence of length $T$, $x = (x^{(1)}, x^{(2)}, \ldots, x^{(T)})$, where each $x^{(i)}$ is a word embedding vector of dimension $d$. The self-attention mechanism computes a new representation for each input element by attending to all other elements in the sequence. And, the output of the self-attention mechanism is a set of context vectors $z = (z^{(1)}, z^{(2)}, \ldots, z^{(T)})$, where each context vector $z^{(i)}$ is computed as a weighted sum of all input elements:

$$z^{(i)} = \sum_{j=1}^{T} \alpha_{ij} x^{(j)}$$

Here, $\alpha_{ij}$ represents the attention weight from the $i$th input element to the $j$th input element, indicating how much attention the model should pay to $x^{(j)}$ when computing $z^{(i)}$.

For a seq2seq modeling task, the self-attention mechanism allows each word in the input sequence to attend to all other words, enabling the model to capture dependencies and relationships between words, regardless of their distance from each other in the sequence.

self-attention mechanism can be broken down into the following steps:

- 1. **Derive the attention weights $\alpha_{ij}$ for each pair of input elements.**

To compute the attention weights $\alpha_{ij}$, we first calculate similarity scores $w_{ij}$ between each pair of input elements using dot products:

$$w_{ij} = x^{(i)^T} x^{(j)}$$

- 2. **We normalize the weights using the softmax function.**

These scores are then normalized using the softmax function to obtain the attention weights:

$$\alpha_{ij} = \frac{\exp(w_{ij})}{\sum_{k=1}^{T} \exp(w_{ik})}$$

The `attention_weights` matrix has shape `(T, T)`.

- 3. **Compute the context vectors $z^{(i)}$ as weighted sums of the input elements.**

Finally, we compute the context vector $z^{(i)}$ for each input element as a weighted sum of all input elements using the attention weights:

$$z^{(i)} = \sum_{j=1}^{T} \alpha_{ij} x^{(j)}$$


In [1]:
import torch

# input sequence / sentence:
#  "Can you help me to translate this sentence"

sentence = torch.tensor(
    [0, # can
     7, # you     
     1, # help
     2, # me
     5, # to
     6, # translate
     4, # this
     3] # sentence
)

sentence

tensor([0, 7, 1, 2, 5, 6, 4, 3])

- Next, assume we have an embedding of the words, i.e., the words are represented as real vectors.

- Here, our embedding size is `16`, and we assume that the dictionary size is `10`.

- Since we have `8` words, there will be `8` vectors. Each vector is 16-dimensional:

In [2]:
torch.manual_seed(123)
embed = torch.nn.Embedding(10, 16)
embedded_sentence = embed(sentence).detach()
embedded_sentence.shape

torch.Size([8, 16])

In [4]:
embedded_sentence[0]

tensor([ 0.3374, -0.1778, -0.3035, -0.5880,  0.3486,  0.6603, -0.2196, -0.3792,
         0.7671, -1.1925,  0.6984, -1.4097,  0.1794,  1.8951,  0.4954,  0.2692])

In [6]:
embedded_sentence.dtype

torch.float32

- The goal is to compute the context vectors $`z^{(i)} = \sum_{j = 1}^{T} \alpha_{ij}x^{(j)}`$ , which involve attention weights $`\alpha_{ij}`$.

- In turn, the attention weights $`\alpha_{ij}`$ involve the $`w_{ij}`$ values.

- Let's start with the $`w_{ij}`$'s first, which are computed as dot-products:

$$w_{ij} = (x^{(i)})^T x^{(j)}$$

In [10]:
for i, x_i in enumerate(embedded_sentence):
    print(f"Word {i} embedding: {x_i}")
    for j, x_j in enumerate(embedded_sentence):
        print(f"  Dot product with word {j}: {torch.dot(x_i, x_j)}")

Word 0 embedding: tensor([ 0.3374, -0.1778, -0.3035, -0.5880,  0.3486,  0.6603, -0.2196, -0.3792,
         0.7671, -1.1925,  0.6984, -1.4097,  0.1794,  1.8951,  0.4954,  0.2692])
  Dot product with word 0: 9.760122299194336
  Dot product with word 1: 1.7326233386993408
  Dot product with word 2: 4.75434684753418
  Dot product with word 3: -1.3586798906326294
  Dot product with word 4: 0.47519540786743164
  Dot product with word 5: -1.6716841459274292
  Dot product with word 6: 1.0226718187332153
  Dot product with word 7: -0.12858974933624268
Word 1 embedding: tensor([-9.4053e-01, -4.6806e-01,  1.0322e+00, -2.8300e-01,  4.9275e-01,
        -1.4078e-02, -2.7466e-01, -7.6409e-01,  1.3966e+00, -9.9491e-01,
        -1.5822e-03,  1.2471e+00, -7.7105e-02,  1.2774e+00, -1.4596e+00,
        -2.1595e+00])
  Dot product with word 0: 1.7326233386993408
  Dot product with word 1: 16.07872772216797
  Dot product with word 2: 9.064151763916016
  Dot product with word 3: -0.3370445966720581
  Dot pro

In [11]:
omega = torch.empty(8, 8)

for i, x_i in enumerate(embedded_sentence):
    for j, x_j in enumerate(embedded_sentence):
        omega[i, j] = torch.dot(x_i, x_j)

In [12]:
omega

tensor([[ 9.7601,  1.7326,  4.7543, -1.3587,  0.4752, -1.6717,  1.0227, -0.1286],
        [ 1.7326, 16.0787,  9.0642, -0.3370,  1.1368,  1.1972,  1.6485, -1.2789],
        [ 4.7543,  9.0642, 22.6615, -0.8519,  7.7799,  2.7483, -0.6832,  1.6236],
        [-1.3587, -0.3370, -0.8519, 13.9473, -1.4198, 10.9659, -0.5887,  2.3869],
        [ 0.4752,  1.1368,  7.7799, -1.4198, 13.7511, -6.8568, -2.5114, -3.3468],
        [-1.6717,  1.1972,  2.7483, 10.9659, -6.8568, 24.6738, -3.8294,  4.9581],
        [ 1.0227,  1.6485, -0.6832, -0.5887, -2.5114, -3.8294, 15.8691,  2.0269],
        [-0.1286, -1.2789,  1.6236,  2.3869, -3.3468,  4.9581,  2.0269, 18.7382]])

In [13]:
omega.shape

torch.Size([8, 8])

- Actually, let's compute this more efficiently by replacing the nested for-loops with a matrix multiplication:

In [14]:
omega_mat = torch.matmul(embedded_sentence, embedded_sentence.T)
omega_mat

tensor([[ 9.7601,  1.7326,  4.7543, -1.3587,  0.4752, -1.6717,  1.0227, -0.1286],
        [ 1.7326, 16.0787,  9.0642, -0.3370,  1.1368,  1.1972,  1.6485, -1.2789],
        [ 4.7543,  9.0642, 22.6615, -0.8519,  7.7799,  2.7483, -0.6832,  1.6236],
        [-1.3587, -0.3370, -0.8519, 13.9473, -1.4198, 10.9659, -0.5887,  2.3869],
        [ 0.4752,  1.1368,  7.7799, -1.4198, 13.7511, -6.8568, -2.5114, -3.3468],
        [-1.6717,  1.1972,  2.7483, 10.9659, -6.8568, 24.6738, -3.8294,  4.9581],
        [ 1.0227,  1.6485, -0.6832, -0.5887, -2.5114, -3.8294, 15.8691,  2.0269],
        [-0.1286, -1.2789,  1.6236,  2.3869, -3.3468,  4.9581,  2.0269, 18.7382]])

In [15]:
omega_mat.shape

torch.Size([8, 8])

In [16]:
torch.allclose(omega, omega_mat)

True

- Next, let's compute the attention weights by normalizing the "omega" values so they sum to 1

$$\alpha{ij} = \frac {exp(w_{ij})}{\sum_{j = 1}^{T}exp{(w_{ij})}} = softmax([w_{ij}]_{j=1...T})$$


- Hence, due to applying this `softmax` function, the weights will sum to `1` after this normalization, that is,

$$\sum_{j=1}^{T} \alpha_{ij} = 1$$

- We can compute the attention weights using PyTorch’s softmax function as follows:

In [18]:
import torch.nn.functional as F

attention_weights = F.softmax(omega, dim=1)
attention_weights.shape

torch.Size([8, 8])

In [33]:
attention_weights

tensor([[9.9270e-01, 3.2398e-04, 6.6502e-03, 1.4723e-05, 9.2135e-05, 1.0766e-05,
         1.5929e-04, 5.0374e-05],
        [5.8773e-07, 9.9910e-01, 8.9788e-04, 7.4187e-08, 3.2391e-07, 3.4407e-07,
         5.4033e-07, 2.8926e-08],
        [1.6712e-08, 1.2438e-06, 1.0000e+00, 6.1412e-11, 3.4437e-07, 2.2482e-09,
         7.2703e-11, 7.3008e-10],
        [2.1438e-07, 5.9550e-07, 3.5585e-07, 9.5172e-01, 2.0167e-07, 4.8272e-02,
         4.6299e-07, 9.0760e-06],
        [1.7110e-06, 3.3158e-06, 2.5448e-03, 2.5720e-07, 9.9745e-01, 1.1195e-09,
         8.6338e-08, 3.7443e-08],
        [3.6165e-12, 6.3713e-11, 3.0053e-10, 1.1136e-06, 2.0250e-14, 1.0000e+00,
         4.1805e-13, 2.7390e-09],
        [3.5667e-07, 6.6694e-07, 6.4779e-08, 7.1194e-08, 1.0410e-08, 2.7865e-09,
         1.0000e+00, 9.7366e-07],
        [6.4013e-09, 2.0263e-09, 3.6918e-08, 7.9205e-08, 2.5622e-10, 1.0361e-06,
         5.5258e-08, 1.0000e+00]])

- The `attention_weights` matrix has shape `(T, T)`.

- The `attention_weights` is an `8 x 8` matrix, where each entry $\alpha_{ij}$ represents the attention weight from the `i`th word to the `j`th word. 

- These attention weights indicate how relevant each word is to the `ith` word. 

- Hence, the columns in this attention matrix should sum to 1, which we can confirm via the following code:

In [19]:
attention_weights.sum(dim=1)  # each row sums to 1

tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])

- let us recap and summarize the three main steps behind the self-attention operation:

1. For a given input element $`x^{(i)}`$, compute the similarity scores $`w_{ij}`$ with all other input elements $`x^{(j)}`$ using dot products. $`x^{(i)^T}x^{(j)}`$.
2. Normalize these scores using the softmax function to obtain attention weights $`\alpha_{ij}`$.
3. Compute the context vector $`z^{(i)}`$ as a weighted sum of the input elements using the attention weights: $`z^{(i)} = \sum_{j=1}^{T} \alpha_{ij} x^{(j)}`$.


![A basic Self-attention mechanism](./figures/16_04.png)

- Now that we have the attention weights, we can compute the context vectors $`z^{(i)} = \sum_{j=1}^{T} \alpha_{ij} x^{(j)}`$, which involve attention weights $`\alpha_{ij}`$. 

- For instance, to compute the context-vector of the 2nd input element (the element at index 1), we can perform the following computation:

In [20]:
x_2 = embedded_sentence[1, :]
context_vec_2 = torch.zeros(x_2.shape)

for j in range(8):
    x_j = embedded_sentence[j, :]
    context_vec_2 += attention_weights[1, j] * x_j
    
context_vec_2

tensor([-9.3975e-01, -4.6856e-01,  1.0311e+00, -2.8192e-01,  4.9373e-01,
        -1.2896e-02, -2.7327e-01, -7.6358e-01,  1.3958e+00, -9.9543e-01,
        -7.1287e-04,  1.2449e+00, -7.8077e-02,  1.2765e+00, -1.4589e+00,
        -2.1601e+00])

In [30]:
x_2.shape, context_vec_2.shape

(torch.Size([16]), torch.Size([16]))

- Again, we can achieve this more efficiently by using matrix multiplication. Using the following code, we are computing the context vectors for all eight input words:

In [31]:
context_vectors = torch.matmul(
        attention_weights, embedded_sentence)

In [32]:
torch.allclose(context_vec_2, context_vectors[1])

True

In [36]:
context_vectors, context_vectors.shape

(tensor([[ 3.3420e-01, -1.8324e-01, -3.0218e-01, -5.7772e-01,  3.5662e-01,
           6.6452e-01, -2.0998e-01, -3.7798e-01,  7.6537e-01, -1.1946e+00,
           6.9960e-01, -1.4067e+00,  1.7021e-01,  1.8838e+00,  4.8729e-01,
           2.4730e-01],
         [-9.3975e-01, -4.6856e-01,  1.0311e+00, -2.8192e-01,  4.9373e-01,
          -1.2896e-02, -2.7327e-01, -7.6358e-01,  1.3958e+00, -9.9543e-01,
          -7.1287e-04,  1.2449e+00, -7.8077e-02,  1.2765e+00, -1.4589e+00,
          -2.1601e+00],
         [-7.7021e-02, -1.0205e+00, -1.6895e-01,  9.1776e-01,  1.5810e+00,
           1.3010e+00,  1.2753e+00, -2.0095e-01,  4.9647e-01, -1.5723e+00,
           9.6657e-01, -1.1481e+00, -1.1589e+00,  3.2547e-01, -6.3151e-01,
          -2.8400e+00],
         [-1.3679e+00,  1.0614e-01, -2.1317e+00,  1.0480e+00, -3.7127e-01,
          -9.1234e-01, -4.3802e-01, -1.0329e+00,  9.3425e-01,  1.5453e+00,
           5.7218e-01, -1.8049e-01, -6.0455e-03, -8.8691e-02,  2.0559e-01,
          -5.2292e-01],
    

### **Parameterizing the self-attention mechanism: scaled dot-product attention**

- more advanced self-attention mechanisms, such as scaled dot-product attention, introduce learnable parameters to enhance the model's ability to capture complex relationships within the input sequence.

- In scaled dot-product attention, we introduce three learnable weight matrices: $W_Q$, $W_K$, and $W_V$. These matrices are used to project the input embeddings into three different spaces: queries, keys, and values.

- The three weight matrices are learned during training, allowing the model to adaptively focus on different aspects of the input data.
  - Query Sequence: A set of vectors representing the elements for which we want to compute attention scores.
    - $Q^{(i)} = X^{(i)} W_Q$
  - Key Sequence: A set of vectors representing the elements against which we want to compute attention scores.
    - $K^{(i)} = X^{(i)} W_K$
  - Value Sequence: A set of vectors representing the elements that will be combined to produce the final output.
    - $V^{(i)} = X^{(i)} W_V$
  

![Context-aware self-attention mechanism with learnable parameters](./figures/16_05.png)


- Here, both the queries and keys are scaled by the square root of the dimensionality of the keys, $d_k$, to prevent excessively large dot product values that could lead to vanishing gradients during training.

- We can initialize these projection matrices as follows:

In [37]:
torch.manual_seed(123)

d = embedded_sentence.shape[1]
W_Q = torch.randn(d, d)
W_K = torch.randn(d, d)
W_V = torch.randn(d, d)

In [39]:
W_Q.shape

torch.Size([16, 16])

- we can compute the query sequence, key sequence, and value sequence using matrix multiplication:

In [40]:
x_2 = embedded_sentence[1, :]
q_2 = torch.matmul(x_2, W_Q)
k_2 = torch.matmul(x_2, W_K)
v_2 = torch.matmul(x_2, W_V)

In [41]:
q_2, k_2, v_2

(tensor([-4.0813, -1.6130, -2.5060, -3.3268, -4.1174, -2.3729, -2.6083,  2.3683,
         -4.1584,  9.9378,  3.5163, -2.2705,  4.6320, -4.2101, -0.5922,  4.6235]),
 tensor([-4.8231, -4.6870, -0.2487, 13.0042, -2.5805, -1.9165,  0.7472,  2.6754,
          4.7989,  0.1297, -2.5521,  3.9984,  4.0280,  3.7667,  0.2393, -3.2154]),
 tensor([ 4.1296, -0.5803,  1.4880,  1.8756, -0.3079, -6.6038, -5.6030,  4.7419,
          3.5117,  2.3469, -3.6096, -2.3465, -4.7263,  4.6613, -2.4629, -1.5542]))

- We also need the key and value sequences for all other input elements, which we can compute as follows:

In [45]:
keys = torch.matmul(embedded_sentence, W_K)
torch.allclose(k_2, keys[1])

True

In [46]:
values = torch.matmul(embedded_sentence, W_V)
torch.allclose(v_2, values[1])

True

- the unnormalized attention weights, $w_{ij} = {q^{(i)} \cdot k^{(j)}}$, using the scaled dot-product between the query and key vectors.

- the following code computes the unnormalized attention weights, $w_{23}$, that is, the attention weight from the 2nd input element to the 3rd input element:

In [47]:
omega_23 = torch.dot(q_2, keys[2])
omega_23

tensor(16.0975)

- We can scale up this computation to all keys at once using matrix multiplication:

In [49]:
omega_2 = torch.matmul(q_2, keys.T)
omega_2

tensor([-35.2682, -44.7632,  16.0975,  21.0103,  27.2076,  33.4731,  11.9149,
         55.0160])

- Going from unnormalized attention weights to the normalized attention weights involves applying the softmax function to the unnormalized weights. This ensures that the attention weights sum to one, allowing them to be interpreted as probabilities.

Formula:

$$\alpha_{ij} = \frac{\exp(w_{ij} / \sqrt{d_k})}{\sum_{k=1}^{T} \exp(w_{ik} / \sqrt{d_k})} = softmax\left(\frac{[w_{ij}]_{j=1...T}}{\sqrt{d_k}}\right)$$



In [50]:
attention_weights_2 = F.softmax(omega_2 / torch.sqrt(torch.tensor(d, dtype=torch.float32)), dim=0)
attention_weights_2

tensor([1.5667e-10, 1.4591e-11, 5.9150e-05, 2.0200e-04, 9.5108e-04, 4.5550e-03,
        2.0789e-05, 9.9421e-01])

- Finally, the output is a weighted average of the value vectors, where the weights are given by the normalized attention weights:

$$z^{(i)} = \sum_{j=1}^{T} \alpha_{ij} v^{(j)}$$

In [51]:
context_vector_2 = attention_weights_2 @ values
context_vector_2

tensor([-4.7645,  6.1684, -8.1683, -6.4059,  3.0102,  5.7119, -1.4577,  1.6116,
         1.6057, -4.7039,  4.0043,  0.5080,  3.5367,  2.7837,  2.8228, -7.7864])

In [52]:
context_vector_2.shape

torch.Size([16])

- This section has introduced the self-attention mechanism, a powerful tool for capturing relationships within sequences. By computing attention weights and context vectors, self-attention allows models to focus on relevant parts of the input, enhancing their ability to understand and generate complex data. 

---

## **Attention is all we need: introducing the original transformer architecture**

The transformer architecture, introduced by Vaswani et al. in the seminal paper "Attention is All You Need," revolutionized natural language processing by relying entirely on self-attention mechanisms, eliminating the need for recurrent or convolutional layers. This architecture has since become the foundation for many state-of-the-art models in NLP, including BERT and GPT. 


![The original transformer architecture](./figures/16_06.png)


We will explore the transformer architecture in detail in the following subsections, by decomposing it into its key components: the encoder and decoder, multi-head self-attention, positional encoding, and feed-forward neural networks. Each of these components plays a crucial role in enabling the transformer to effectively process and generate natural language text.

### **Encoding context embeddings via multi-head attention**

The overall goal of the encoder is to transform the input sequence into a set of context-aware embeddings that capture the relationships between words in the sequence. This is achieved through multiple layers of multi-head self-attention and feed-forward neural networks.

The `encoder` consists of several identical layers, each containing two main sub-layers:
1. Multi-head self-attention mechanism
2. Position-wise feed-forward neural network
3. Each sub-layer is followed by a residual connection and layer normalization to facilitate training and improve convergence.


**Multi-head self-attention mechanism** 

This allows the model to attend to different parts of the input sequence simultaneously. Instead of computing a single set of queries, keys, and values, the multi-head attention mechanism computes multiple sets, or "heads," each with its own learned projection matrices. This enables the model to capture different types of relationships and dependencies within the input sequence.


- As indicated by its name, the multi-head self-attention mechanism consists of multiple attention heads, each of which computes self-attention independently. The outputs of these attention heads are then concatenated and linearly transformed to produce the final output of the multi-head attention layer.


To explain the concept of multi-head self-attention, let's consider an input sequence represented as a matrix $X$ of shape $(T, d_{model})$, where $T$ is the sequence length and $d_{model}$ is the dimensionality of the input embeddings. The multi-head self-attention mechanism computes $h$ different sets of queries, keys, and values by multiplying the input matrix $X$ with learned weight matrices for each head:

$$Q_i = X W_{Q_i}, \quad K_i = X W_{K_i}, \quad V_i = X W_{V_i} \quad \text{for } i = 1, 2, \ldots, h$$

where $W_{Q_i}$, $W_{K_i}$, and $W_{V_i}$ are the learned weight matrices for the $ith$ head.

#### **Multi-Head Attention in Transformers**

A structured deep dive with precise notation, clean progression, and interpretation.


**1. Context and Purpose**

In Transformer architectures, **Multi-Head Attention (MHA)** is the core mechanism replacing recurrence (RNNs) and convolution (CNNs). It enables a model to **compare all positions in a sequence simultaneously** and determine how much each token should focus on every other token.

MHA allows the model to learn **different types of relational patterns** at different representation subspaces.



**2. Input Representation**

Consider an input sequence of length $`T`$:

$$X = \begin{bmatrix}
x_1 \
x_2 \
\vdots \
x_T
\end{bmatrix} \in \mathbb{R}^{T \times d_{\mathrm{model}}}$$

Each token embedding is $`d_{\mathrm{model}}`$ dimensional (e.g., 512 or 768).



**3. Projection to Query, Key, and Value**

For attention, we learn three linear projections:

$$Q = XW_Q,\quad K = XW_K,\quad V = XW_V$$

Where:

* $`W_Q, W_K, W_V \in \mathbb{R}^{d_{\mathrm{model}} \times d_k}`$
* $`d_k`$ is the dimensionality of each head (commonly $`d_k = d_{\mathrm{model}}/h`$)



**4. Scaled Dot-Product Attention**

For one attention head:

$$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right) V$$

Interpretation:

* $`QK^\top`$ measures **similarity** between tokens.
* $`\sqrt{d_k}`$ scaling stabilizes gradients.
* Softmax normalizes to produce weights.
* Weighted sum with $`V`$ produces contextualized representations.



**5. Multi-Head Extension**

Instead of one attention computation, we perform **h** parallel attentions:

$$\mathrm{head}_i = \mathrm{Attention}(Q_i, K_i, V_i),\quad i = 1,\ldots,h$$

Each head learns **different** patterns (e.g., syntax, long-range dependency, sentiment cues).

Outputs are concatenated:

$$\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \ldots, \mathrm{head}_h) W_O$$

Where:

* $`W_O \in \mathbb{R}^{(h d_k) \times d_{\mathrm{model}}}`$



**6. Dimensional Relationships**

| Component                 | Notation                      | Shape                         | Notes                     |
| ------------------------- | ----------------------------- | ----------------------------- | ------------------------- |
| Input sequence            | $X$                           | $T \times d_{\mathrm{model}}$ | T tokens, model dimension |
| Query matrix              | $Q$                           | $T \times d_k$ (per head)     | Same for $K, V$           |
| Attention scores          | $QK^\top$                     | $T \times T$                  | Pairwise token similarity |
| Attention output (1 head) | $T \times d_k$                | Contextualized token vectors  |                           |
| Concatenated heads        | $T \times (h d_k)$            | Combine perspectives          |                           |
| Final projection          | $T \times d_{\mathrm{model}}$ | Back to model dimension       |                           |



**7. Why Multi-Head Instead of Single Attention?**

A single attention head provides **one similarity interpretation**.
However, language and token interactions involve multiple simultaneous relationships:

* syntactic structure
* semantic role
* long-term link
* local dependency
* phrase grouping

Splitting representation space:

$$d_{\mathrm{model}} = h \cdot d_k$$

means **each head learns a specialized relational subspace**.

This **diversifies representation** and improves model capacity without increasing data dependence.



**8. Effectively, What Happens Conceptually**

For each token:

1. Compute how much it should **focus** on each other token (via $`QK^\top`$).
2. Use this focus to compute a **weighted blend** of other token representations.
3. Do this multiple independent times (heads) with different learned projections.
4. Combine the resulting representations.

The result:

* Every token becomes a **context-aware** embedding conditioned on the full sequence.



**9. Where Multi-Head Attention is Used**

| Transformer Block Stage       | Role of MHA                                 |
| ----------------------------- | ------------------------------------------- |
| **Encoder Self-Attention**    | Captures relationships among input tokens   |
| **Decoder Self-Attention**    | Models relationships among generated tokens |
| **Encoder–Decoder Attention** | Allows decoder to attend to encoded source  |



**10. Key Insights**

1. Attention is **pairwise token comparison**.
2. Multi-head splits computation into **parallel representation spaces**.
3. Each head extracts **different relational cues**.
4. Concatenation + projection produces a unified contextual representation.
5. MHA is what enables **parallelism** and **long-context reasoning**, unlike RNNs which propagate sequentially.


---

- parameterizing the self-attention mechanism: scaled dot-product attention:

In [None]:
torch.manual_seed(123)
d = embedded_sentence.shape[1]
one_U_query = torch.rand(d, d)

In [54]:
one_U_query.shape

torch.Size([16, 16])

- Assume we have eight attention heads similar to the original transformer, that is, `h = 8`.

In [55]:
h = 8  # number of heads
multihead_U_query = torch.rand(h, d, d)
multihead_U_key = torch.rand(h, d, d)
multihead_U_value = torch.rand(h, d, d)

In [56]:
multihead_U_query.shape 

torch.Size([8, 16, 16])

- As seen above, multiple attention heads can be added by simply adding an additional dimension to the projection matrices.

- After initializing the projection matrices for all attention heads, we can compute the query, key, and value sequences for all attention heads as follows: 

$$Q_i = X W_{Q_i}, \quad K_i = X W_{K_i}, \quad V_i = X W_{V_i} \quad \text{for } i = 1, 2, \ldots, h$$

- we can repeat this computation for all attention heads at once using batch matrix multiplication.

In [57]:
multihead_query_2 = multihead_U_query.matmul(x_2)
multihead_query_2.shape

torch.Size([8, 16])

- The `multihead_query_2` matrix has eight rows, each corresponding to the query vector for one attention head for the 2nd input element.

- we can compute the key and value sequences for all attention heads in a similar manner.

In [58]:
multihead_key_2 = multihead_U_key.matmul(x_2)
multihead_value_2 = multihead_U_value.matmul(x_2)

In [60]:
multihead_key_2[2]

tensor([-1.9619, -0.7701, -0.7280, -1.6840, -1.0801, -1.6778,  0.6763,  0.6547,
         1.4445, -2.7016, -1.1364, -1.1204, -2.4430, -0.5982, -0.8292, -1.4401])

In [61]:
multihead_value_2

tensor([[-8.4229e-01, -1.4590e+00, -8.2361e-01, -1.4182e+00, -1.7702e+00,
         -1.7670e+00, -1.3254e+00,  1.5882e+00, -2.9187e+00, -1.7060e+00,
         -2.7460e+00,  6.5088e-01, -1.3654e+00, -5.6964e-01, -5.1584e-01,
         -1.2448e+00],
        [ 2.0341e-03, -1.5079e+00,  1.0892e-02, -9.2818e-01,  4.2887e-01,
         -3.9543e+00,  3.1292e-02,  1.1121e-01, -3.7466e-01, -3.7266e-01,
         -1.2568e+00, -2.1261e+00, -1.3521e+00, -7.6900e-01, -1.5659e+00,
         -3.6651e+00],
        [-1.2141e-02, -1.6249e-01, -9.6153e-01, -8.4408e-01, -3.1676e-01,
          1.1681e+00, -2.3165e+00, -6.4110e-01, -1.1253e+00, -1.4065e+00,
         -4.3523e-01, -1.4211e+00,  2.2433e+00, -2.4570e+00, -2.5382e+00,
         -1.0644e+00],
        [-4.8746e-01, -2.2939e+00,  1.8204e+00, -1.5698e+00, -5.0320e-01,
         -1.5094e+00,  3.9411e-01,  1.6684e+00, -2.5816e+00, -2.1744e+00,
         -3.2841e+00,  2.4258e-01,  6.1703e-01, -1.9446e-01,  2.1254e-01,
         -2.5788e+00],
        [-5.8621e-01

- We need to repeat this computation for all attention heads at once using batch matrix multiplication.
- We can do this by expanding the input vector to have an additional dimension corresponding to the number of attention heads.

In [63]:
stacked_inputs = embedded_sentence.T.repeat(h, 1, 1)
stacked_inputs.shape

torch.Size([8, 16, 8])

- Then, we can have a batch matrix multiplication between the expanded input and the projection matrices for all attention heads.

In [64]:
multihead_keys = torch.bmm(multihead_U_key, stacked_inputs)
multihead_keys.shape

torch.Size([8, 16, 8])

In [65]:
multihead_keys[0]

tensor([[ 0.3806, -1.6947, -1.8699, -0.4471, -2.1738, -1.1601, -1.5491,  0.3820],
        [ 2.0499, -0.6587,  1.2486, -1.8467, -1.3698, -1.3811, -0.4635,  2.7698],
        [ 1.4424, -1.1834,  0.2757, -1.7952, -3.4140, -4.3602,  0.5074,  3.2851],
        [ 1.8887, -1.2219, -0.9714, -0.7650, -2.6386, -3.1871,  1.5554,  2.6618],
        [ 1.4270, -2.6479, -1.4276, -1.8207, -2.2600, -3.0823, -2.2974,  2.3776],
        [ 2.5535, -2.6108, -1.4909, -2.5400, -3.5193, -2.2251, -0.0805,  0.6264],
        [ 1.2314, -1.0226, -3.8808, -1.1661, -4.4679, -3.0439, -0.3201,  2.3182],
        [ 0.4981, -0.2372, -1.1328, -1.3592, -3.6253, -1.3326,  1.0347,  2.7223],
        [ 0.1742,  1.2760, -1.9311, -2.3503, -4.1406, -3.1241,  1.1994,  2.4895],
        [-0.7740, -1.0980, -0.4115, -3.7343, -2.7884, -2.6045,  0.1928,  2.5534],
        [ 1.5517,  0.9142,  1.2744, -3.3934, -1.1208, -4.1307, -0.0142,  0.8712],
        [ 1.7910, -0.8720, -0.2987,  1.0605, -1.5535, -0.7161,  0.5001,  0.4319],
        [ 1.8690

- We now have a tensor that refers to the eight attention heads in its first dimension. The second and third dimensions refer to the embedding size and the number of input elements, respectively.

- Swap the second and third dimensions to facilitate further computations.

In [66]:
multihead_keys = multihead_keys.permute(0, 2, 1)
multihead_keys.shape

torch.Size([8, 8, 16])

- we can access the second key value in the second attention head as follows:

In [68]:
multihead_keys[2, 1] # index: [2nd attention head, 2nd key value]

tensor([-1.9619, -0.7701, -0.7280, -1.6840, -1.0801, -1.6778,  0.6763,  0.6547,
         1.4445, -2.7016, -1.1364, -1.1204, -2.4430, -0.5982, -0.8292, -1.4401])

- we can see that this is the same key value that we computed earlier for the 2nd input element in the 2nd attention head.

- Let's repeat the same process to compute the multi-head queries and values.

In [69]:
multihead_values = torch.bmm(multihead_U_value, stacked_inputs)
multihead_values = multihead_values.permute(0, 2, 1)
multihead_values.shape

torch.Size([8, 8, 16])

In [70]:
torch.allclose(multihead_values[2, 1], multihead_value_2[2])

True

- Calculate the context vectors, we will skip the intermediate steps for brevity and assume that we have already computed the context vectors for second input element as the query and the eight different attention heads.

In [73]:
multihead_z_2 = torch.rand(h, d)
multihead_z_2.shape

torch.Size([8, 16])

- We concatenate these context vectors and linearly transform them to obtain the final output for the 2nd input element. 

![Concatenating multi-head context vectors and linearly transforming them to obtain the final output](./figures/16_07.png)

In [74]:
linear = torch.nn.Linear(h * d, d)
context_vector_2 = linear(multihead_z_2.flatten())
context_vector_2.shape

torch.Size([16])

- Multi-head self-attention allows the model to capture diverse relationships within the input sequence by attending to different aspects of the data simultaneously. This enhances the model's ability to understand complex patterns and dependencies, ultimately leading to improved performance on various NLP tasks.
  

- It's repeating the scaled dot-product attention computation multiple times in parallel and combining the results to form a richer representation of the input data.
  

- It works very well in practice because the multiple heads help the model to focus on different parts of the input sequence, capturing a wider range of relationships and dependencies.


- The multi-head attention mechanism is computationally expensive due to the multiple projections and attention computations.

---

#### **Learning a language model: decoder and masked multi-head attention**

- The `decoder` is responsible for generating the output sequence based on the encoded input representations. It consists of several identical layers, each containing three main sub-layers:

1. Masked multi-head self-attention mechanism
2. Multi-head attention mechanism over the encoder's output
3. Position-wise feed-forward neural network (fully connected layer)
4. Each sub-layer is followed by a residual connection and layer normalization, similar to the encoder.

- Masked attention is used in the decoder to prevent the model from attending to future tokens during training. This is crucial for autoregressive tasks, where the model generates one token at a time and should not have access to future tokens that it has not yet generated.


- Masked attention is a variation of the standard attention mechanism where certain positions in the input sequence are masked (i.e., set to zero) to prevent the model from attending to them. This is typically done by adding a large negative value (e.g., negative infinity) to the attention scores for the masked positions before applying the softmax function. This ensures that the attention weights for these positions are effectively zero, preventing the model from using information from future tokens during training.


![Layer arrangement in the decoder of the original transformer architecture](./figures/16_08.png)


- First, the previous output tokens (output embeddings) are processed through a masked multi-head self-attention layer. This allows the decoder to attend to all previous tokens while preventing access to future tokens.

- Then, the second multi-head attention layer allows the decoder to attend to the encoder's output representations. This enables the decoder to incorporate information from the input sequence when generating each output token.
 
- Finally, the output of the attention layers is passed through a position-wise feed-forward neural network to produce the final output for each token in the sequence.


Comparing the `decoder` to the `encoder` block, the main differences are:
1. The decoder includes a masked multi-head self-attention layer to prevent access to future tokens.
2. The decoder has an additional multi-head attention layer that attends to the encoder's output representations

---

#### **Implementation details: positional encodings and layer normalization**

**Positional Encoding** positional encodings are added to the input embeddings to provide the model with information about the position of each token in the sequence. Since transformers do not have a built-in notion of order (unlike RNNs), positional encodings are crucial for capturing the sequential nature of language.

- Without positional encodings, the transformer would treat the input tokens as a bag of words, losing the order information that is essential for understanding the meaning of sentences.

- Transfomers use positional encodings to inject information about the position of each token in the sequence. This is typically done by adding a positional encoding vector to each input embedding before feeding it into the transformer layers.

- Transformers enable the same words at different positions to have slightly different representations, by adding a vector of small values to the input embeddings at the beginning of the encoder and decoder blocks.

- Positional encodings can be implemented using `sinusoidal functions` or `learned embeddings`. The sinusoidal approach uses `sine` and `cosine` functions of different frequencies to encode the position of each token, allowing the model to generalize to longer sequences than those seen during training.


`Sinusoidal Positional Encoding` Formula:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

- Here, $pos$ is the position of the token in the sequence, $i$ is the dimension index, and $d_{model}$ is the dimensionality of the model embeddings.


- The sinusoidal positional encoding allows the model to learn relative positions between tokens, as the `sine` and `cosine` functions provide a continuous representation of position that can be easily interpreted by the model.


- These positional encodings are added to the input embeddings before they are fed into the transformer layers, allowing the model to incorporate both the content of the tokens and their positions in the sequence.

#### **Deep Dive into Positional Encoding**

Transformers **do not have built-in sequence order**.
Unlike RNNs (which read tokens sequentially) and CNNs (which use local receptive fields), the Transformer processes all tokens **in parallel**.
Therefore, the model needs **an explicit way to encode token position** so it can reason about order, word proximity, phrase structure, and directionality.

Positional Encoding (PE) provides this structure.



**1. Why Positional Information is Required**

Given an input sequence:

$$X = \begin{bmatrix} x_1 \ x_2 \ \cdots \ x_T \end{bmatrix} \in \mathbb{R}^{T \times d_{\mathrm{model}}}$$

Self-Attention computes:

$$\mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$$

This operation depends only on **similarity** between tokens — **not their order**.

Without positional encoding:

* “dog bites man” and “man bites dog” produce identical attention structures.

So we introduce:

$$X' = X + PE$$

Where $PE$ injects **position-dependent structure**.



**2. Sinusoidal Positional Encoding (Original Transformer)**

The key design principle:
Positions should be represented **continuously and relationally**, not discretely.

For token at position $`pos`$ and embedding index $`i`$:


$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{\mathrm{model}}}}}\right)$$


$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{\mathrm{model}}}}}\right)$$

Where:

* $pos$ ∈ $[0, T)$ is the position index.
* $i$ ∈ $[0, d_{\mathrm{model}})$ controls frequency scaling.

##### **Interpretation**

* Lower dimensions vary **slowly**, encoding **global** order.
* Higher dimensions vary **rapidly**, encoding **local** detail.
* The encoding creates a **smooth geometric space** of positions.


#### **Critical Mathematical Property**

The encoding is **translation-consistent**:


$$PE(pos + k) \approx \text{simple transform of } PE(pos)$$

Meaning the model can learn to reason about **distance** between tokens.



**3. Why Sinusoidal Instead of Trainable Vectors?**

| Property                               | Sinusoidal | Learned Position Embeddings                |
| -------------------------------------- | ---------- | ------------------------------------------ |
| Generalizes to unseen longer sequences | **Yes**    | No (limited to learned max length)         |
| Provides explicit distance structure   | **Yes**    | No inherent structure                      |
| Easy to compute                        | Yes        | Yes                                        |
| Used in early Transformers             | ✅          | Later Transformers (BERT, GPT) use learned |

Transformers like GPT and BERT use **learned absolute** PE for flexibility.
Modern models (GPT-NeoX, PaLM, LLaMA) use **Rotary Positional Encoding (RoPE)** — best of both worlds.



**4. Rotary Positional Encoding (RoPE)** *(Modern LLM Standard)*

RoPE rotates token embeddings by a **position-dependent rotation matrix**:


$$\mathrm{RoPE}(x, pos) = R(pos) \cdot x$$

Where $R(pos)$ is a block-diagonal rotation operator on embedding subspaces.

##### Benefits:

* Encodes **relative** position, not absolute.
* Handles **long context** efficiently.
* Supports extrapolation beyond training length.

RoPE is used in:

* LLaMA
* GPT-J/NeoX
* Mistral
* Qwen
* Phi-3, GPT-4 series



**5. Absolute vs Relative Position Encoding**

| Approach                                 | Core Idea                             | Used In         | Strength                             |
| ---------------------------------------- | ------------------------------------- | --------------- | ------------------------------------ |
| **Absolute (Sinusoidal or Learned)**     | Tokens know their numerical position  | BERT, GPT-2     | Simple, works for moderate sequences |
| **Relative Position Bias (Shaw et al.)** | Model learns distances between tokens | T5, DeBERTa     | Better for syntax and structure      |
| **RoPE**                                 | Rotates embeddings to encode offsets  | GPT-NeoX, LLaMA | Best long-context performance        |

Relative and rotary encodings allow the model to understand:

* Token A is **5 positions before** token B
* Meaningful for language structure (e.g., grammar, object-verb agreement)



**6. Why Positional Encoding Works**

Self-attention computes similarity:

$$QK^\top$$

With PE added:

$$(X + PE)W_Q \cdot (X + PE)W_K^\top$$

Positions influence the **attention score landscape**, meaning:

* The model no longer matches tokens solely by meaning
* But also considers **where** they occur in the sequence

Thus:

* **Order** emerges from geometry
* **Structure** emerges from learned weighting



**7. Key Takeaways**

1. Transformers require positional encoding because they **lack inherent sequential bias**.
2. Sinusoidal PE creates a **continuous spatial encoding** of token positions.
3. Learned PE offers flexibility but limited generalization.
4. Relative and RoPE encodings are the modern default because they encode **distance**, not absolute index.
5. Position encodings shape the **attention map**, influencing how tokens contextualize each other.

---


#### **Deep Dive into Layer Normalization**

Layer Normalization (LayerNorm) is a normalization technique used heavily in **Transformers** and **LLMs**.
Its purpose is to **stabilize training**, preserve representation scale, and improve gradient flow — especially in architectures where **batch dimension varies or parallelism is essential**.


![Batch and Layer Normalization comparison](./figures/16_09.png)


**1. The Problem LayerNorm Solves**

Neural networks, especially deep ones, can suffer from:

* **Internal Covariate Shift**: distributions of activations change during training.
* **Exploding/Vanishing Gradients**: gradients become unstable across layers.
* **Sensitivity to learning rates** and initialization.

Batch Normalization (BatchNorm) solves part of this problem but depends on **batch statistics**, making it:

* Unstable when batch sizes are small
* Incompatible with sequence parallel decoding (e.g., autoregressive inference)
* Dependent on ordering in distributed training

Transformers require *position-invariant*, *batch-independent* normalization — which leads to **LayerNorm**.



**2. The Layer Normalization Operation**

Given an input activation vector for a single sample:


$$x = (x_1, x_2, \dots, x_H)$$


Where (H) = hidden dimension.

LayerNorm normalizes **over features**, not over batch:


$$\mu = \frac{1}{H} \sum_{i=1}^{H} x_i$$


$$\sigma^2 = \frac{1}{H} \sum_{i=1}^{H} (x_i - \mu)^2$$

Normalization:


$$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$$

Then apply learned scale and shift:


$$y_i = \gamma \hat{x}_i + \beta$$

Where:

* $( \gamma ) (scale) and ( \beta ) (bias)$ are **trainable parameters**, same shape as the feature dimension.
* $( \epsilon )$ prevents division-by-zero.




**3. Key Design Difference vs BatchNorm**

| Aspect                               | **LayerNorm**                                  | **BatchNorm**                |
| ------------------------------------ | ---------------------------------------------- | ---------------------------- |
| Normalizes across                    | Feature dimension                              | Batch dimension              |
| Works well on                        | NLP, Transformers, RNNs, autoregressive models | CNNs, computer vision        |
| Requires large batch size?           | **No**                                         | Yes (unstable otherwise)     |
| Stable for variable sequence length? | **Yes**                                        | No                           |
| Works at inference same as training? | **Yes**                                        | No running averages required |

**Transformers would not work reliably with BatchNorm** — LayerNorm is essential.




**4. Where LayerNorm Appears in Transformers**

There are two canonical placements:

##### **(A) Post-Layer Norm (Original Transformer - Vaswani et al.)**


$$x = x + \mathrm{Attention}( \mathrm{LayerNorm}(x) )$$

$$x = x + \mathrm{FFN}( \mathrm{LayerNorm}(x) )$$


##### **(B) Pre-Layer Norm (Modern Models: GPT-2, LLaMA, Mistral, etc.)**


$$x = \mathrm{LayerNorm}(x) + \mathrm{Attention}(x)$$


$$x = \mathrm{LayerNorm}(x) + \mathrm{FFN}(x)$$

**Pre-LN** models train more stably, avoid divergence, and support deeper stacks.



**5. Why LayerNorm Improves Gradient Flow**

Consider the gradient of normalized activations:


$$\frac{\partial \hat{x}_i}{\partial x_j}$$

Normalization ensures:

* No single neuron can grow arbitrarily
* Gradients are **distributed evenly**
* Backprop remains stable across many layers

This is especially critical in **attention**, where activations can vary widely depending on context.

LayerNorm ensures:

* Stable scaling before softmax attention
* Stable residual paths
* Controlled value magnitudes → smoother optimization



**6. Intuition: What LayerNorm Actually Enforces**

LayerNorm forces the hidden representation of each token:


$$(x_1, x_2, ..., x_H)$$

to have:

* **Mean = 0** → removes bias shift
* **Variance = 1** → fixes scale

This encourages the network to encode *the meaningful differences between components* of the feature vector, not absolute magnitude.

Put differently:

> LayerNorm does not change **what** information is encoded — only the **coordinate system** it lives in.



**7. Why LayerNorm is Crucial for Transformers**

Self-attention amplifies certain features and suppresses others.
Without normalization, these transformations can rapidly destabilize:

* Queries and keys explode → softmax saturates
* Gradients collapse → no learning signal

LayerNorm maintains controlled scale throughout the computation graph — keeping attention *differentiable*, *expressive*, and *trainable*.



**8. Key Takeaways**

1. **LayerNorm normalizes across hidden features**, making it **independent of batch size**.
2. It is essential for **Transformer stability**, especially with:

   * Deep residual stacks
   * Multi-head attention
   * Autoregressive decoding
3. **Pre-LN** transformers (modern) outperform **Post-LN** (original) in stability and depth scaling.
4. LayerNorm ensures **smooth gradient flow** and prevents activation explosion/collapse.

---

#### **Deep Dive into Residual Connections and Their Role in Transformers**

Residual connections (also called *skip connections*) are one of the **core stability mechanisms** in Transformers.
Without them, transformers **do not train well** — gradients vanish, activations saturate, and deep architectures collapse.

We will break this down from **mathematical intuition**, **gradient flow**, and **practical architecture design**.



##### **1. What is a Residual Connection?**

Given a function (a layer or block):


$$F(x)$$

A **residual connection** outputs:


$$y = x + F(x)$$

Instead of:


$$y = F(x)$$


So the network learns **a residual mapping**:


$$F(x) = y - x$$

This means the layer *only needs to learn what changes relative to the input*.



##### **2. Why Residuals Matter in Deep Networks**

As networks get deeper:

* Gradients become unstable (vanish or explode)
* Layers struggle to learn identity transformations
* Optimization becomes harder

Residual paths allow:


$$\text{Gradient to flow directly from deeper layers back to earlier layers}$$

This means:

* Even if a deep layer learns **nothing**, the model still preserves the input via `x + ...`.
* Early layers receive strong gradient updates, preventing stagnation.



##### **3. Residuals in Transformers: Where They Appear**

Each Transformer **block** has two residual pathways:

**(1) Self-Attention Residual**


$$x_1 = x + \mathrm{MultiHeadAttention}(x)$$

**(2) Feed-Forward Network (FFN) Residual**


$$x_2 = x_1 + \mathrm{FFN}(x_1)$$


Every transformer layer can be diagrammed as:

```
      +-------------------+
      |                   |
      |    Multi-Head     |
x --->|     Attention     |---+--> x1
      |                   |   |
      +-------------------+   |
                              |
                              + Residual (add and norm)

      +-------------------+
      |                   |
      |       FFN        |
x1 -->|  (2-layer MLP)   |---+--> x2
      |                   |   |
      +-------------------+   |
                              |
                              + Residual (add and norm)
```



##### **4. Pre-LN vs Post-LN Residual Placement**

There are two normalization designs:

**Original Transformer (Post-LN)**


$$x = x + \mathrm{Attention}( \mathrm{LayerNorm}(x) )$$


$$x = x + \mathrm{FFN}( \mathrm{LayerNorm}(x) )$$


**Modern Transformers (Pre-LN; GPT-2, LLaMA, PaLM, Mistral, etc.)**


$$x = x + \mathrm{Attention}( \mathrm{LayerNorm}(x) )$$


$$x = x + \mathrm{FFN}( \mathrm{LayerNorm}(x) )$$


Although they *look the same*, order during forward/backprop is different.

**Pre-LN (modern)** gives *stable gradients* and supports *very deep models*.

**Post-LN (original)* is harder to train without tricks like warm-up or initialization tuning.



##### **5. Gradient Flow Intuition**

With a residual connection:


$$y = x + F(x)$$

Gradient wrt input:


$$\frac{\partial y}{\partial x} = 1 + \frac{\partial F(x)}{\partial x}$$

The `1` is the skip path.

This guarantees:

* Gradient never becomes zero
* Gradient doesn't depend *only* on deep layers
* Learning remains stable even when deeper layers are poorly initialized

Without residuals:


$$\frac{\partial y}{\partial x} = \frac{\partial F(x)}{\partial x}$$

Gradients become susceptible to:

* Vanishing (gradient → 0)
* Exploding (gradient → ∞)
* Training collapse

This is **why very deep transformers (24–200+ layers) are trainable at all.**



##### **6. Why Residuals Enable Strong Representation Learning**

Self-attention identifies **relationships** between tokens.

The FFN identifies **non-linear transformations** on token meaning.

Residuals:

* Preserve the **original meaning** of the token
* Allow attention/FFN to **modify, refine, or add** meaning, not rewrite it

So a transformer layer operates like:


$$\text{Meaning}*{\text{new}} = \text{Meaning}*{\text{old}} + \text{Refinement / Contextualization}$$

This is essential for **stable semantic accumulation across depths**.



##### **7. Practical Summary**

| Component                | Role                                                 |
| ------------------------ | ---------------------------------------------------- |
| Self-Attention           | Figures out which other tokens matter                |
| FFN                      | Learns transformations within token representation   |
| LayerNorm                | Keeps scale stable                                   |
| **Residual Connections** | Ensure stable gradient flow and meaning preservation |

Residuals are the **structural backbone** that allows transformers to scale.



##### **8. Why Transformers Without Residuals Fail**

If residuals are removed:

* Deep layers overwrite token representations instead of improving them
* Gradients vanish across ~6+ layers
* Model collapses to near-random output
* Training becomes extremely unstable

Residuals make deep attention-based reasoning possible.

---
