## **Multi Head Attention**

**Recap**

First we had `Word Embeddings` which is static in nature i.e. each word has a fixed representation regardless of its context. We need a mechanism that can dynamically adjust these representations based on the surrounding words in a sentence. This is where `Self-Attention` comes into play.

`Self-Attention` allows each word to attend to all other words in the sequence, enabling the model to capture contextual relationships effectively. With `Q` (`Query`), `K` (`Key`), and `V` (`Value`) vectors, we were able to compute attention scores and then generate context-aware representations for each word. So `Self-Attention` genereates `Contextual Embeddings` from `Word Embeddings`.

**Promblem with Single Head Attention or Self-Attention**

`Self-Attention` mechanism, was able to capture contextual relationships effectively. For example, if we've two sentences like `"Money Bank Grows"` and `"River Bank Flows"`, the word `"Bank"` would attend more to `"Money"` in the first sentence and more to `"River"` in the second sentence. This dynamic adjustment of attention based on context is a key strength of the `Self-Attention` mechanism.

But what if we've a sentence like `"The man saw the astronomer with the telescope"`? Here, the word `"with"` can relate to either `"astronomer"` or `"telescope"`, leading to ambiguity. A single attention head might struggle to capture both interpretations effectively.

This sentence can have two different interpretations:

1. The man used the telescope to see the astronomer.

2. The astronomer has the telescope.

If we use `Self-Attention` with a single head, it might focus on `one interpretation` and miss the other. This is where `Multi-Head Attention` comes into play.

To get the actual meaning of the sentence, we need to consider both interpretations simultaneously. `Multi-Head Attention` allows the model to do just that by having multiple attention heads, each focusing on different parts of the sentence.

<hr>

In `Multi-Head Attention`, instead of having a single set of `Q`, `K`, and `V` matrices, we have multiple sets, each corresponding to a different attention head. Each head can learn to focus on different aspects of the input sequence.

So if we've `h` attention heads, we have `h` different sets of `Q`, `K`, and `V` matrices. Each head computes its own attention scores and generates its own context-aware representations.

### **Understanding `Multi-Head Attention` With Example**

Let's say we've an input sentence: `"Money Bank Grows"`.

Now for this sentence, we'll have multiple attention heads, say `h = 2` heads for simplicity.

1. **Word Embeddings**: First, we convert each word into its corresponding `Word Embedding` or `Token Embedding`.

$$
E_{Money} = [0.1, 0.2, 0.3]
$$

$$
E_{Bank} = [0.4, 0.5, 0.6]
$$

$$
E_{Grows} = [0.7, 0.8, 0.9]
$$

2. **Weight Matrices for Each Head**: For each attention head, we have separate weight matrices for `Query`, `Key`, and `Value`. Let's denote them as `W_Q^1`, `W_K^1`, `W_V^1` for the first head or `First Set` and `W_Q^2`, `W_K^2`, `W_V^2` for the second head or `Second Set`.

**First Head Weight Matrices**:

$$
W_Q^1 = \begin{bmatrix}
0.1 & 0.0 & 0.2 \\
0.0 & 0.1 & 0.1 \\
0.2 & 0.1 & 0.0
\end{bmatrix}
$$

$$
W_K^1 = \begin{bmatrix}
0.2 & 0.1 & 0.0 \\
0.1 & 0.2 & 0.1 \\
0.0 & 0.1 & 0.2
\end{bmatrix}
$$

$$
W_V^1 = \begin{bmatrix}
0.1 & 0.2 & 0.1 \\
0.2 & 0.1 & 0.2 \\
0.1 & 0.1 & 0.1
\end{bmatrix}
$$

**Second Head Weight Matrices**:

$$
W_Q^2 = \begin{bmatrix}
0.0 & 0.2 & 0.1 \\
0.1 & 0.0 & 0.2 \\
0.2 & 0.1 & 0.0
\end{bmatrix}
$$

$$
W_K^2 = \begin{bmatrix}
0.1 & 0.0 & 0.2 \\
0.2 & 0.1 & 0.0 \\
0.0 & 0.2 & 0.1
\end{bmatrix}
$$

$$
W_V^2 = \begin{bmatrix}
0.2 & 0.1 & 0.2 \\
0.1 & 0.2 & 0.1 \\
0.1 & 0.1 & 0.1
\end{bmatrix}
$$

3. **Computing Q, K, V for Each Head**: For each head, we compute the `Query`, `Key`, and `Value` vectors for each word by multiplying the `Word Embeddings` with the respective `Weight Matrices`.

**First Head Computations**:

For the word `"Money"` with embedding `E_{Money} = [0.1, 0.2, 0.3]`:

$$
Q_{Money}^1 = E_{Money} \cdot W_Q^1 = [0.1, 0.2, 0.3] \cdot \begin{bmatrix}
0.1 & 0.0 & 0.2 \\
0.0 & 0.1 & 0.1 \\
0.2 & 0.1 & 0.0
\end{bmatrix} = [0.07, 0.05, 0.04]
$$

$$
K_{Money}^1 = E_{Money} \cdot W_K^1 = [0.1, 0.2, 0.3] \cdot \begin{bmatrix}
0.2 & 0.1 & 0.0 \\
0.1 & 0.2 & 0.1 \\
0.0 & 0.1 & 0.2
\end{bmatrix} = [0.04, 0.08, 0.07]
$$

$$
V_{Money}^1 = E_{Money} \cdot W_V^1 = [0.1, 0.2, 0.3] \cdot \begin{bmatrix}
0.1 & 0.2 & 0.1 \\
0.2 & 0.1 & 0.2 \\
0.1 & 0.1 & 0.1
\end{bmatrix} = [0.08, 0.07, 0.09]
$$

Similarly, we compute for `"Bank"` and `"Grows"` for the first head.

Then, we repeat the same process for the second head using `W_Q^2`, `W_K^2`, and `W_V^2`.

4. **Attention Calculation for Each Head**: Each head computes its own attention scores and generates context-aware representations using the `Scaled Dot-Product Attention` mechanism.

**First Head Attention**:

$$
\text{Attention}^1(Q^1, K^1, V^1) = \text{Softmax}\left(\frac{Q^1 {K^1}^T}{\sqrt{d_k}}\right) V^1
$$

Where `d_k` is the dimension of the `Key` vectors.

Here, we compute the dot product of the `Query` and `Key` vectors, scale it by dividing by the square root of `d_k`, apply the `Softmax` function to get the attention weights, and then multiply by the `Value` vectors to get the output for the first head.

This will be the `First Contextual Embedding` for the sentence `"Money Bank Grows"` from the `First Head`.

Which will look something like this:

$$
C^1 = \begin{bmatrix}
c_{Money}^1 \\
c_{Bank}^1 \\
c_{Grows}^1
\end{bmatrix}
$$

**Second Head Attention**:

$$
\text{Attention}^2(Q^2, K^2, V^2) = \text{Softmax}\left(\frac{Q^2 {K^2}^T}{\sqrt{d_k}}\right) V^2
$$

Where `d_k` is the dimension of the `Key` vectors.

Here, we compute the dot product of the `Query` and `Key` vectors, scale it by dividing by the square root of `d_k`, apply the `Softmax` function to get the attention weights, and then multiply by the `Value` vectors to get the output for the second head.

This will be the `Second Contextual Embedding` for the sentence `"Money Bank Grows"` from the `Second Head`.

Which will look something like this:

$$
C^2 = \begin{bmatrix}
c_{Money}^2 \\
c_{Bank}^2 \\
c_{Grows}^2
\end{bmatrix}
$$

Now, we've two different context-aware representations for the same input sentence, each capturing different aspects of the relationships between the words. Previously, with single head attention, we had only one such representation.

5. **Concatenation and Final Linear Transformation**: Finally, we concatenate the outputs from all attention heads and pass them through a final linear transformation to get the final output of the `Multi-Head Attention` mechanism.

$$
C = \text{Concat}(C^1, C^2)
$$

We multiply this concatenated matrix with another weight matrix `W_O` i.e. `Apply Linear Transformation` to get the final output embeddings.

$$
O = C \cdot W_O
$$

Where `W_O` is the weight matrix for the output linear transformation.

`W_O` could look something like this:

$$
W_O = \begin{bmatrix}
0.1 & 0.2 & 0.1 & 0.0 & 0.1 & 0.2 \\
0.2 & 0.1 & 0.0 & 0.1 & 0.2 & 0.1 \\
0.1 & 0.0 & 0.2 & 0.2 & 0.1 & 0.0
\end{bmatrix}
$$

During `Backpropagation`, all these weight matrices (`W_Q^i`, `W_K^i`, `W_V^i`, and `W_O`) are updated to minimize the loss function, allowing the model to learn optimal attention patterns for different contexts.


<hr>
<hr>
<hr>


## **How Does `Multi-Head Attention` Actually Works?**

Below image explains the entire process of `Multi-Head Attention` step by step.

<img src="../../../Notes_Images/Multi_Head_Attn.png" alt="Multi Head Attention" style="width:1200px;"/>

Then,

<img src="../../../Notes_Images/Multi_Head_Attn2.png" alt="Multi Head Attention Steps" style="width:1200px;"/>

The output `Z` obtained after Mutliplying `Concat(head_1, head_2, ..., head_h)` with `W_O` is the final output of the `Multi-Head Attention` mechanism.

This `Z` captures information from different representation subspaces at different positions, allowing the model to attend to various aspects of the input sequence simultaneously.

**Why `Linear Transformations` i.e. `W_O` is required after Concatenation?**

- These weights tries to create a balance between the different heads.

- To get the size of the output same as input embedding size.

### **Attention is All You Need**

In the seminal paper "Attention is All You Need" by Vaswani et al., following were the sizes and dimensions used for `Multi-Head Attention`:

**Embedding Dimension (d_model)**: 512

**Number of Heads (h)**: 8

**Dimension of Each Weight Matrix (W_Q, W_K, W_V)**: 64 (i.e., d_model/h = 512/8). So its shape will be (512, 64).

- As the `Input Embedding Dimension` will have size of `(n, 512)` where `n` is the number of tokens in the input sequence.

- And the `Weight Matrices` will have size of `(512, 64)`. So matrix multiplication will be possible. And the output `Q`, `K`, `V` will have size of `(n, 64)` for each head.

- After computing attention for each head, we will have `8` different `Contextual Embeddings` for each token, each of size `(n, 64)`.

- Then, we concatenate these `8` embeddings along the last dimension to get a combined embedding of size `(n, 512)`.

- Then, this concatenated embedding is multiplied with the output weight matrix `W_O` of size `(512, 512)` to get the final output of the `Multi-Head Attention` mechanism, which also has size `(n, 512)`.

This the `Input Embedding Dimension` and `Output Embedding Dimension` remains the same i.e. `512`, while allowing the model to attend to information from different representation subspaces through multiple heads.
