
## **Multi-Head Attention in Transformers**

### **Introduction**

Multi-Head Attention (MHA) is a core component of Transformer architectures that allows the model to attend to different positions in the sequence representation from multiple perspectives (or “heads”). Instead of performing a single self-attention operation, MHA performs several in parallel and concatenates the results, enabling richer context representation.

---

### **Steps in Multi-Head Attention**

1. **Input Projection**

   * The input embedding (or hidden state) is projected into three distinct spaces:

     * **Query (Q)**
     * **Key (K)**
     * **Value (V)**
   * These projections are done *separately for each head*.

2. **Scaled Dot-Product Attention (per head)**

   * For each head, compute attention weights using:

     $$
     \text{Attention}(Q, K, V) = \text{Softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V
     $$

     * $d_k$ is the dimension of each key vector.
     * Scaling avoids large dot products that could cause small gradient updates.

3. **Parallel Attention**

   * Each head learns different relationships (e.g., syntax, semantics) independently.

4. **Concatenation**

   * The outputs of all heads are concatenated.

5. **Final Linear Projection**

   * A final linear layer merges the concatenated heads into a single tensor.

---

### **Mathematical Representation**

If there are $h$ heads:

$$
\text{MHA}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O
$$

Where:

$$
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
$$

---

### **Example**

Suppose:

* Input embedding size = **8**
* Number of heads = **2**
* Each head dimension = **4**

**Step-by-step:**

1. Project embeddings into Q, K, V for head 1 and head 2 (each of size 4).
2. Compute scaled dot-product attention separately for each head.
3. Concatenate the two attention outputs (size 8).
4. Apply a linear transformation to mix head outputs.

---

### **Advantages**

* Captures **multiple types of relationships** in parallel.
* More expressive than single-head attention.
* Helps in modeling **long-range dependencies** better.

### **Disadvantages**

* Increased computational cost.
* More memory usage.
* Potential redundancy if heads learn similar patterns.

---

### **Interview Q\&A**

| **Question**                                       | **Answer**                                                                                                                                  |
| -------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| **Why do we use multiple heads instead of one?**   | Multiple heads allow the model to attend to different types of relationships in the data simultaneously, leading to richer representations. |
| **Why is the dot product scaled by $\sqrt{d_k}$?** | To prevent very large dot product values which can push the softmax into regions with extremely small gradients.                            |
| **Do all heads have the same parameters?**         | No, each head has its own learnable projection matrices $W_i^Q, W_i^K, W_i^V$.                                                              |
| **What happens after concatenating the heads?**    | The concatenated tensor is passed through a final linear transformation $W^O$ to mix the information from all heads.                        |
| **Can we have different head sizes in MHA?**       | Typically, all heads have the same dimension for implementation efficiency, but in theory, variable sizes are possible.                     |

