# 🚀 Self-Attention explained with an example

This file explains **how the attention mechanism works** in Transformers (the heart of LLMs like GPT, LLaMA, etc.), using the sentence:

> **"The dog ran quickly"**

---

## 1. Context: What is attention in an LLM?

In a **Large Language Model (LLM)**, each word (token) is not processed in isolation, but instead builds its meaning **by taking into account the context** of the others.

The mechanism that makes this possible is **Self-Attention**:

- Each token generates **three vectors**:
  - **Query (Q)** → what it looks for.
  - **Key (K)** → what it offers.
  - **Value (V)** → the information it provides.

---

## 2. The example: "The dog ran quickly"

Let’s suppose we are processing the word **"quickly"**.  
We want to know: **which tokens influence the meaning of "quickly"?**

### Initial vectors (simplified example in 3 dimensions):

- **Query (quickly)**:  
  \[
  Q_{quickly} = [0.22, 0.64, 0.73]
  \]

- **Key (dog)**:  
  \[
  K_{dog} = [0.54, 0.36, 0.74]
  \]

- **Value (dog)**:  
  \[
  V_{dog} = [0.12, 0.84, 0.51]
  \]

- **Value (quickly)**:  
  \[
  V_{quickly} = [0.62, 0.24, 0.33]
  \]

---

## 3. Step 1: Query–Key similarity

We calculate the **dot product** between the Query of "quickly" and the Keys of other tokens:

\[
score(dog, quickly) = Q_{quickly} \cdot K_{dog}
\]

\[
= (0.22)(0.54) + (0.64)(0.36) + (0.73)(0.74)
\]

\[
= 0.1188 + 0.2304 + 0.5402 \approx 0.889
\]

> This number tells us **how much "quickly" pays attention to "dog"**.

(In practice this is done with all tokens: "the", "dog", "ran", "quickly"...).

---

## 4. Step 2: Softmax → Attention Weights

We convert these scores into **normalized weights**:

Example with only *dog* and *quickly* (assuming a score for quickly ≈ 1.2):

\[
\alpha_{dog} = \frac{e^{0.889}}{e^{0.889} + e^{1.2}} \approx 0.42
\]

\[
\alpha_{quickly} = \frac{e^{1.2}}{e^{0.889} + e^{1.2}} \approx 0.58
\]

---

## 5. Step 3: Combining Values

The **output of "quickly"** is the **weighted sum of the Values**:

\[
output(quickly) = \alpha_{dog} \cdot V_{dog} + \alpha_{quickly} \cdot V_{quickly}
\]

\[
= 0.42 \cdot [0.12, 0.84, 0.51] + 0.58 \cdot [0.62, 0.24, 0.33]
\]

\[
= [0.41, 0.49, 0.41] \ (\text{approx})
\]

---

## 6. Conceptual interpretation

- **Each token has a single Value**, not a different one for each other token.  
- What changes is the **weight with which that Value is combined**, according to the Query–Key similarity.  
- The result is a **new semantic vector** for "quickly", which is no longer just its original embedding, but a **contextualized mixture** of information from the whole sentence.  

---

## 7. Conclusion

- **Query + Key = how much attention I give to another token.**  
- **Weights (softmax) = how strongly I listen to each token.**  
- **Values = what each token contributes to the mix.**  
- **Output = new contextualized vector**, used in later layers to finally **predict the next token**.

---
