# Why Transformers?

### **Background**
Traditional models like **RNNs** or **LSTMs** read text *sequentially*, meaning the processing of one word depends on the previous one.  
This causes two main issues:

1. **Slow computation** – no parallelism  
2. **Weak long-term dependencies** – difficult to remember distant words  

---

### **The Transformer Solution**
Introduced by **Vaswani et al. (2017)** in the paper *“Attention is All You Need”*,  
the **Transformer** solves these issues by allowing **each word to directly attend to every other word**.

This enables:
- **Parallel computation**
- **Global context understanding**

---

### **Key Concepts**

| Concept | Description |
|----------|--------------|
| **Self-Attention** | Allows each token to gather information from all other tokens in a sequence. |
| **Multi-Head Attention** | Uses multiple attention “heads” to learn different types of relationships (e.g., syntax, semantics). |
|  **Positional Encoding** | Injects information about word order, since self-attention itself has no sense of sequence. |
|  **Feed-Forward Networks** | Non-linear layers applied to each position to refine learned features. |
| **Residual Connections + LayerNorm** | Help stabilize training and ensure better gradient flow through deep networks. |

---

### **Encoder vs Decoder**

| Component | Function | Typical Use |
|------------|-----------|--------------|
| **Encoder** | Reads and encodes the input sequence into contextual representations. | Classification, embeddings, BERT-like models |
| **Decoder** | Generates the output sequence step by step, attending to encoder outputs. | Translation, text generation, GPT-like models |

---



# Step 1️: Tokenization
Sentence → ["I", "love", "artificial", "intelligence", "."]
Each word is assigned an ID from the vocabulary.
Example: `[101, 2000, 3562, 8007, 102]`


# Step 2️: Embedding
Each token is turned into a vector of numbers (say 4 dimensions for simplicity):
| Token | Embedding (simplified) |
|--------|------------------------|
| I | [0.1, 0.3, 0.5, 0.7] |
| love | [0.8, 0.2, 0.4, 0.6] |
| artificial | [0.3, 0.9, 0.1, 0.5] |
| intelligence | [0.4, 0.8, 0.9, 0.3] |
| . | [0.2, 0.1, 0.2, 0.1] |


# Step 3️: Positional Encoding
Since attention has no order awareness, we add a position signal to each embedding.
Example (added element-wise):
| Position | Encoding | Resulting vector (Embedding + Position) |
|-----------|-----------|---------------------------------------|
| 1 | [0.0, 0.1, 0.2, 0.3] | [0.1, 0.4, 0.7, 1.0] |
| 2 | [0.2, 0.3, 0.4, 0.5] | [1.0, 0.5, 0.8, 1.1] |
| … | … | … |


# Step 4️: Self-Attention (Intuition)
Each token looks at **all others** to understand context.
- “I” attends strongly to “love” → who is doing the action.
- “love” attends to “artificial” and “intelligence” → what is being loved.
- “artificial” attends to “intelligence” → adjective–noun relation.


This is captured mathematically by the **attention weights** matrix:


| Query → Key | I | love | artificial | intelligence | . |
|--------------|---|------|-------------|---------------|---|
| **I** | 0.10 | **0.70** | 0.15 | 0.05 | 0.00 |
| **love** | 0.30 | 0.10 | **0.30** | **0.25** | 0.05 |
| **artificial** | 0.05 | 0.10 | 0.20 | **0.60** | 0.05 |
| **intelligence** | 0.05 | 0.05 | **0.30** | 0.55 | 0.05 |
| **.** | 0.00 | 0.05 | 0.10 | 0.10 | **0.75** |


Each row (Query) shows how much that word focuses on the others.


# Step 5️: Weighted Sum (Context Vectors)
After attention, each token gets a **contextual embedding** — a blend of other word meanings weighted by attention scores.
Example (conceptually):
- “I” → now knows about “love”.
- “love” → enriched by “artificial” + “intelligence”.
- “intelligence” → retains self + input from “artificial”.


# Step 6️: Feed-Forward & Residual
Each contextual embedding passes through a small feed-forward network and is added back to the original vector → stabilizes and refines meaning.


# Step 7️: Multi-Head Attention
Several heads repeat this process focusing on different relationships:
- Head 1: Grammar (who–does–what)
- Head 2: Semantics (what concepts connect)
- Head 3: Position (which comes first)
Combined → rich, multi-dimensional context.


# Step 8️: Output Interpretation
The model can now:
- Encode “I love artificial intelligence.” into a **meaning-rich vector** (for classification).
- Or pass it to a **decoder** (for translation/generation).


**Key takeaway:**
Transformers understand context by **attending to every word at once** —
not by reading left-to-right, but by building a global map of relationships.