# 🎯 Transformers & Attention: Complete Guide

## The Architecture That Changed Everything

**"Attention is All You Need"** - Vaswani et al., 2017

Comprehensive breakdown of the Transformer architecture and self-attention mechanism.

---


In [None]:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
print('✅ Transformers ready!')


## Self-Attention Mathematics

### The Core Innovation

**Input**: Sequence of vectors $X = [x_1, x_2, ..., x_n]$

**Three learned projections**:
- **Query**: $Q = XW_Q$
- **Key**: $K = XW_K$  
- **Value**: $V = XW_V$

**Attention formula**:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

**Why $\sqrt{d_k}$?** Prevents dot products from growing too large.

### Multi-Head Attention

Run h parallel attention "heads":

$$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$

where each $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$

**Benefit**: Learn different representation subspaces


In [None]:
# Simplified self-attention
def self_attention(Q, K, V):
    """Scaled dot-product attention."""
    d_k = Q.shape[-1]
    scores = np.matmul(Q, K.transpose(-2, -1)) / np.sqrt(d_k)
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
    output = np.matmul(attention_weights, V)
    return output, attention_weights

# Example
seq_len, d_model = 4, 8
X = np.random.randn(seq_len, d_model)
Q = K = V = X  # Self-attention

output, weights = self_attention(Q, K, V)
print(f'Input shape: {X.shape}')
print(f'Output shape: {output.shape}')
print(f'Attention weights shape: {weights.shape}')
print('✅ Self-attention computed!')


## Transformer Architecture

### Encoder
1. Input Embedding + Positional Encoding
2. **N layers** of:
   - Multi-Head Self-Attention
   - Add & Norm
   - Feed-Forward Network
   - Add & Norm

### Decoder  
1. Output Embedding + Positional Encoding
2. **N layers** of:
   - Masked Multi-Head Self-Attention
   - Add & Norm
   - Multi-Head Cross-Attention (with encoder)
   - Add & Norm
   - Feed-Forward
   - Add & Norm

### Why Transformers Won

✅ **Parallelizable** (unlike RNNs)
✅ **Long-range dependencies** (O(1) vs O(n) for RNNs)
✅ **Scalable** to billions of parameters
✅ **Transfer learning** (BERT, GPT)
