# Transformers: The Architecture Behind Modern AI

<figure>
    <center> <img src="./images/transformer_architecture.png"  style="width:800px;height:300px;" ></center>
</figure>

## From Neural Networks to Language Models

**This notebook is currently an OUTLINE** - to be completed by Daniel Huber

### The Journey So Far:

1. **Linear Regression**: $f(x) = wx + b$ - Simple but limited
2. **Cost Functions**: Measuring prediction errors with MSE
3. **Gradient Descent**: Optimizing parameters iteratively
4. **Neural Networks**: Non-linear transformations with activation functions

### The Next Leap: Transformers

**Transformers** (2017) revolutionized AI by introducing a new architecture that powers:
- **ChatGPT** (OpenAI)
- **Claude** (Anthropic) 
- **BERT** (Google)
- **GPT-4**, **Gemini**, and virtually all modern language models

### What You'll Learn

1. Why traditional neural networks struggle with sequences
2. The attention mechanism - "looking at the right place"
3. Self-attention: how transformers understand context
4. Positional encoding: teaching position to the model
5. The complete transformer architecture
6. How transformers scale to billions of parameters

## SECTION 1: The Problem with Sequences

### TODO: Add content

Topics to cover:
- Why position matters in language
- Limitations of feed-forward networks for text
- RNNs and their sequential bottleneck
- The need for parallel processing

### Example: Word Order Matters
```
"The cat sat on the mat" ≠ "The mat sat on the cat"
```

Traditional neural networks (from Notebook 4) process each input independently - they can't understand the **relationship** between words!

In [None]:
# Setup
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch, FancyArrowPatch

plt.style.use('./leonteq.mplstyle')
%matplotlib inline

# TODO: Add transformer imports when implementing
# import torch
# import torch.nn as nn
# from transformers import GPT2Tokenizer, GPT2Model

## SECTION 2: Attention Mechanism - The Key Innovation

### TODO: Implement attention visualization

**The Core Idea:**
> "When processing each word, look at ALL other words and decide which ones are important"

### Mathematical Formulation

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

Where:
- **Q** (Query): "What am I looking for?"
- **K** (Key): "What do I contain?"
- **V** (Value): "What information do I have?"
- **d_k**: Dimension of keys (scaling factor)

### Example: Attention in "The animal didn't cross the street because it was too tired"

When processing "it":
- High attention to "animal" (92%)
- Low attention to "street" (5%)
- Low attention to "cross" (3%)

The model learns that "it" refers to "animal", not "street"!

In [None]:
# TODO: Implement simple attention mechanism

def simple_attention(Q, K, V):
    """
    Compute scaled dot-product attention
    
    Args:
        Q: Query matrix (seq_len, d_k)
        K: Key matrix (seq_len, d_k)
        V: Value matrix (seq_len, d_v)
    
    Returns:
        Output: Weighted sum of values
        Attention weights: Softmax of Q·K^T
    """
    d_k = K.shape[-1]
    
    # Compute attention scores
    scores = np.dot(Q, K.T) / np.sqrt(d_k)
    
    # Apply softmax
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
    
    # Weighted sum of values
    output = np.dot(attention_weights, V)
    
    return output, attention_weights

# TODO: Add visualization of attention weights as heatmap

## SECTION 3: Self-Attention - Looking Within

### TODO: Explain self-attention vs regular attention

**Self-Attention**: Each word attends to every other word in the same sentence

### Multi-Head Attention

Instead of one attention mechanism, use **multiple** in parallel:
- Head 1: Learns syntactic relationships (subject-verb)
- Head 2: Learns semantic relationships (noun-adjective)
- Head 3: Learns long-range dependencies
- ...
- Head 8: Learns positional patterns

$$
\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O
$$

where each $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$

In [None]:
# TODO: Implement multi-head attention
# TODO: Visualize what different heads learn

## SECTION 4: Positional Encoding

### TODO: Explain why we need position information

**Problem**: Attention has no inherent notion of order!
- "Dog bites man" vs "Man bites dog" would look identical

**Solution**: Add positional information to embeddings

$$
\begin{align}
PE_{(pos, 2i)} &= \sin\left(\frac{pos}{10000^{2i/d}}\right) \\
PE_{(pos, 2i+1)} &= \cos\left(\frac{pos}{10000^{2i/d}}\right)
\end{align}
$$

Where:
- **pos**: Position in sequence
- **i**: Dimension index
- **d**: Total embedding dimension

In [None]:
# TODO: Implement positional encoding

def positional_encoding(max_len, d_model):
    """
    Generate sinusoidal positional encodings
    
    Args:
        max_len: Maximum sequence length
        d_model: Embedding dimension
    
    Returns:
        PE: Positional encoding matrix (max_len, d_model)
    """
    pe = np.zeros((max_len, d_model))
    position = np.arange(0, max_len).reshape(-1, 1)
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
    
    pe[:, 0::2] = np.sin(position * div_term)
    pe[:, 1::2] = np.cos(position * div_term)
    
    return pe

# TODO: Visualize positional encodings as heatmap

## SECTION 5: The Complete Transformer Architecture

### TODO: Build complete transformer block

```
┌─────────────────────────────────────┐
│         TRANSFORMER BLOCK           │
├─────────────────────────────────────┤
│                                     │
│  1. Input Embeddings                │
│     + Positional Encoding           │
│                                     │
│  2. Multi-Head Self-Attention       │
│     + Residual Connection           │
│     + Layer Normalization           │
│                                     │
│  3. Feed-Forward Network            │
│     (2-layer MLP with ReLU)         │
│     + Residual Connection           │
│     + Layer Normalization           │
│                                     │
│  [Repeat N times]                   │
│                                     │
│  4. Output Layer                    │
│     (Linear + Softmax)              │
│                                     │
└─────────────────────────────────────┘
```

### Key Components:

1. **Multi-Head Attention**: Process relationships between all tokens
2. **Feed-Forward Network**: Transform representations
3. **Layer Normalization**: Stabilize training
4. **Residual Connections**: Enable deep networks (100+ layers)

In [None]:
# TODO: Implement complete transformer block
# TODO: Use PyTorch or implement from scratch

class TransformerBlock:
    """Single transformer encoder block"""
    
    def __init__(self, d_model=512, n_heads=8, d_ff=2048, dropout=0.1):
        """
        Args:
            d_model: Embedding dimension
            n_heads: Number of attention heads
            d_ff: Feed-forward hidden dimension
            dropout: Dropout probability
        """
        # TODO: Initialize layers
        pass
    
    def forward(self, x, mask=None):
        """
        Forward pass through transformer block
        
        Args:
            x: Input tensor (batch_size, seq_len, d_model)
            mask: Attention mask (optional)
        
        Returns:
            Output tensor (batch_size, seq_len, d_model)
        """
        # TODO: Implement forward pass
        pass

## SECTION 6: Training and Scaling

### TODO: Discuss training transformers

### From Small to Large

| Model | Parameters | Layers | Hidden Size | Heads |
|-------|-----------|---------|-------------|-------|
| BERT-Base | 110M | 12 | 768 | 12 |
| GPT-2 | 1.5B | 48 | 1600 | 25 |
| GPT-3 | 175B | 96 | 12288 | 96 |
| GPT-4 | ~1.7T (estimated) | ? | ? | ? |

### Key Training Techniques:
1. **Pre-training**: Learn from massive unlabeled text
2. **Fine-tuning**: Adapt to specific tasks
3. **RLHF** (Reinforcement Learning from Human Feedback)

### Computational Requirements:
- **GPT-3 training**: ~$4.6 million in cloud compute
- **Training time**: Months on thousands of GPUs
- **Dataset**: 45TB of text (Common Crawl, books, Wikipedia)

## SECTION 7: Practical Example - Mini Language Model

### TODO: Implement tiny GPT-style model

Build a character-level language model that:
1. Learns from Shakespeare text
2. Generates new text in similar style
3. Uses 2-layer transformer
4. ~1M parameters (tiny!)

This will demonstrate:
- Text tokenization
- Training loop
- Text generation
- Attention visualization

In [None]:
# TODO: Implement mini GPT
# TODO: Train on small dataset
# TODO: Generate text
# TODO: Visualize attention patterns

## SECTION 8: Modern Applications

### TODO: Show real-world transformer applications

### Language Models
- **GPT-4**: Text generation, reasoning, coding
- **Claude**: Long-context understanding (100K+ tokens)
- **BERT**: Search, question answering

### Beyond Text
- **Vision Transformers (ViT)**: Image classification
- **DALL-E**: Text-to-image generation  
- **AlphaFold**: Protein structure prediction
- **Whisper**: Speech recognition

### Multimodal Models
- **GPT-4V**: Vision + language
- **Gemini**: Text + image + video + audio

### The Future
- Longer contexts (1M+ tokens)
- More efficient architectures
- Reasoning and planning capabilities
- Specialized domain models

## Key Takeaways

### The Transformer Revolution

1. **Attention is All You Need** (2017 paper title)
   - Self-attention replaces recurrence
   - Parallel processing enables scaling

2. **Core Innovations**:
   - **Self-attention**: Model relationships between all positions
   - **Multi-head attention**: Learn different relationship types
   - **Positional encoding**: Preserve sequence order
   - **Residual connections**: Enable deep networks

3. **Why Transformers Dominate**:
   - Scale efficiently to billions of parameters
   - Transfer learning across tasks
   - Capture long-range dependencies
   - Generalize beyond training data

4. **From Foundations to GPT**:
   - **Notebook 1-3**: Linear models, optimization basics
   - **Notebook 4**: Neural networks with non-linearity
   - **Notebook 5**: Transformers - attention + scale = modern AI

---

## What's Next?

### Explore Further:
- **Hugging Face**: Pre-trained transformer models
- **Andrej Karpathy's nanoGPT**: Minimal GPT implementation
- **The Illustrated Transformer**: Jay Alammar's visual guide
- **Attention is All You Need**: Original 2017 paper

### Build Something:
- Fine-tune a model on your data
- Build a chatbot with GPT
- Create custom embeddings
- Experiment with vision transformers

---

**Congratulations!** You've journeyed from linear regression to the architecture powering ChatGPT. You now understand the foundations of modern AI! 🎉

## Additional Resources

### Papers
- [Attention is All You Need](https://arxiv.org/abs/1706.03762) (Vaswani et al., 2017)
- [BERT: Pre-training of Deep Bidirectional Transformers](https://arxiv.org/abs/1810.04805)
- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) (GPT-3)

### Tutorials
- [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [The Annotated Transformer](http://nlp.seas.harvard.edu/annotated-transformer/)
- [Hugging Face Course](https://huggingface.co/course)

### Code Repositories
- [nanoGPT](https://github.com/karpathy/nanoGPT) - Minimal GPT implementation
- [minGPT](https://github.com/karpathy/minGPT) - Educational GPT
- [Transformers Library](https://github.com/huggingface/transformers)

### Videos
- [Andrej Karpathy: Let's Build GPT](https://www.youtube.com/watch?v=kCc8FmEb1nY)
- [3Blue1Brown: Attention in Transformers](https://www.youtube.com/watch?v=eMlx5fFNoYc)