In [None]:
# 🔧 Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"✅ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("⚠️ No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime → Change runtime type → GPU")

print(f"\n📦 Python {sys.version.split()[0]}")
print(f"🔥 PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"🎲 Random seed set to {SEED}")

%matplotlib inline

# Self-Attention from First Principles — Notebook Series

## Vizuara Learning Path

Welcome to the **Self-Attention from First Principles** notebook series. This is Part 3 of the **Build LLM from Scratch** course at Vizuara.

In these four notebooks, you will build every component of the Transformer's self-attention mechanism from scratch, verify each piece with numerical examples, and train real models.

## Series Overview

| # | Notebook | What You Build | Time |
|---|----------|---------------|------|
| 1 | **QKV Projections** | Query, Key, Value weight matrices; manual vs PyTorch computation | ~35 min |
| 2 | **Scaled Dot-Product Attention** | The core attention formula; causal masking; next-word prediction | ~45 min |
| 3 | **Multi-Head Attention & Positional Encoding** | Parallel heads; sinusoidal PE; visualizing head specialization | ~50 min |
| 4 | **Full Transformer Block** | Layer norm, FFN, residual connections; text classification | ~55 min |

**Total estimated time:** ~3 hours

## Prerequisites

- Basic Python and PyTorch (tensors, `nn.Module`, `nn.Linear`)
- Matrix multiplication and dot products
- Understanding of neural network training (loss, optimizer, backprop)
- Notebook 1 of this series (Foundations of Language Modeling) is helpful but not required

## How to Use These Notebooks

1. **Run in Google Colab** with a T4 GPU (free tier is fine)
2. **Read the explanations carefully** before running each code cell
3. **Complete the TODO sections** — they are designed to test your understanding
4. **Check the visualizations** at each checkpoint — they confirm your implementation is correct

## Quick Links

In [None]:
# Run this cell to see the notebook filenames
notebooks = [
    ("01_qkv_projections.ipynb", "QKV Projections from Scratch"),
    ("02_scaled_dot_product_attention.ipynb", "Scaled Dot-Product Attention"),
    ("03_multi_head_attention_and_positional_encoding.ipynb", "Multi-Head Attention & Positional Encoding"),
    ("04_full_transformer_block.ipynb", "Full Transformer Block"),
]

print("Self-Attention from First Principles — Notebook Series")
print("=" * 55)
for filename, title in notebooks:
    print(f"  {filename}")
    print(f"    -> {title}")
    print()

## What You Will Build

By the end of this series, you will have implemented:

- **QKV Projection:** $Q = XW^Q, \; K = XW^K, \; V = XW^V$
- **Scaled Dot-Product Attention:** $\text{Attention}(Q,K,V) = \text{softmax}(QK^T / \sqrt{d_k}) V$
- **Multi-Head Attention:** Parallel heads with concatenation and output projection
- **Sinusoidal Positional Encoding:** $PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d})$
- **A Complete Transformer Block:** Attention + Add & Norm + FFN + Add & Norm
- **A Transformer Classifier:** Trained on real data

Every equation is verified with code. Every component is visualized. Every concept is grounded in intuition.

Let us get started!