In [None]:
# 🔧 Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"✅ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("⚠️ No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime → Change runtime type → GPU")

print(f"\n📦 Python {sys.version.split()[0]}")
print(f"🔥 PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"🎲 Random seed set to {SEED}")

%matplotlib inline

# Inference and Scaling -- Notebook Index

*Build LLM from Scratch, Pod 5*

---

Welcome to the hands-on notebooks for **Inference and Scaling**. This pod covers everything that happens after training: how to generate text efficiently, how to control output quality, how to build a production inference engine, and how to adapt pretrained models to new tasks.

These four notebooks follow a natural progression. Each one builds on the previous, and by the end you will have a complete, working inference system with LoRA fine-tuning.

## Notebook 1: Autoregressive Generation and the KV Cache

**File:** `01_autoregressive_generation_kv_cache.ipynb`
**Estimated time:** 55 minutes

You start with the most fundamental question: how does a language model actually generate text? You will build naive autoregressive generation from scratch, measure the computational waste, derive why the KV cache works from the attention equations, and implement a working KV cache that delivers dramatic speedups. By the end, you will understand why inference is memory-bound, not compute-bound.

**Key concepts:**
- Autoregressive token-by-token generation
- Redundant computation in naive generation
- KV cache derivation from causal attention
- Memory-compute tradeoff analysis

## Notebook 2: Sampling Strategies

**File:** `02_sampling_strategies.ipynb`
**Estimated time:** 50 minutes

With efficient generation in hand, you tackle the next question: how do you pick the right token? You will implement greedy decoding and see it degenerate into repetition. Then you will build temperature scaling, top-k sampling, and top-p (nucleus) sampling from scratch, visualize how each reshapes the probability distribution, and develop intuition for when to use each strategy.

**Key concepts:**
- Greedy decoding and its failure modes
- Temperature scaling and its effect on distributions
- Top-k sampling with fixed vocabulary cutoff
- Top-p (nucleus) sampling with adaptive cutoff
- Comparing sampling strategies on real generation tasks

## Notebook 3: Complete Inference Engine

**File:** `03_complete_inference_engine.ipynb`
**Estimated time:** 60 minutes

Now you combine everything into a production-grade inference pipeline. You will build a complete generate function with KV cache, configurable sampling, and stopping criteria. You will benchmark throughput, measure latency at different sequence lengths, and understand the system-level considerations that matter when serving LLMs in production.

**Key concepts:**
- End-to-end generation loop with KV cache and sampling
- Batched inference for throughput
- Latency profiling and bottleneck analysis
- Stopping criteria and special token handling

## Notebook 4: LoRA Fine-Tuning

**File:** `04_lora_finetuning.ipynb`
**Estimated time:** 60 minutes

Finally, you learn how to adapt a pretrained model to new tasks without retraining all the parameters. You will implement LoRA (Low-Rank Adaptation) from scratch, understand why weight updates are low-rank, apply LoRA adapters to a transformer, and fine-tune on a downstream task. You will see how training 0.1% of the parameters can match full fine-tuning performance.

**Key concepts:**
- The low-rank structure of weight updates
- LoRA decomposition: freezing W, training B and A
- Applying LoRA to attention projections
- Parameter-efficient fine-tuning in practice
- Comparing LoRA to full fine-tuning

## Suggested Approach

1. Work through the notebooks in order. Each one assumes familiarity with the previous material.
2. Run every code cell. The computations are designed to run on a free Colab T4 GPU.
3. Complete the TODO exercises before looking at the solutions. The exercises are where the real learning happens.
4. Pay attention to the numerical examples. They build intuition that the theory alone cannot provide.

## Prerequisites

- Familiarity with the Transformer architecture (self-attention, multi-head attention, layer normalization)
- Basic PyTorch fluency (tensors, modules, autograd)
- Completion of Pods 1-4 in this course (or equivalent knowledge)

In [None]:
# Quick environment check
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
print("\nYou are ready to begin. Open Notebook 01 to start.")