# PyTorch Handbook

A complete guide to PyTorch, from fundamentals to LLM training.


In [None]:
# ─────────────────────────────────────────────────────────────────────────────
# INSTALL DEPENDENCIES
# ─────────────────────────────────────────────────────────────────────────────

%pip install torch numpy transformers --quiet


---

## Table of Contents

### [1. Core PyTorch Fundamentals](#1-core-pytorch-fundamentals)
- [1.1 Tensors](#11-tensors)
- [1.2 Autograd and Computation Graphs](#12-autograd-and-computation-graphs)
- [1.3 Modules and Parameters](#13-modules-and-parameters)
- [1.4 Loss Functions and Optimizers](#14-loss-functions-and-optimizers)

### [2. Training Mechanics](#2-training-mechanics)
- [2.1 Training Loop Anatomy](#21-training-loop-anatomy)
- [2.2 Datasets and DataLoader](#22-datasets-and-dataloader)
- [2.3 Device Management](#23-device-management)

### [3. PyTorch Building Blocks for Deep Learning](#3-pytorch-building-blocks-for-deep-learning)
- [3.1 Layers and Activations](#31-layers-and-activations)
- [3.2 Normalization](#32-normalization)
- [3.3 Initialization](#33-initialization)

### [4. PyTorch for NLP and Sequence Models](#4-pytorch-for-nlp-and-sequence-models)
- [4.1 Embeddings and Vocab](#41-embeddings-and-vocab)
- [4.2 Attention from Scratch](#42-attention-from-scratch)
- [4.3 Transformer Blocks](#43-transformer-blocks)
- [4.4 Tokenization Ecosystem](#44-tokenization-ecosystem)

### [5. Training Language Models](#5-training-language-models)
- [5.1 Autoregressive Training](#51-autoregressive-training)
- [5.2 Mixed Precision and Performance](#52-mixed-precision-and-performance)
- [5.3 Gradient Control](#53-gradient-control)

### [6. Generation and Inference](#6-generation-and-inference)
- [6.1 Decoding Mechanics](#61-decoding-mechanics)
- [6.2 KV Caching](#62-kv-caching)

### [7. Ecosystem and Real-world PyTorch](#7-ecosystem-and-real-world-pytorch)
- [7.1 Hugging Face Transformers](#71-hugging-face-transformers)
- [7.2 Model Saving and Loading](#72-model-saving-and-loading)
- [7.3 Distributed Training Concepts](#73-distributed-training-concepts)

### [8. Final Notes](#8-final-notes)
- [8.1 Common Questions & Answers](#81-common-questions--answers)
- [8.2 Quick Reference Card](#82-quick-reference-card)

### [9. PyTorch Examples](#9-pytorch-examples)
- [9.1 Basic Neural Network Training](#91-basic-neural-network-training)
- [9.2 Transformer Language Model on MPS](#92-transformer-language-model-on-mps)

---

*This replaces TensorFlow basics, but faster for you.*

## 1. Core PyTorch Fundamentals

### 1.1 Tensors

Tensors are the fundamental data structure in PyTorch—multi-dimensional arrays that can run on GPUs and track gradients for automatic differentiation.

**Key Concepts:**

| Concept | Description |
|---------|-------------|
| `torch.tensor()` | Creates tensor from data, infers dtype |
| `torch.Tensor()` | Class constructor, defaults to float32 |
| `dtype` | Data type (float32, int64, bool, etc.) |
| `device` | Where tensor lives (cpu, cuda, mps) |
| `requires_grad` | Enable gradient tracking |

**Why this matters:** Everything in PyTorch is a tensor. If you're weak here, everything else leaks.

In [1]:
import torch
import numpy as np

# ─────────────────────────────────────────────────────────────────────────────
# TENSOR CREATION
# ─────────────────────────────────────────────────────────────────────────────

# torch.tensor() - Creates tensor from data, infers dtype
a = torch.tensor([1, 2, 3])           # int64 inferred
b = torch.tensor([1.0, 2.0, 3.0])     # float32 inferred
c = torch.tensor([[1, 2, 3], [3, 4, 5]])    # 2D tensor

print(f"a: {a}, dtype: {a.dtype}")
print(f"b: {b}, dtype: {b.dtype}")
print(f"c shape: {c.shape}")

# torch.Tensor() - Class constructor, ALWAYS float32
d = torch.Tensor([1, 2, 3])           # Always float32
print(f"d: {d}, dtype: {d.dtype}")

# Explicit dtype specification
e = torch.tensor([1, 2, 3], dtype=torch.float64)
print(f"e dtype: {e.dtype}")

# what is the '.' in '1.' 
a = torch.tensor([1]) # int 
b = torch.tensor([1.]) # creates a float tensor i.e float32 1.xxx
print(a.dtype)  # torch.int64
print(b.dtype)  # torch.float32
# ! If x were [1] instead of [1.], PyTorch would not allow gradient tracking because it would be an integer tensor not a floating-point tensor.

a: tensor([1, 2, 3]), dtype: torch.int64
b: tensor([1., 2., 3.]), dtype: torch.float32
c shape: torch.Size([2, 3])
d: tensor([1., 2., 3.]), dtype: torch.float32
e dtype: torch.float64
torch.int64
torch.float32


In [2]:
# ─────────────────────────────────────────────────────────────────────────────
# DEVICE PLACEMENT
# ─────────────────────────────────────────────────────────────────────────────

# Check available devices
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"MPS available: {torch.backends.mps.is_available()}")  # Apple Silicon

# Create tensor on specific device
cpu_tensor = torch.tensor([1, 2, 3], device='cpu')

# Move tensor to device (use this pattern for portability)
device = torch.device('cuda' if torch.cuda.is_available() 
                      else 'mps' if torch.backends.mps.is_available() 
                      else 'cpu')
print(f"Using device: {device}")

gpu_tensor = cpu_tensor.to(device)
print(f"Tensor device: {gpu_tensor.device}")

CUDA available: False
MPS available: True
Using device: mps
Tensor device: mps:0


In [None]:
# ─────────────────────────────────────────────────────────────────────────────
# REQUIRES_GRAD - Enable gradient tracking
# ─────────────────────────────────────────────────────────────────────────────

""" 
Gradient tracking in PyTorch is automatic differentiation where PyTorch records tensor operations to build a computation graph and computes gradients via backpropagation when you call backward().
You use gradient tracking so your model knows how to change its parameters to reduce loss, enabling learning through backpropagation.
this is enabled by setting the requires_grad attribute of a tensor to True. it will be on during training and off during inference to save memory and computations.
if you dont have this on the backpropagation step will fail because there will be no computation graph to traverse.
"""

# By default, tensors don't track gradients
x = torch.tensor([1.0, 2.0, 3.0])
print(f"requires_grad: {x.requires_grad}")  # False

# Enable gradient tracking
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
print(f"requires_grad: {x.requires_grad}")  # True

# Or set it after creation
y = torch.tensor([4.0, 5.0, 6.0])
y.requires_grad_(True)  # In-place operation (note the underscore)
print(f"y requires_grad: {y.requires_grad}")

requires_grad: False
requires_grad: True
y requires_grad: True


In [4]:
# ─────────────────────────────────────────────────────────────────────────────
# BASIC OPERATIONS AND BROADCASTING
# ─────────────────────────────────────────────────────────────────────────────

a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])

# Element-wise operations
print(f"Add: {a + b}")
print(f"Multiply: {a * b}")
print(f"Power: {a ** 2}")

# Matrix operations
A = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
B = torch.tensor([[5, 6], [7, 8]], dtype=torch.float32)

print(f"Matrix multiply (@ operator): \n{A @ B}")
print(f"Matrix multiply (torch.matmul): \n{torch.matmul(A, B)}")

# Broadcasting - smaller tensor expands to match larger
scalar = torch.tensor(10)
vector = torch.tensor([1, 2, 3])
print(f"Broadcast scalar: {scalar + vector}")

matrix = torch.tensor([[1, 2, 3], [4, 5, 6]])
print(f"Broadcast vector to matrix:\n{matrix + vector}")

Add: tensor([5., 7., 9.])
Multiply: tensor([ 4., 10., 18.])
Power: tensor([1., 4., 9.])
Matrix multiply (@ operator): 
tensor([[19., 22.],
        [43., 50.]])
Matrix multiply (torch.matmul): 
tensor([[19., 22.],
        [43., 50.]])
Broadcast scalar: tensor([11, 12, 13])
Broadcast vector to matrix:
tensor([[2, 4, 6],
        [5, 7, 9]])


In [5]:
# ─────────────────────────────────────────────────────────────────────────────
# INDEXING AND SLICING
# ─────────────────────────────────────────────────────────────────────────────

t = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Basic indexing (same as NumPy)
print(f"Element [1,2]: {t[1, 2]}")
print(f"Row 0: {t[0]}")
print(f"Column 1: {t[:, 1]}")

# Slicing
print(f"First 2 rows:\n{t[:2]}")
print(f"Last 2 columns:\n{t[:, -2:]}")

# Boolean indexing
mask = t > 5
print(f"Elements > 5: {t[mask]}")

# Fancy indexing
indices = torch.tensor([0, 2])
print(f"Rows 0 and 2:\n{t[indices]}")

Element [1,2]: 6
Row 0: tensor([1, 2, 3])
Column 1: tensor([2, 5, 8])
First 2 rows:
tensor([[1, 2, 3],
        [4, 5, 6]])
Last 2 columns:
tensor([[2, 3],
        [5, 6],
        [8, 9]])
Elements > 5: tensor([6, 7, 8, 9])
Rows 0 and 2:
tensor([[1, 2, 3],
        [7, 8, 9]])


In [6]:
# ─────────────────────────────────────────────────────────────────────────────
# PYTORCH TENSORS VS NUMPY ARRAYS
# ─────────────────────────────────────────────────────────────────────────────

# NumPy to PyTorch (shares memory by default!)
np_array = np.array([1.0, 2.0, 3.0])
tensor_from_np = torch.from_numpy(np_array)
print(f"From NumPy: {tensor_from_np}")

# Modify NumPy array - tensor changes too!
np_array[0] = 999
print(f"After modifying NumPy: {tensor_from_np}")  # Also 999!

# PyTorch to NumPy (also shares memory)
tensor = torch.tensor([1.0, 2.0, 3.0])
np_from_tensor = tensor.numpy()
print(f"To NumPy: {np_from_tensor}")

# Use .clone() to avoid shared memory
tensor_copy = torch.from_numpy(np_array).clone()
np_array[0] = 1  # Original back
print(f"Cloned (independent): {tensor_copy}")  # Still 999

# Key differences:
# - PyTorch tensors can live on GPU
# - PyTorch tensors track gradients
# - PyTorch has autograd integration

From NumPy: tensor([1., 2., 3.], dtype=torch.float64)
After modifying NumPy: tensor([999.,   2.,   3.], dtype=torch.float64)
To NumPy: [1. 2. 3.]
Cloned (independent): tensor([999.,   2.,   3.], dtype=torch.float64)


### 1.2 Autograd and Computation Graphs

**Critical mindset:** PyTorch builds the computation graph *as Python executes*, not ahead of time. This is the single biggest difference from TensorFlow.

**Key Concepts:**

| Concept | Description |
|---------|-------------|
| `requires_grad=True` | Tells PyTorch to track operations on this tensor |
| Computation graph | DAG of operations built dynamically during forward pass |
| `.backward()` | Computes gradients via reverse-mode autodiff |
| `.grad` | Stores computed gradients |
| `torch.no_grad()` | Context manager to disable gradient tracking |

**How it works:**
1. Every operation on tensors with `requires_grad=True` is recorded
2. The graph is built on-the-fly (dynamic graph)
3. Calling `.backward()` on a scalar traverses the graph backwards
4. Gradients accumulate in `.grad` attributes

In [3]:
# ─────────────────────────────────────────────────────────────────────────────
# AUTOGRAD BASICS - Building computation graphs dynamically
# ─────────────────────────────────────────────────────────────────────────────

# Create tensors with gradient tracking
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = torch.tensor([4.0, 5.0], requires_grad=True)

# Every operation creates a node in the graph
z = x * y          # Element-wise multiply
out = z.sum()      # Sum to scalar (required for .backward())

print(f"z: {z}")
print(f"out: {out}")
print(f"z.grad_fn: {z.grad_fn}")      # MulBackward (the grad_fn shows the operation that created this tensor can be Muultiply, Add, etc.)
print(f"out.grad_fn: {out.grad_fn}")  # SumBackward

# Compute gradients
out.backward()

# Gradients are stored in .grad
print(f"x.grad: {x.grad}")  # d(out)/dx = y = [4, 5]
print(f"y.grad: {y.grad}")  # d(out)/dy = x = [2, 3]
print(f"z.grad: {z.grad}")  # None, because z is not a leaf node

# EX 2 scalar case (no need to sum)
x = torch.tensor(2.0, requires_grad=True)
y = x * x
y.backward()
print(x.grad)  # 4.0 (all operations so far on x)

""" 
═══════════════════════════════════════════════════════════════════════════════
                        COMPUTATION GRAPH EXPLANATION
═══════════════════════════════════════════════════════════════════════════════

FORWARD PASS (Building the Graph):
──────────────────────────────────
    x = [2., 3.]  (LEAF NODE - user input)
    y = [4., 5.]  (LEAF NODE - user input)
         │           │
         └─────┬─────┘
               ↓
         z = x * y = [2*4, 3*5] = [8., 15.]  (INTERMEDIATE NODE)
         (grad_fn = <MulBackward0>)
               │
               ↓
       out = z.sum() = 8 + 15 = 23.0  (OUTPUT NODE)
       (grad_fn = <SumBackward0>)


BACKWARD PASS (Computing Gradients):
─────────────────────────────────────
When we call out.backward(), PyTorch computes gradients using the chain rule:

Step 1: Start at output node
   out = 23.0
   d(out)/d(out) = 1.0  (gradient at the starting point)

Step 2: Backprop through SumBackward0 (out = z.sum())
   d(out)/d(z) = [1., 1.]  (sum distributes gradient equally to all inputs)
   
   Why? Because:
   out = z[0] + z[1]
   ∂(out)/∂(z[0]) = 1
   ∂(out)/∂(z[1]) = 1

Step 3: Backprop through MulBackward0 (z = x * y)
   Using chain rule: d(out)/d(x) = d(out)/d(z) * d(z)/d(x)
   
   For x[0]: d(z[0])/d(x[0]) = y[0] = 4.0
             d(out)/d(x[0]) = 1.0 * 4.0 = 4.0
   
   For x[1]: d(z[1])/d(x[1]) = y[1] = 5.0
             d(out)/d(x[1]) = 1.0 * 5.0 = 5.0
   
   Therefore: x.grad = [4., 5.]  ← This is why x.grad = y!
   
   Similarly for y:
   For y[0]: d(z[0])/d(y[0]) = x[0] = 2.0
             d(out)/d(y[0]) = 1.0 * 2.0 = 2.0
   
   For y[1]: d(z[1])/d(y[1]) = x[1] = 3.0
             d(out)/d(y[1]) = 1.0 * 3.0 = 3.0
   
   Therefore: y.grad = [2., 3.]  ← This is why y.grad = x!


WHY z.grad IS None:
───────────────────
• z is an INTERMEDIATE (non-leaf) node in the computation graph
• By default, PyTorch only stores gradients for LEAF nodes (x, y)
• This saves memory for large networks
• To get z.grad, you must call z.retain_grad() before backward()
"""

z: tensor([ 8., 15.], grad_fn=<MulBackward0>)
out: 23.0
z.grad_fn: <MulBackward0 object at 0x1076e0f40>
out.grad_fn: <SumBackward0 object at 0x1079fec80>
x.grad: tensor([4., 5.])
y.grad: tensor([2., 3.])
z.grad: None
tensor(4.)


  print(f"z.grad: {z.grad}")  # None, because z is not a leaf node


' \n═══════════════════════════════════════════════════════════════════════════════\n                        COMPUTATION GRAPH EXPLANATION\n═══════════════════════════════════════════════════════════════════════════════\n\nFORWARD PASS (Building the Graph):\n──────────────────────────────────\n    x = [2., 3.]  (LEAF NODE - user input)\n    y = [4., 5.]  (LEAF NODE - user input)\n         │           │\n         └─────┬─────┘\n               ↓\n         z = x * y = [2*4, 3*5] = [8., 15.]  (INTERMEDIATE NODE)\n         (grad_fn = <MulBackward0>)\n               │\n               ↓\n       out = z.sum() = 8 + 15 = 23.0  (OUTPUT NODE)\n       (grad_fn = <SumBackward0>)\n\n\nBACKWARD PASS (Computing Gradients):\n─────────────────────────────────────\nWhen we call out.backward(), PyTorch computes gradients using the chain rule:\n\nStep 1: Start at output node\n   out = 23.0\n   d(out)/d(out) = 1.0  (gradient at the starting point)\n\nStep 2: Backprop through SumBackward0 (out = z.sum())\n  

In [8]:
# ─────────────────────────────────────────────────────────────────────────────
# GRADIENT ACCUMULATION - Why zeroing matters
# ─────────────────────────────────────────────────────────────────────────────

x = torch.tensor([1.0, 2.0], requires_grad=True)

# First backward pass
y1 = (x * 2).sum()
y1.backward()
print(f"After first backward: {x.grad}")  # [2, 2]

# Second backward pass - gradients ACCUMULATE!
y2 = (x * 3).sum()
y2.backward()
print(f"After second backward: {x.grad}")  # [5, 5] = [2+3, 2+3]

# This is why you MUST zero gradients in training loops
x.grad.zero_()  # In-place zero
print(f"After zeroing: {x.grad}")  # [0, 0]

# Now fresh gradient
y3 = (x * 4).sum()
y3.backward()
print(f"After third backward: {x.grad}")  # [4, 4]

After first backward: tensor([2., 2.])
After second backward: tensor([5., 5.])
After zeroing: tensor([0., 0.])
After third backward: tensor([4., 4.])


In [9]:
# ─────────────────────────────────────────────────────────────────────────────
# TORCH.NO_GRAD() - Disable gradient tracking for inference
# ─────────────────────────────────────────────────────────────────────────────

x = torch.tensor([1.0, 2.0], requires_grad=True)

# With gradient tracking (default)
y = x * 2
print(f"y.requires_grad: {y.requires_grad}")  # True
print(f"y.grad_fn: {y.grad_fn}")  # MulBackward

# Without gradient tracking - faster, less memory
with torch.no_grad():
    z = x * 2
    print(f"z.requires_grad: {z.requires_grad}")  # False
    print(f"z.grad_fn: {z.grad_fn}")  # None

# Alternative: detach() creates a new tensor without grad
w = x.detach() * 2
print(f"w.requires_grad: {w.requires_grad}")  # False

# Use torch.inference_mode() for even more optimization (PyTorch 1.9+)
with torch.inference_mode():
    v = x * 2
    print(f"v.requires_grad: {v.requires_grad}")  # False

y.requires_grad: True
y.grad_fn: <MulBackward0 object at 0x115b41f60>
z.requires_grad: False
z.grad_fn: None
w.requires_grad: False
v.requires_grad: False


### 1.3 Modules and Parameters

Every model, block, transformer, and attention head in PyTorch is a `nn.Module`. This maps directly to TensorFlow's subclassed models.

**Key Concepts:**

| Concept | Description |
|---------|-------------|
| `nn.Module` | Base class for all neural network modules |
| `nn.Parameter` | Tensor that's automatically registered as a parameter |
| `forward()` | Defines the computation performed at every call |
| `model.parameters()` | Iterator over all learnable parameters |
| `state_dict` | Dictionary mapping parameter names to tensors |

**The Module Pattern:**
```python
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        # Define layers here
        
    def forward(self, x):
        # Define forward pass here
        return x
```

In [3]:
import torch.nn as nn

# ─────────────────────────────────────────────────────────────────────────────
# BASIC MODULE DEFINITION
# ─────────────────────────────────────────────────────────────────────────────

class SimpleNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()  # Always call parent __init__
        
        # Layers are registered automatically when assigned as attributes
        self.fc1 = nn.Linear(input_size, hidden_size) # linear layer is a fully connected layer where each input feature is connected to each output feature with its own weight. a linear layer is simply: xW^T + b; x = input, W = weights, b = bias the ^T is the transpose done to the weights to match dimensions with input size and hidden size
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        # Define the forward pass
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Create model
model = SimpleNet(10, 20, 5)
print(model)

# Test forward pass
x = torch.randn(32, 10)  # Batch of 32, 10 features each
output = model(x)        # Calls forward() automatically
print(f"Output shape: {output.shape}")

""" 
- PyTorch linear layers are batch aware.
- forward does not care about batch size.
- It operates on the last dimension.
- nn.Linear expects input shape. batch_size, input_size. here our input size is 10 and batch size is 32.
- so here we pas 'x' = (32, 10) to the model and it processes each of the 32 samples independently through the linear layers. so 10 features passed in the model 32 times (batch size).
- What forward expects: The last dimension matches input_size. (Shape is anything, input_size).
- for ex (8, 32, 10): 8 time steps. 32 samples per time step. Each sample has 10 features.
- there 10 festures are mapped to 20 hidden fetures then those are mapped to one of 5 output features (classes).
"""

SimpleNet(
  (fc1): Linear(in_features=10, out_features=20, bias=True)
  (relu): ReLU()
  (fc2): Linear(in_features=20, out_features=5, bias=True)
)
Output shape: torch.Size([32, 5])


" \n- PyTorch linear layers are batch aware.\n- forward does not care about batch size.\n- It operates on the last dimension.\n- nn.Linear expects input shape. batch_size, input_size. here our input size is 10 and batch size is 32.\n- so here we pas 'x' = (32, 10) to the model and it processes each of the 32 samples independently through the linear layers. so 10 features passed in the model 32 times (batch size).\n- What forward expects: The last dimension matches input_size. (Shape is anything, input_size).\n- for ex (8, 32, 10): 8 time steps. 32 samples per time step. Each sample has 10 features.\n- there 10 festures are mapped to 20 hidden fetures then those are mapped to one of 5 output features (classes).\n"

In [11]:
# ─────────────────────────────────────────────────────────────────────────────
# NN.PARAMETER - Custom learnable parameters
# ─────────────────────────────────────────────────────────────────────────────

class CustomLayer(nn.Module):
    def __init__(self, size):
        super().__init__()
        # nn.Parameter wraps a tensor and registers it as a parameter
        self.weights = nn.Parameter(torch.randn(size))
        self.bias = nn.Parameter(torch.zeros(size))
        
        # Regular tensors are NOT parameters (not learned)
        self.register_buffer('constant', torch.ones(size))  # Saved but not trained
    
    def forward(self, x):
        return x * self.weights + self.bias + self.constant

layer = CustomLayer(5)

# Check what's registered
print("Parameters:")
for name, param in layer.named_parameters():
    print(f"  {name}: {param.shape}, requires_grad={param.requires_grad}")

print("\nBuffers:")
for name, buf in layer.named_buffers():
    print(f"  {name}: {buf.shape}")

Parameters:
  weights: torch.Size([5]), requires_grad=True
  bias: torch.Size([5]), requires_grad=True

Buffers:
  constant: torch.Size([5])


In [12]:
# ─────────────────────────────────────────────────────────────────────────────
# MODEL.PARAMETERS() AND STATE_DICT
# ─────────────────────────────────────────────────────────────────────────────

model = SimpleNet(10, 20, 5)

# Iterate over all parameters
print("All parameters:")
for name, param in model.named_parameters():
    print(f"  {name}: {param.shape}")

# Total parameter count
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal parameters: {total_params}")
print(f"Trainable parameters: {trainable_params}")

# State dict - serializable dictionary of all parameters
state = model.state_dict()
print(f"\nState dict keys: {state.keys()}")

# Load state dict (for loading saved models)
model.load_state_dict(state)

All parameters:
  fc1.weight: torch.Size([20, 10])
  fc1.bias: torch.Size([20])
  fc2.weight: torch.Size([5, 20])
  fc2.bias: torch.Size([5])

Total parameters: 325
Trainable parameters: 325

State dict keys: odict_keys(['fc1.weight', 'fc1.bias', 'fc2.weight', 'fc2.bias'])


<All keys matched successfully>

### 1.4 Loss Functions and Optimizers

**Important:** PyTorch expects raw logits, not softmaxed outputs, for most losses.

**Key Loss Functions:**

| Loss | Use Case | Input |
|------|----------|-------|
| `nn.CrossEntropyLoss` | Multi-class classification | Raw logits |
| `nn.BCEWithLogitsLoss` | Binary/multi-label classification | Raw logits |
| `nn.MSELoss` | Regression | Any |
| `nn.NLLLoss` | Multi-class (after log_softmax) | Log probabilities |

**Reduction Modes:**
- `reduction='mean'` (default): Average loss over batch
- `reduction='sum'`: Sum loss over batch
- `reduction='none'`: Return loss per sample

**Key Optimizers:**

| Optimizer | Description |
|-----------|-------------|
| `SGD` | Basic stochastic gradient descent |
| `Adam` | Adaptive learning rates, most popular |
| `AdamW` | Adam with proper weight decay (use for transformers) |

In [None]:
# ─────────────────────────────────────────────────────────────────────────────
# CROSS ENTROPY LOSS - The most important for classification
# ─────────────────────────────────────────────────────────────────────────────

# NOTE: the loss function mesures the difference between the predicted class probabilities (logits) and the true class labels. this loss is then used to tell the # model how to adjust its weights during training to improve accuracy. it is especially useful for multi-class classification problems where each input belongs to one of several classes.

# CrossEntropyLoss combines LogSoftmax + NLLLoss; wher logSoftmax = log(softmax(x)) and NLLLoss is negative log likelihood loss = -log(p) where p is the predicted probability of the true class.
# Input: raw logits (batch_size, num_classes)
# Target: class indices (batch_size,) NOT one-hot!

logits = torch.tensor([[2.0, 1.0, 0.1],   # Sample 1: likely class 0
                       [0.1, 2.0, 0.5]])  # Sample 2: likely class 1
targets = torch.tensor([0, 1])  # True classes

criterion = nn.CrossEntropyLoss()
loss = criterion(logits, targets)
print(f"CrossEntropyLoss: {loss.item():.4f}")

# Understanding what happens internally:
log_softmax = nn.LogSoftmax(dim=1)
log_probs = log_softmax(logits)
print(f"Log probabilities:\n{log_probs}")

nll_loss = nn.NLLLoss()
loss_manual = nll_loss(log_probs, targets)
print(f"Manual NLLLoss: {loss_manual.item():.4f}")  # Same result!

CrossEntropyLoss: 0.3669
Log probabilities:
tensor([[-0.4170, -1.4170, -2.3170],
        [-2.2168, -0.3168, -1.8168]])
Manual NLLLoss: 0.3669


In [14]:
# ─────────────────────────────────────────────────────────────────────────────
# OTHER LOSS FUNCTIONS
# ─────────────────────────────────────────────────────────────────────────────

# Binary Cross Entropy with Logits (for binary/multi-label)
logits_binary = torch.tensor([0.5, -0.5, 1.0])
targets_binary = torch.tensor([1.0, 0.0, 1.0])
bce = nn.BCEWithLogitsLoss()
print(f"BCE with Logits: {bce(logits_binary, targets_binary).item():.4f}")

# Mean Squared Error (regression)
predictions = torch.tensor([1.0, 2.0, 3.0])
targets_reg = torch.tensor([1.5, 2.0, 2.5])
mse = nn.MSELoss()
print(f"MSE Loss: {mse(predictions, targets_reg).item():.4f}")

# Reduction modes
mse_none = nn.MSELoss(reduction='none')
mse_sum = nn.MSELoss(reduction='sum')
print(f"Per-sample loss: {mse_none(predictions, targets_reg)}")
print(f"Sum loss: {mse_sum(predictions, targets_reg).item():.4f}")

BCE with Logits: 0.4205
MSE Loss: 0.1667
Per-sample loss: tensor([0.2500, 0.0000, 0.2500])
Sum loss: 0.5000


In [None]:
# ─────────────────────────────────────────────────────────────────────────────
# OPTIMIZERS - SGD, Adam, AdamW
# ─────────────────────────────────────────────────────────────────────────────

# NOTE: optimizers are algorithms used to update the weights of a neural network during training in order to minimize the loss function. they determine how the model learns from the data by adjusting the parameters based on the computed gradients. different optimizers use different strategies for updating the weights, which can affect the speed and quality of learning.
# - the Lr (learning rate) is the most important hyperparameter to tune for optimizers. it controls how big of a step we take during each update. too high can cause divergence, too low can make training very slow.

model = SimpleNet(10, 20, 5)

# SGD - Basic gradient descent
optimizer_sgd = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Adam - Adaptive learning rates (most popular)
optimizer_adam = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))

# AdamW - Adam with decoupled weight decay (use for transformers!)
optimizer_adamw = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# Weight decay comparison:
# Adam:  weight_decay is L2 regularization (added to loss)
# AdamW: weight_decay is true weight decay (decoupled from loss)
# For transformers and large models, AdamW is almost always better

print("Optimizer state keys:", optimizer_adam.state_dict().keys())
print("Param groups:", len(optimizer_adam.param_groups))

Optimizer state keys: dict_keys(['state', 'param_groups'])
Param groups: 1


In [8]:
# ─────────────────────────────────────────────────────────────────────────────
# Learning Rate Shedulers
# ─────────────────────────────────────────────────────────────────────────────

""" 
Learning rate schedulers control how the learning rate changes during training.

What they do.
- Adjust the learning rate over time.
- Usually decrease it as training progresses.
- Improve convergence and stability.

Why you need them.
- High learning rate helps early learning.
- Lower learning rate helps fine tuning.
- Fixed learning rates often stall or overshoot.

How they work in PyTorch.
- They wrap an optimizer.
- You call scheduler.step().
- They update optimizer.param_groups learning rates.

Common schedulers.
- StepLR. Drops LR every N epochs.
- MultiStepLR. Drops LR at specific epochs.
- ExponentialLR. Decays LR continuously.
- CosineAnnealingLR. Smooth cosine decay.
- ReduceLROnPlateau. Lowers LR when loss stops improving.

Key tradeoff.
- Too fast decay slows learning.
- Too slow decay causes instability.

Rule of thumb.
- Start without a scheduler.
- Add one when training plateaus or oscillates.
"""

# EX:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
scheduler = torch.optim.lr_scheduler.StepLR(
    optimizer,
    step_size=10,
    gamma=0.1
)

epochs = 30
# usage
for epoch in range(epochs):
    # train()
    scheduler.step()

""" 
in this EX:
- Initial learning rate is 0.001.
- StepLR uses step_size=10.
- Gamma is 0.1.

What actually happens.
- For epochs 1 to 10.
- Learning rate stays 0.001.
- At epoch 10.
- Learning rate becomes 0.0001.
- At epoch 20.
- Learning rate becomes 0.00001.

Rule.
- LR_new = LR_old x gamma.
- This happens every step_size epochs.
- Not every batch or train step.

If you wanted per step changes.
- You would call scheduler.step() per batch.
- Or use schedulers designed for that.

"""




' \nin this EX:\n- Initial learning rate is 0.001.\n- StepLR uses step_size=10.\n- Gamma is 0.1.\n\nWhat actually happens.\n- For epochs 1 to 10.\n- Learning rate stays 0.001.\n- At epoch 10.\n- Learning rate becomes 0.0001.\n- At epoch 20.\n- Learning rate becomes 0.00001.\n\nRule.\n- LR_new = LR_old x gamma.\n- This happens every step_size epochs.\n- Not every batch or train step.\n\nIf you wanted per step changes.\n- You would call scheduler.step() per batch.\n- Or use schedulers designed for that.\n\n'

---

## 2. Training Mechanics

*This is where PyTorch starts to feel powerful. This replaces TF's `model.fit()` entirely.*

### 2.1 Training Loop Anatomy

The canonical PyTorch training loop follows this pattern:

```
for epoch in range(num_epochs):
    for batch in dataloader:
        # 1. Forward pass
        outputs = model(inputs)
        
        # 2. Compute loss
        loss = criterion(outputs, targets)
        
        # 3. Backward pass
        loss.backward()
        
        # 4. Update weights
        optimizer.step()
        
        # 5. Zero gradients (CRITICAL!)
        optimizer.zero_grad()
```

**Why this order matters:**
- `backward()` computes gradients based on the current loss
- `step()` updates weights using those gradients
- `zero_grad()` clears gradients for next iteration (they accumulate by default!)

In [9]:
# ─────────────────────────────────────────────────────────────────────────────
# COMPLETE TRAINING LOOP EXAMPLE
# ─────────────────────────────────────────────────────────────────────────────

# Create synthetic data
X = torch.randn(1000, 10) # data: feature size 10 1000 samples
y = torch.randint(0, 5, (1000,))  # lables: 5 random classes to simulate true lables 1 -> 4 1000 samples
# Create dataset and dataloader (covered in next section)
from torch.utils.data import TensorDataset, DataLoader
dataset = TensorDataset(X, y) # combine features and labels into a dataset for easy batching after this the form is: (features, labels)
print(dataset[0]) # access the first sample (features, label) both are tensors
dataloader = DataLoader(dataset, batch_size=32, shuffle=True) # takes the data tuples and creates batches of size 32 and shuffles them every epoch thes bacthes are in form (batch_X, batch_y) where batch_X is the data 32 samples with 10 features each sample and batch_y is the true lables for that batch since thre are 32 examples in each batch i.e len(bacth_X) = 32 there are 32 lables in bacth y one for each corrisponding batch X exampel
print(next(iter(dataloader))[0].shape) # batch X: 32 tensors each with 10 features i.e 10 dataset[0:33] (32 inclusive) and 10 features as each dataset[n] has 10 features
print(next(iter(dataloader))[1].shape) # batch y: 32 tensors each with 1 label i.e 1 dataset[0:33] (32 inclusive) and 1 label as each dataset[n] has 1 label

# * next is a function that returns the next item in the iterator iter is a function that returns an iterator a iterator is an object that implements the __iter__ and __next__ methods. the dataloader is an iterable that returns batches of data when iterated over. the next function is used to get the first batch of data from the dataloader.

# Model, loss, optimizer
model = SimpleNet(10, 20, 5)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 3
for epoch in range(num_epochs):
    total_loss = 0
    for batch_X, batch_y in dataloader: # batch x is the data, batch y is the true lable 
        # 1. Forward pass (pass in batch_X i.e the data to make a prediction usign the model) 
        outputs = model(batch_X)
        
        # 2. Compute loss (compare outputs to true labels batch_y i.e true outputs to see hwo off we were)
        loss = criterion(outputs, batch_y)
        
        # 3. Backward pass (computes gradients for each parameter with respect to loss)
        loss.backward()
        
        # 4. Update weights (apply gradients to the model parameters)
        optimizer.step()
        
        # 5. Zero gradients for next iteration (important to prevent accumulation i.e each training step we start fresh)
        optimizer.zero_grad()
        
        total_loss += loss.item()
    
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(dataloader):.4f}")

(tensor([ 0.4061,  1.8123,  0.6576,  0.4971,  0.4414, -1.0050,  1.2208,  0.6445,
         1.5288, -0.7583]), tensor(1))
torch.Size([32, 10])
torch.Size([32])
Epoch 1/3, Loss: 1.6625
Epoch 2/3, Loss: 1.6398
Epoch 3/3, Loss: 1.6246


### 2.2 Datasets and DataLoader

PyTorch's data loading is more explicit and Pythonic than TensorFlow's `tf.data`.

**Key Components:**

| Component | Description |
|-----------|-------------|
| `Dataset` | Abstract class defining `__getitem__` and `__len__` |
| `DataLoader` | Wraps dataset for batching, shuffling, parallel loading |
| `collate_fn` | Custom function to combine samples into batches |
| `num_workers` | Number of parallel data loading processes |

In [17]:
# ─────────────────────────────────────────────────────────────────────────────
# CUSTOM DATASET CLASS
# ─────────────────────────────────────────────────────────────────────────────

from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    """
    Custom dataset must implement:
    - __init__: Setup (load data, store paths, etc.)
    - __len__: Return total number of samples
    - __getitem__: Return one sample by index
    """
    def __init__(self, data, labels, transform=None):
        self.data = data
        self.labels = labels
        self.transform = transform
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        sample = self.data[idx]
        label = self.labels[idx]
        
        if self.transform:
            sample = self.transform(sample)
        
        return sample, label

# Create synthetic data
X = torch.randn(1000, 10)  # 1000 samples, 10 features
y = torch.randint(0, 5, (1000,))  # 5 classes

# Create dataset
dataset = CustomDataset(X, y)

print(f"Dataset size: {len(dataset)}")
sample, label = dataset[0]
print(f"Sample shape: {sample.shape}, Label: {label}")

Dataset size: 1000
Sample shape: torch.Size([10]), Label: 1


In [19]:
# ─────────────────────────────────────────────────────────────────────────────
# DATALOADER - Batching, shuffling, parallel loading
# ─────────────────────────────────────────────────────────────────────────────

# Basic DataLoader
loader = DataLoader(
    dataset,
    batch_size=16,
    shuffle=True,           # Shuffle at each epoch
    num_workers=0,          # Number of parallel loading processes (0 = main process)
    drop_last=False,        # Drop incomplete last batch?
    pin_memory=True,        # Faster GPU transfer (use with CUDA)
)

# Iterate through batches
for batch_idx, (batch_x, batch_y) in enumerate(loader):
    print(f"Batch {batch_idx}: X shape = {batch_x.shape}, y shape = {batch_y.shape}")
    if batch_idx >= 2:
        break

# TensorDataset - Quick dataset from tensors
from torch.utils.data import TensorDataset
quick_dataset = TensorDataset(X, y)
quick_loader = DataLoader(quick_dataset, batch_size=32)

Batch 0: X shape = torch.Size([16, 10]), y shape = torch.Size([16])
Batch 1: X shape = torch.Size([16, 10]), y shape = torch.Size([16])
Batch 2: X shape = torch.Size([16, 10]), y shape = torch.Size([16])




In [20]:
# ─────────────────────────────────────────────────────────────────────────────
# COLLATE_FN - Custom batching logic
# ─────────────────────────────────────────────────────────────────────────────

# collate_fn is called to combine samples into a batch
# Useful for variable-length sequences, padding, etc.

def custom_collate(batch):
    """Custom collate function that pads sequences to max length in batch."""
    # batch is a list of (data, label) tuples
    data = [item[0] for item in batch]
    labels = [item[1] for item in batch]
    
    # Stack into tensors
    data = torch.stack(data)
    labels = torch.stack(labels)
    
    # You could add padding here for variable-length sequences
    return data, labels

loader_with_collate = DataLoader(
    dataset,
    batch_size=8,
    collate_fn=custom_collate
)

batch_x, batch_y = next(iter(loader_with_collate))
print(f"Custom collate batch: {batch_x.shape}")

Custom collate batch: torch.Size([8, 10])


### 2.3 Device Management

LLMs will fail instantly if you don't understand device placement. All tensors and models must be on the same device.

**Common Device Mismatch Errors:**
```
RuntimeError: Expected all tensors to be on the same device, 
but found at least two devices, cuda:0 and cpu!
```

**The Pattern:**
```python
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
data = data.to(device)
```

In [21]:
# ─────────────────────────────────────────────────────────────────────────────
# DEVICE MANAGEMENT - The essential pattern
# ─────────────────────────────────────────────────────────────────────────────

# Set device once at the start
device = torch.device(
    'cuda' if torch.cuda.is_available() 
    else 'mps' if torch.backends.mps.is_available()  # Apple Silicon
    else 'cpu'
)
print(f"Using device: {device}")

# Move model to device
model = SimpleNet(10, 20, 5)
model = model.to(device)  # Returns model, also modifies in-place
print(f"Model device: {next(model.parameters()).device}")

# Move data to device (must do this for every batch!)
X = torch.randn(32, 10)
y = torch.randint(0, 5, (32,))

X = X.to(device)
y = y.to(device)

# Now forward pass works
output = model(X)
print(f"Output device: {output.device}")

Using device: mps
Model device: mps:0
Output device: mps:0


In [22]:
# ─────────────────────────────────────────────────────────────────────────────
# COMPLETE TRAINING LOOP WITH DEVICE MANAGEMENT
# ─────────────────────────────────────────────────────────────────────────────

def train_epoch(model, dataloader, criterion, optimizer, device):
    model.train()  # Set to training mode
    total_loss = 0
    
    for batch_x, batch_y in dataloader:
        # Move batch to device
        batch_x = batch_x.to(device)
        batch_y = batch_y.to(device)
        
        # Forward
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)
        
        # Backward
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        
        total_loss += loss.item()
    
    return total_loss / len(dataloader)

def evaluate(model, dataloader, criterion, device):
    model.eval()  # Set to evaluation mode
    total_loss = 0
    correct = 0
    total = 0
    
    with torch.no_grad():  # No gradient tracking for evaluation
        for batch_x, batch_y in dataloader:
            batch_x = batch_x.to(device)
            batch_y = batch_y.to(device)
            
            outputs = model(batch_x)
            loss = criterion(outputs, batch_y)
            total_loss += loss.item()
            
            _, predicted = outputs.max(1)
            total += batch_y.size(0)
            correct += predicted.eq(batch_y).sum().item()
    
    return total_loss / len(dataloader), correct / total

# Example usage
print("Training and evaluation functions defined.")

Training and evaluation functions defined.


---

## 3. PyTorch Building Blocks for Deep Learning

*This maps closely to what you already know from TensorFlow, but in PyTorch style.*

### 3.1 Layers and Activations

**Important:** You will manually control `train()` and `eval()` modes. This affects layers like Dropout and BatchNorm.

**Common Layers:**

| Layer | Description |
|-------|-------------|
| `nn.Linear` | Fully connected layer (y = xW^T + b) |
| `nn.Conv1d/2d` | Convolutional layers |
| `nn.Embedding` | Lookup table for embeddings |
| `nn.Dropout` | Randomly zeros elements (only in train mode!) |

**Activations:**

| Activation | Formula | Use Case |
|------------|---------|----------|
| ReLU | max(0, x) | Default choice |
| GELU | x · Φ(x) | Transformers (BERT, GPT) |
| SiLU/Swish | x · σ(x) | Vision, newer models |

In [23]:
# ─────────────────────────────────────────────────────────────────────────────
# COMMON LAYERS
# ─────────────────────────────────────────────────────────────────────────────

# Linear (Dense) layer
linear = nn.Linear(in_features=10, out_features=5)
x = torch.randn(32, 10)  # batch of 32
out = linear(x)
print(f"Linear: {x.shape} -> {out.shape}")
print(f"  Weight: {linear.weight.shape}, Bias: {linear.bias.shape}")

# Conv2d layer
conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
img = torch.randn(32, 3, 224, 224)  # batch of 32 RGB images
out = conv(img)
print(f"Conv2d: {img.shape} -> {out.shape}")

# Embedding layer (for NLP)
embedding = nn.Embedding(num_embeddings=10000, embedding_dim=256)
tokens = torch.randint(0, 10000, (32, 50))  # batch of 32, sequence length 50
out = embedding(tokens)
print(f"Embedding: {tokens.shape} -> {out.shape}")

Linear: torch.Size([32, 10]) -> torch.Size([32, 5])
  Weight: torch.Size([5, 10]), Bias: torch.Size([5])
Conv2d: torch.Size([32, 3, 224, 224]) -> torch.Size([32, 16, 224, 224])
Embedding: torch.Size([32, 50]) -> torch.Size([32, 50, 256])


In [24]:
# ─────────────────────────────────────────────────────────────────────────────
# ACTIVATIONS AND DROPOUT (train vs eval mode)
# ─────────────────────────────────────────────────────────────────────────────

x = torch.randn(5)

# Activations
print("Activations:")
print(f"  ReLU:   {torch.relu(x)}")
print(f"  GELU:   {torch.nn.functional.gelu(x)}")
print(f"  SiLU:   {torch.nn.functional.silu(x)}")
print(f"  Sigmoid: {torch.sigmoid(x)}")

# Dropout - CRITICAL: behavior changes between train/eval
class DropoutDemo(nn.Module):
    def __init__(self):
        super().__init__()
        self.dropout = nn.Dropout(p=0.5)
    
    def forward(self, x):
        return self.dropout(x)

model = DropoutDemo()
x = torch.ones(10)

model.train()  # Training mode - dropout active
print(f"\nTrain mode: {model(x)}")  # Some zeros

model.eval()   # Eval mode - dropout disabled
print(f"Eval mode:  {model(x)}")  # All ones (scaled)

Activations:
  ReLU:   tensor([1.9683, 0.0000, 0.0000, 0.0000, 0.9355])
  GELU:   tensor([ 1.9200, -0.1698, -0.1447, -0.0323,  0.7720])
  SiLU:   tensor([ 1.7270, -0.2362, -0.2763, -0.2220,  0.6719])
  Sigmoid: tensor([0.8774, 0.3268, 0.2420, 0.1021, 0.7182])

Train mode: tensor([0., 0., 0., 2., 2., 2., 2., 0., 0., 0.])
Eval mode:  tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])


### 3.2 Normalization

**Why LayerNorm dominates in transformers:**
- BatchNorm normalizes across the batch dimension → requires large batches
- LayerNorm normalizes across the feature dimension → works with any batch size
- For sequence models where batch size varies, LayerNorm is essential

| Normalization | Normalizes Over | Use Case |
|---------------|-----------------|----------|
| BatchNorm | Batch dimension | CNNs, fixed batch sizes |
| LayerNorm | Feature dimension | Transformers, RNNs |
| RMSNorm | Feature (no mean) | LLaMA, efficient transformers |

In [25]:
# ─────────────────────────────────────────────────────────────────────────────
# NORMALIZATION LAYERS
# ─────────────────────────────────────────────────────────────────────────────

# BatchNorm - normalizes over batch dimension
# Input: (N, C, *) where N=batch, C=channels
batch_norm = nn.BatchNorm1d(num_features=10)
x = torch.randn(32, 10)  # batch of 32, 10 features
out = batch_norm(x)
print(f"BatchNorm1d: {x.shape} -> {out.shape}")
print(f"  Mean ≈ 0: {out.mean(dim=0)[:3]}")  # Mean across batch

# LayerNorm - normalizes over feature dimensions
# Input: (*, normalized_shape)
layer_norm = nn.LayerNorm(normalized_shape=10)
x = torch.randn(32, 50, 10)  # batch=32, seq_len=50, features=10
out = layer_norm(x)
print(f"\nLayerNorm: {x.shape} -> {out.shape}")
print(f"  Mean ≈ 0: {out[0, 0].mean():.6f}")  # Mean across features

# RMSNorm (common in LLaMA, needs custom implementation)
class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))
    
    def forward(self, x):
        # RMS = sqrt(mean(x^2))
        rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
        return x / rms * self.weight

rms_norm = RMSNorm(10)
out = rms_norm(torch.randn(32, 50, 10))
print(f"\nRMSNorm: {out.shape}")

BatchNorm1d: torch.Size([32, 10]) -> torch.Size([32, 10])
  Mean ≈ 0: tensor([ 0.0000e+00,  7.4506e-09, -7.4506e-09], grad_fn=<SliceBackward0>)

LayerNorm: torch.Size([32, 50, 10]) -> torch.Size([32, 50, 10])
  Mean ≈ 0: -0.000000

RMSNorm: torch.Size([32, 50, 10])


### 3.3 Initialization

**This matters more than beginners realize.** Transformers rely on careful initialization for stable training.

**Default PyTorch Initialization:**
- `nn.Linear`: Kaiming uniform (good for ReLU)
- `nn.Embedding`: Normal(0, 1)

**Why careful init matters for transformers:**
- Too large: exploding activations/gradients
- Too small: vanishing signals
- Common practice: scale by $\frac{1}{\sqrt{d_{model}}}$ or $\frac{1}{\sqrt{n_{layers}}}$

In [26]:
# ─────────────────────────────────────────────────────────────────────────────
# WEIGHT INITIALIZATION
# ─────────────────────────────────────────────────────────────────────────────

# Check default initialization
linear = nn.Linear(512, 512)
print(f"Default Linear init:")
print(f"  Weight std: {linear.weight.std():.4f}")
print(f"  Bias mean:  {linear.bias.mean():.4f}")

# Common initialization methods
layer = nn.Linear(512, 512)

# Xavier/Glorot (good for tanh/sigmoid)
nn.init.xavier_uniform_(layer.weight)
print(f"\nXavier uniform std: {layer.weight.std():.4f}")

# Kaiming/He (good for ReLU)
nn.init.kaiming_uniform_(layer.weight, nonlinearity='relu')
print(f"Kaiming uniform std: {layer.weight.std():.4f}")

# Normal initialization with specific std
nn.init.normal_(layer.weight, mean=0, std=0.02)  # GPT-2 style
print(f"Normal(0, 0.02) std: {layer.weight.std():.4f}")

# Zero initialization (for biases, residual connections)
nn.init.zeros_(layer.bias)
print(f"Zeros bias: {layer.bias.sum():.4f}")

Default Linear init:
  Weight std: 0.0255
  Bias mean:  -0.0007

Xavier uniform std: 0.0442
Kaiming uniform std: 0.0625
Normal(0, 0.02) std: 0.0200
Zeros bias: 0.0000


In [27]:
# ─────────────────────────────────────────────────────────────────────────────
# CUSTOM INITIALIZATION FOR A MODEL
# ─────────────────────────────────────────────────────────────────────────────

def init_weights(module):
    """Custom weight initialization function."""
    if isinstance(module, nn.Linear):
        nn.init.normal_(module.weight, mean=0, std=0.02)
        if module.bias is not None:
            nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Embedding):
        nn.init.normal_(module.weight, mean=0, std=0.02)
    elif isinstance(module, nn.LayerNorm):
        nn.init.ones_(module.weight)
        nn.init.zeros_(module.bias)

# Apply to model
model = SimpleNet(10, 20, 5)
model.apply(init_weights)  # Recursively applies to all modules

print("After custom initialization:")
for name, param in model.named_parameters():
    if 'weight' in name:
        print(f"  {name}: std = {param.std():.4f}")

After custom initialization:
  fc1.weight: std = 0.0209
  fc2.weight: std = 0.0189


---

## 4. PyTorch for NLP and Sequence Models

*This is where PyTorch becomes your main LLM tool.*

### 4.1 Embeddings and Vocab

`nn.Embedding` is a lookup table that maps integer indices to dense vectors.

**Key Parameters:**

| Parameter | Description |
|-----------|-------------|
| `num_embeddings` | Vocabulary size |
| `embedding_dim` | Dimension of embedding vectors |
| `padding_idx` | Index to zero out (for padding tokens) |

**Vocab size effects:**
- Larger vocab = more parameters = more memory
- GPT-2: 50,257 tokens × 768 dims = ~39M parameters just for embeddings!

In [28]:
# ─────────────────────────────────────────────────────────────────────────────
# EMBEDDING LAYER INTERNALS
# ─────────────────────────────────────────────────────────────────────────────

vocab_size = 10000
embed_dim = 256
pad_idx = 0  # Token 0 is padding

# Create embedding layer
embedding = nn.Embedding(
    num_embeddings=vocab_size,
    embedding_dim=embed_dim,
    padding_idx=pad_idx  # This index will always output zeros
)

# Input: token indices [batch_size, seq_len]
tokens = torch.tensor([
    [5, 23, 456, 0, 0],    # Sentence 1 (padded)
    [12, 34, 56, 78, 0]    # Sentence 2 (padded)
])

# Output: embeddings [batch_size, seq_len, embed_dim]
embedded = embedding(tokens)
print(f"Input shape: {tokens.shape}")
print(f"Output shape: {embedded.shape}")

# Verify padding is zeroed
print(f"\nPadding embedding (should be zeros): {embedded[0, 3].sum().item()}")
print(f"Non-padding embedding: {embedded[0, 0].sum().item():.4f}")

Input shape: torch.Size([2, 5])
Output shape: torch.Size([2, 5, 256])

Padding embedding (should be zeros): 0.0
Non-padding embedding: -3.7477


In [29]:
# ─────────────────────────────────────────────────────────────────────────────
# WEIGHT TYING (Input/Output embeddings)
# ─────────────────────────────────────────────────────────────────────────────

# In language models, we often tie input embeddings with output projection
# This reduces parameters and improves performance

class TiedEmbeddingLM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.hidden = nn.Linear(embed_dim, hidden_dim)
        self.output = nn.Linear(hidden_dim, embed_dim)
        # Output projection to vocab (tied with embedding)
        self.lm_head = nn.Linear(embed_dim, vocab_size, bias=False)
        
        # TIE THE WEIGHTS
        self.lm_head.weight = self.embedding.weight
    
    def forward(self, x):
        x = self.embedding(x)           # [B, S] -> [B, S, E]
        x = self.hidden(x)              # [B, S, E] -> [B, S, H]
        x = torch.relu(x)
        x = self.output(x)              # [B, S, H] -> [B, S, E]
        logits = self.lm_head(x)        # [B, S, E] -> [B, S, V]
        return logits

model = TiedEmbeddingLM(10000, 256, 512)

# Verify weights are the same object
print(f"Weights tied: {model.lm_head.weight is model.embedding.weight}")

# Parameter count reduction
params_with_tie = sum(p.numel() for p in model.parameters())
print(f"Parameters with tying: {params_with_tie:,}")

Weights tied: True
Parameters with tying: 2,822,912


### 4.2 Attention from Scratch

**Important:** Even though PyTorch has `nn.MultiheadAttention`, you should implement it yourself once to understand LLMs.

**Attention Formula:**
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

**Key Concepts:**
- **Q (Query):** What we're looking for
- **K (Key):** What we match against
- **V (Value):** What we retrieve
- **Scaling by $\sqrt{d_k}$:** Prevents softmax saturation with large dot products
- **Causal mask:** Prevents attending to future tokens (for autoregressive models)

In [30]:
# ─────────────────────────────────────────────────────────────────────────────
# SCALED DOT-PRODUCT ATTENTION FROM SCRATCH
# ─────────────────────────────────────────────────────────────────────────────

import torch.nn.functional as F
import math

def scaled_dot_product_attention(query, key, value, mask=None):
    """
    Args:
        query: [batch, seq_len, d_k]
        key:   [batch, seq_len, d_k]
        value: [batch, seq_len, d_v]
        mask:  [seq_len, seq_len] or None
    Returns:
        output: [batch, seq_len, d_v]
        attention_weights: [batch, seq_len, seq_len]
    """
    d_k = query.size(-1)
    
    # Compute attention scores: Q @ K^T / sqrt(d_k)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    # scores: [batch, seq_len, seq_len]
    
    # Apply mask (for causal attention)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    
    # Softmax to get attention weights
    attention_weights = F.softmax(scores, dim=-1)
    
    # Apply attention to values
    output = torch.matmul(attention_weights, value)
    
    return output, attention_weights

# Test it
batch_size, seq_len, d_model = 2, 4, 8
Q = torch.randn(batch_size, seq_len, d_model)
K = torch.randn(batch_size, seq_len, d_model)
V = torch.randn(batch_size, seq_len, d_model)

output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")
print(f"Attention weights sum (should be 1): {weights[0, 0].sum():.4f}")

Output shape: torch.Size([2, 4, 8])
Attention weights shape: torch.Size([2, 4, 4])
Attention weights sum (should be 1): 1.0000


In [31]:
# ─────────────────────────────────────────────────────────────────────────────
# CAUSAL MASK FOR AUTOREGRESSIVE MODELS
# ─────────────────────────────────────────────────────────────────────────────

def create_causal_mask(seq_len):
    """Create a causal (lower triangular) mask."""
    # 1s in lower triangle (including diagonal), 0s above
    mask = torch.tril(torch.ones(seq_len, seq_len))
    return mask

# Visualize the mask
seq_len = 5
mask = create_causal_mask(seq_len)
print("Causal mask (1=attend, 0=block):")
print(mask)

# Apply causal attention
Q = torch.randn(1, seq_len, 8)
K = torch.randn(1, seq_len, 8)
V = torch.randn(1, seq_len, 8)

output, weights = scaled_dot_product_attention(Q, K, V, mask=mask)
print(f"\nCausal attention weights for position 2:")
print(f"  {weights[0, 2]}")  # Can only attend to positions 0, 1, 2

Causal mask (1=attend, 0=block):
tensor([[1., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1.]])

Causal attention weights for position 2:
  tensor([0.1933, 0.1789, 0.6278, 0.0000, 0.0000])


In [32]:
# ─────────────────────────────────────────────────────────────────────────────
# MULTI-HEAD ATTENTION FROM SCRATCH
# ─────────────────────────────────────────────────────────────────────────────

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # Dimension per head
        
        # Linear projections for Q, K, V
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        
        # Output projection
        self.W_o = nn.Linear(d_model, d_model)
    
    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.shape
        
        # Project to Q, K, V
        Q = self.W_q(x)  # [B, S, D]
        K = self.W_k(x)
        V = self.W_v(x)
        
        # Reshape for multi-head: [B, S, D] -> [B, S, H, D/H] -> [B, H, S, D/H]
        Q = Q.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        
        # Scaled dot-product attention for each head
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        attention = F.softmax(scores, dim=-1)
        context = torch.matmul(attention, V)  # [B, H, S, D/H]
        
        # Concatenate heads: [B, H, S, D/H] -> [B, S, H, D/H] -> [B, S, D]
        context = context.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
        
        # Final projection
        output = self.W_o(context)
        
        return output

# Test it
mha = MultiHeadAttention(d_model=64, num_heads=8)
x = torch.randn(2, 10, 64)  # [batch=2, seq_len=10, d_model=64]
output = mha(x)
print(f"Multi-head attention output: {output.shape}")

Multi-head attention output: torch.Size([2, 10, 64])


### 4.3 Transformer Blocks

This is the core of GPT, LLaMA, Mistral, and all modern LLMs.

**Transformer Block Structure:**
```
Input
  │
  ├── Multi-Head Attention ──┐
  │                          │ (residual connection)
  └──────────────────────────┼──> Add & Norm
                             │
  ├── Feed-Forward Network ──┐
  │                          │ (residual connection)
  └──────────────────────────┼──> Add & Norm
                             │
                           Output
```

**Pre-norm vs Post-norm:**
- **Post-norm (original):** `x + Sublayer(LayerNorm(x))` - harder to train deep models
- **Pre-norm (modern):** `x + LayerNorm(Sublayer(x))` - more stable, used in GPT-2+

In [33]:
# ─────────────────────────────────────────────────────────────────────────────
# TRANSFORMER BLOCK (Pre-norm, GPT-style)
# ─────────────────────────────────────────────────────────────────────────────

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        
        # Multi-head attention
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        
        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
        )
        self.norm2 = nn.LayerNorm(d_model)
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # Pre-norm attention with residual
        attn_out = self.attention(self.norm1(x), mask)
        x = x + self.dropout(attn_out)
        
        # Pre-norm FFN with residual
        ffn_out = self.ffn(self.norm2(x))
        x = x + self.dropout(ffn_out)
        
        return x

# Test transformer block
block = TransformerBlock(d_model=64, num_heads=8, d_ff=256)
x = torch.randn(2, 10, 64)
output = block(x)
print(f"Transformer block: {x.shape} -> {output.shape}")

Transformer block: torch.Size([2, 10, 64]) -> torch.Size([2, 10, 64])


In [34]:
# ─────────────────────────────────────────────────────────────────────────────
# STACKING TRANSFORMER BLOCKS (Mini GPT)
# ─────────────────────────────────────────────────────────────────────────────

class MiniGPT(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, d_ff, num_layers, max_seq_len):
        super().__init__()
        
        # Token + positional embeddings
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(max_seq_len, d_model)
        
        # Stack of transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff)
            for _ in range(num_layers)
        ])
        
        # Final layer norm (pre-norm style)
        self.ln_f = nn.LayerNorm(d_model)
        
        # Output head
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
        
        # Weight tying
        self.lm_head.weight = self.token_embedding.weight
    
    def forward(self, idx):
        B, T = idx.shape
        
        # Embeddings
        tok_emb = self.token_embedding(idx)
        pos_emb = self.position_embedding(torch.arange(T, device=idx.device))
        x = tok_emb + pos_emb
        
        # Causal mask
        mask = torch.tril(torch.ones(T, T, device=idx.device))
        
        # Transformer blocks
        for block in self.blocks:
            x = block(x, mask)
        
        # Output
        x = self.ln_f(x)
        logits = self.lm_head(x)
        
        return logits

# Create a small GPT model
model = MiniGPT(
    vocab_size=1000,
    d_model=64,
    num_heads=4,
    d_ff=256,
    num_layers=4,
    max_seq_len=128
)

# Test forward pass
tokens = torch.randint(0, 1000, (2, 32))
logits = model(tokens)
print(f"MiniGPT: {tokens.shape} -> {logits.shape}")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

MiniGPT: torch.Size([2, 32]) -> torch.Size([2, 32, 1000])
Total parameters: 272,256


### 4.4 Tokenization Ecosystem

PyTorch relies heavily on Hugging Face tokenizers. Understanding this is essential for working with any pretrained model.

**Key Concepts:**

| Concept | Description |
|---------|-------------|
| BPE (Byte-Pair Encoding) | Subword tokenization (GPT-2, GPT-3) |
| SentencePiece | Unigram or BPE, handles whitespace (LLaMA, T5) |
| `padding` | Add tokens to reach fixed length |
| `truncation` | Cut sequences to max length |
| `attention_mask` | 1 for real tokens, 0 for padding |

**Common Token IDs:**
- `[PAD]` or `<pad>`: Padding token
- `[CLS]` or `<s>`: Start of sequence (BERT style)
- `[SEP]` or `</s>`: End of sequence
- `[UNK]` or `<unk>`: Unknown token

In [35]:
# ─────────────────────────────────────────────────────────────────────────────
# HUGGING FACE TOKENIZERS (Example with GPT-2)
# ─────────────────────────────────────────────────────────────────────────────

# Note: Requires `pip install transformers`
# Uncomment to run:

# from transformers import AutoTokenizer

# # Load tokenizer
# tokenizer = AutoTokenizer.from_pretrained("gpt2")

# # Basic tokenization
# text = "Hello, how are you?"
# tokens = tokenizer.tokenize(text)
# print(f"Tokens: {tokens}")

# # Get token IDs
# token_ids = tokenizer.encode(text)
# print(f"Token IDs: {token_ids}")

# # Full encoding with attention mask
# encoded = tokenizer(
#     text,
#     padding="max_length",      # Pad to max_length
#     max_length=10,             # Maximum sequence length
#     truncation=True,           # Truncate if longer
#     return_tensors="pt"        # Return PyTorch tensors
# )
# print(f"Input IDs: {encoded['input_ids']}")
# print(f"Attention mask: {encoded['attention_mask']}")

# Simulated example for demonstration:
print("Tokenization example (simulated):")
print("Text: 'Hello, how are you?'")
print("Tokens: ['Hello', ',', ' how', ' are', ' you', '?']")
print("Token IDs: [15496, 11, 703, 389, 345, 30]")
print("With padding (max_length=10): [15496, 11, 703, 389, 345, 30, 50256, 50256, 50256, 50256]")
print("Attention mask: [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]")

Tokenization example (simulated):
Text: 'Hello, how are you?'
Tokens: ['Hello', ',', ' how', ' are', ' you', '?']
Token IDs: [15496, 11, 703, 389, 345, 30]
With padding (max_length=10): [15496, 11, 703, 389, 345, 30, 50256, 50256, 50256, 50256]
Attention mask: [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]


---

## 5. Training Language Models

*Now everything converges. This is identical in theory to TensorFlow, but loop-driven.*

### 5.1 Autoregressive Training

**Next-token prediction** is the core of language model training.

**Key Concepts:**
- **Shifting labels:** Input `[A, B, C, D]` → Targets `[B, C, D, E]`
- **Teacher forcing:** Use ground truth as input during training
- **Masking padding:** Don't compute loss on padding tokens

```
Input:   [START, The, cat, sat]
Target:  [The,   cat, sat, END]
```

In [36]:
# ─────────────────────────────────────────────────────────────────────────────
# AUTOREGRESSIVE TRAINING SETUP
# ─────────────────────────────────────────────────────────────────────────────

def prepare_lm_batch(token_ids, pad_id=-100):
    """
    Prepare input and target for language model training.
    
    Args:
        token_ids: [batch, seq_len] - full sequences including BOS/EOS
        pad_id: Value to ignore in loss computation (-100 for CrossEntropyLoss)
    
    Returns:
        inputs: [batch, seq_len-1] - everything except last token
        targets: [batch, seq_len-1] - everything except first token
    """
    inputs = token_ids[:, :-1]   # All except last
    targets = token_ids[:, 1:]   # All except first (shifted by 1)
    return inputs, targets

# Example
# Full sequence: [BOS, The, cat, sat, on, EOS, PAD, PAD]
sequence = torch.tensor([
    [1, 45, 23, 67, 89, 2, 0, 0],  # Sentence 1
    [1, 12, 34, 56, 2, 0, 0, 0],   # Sentence 2 (shorter)
])

inputs, targets = prepare_lm_batch(sequence)
print(f"Full sequence: {sequence[0]}")
print(f"Input:         {inputs[0]}")
print(f"Target:        {targets[0]}")

Full sequence: tensor([ 1, 45, 23, 67, 89,  2,  0,  0])
Input:         tensor([ 1, 45, 23, 67, 89,  2,  0])
Target:        tensor([45, 23, 67, 89,  2,  0,  0])


In [37]:
# ─────────────────────────────────────────────────────────────────────────────
# COMPLETE LM TRAINING LOOP
# ─────────────────────────────────────────────────────────────────────────────

def train_language_model(model, dataloader, optimizer, device, num_epochs=3, pad_id=0):
    """
    Complete training loop for a language model.
    """
    model.train()
    criterion = nn.CrossEntropyLoss(ignore_index=pad_id)  # Ignore padding in loss
    
    for epoch in range(num_epochs):
        total_loss = 0
        num_batches = 0
        
        for batch in dataloader:
            # Prepare batch
            token_ids = batch[0].to(device)
            inputs, targets = prepare_lm_batch(token_ids)
            
            # Forward pass
            logits = model(inputs)  # [B, S, V]
            
            # Reshape for loss: [B*S, V] and [B*S]
            loss = criterion(
                logits.view(-1, logits.size(-1)),
                targets.reshape(-1)
            )
            
            # Backward pass
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            
            total_loss += loss.item()
            num_batches += 1
        
        avg_loss = total_loss / num_batches
        perplexity = math.exp(avg_loss)
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}, Perplexity: {perplexity:.2f}")

print("Training function defined - would train on real data")

Training function defined - would train on real data


### 5.2 Mixed Precision and Performance

**This is mandatory for LLMs.** Mixed precision uses FP16 for most operations while keeping FP32 for stability.

**Key Components:**

| Component | Purpose |
|-----------|---------|
| `autocast` | Automatically casts operations to FP16 |
| `GradScaler` | Scales gradients to prevent underflow |
| FP16 | Half precision (16-bit) - 2x memory savings |
| FP32 | Full precision (32-bit) - more stable |

**When FP16 breaks:**
- Very small gradients underflow to zero
- Solution: GradScaler multiplies loss, then unscales gradients

In [38]:
# ─────────────────────────────────────────────────────────────────────────────
# MIXED PRECISION TRAINING
# ─────────────────────────────────────────────────────────────────────────────

from torch.cuda.amp import autocast, GradScaler

def train_with_mixed_precision(model, dataloader, optimizer, device, num_epochs=1):
    """
    Training loop with automatic mixed precision (AMP).
    """
    scaler = GradScaler()  # Gradient scaler for stability
    criterion = nn.CrossEntropyLoss()
    
    for epoch in range(num_epochs):
        for batch_x, batch_y in dataloader:
            batch_x = batch_x.to(device)
            batch_y = batch_y.to(device)
            
            # Forward pass with autocast (FP16 where safe)
            with autocast():
                outputs = model(batch_x)
                loss = criterion(outputs, batch_y)
            
            # Backward pass with scaled gradients
            scaler.scale(loss).backward()
            
            # Unscale and step
            scaler.step(optimizer)
            scaler.update()
            
            optimizer.zero_grad()
    
    return model

# Note: autocast works best on CUDA devices
# On CPU/MPS, it may fall back to FP32
print("Mixed precision training pattern defined")
print("Memory savings: ~50% VRAM reduction")
print("Speed improvement: 2-3x on modern GPUs with Tensor Cores")

Mixed precision training pattern defined
Memory savings: ~50% VRAM reduction
Speed improvement: 2-3x on modern GPUs with Tensor Cores


### 5.3 Gradient Control

**Critical for stability** - without proper gradient control, LLM training will explode or vanish.

**Key Techniques:**

| Technique | When to Use |
|-----------|-------------|
| Gradient Clipping | Always for transformers |
| Gradient Accumulation | When batch size limited by memory |
| Learning Rate Warmup | First few thousand steps |

**Common Clip Values:**
- `max_norm=1.0` - standard for transformers
- `max_norm=0.5` - more conservative

In [39]:
# ─────────────────────────────────────────────────────────────────────────────
# GRADIENT CLIPPING
# ─────────────────────────────────────────────────────────────────────────────

model = SimpleNet(10, 20, 5)
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

# Simulate a forward-backward pass
x = torch.randn(32, 10)
y = torch.randint(0, 5, (32,))
loss = criterion(model(x), y)
loss.backward()

# Check gradient norms before clipping
total_norm_before = torch.sqrt(
    sum(p.grad.norm()**2 for p in model.parameters() if p.grad is not None)
)
print(f"Gradient norm before clipping: {total_norm_before:.4f}")

# Gradient clipping (MUST be called after backward, before step)
max_norm = 1.0
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=max_norm)

# Check after clipping
total_norm_after = torch.sqrt(
    sum(p.grad.norm()**2 for p in model.parameters() if p.grad is not None)
)
print(f"Gradient norm after clipping:  {total_norm_after:.4f}")
print(f"Max norm allowed: {max_norm}")

Gradient norm before clipping: 0.4729
Gradient norm after clipping:  0.4729
Max norm allowed: 1.0


In [40]:
# ─────────────────────────────────────────────────────────────────────────────
# GRADIENT ACCUMULATION (for limited GPU memory)
# ─────────────────────────────────────────────────────────────────────────────

def train_with_gradient_accumulation(
    model, dataloader, optimizer, device,
    accumulation_steps=4  # Effective batch = real_batch * accumulation_steps
):
    """
    Gradient accumulation allows larger effective batch sizes.
    
    If batch_size=8 and accumulation_steps=4, effective batch = 32
    """
    model.train()
    criterion = nn.CrossEntropyLoss()
    
    optimizer.zero_grad()  # Zero once at start
    
    for i, (batch_x, batch_y) in enumerate(dataloader):
        batch_x = batch_x.to(device)
        batch_y = batch_y.to(device)
        
        # Forward
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)
        
        # Scale loss by accumulation steps
        loss = loss / accumulation_steps
        
        # Backward (gradients accumulate automatically!)
        loss.backward()
        
        # Only step every accumulation_steps
        if (i + 1) % accumulation_steps == 0:
            # Optional: gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            
            optimizer.step()
            optimizer.zero_grad()
            
            print(f"Step {(i+1) // accumulation_steps}, Loss: {loss.item() * accumulation_steps:.4f}")

print("Gradient accumulation pattern defined")
print("Use when: GPU memory limits batch size, but you need larger effective batches")

Gradient accumulation pattern defined
Use when: GPU memory limits batch size, but you need larger effective batches


---

## 6. Generation and Inference

*This is where models become useful.*

### 6.1 Decoding Mechanics

All decoding methods start with logits and use softmax to get probabilities.

**Decoding Strategies:**

| Strategy | Description | Temperature |
|----------|-------------|-------------|
| Greedy | Always pick highest probability | N/A |
| Temperature | Scale logits before softmax | 0.1-2.0 |
| Top-k | Sample from top k tokens | Usually with temp |
| Top-p (nucleus) | Sample from smallest set with cumulative prob ≥ p | Usually with temp |
| Repetition penalty | Reduce probability of already-generated tokens | N/A |

**Key Formula:**
$$P(x_i) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}$$

Where $T$ is temperature:
- $T < 1$: More confident (sharper distribution)
- $T > 1$: More random (flatter distribution)
- $T = 1$: Original distribution

**Stopping Conditions:**
- `max_length`: Hard limit on total sequence length
- `max_new_tokens`: Limit on generated tokens only
- `eos_token_id`: Stop when EOS token is generated
- `min_length`: Don't stop before minimum length

In [41]:
# ─────────────────────────────────────────────────────────────────────────────
# DECODING STRATEGIES
# ─────────────────────────────────────────────────────────────────────────────

def greedy_decode(logits):
    """Always pick the highest probability token."""
    return logits.argmax(dim=-1)

def temperature_sample(logits, temperature=1.0):
    """Sample with temperature scaling."""
    scaled_logits = logits / temperature
    probs = F.softmax(scaled_logits, dim=-1)
    return torch.multinomial(probs, num_samples=1).squeeze(-1)

def top_k_sample(logits, k=50, temperature=1.0):
    """Sample from top-k tokens only."""
    scaled_logits = logits / temperature
    
    # Zero out everything except top-k
    top_k_logits, top_k_indices = torch.topk(scaled_logits, k)
    
    # Sample from top-k
    probs = F.softmax(top_k_logits, dim=-1)
    sampled_idx = torch.multinomial(probs, num_samples=1)
    
    # Map back to vocabulary indices
    return top_k_indices.gather(-1, sampled_idx).squeeze(-1)

def top_p_sample(logits, p=0.9, temperature=1.0):
    """Sample from smallest set with cumulative probability >= p."""
    scaled_logits = logits / temperature
    sorted_logits, sorted_indices = torch.sort(scaled_logits, descending=True)
    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
    
    # Remove tokens with cumulative prob > p
    sorted_indices_to_remove = cumulative_probs > p
    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
    sorted_indices_to_remove[..., 0] = False
    
    # Set removed tokens to -inf
    sorted_logits[sorted_indices_to_remove] = float('-inf')
    
    # Sample
    probs = F.softmax(sorted_logits, dim=-1)
    sampled_idx = torch.multinomial(probs, num_samples=1)
    
    return sorted_indices.gather(-1, sampled_idx).squeeze(-1)

def apply_repetition_penalty(logits, generated_ids, penalty=1.2):
    """
    Apply repetition penalty to already-generated tokens.
    penalty > 1.0: discourage repetition
    penalty < 1.0: encourage repetition (rarely used)
    """
    for token_id in set(generated_ids.tolist()):
        # If logit is positive, divide by penalty (reduce)
        # If logit is negative, multiply by penalty (make more negative)
        if logits[0, token_id] > 0:
            logits[0, token_id] /= penalty
        else:
            logits[0, token_id] *= penalty
    return logits

# Demonstrate with example logits
logits = torch.tensor([[2.0, 1.5, 0.5, 0.1, -1.0]])  # 5 tokens

print("Logits:", logits)
print(f"Greedy:      {greedy_decode(logits).item()}")
print(f"Temp=0.5:    {temperature_sample(logits, 0.5).item()}")
print(f"Temp=2.0:    {temperature_sample(logits, 2.0).item()}")
print(f"Top-k (k=2): {top_k_sample(logits, k=2).item()}")
print(f"Top-p (p=0.9): {top_p_sample(logits, p=0.9).item()}")

# Demonstrate repetition penalty
generated = torch.tensor([0, 0, 1])  # Token 0 appeared twice
logits_copy = logits.clone()
penalized = apply_repetition_penalty(logits_copy, generated, penalty=1.5)
print(f"\nOriginal logits: {logits}")
print(f"After rep penalty (tokens 0,1 seen): {penalized}")

Logits: tensor([[ 2.0000,  1.5000,  0.5000,  0.1000, -1.0000]])
Greedy:      0
Temp=0.5:    1
Temp=2.0:    1
Top-k (k=2): 1
Top-p (p=0.9): 0

Original logits: tensor([[ 2.0000,  1.5000,  0.5000,  0.1000, -1.0000]])
After rep penalty (tokens 0,1 seen): tensor([[ 1.3333,  1.0000,  0.5000,  0.1000, -1.0000]])


In [42]:
# ─────────────────────────────────────────────────────────────────────────────
# COMPLETE GENERATION FUNCTION
# ─────────────────────────────────────────────────────────────────────────────

@torch.no_grad()
def generate(
    model,
    prompt_ids,          # [batch, prompt_len]
    max_new_tokens=50,
    temperature=1.0,
    top_k=None,
    top_p=None,
    eos_token_id=None
):
    """
    Autoregressive generation with various sampling strategies.
    """
    model.eval()
    generated = prompt_ids.clone()
    
    for _ in range(max_new_tokens):
        # Get logits for last position only
        logits = model(generated)[:, -1, :]  # [batch, vocab]
        
        # Apply sampling strategy
        if top_k is not None:
            next_token = top_k_sample(logits, k=top_k, temperature=temperature)
        elif top_p is not None:
            next_token = top_p_sample(logits, p=top_p, temperature=temperature)
        elif temperature != 1.0:
            next_token = temperature_sample(logits, temperature=temperature)
        else:
            next_token = greedy_decode(logits)
        
        # Append to sequence
        generated = torch.cat([generated, next_token.unsqueeze(-1)], dim=-1)
        
        # Stop if EOS
        if eos_token_id is not None and (next_token == eos_token_id).all():
            break
    
    return generated

print("Generation function defined")
print("Usage: generate(model, prompt_ids, max_new_tokens=100, temperature=0.8, top_p=0.9)")

Generation function defined
Usage: generate(model, prompt_ids, max_new_tokens=100, temperature=0.8, top_p=0.9)


### 6.2 KV Caching

**This separates toy LMs from real ones.** Without KV caching, generation is O(n²) in sequence length.

**The Problem:**
- Without cache: At each step, recompute attention for ALL previous tokens
- With cache: Store K and V from previous steps, only compute for new token

**How it works:**
```
Step 1: Compute K₁, V₁ for token 1 → Store in cache
Step 2: Compute K₂, V₂ for token 2 → Append to cache, attend to [K₁,K₂], [V₁,V₂]
Step 3: Compute K₃, V₃ for token 3 → Append to cache, attend to [K₁,K₂,K₃], [V₁,V₂,V₃]
```

**Speed improvement:** From O(n²) to O(n) for generation

In [43]:
# ─────────────────────────────────────────────────────────────────────────────
# KV CACHE IMPLEMENTATION
# ─────────────────────────────────────────────────────────────────────────────

class CachedMultiHeadAttention(nn.Module):
    """Multi-head attention with KV cache for efficient generation."""
    
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def forward(self, x, kv_cache=None, use_cache=False):
        """
        Args:
            x: [B, S, D] - if cached, S=1 (only new token)
            kv_cache: tuple of (cached_k, cached_v) or None
            use_cache: whether to return updated cache
        """
        B, S, _ = x.shape
        
        # Compute Q, K, V for new tokens
        Q = self.W_q(x).view(B, S, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(B, S, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(B, S, self.num_heads, self.d_k).transpose(1, 2)
        
        # Concatenate with cache if exists
        if kv_cache is not None:
            cached_k, cached_v = kv_cache
            K = torch.cat([cached_k, K], dim=2)  # [B, H, S_cached+S, D]
            V = torch.cat([cached_v, V], dim=2)
        
        # Attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        attn = F.softmax(scores, dim=-1)
        out = torch.matmul(attn, V)
        
        # Reshape and project
        out = out.transpose(1, 2).contiguous().view(B, S, self.d_model)
        out = self.W_o(out)
        
        if use_cache:
            return out, (K, V)  # Return updated cache
        return out

# Demonstrate cache shapes
attn = CachedMultiHeadAttention(64, 4)
x = torch.randn(1, 10, 64)  # Initial prompt

# First forward (no cache)
out, cache = attn(x, kv_cache=None, use_cache=True)
print(f"Initial: input {x.shape}, cache K shape: {cache[0].shape}")

# Second forward (with cache, only 1 new token)
new_token = torch.randn(1, 1, 64)
out, cache = attn(new_token, kv_cache=cache, use_cache=True)
print(f"After 1 token: cache K shape: {cache[0].shape}")

Initial: input torch.Size([1, 10, 64]), cache K shape: torch.Size([1, 4, 10, 16])
After 1 token: cache K shape: torch.Size([1, 4, 11, 16])


---

## 7. Ecosystem and Real-world PyTorch

*This is why PyTorch dominates LLMs.*

### 7.1 Hugging Face Transformers

**Do not treat this as magic. Read the source.**

Hugging Face provides pretrained models, tokenizers, and training utilities that are the standard in NLP.

**Key Components:**

| Component | Description |
|-----------|-------------|
| `AutoModel` | Automatically loads the right model architecture |
| `AutoTokenizer` | Automatically loads the right tokenizer |
| `AutoConfig` | Model configuration (hidden size, layers, etc.) |
| `Trainer` | High-level training API |

**Model Loading Pattern:**
```python
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModel.from_pretrained("gpt2")
```

In [44]:
# ─────────────────────────────────────────────────────────────────────────────
# HUGGING FACE TRANSFORMERS PATTERNS
# ─────────────────────────────────────────────────────────────────────────────

# Note: Requires `pip install transformers`
# This is pseudocode/template - uncomment to run with transformers installed

"""
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM

# ─── BASIC LOADING ───
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

# ─── TOKENIZATION ───
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
# Returns: {'input_ids': tensor, 'attention_mask': tensor}

# ─── INFERENCE ───
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    # outputs.logits: [batch, seq_len, vocab_size]
    # outputs.past_key_values: KV cache for generation

# ─── GENERATION ───
generated = model.generate(
    inputs['input_ids'],
    max_new_tokens=50,
    temperature=0.8,
    top_p=0.9,
    do_sample=True
)
print(tokenizer.decode(generated[0]))

# ─── FINE-TUNING ───
# Just use normal PyTorch training loop!
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
# ... training loop ...

# ─── ACCESS CONFIG ───
print(f"Hidden size: {model.config.hidden_size}")
print(f"Num layers: {model.config.num_hidden_layers}")
print(f"Num heads: {model.config.num_attention_heads}")
"""

print("Hugging Face patterns defined as template")
print("Install with: pip install transformers")

Hugging Face patterns defined as template
Install with: pip install transformers


### 7.2 Model Saving and Loading

This replaces TensorFlow's SavedModel.

**Key Concepts:**

| Method | What it saves | Use Case |
|--------|---------------|----------|
| `state_dict` | Only weights | Recommended for most cases |
| `torch.save(model)` | Entire model | Quick prototyping (not portable) |
| Checkpoints | Weights + optimizer + epoch | Resume training |

**Device-agnostic Loading:**
Always use `map_location` when loading to avoid device mismatches.

In [45]:
# ─────────────────────────────────────────────────────────────────────────────
# MODEL SAVING AND LOADING
# ─────────────────────────────────────────────────────────────────────────────

model = SimpleNet(10, 20, 5)

# ─── METHOD 1: Save state_dict (RECOMMENDED) ───
# Save
torch.save(model.state_dict(), 'model_weights.pth')

# Load
model_new = SimpleNet(10, 20, 5)  # Must create model first
model_new.load_state_dict(torch.load('model_weights.pth'))
print("State dict loaded successfully")

# ─── METHOD 2: Device-agnostic loading ───
# Save on GPU, load on CPU (or vice versa)
torch.save(model.state_dict(), 'model_weights.pth')

# Load to specific device
device = torch.device('cpu')
model_new.load_state_dict(
    torch.load('model_weights.pth', map_location=device)
)
print("Device-agnostic loading successful")

# Clean up
import os
os.remove('model_weights.pth')

State dict loaded successfully
Device-agnostic loading successful


In [46]:
# ─────────────────────────────────────────────────────────────────────────────
# FULL CHECKPOINT (for resuming training)
# ─────────────────────────────────────────────────────────────────────────────

def save_checkpoint(model, optimizer, epoch, loss, path):
    """Save complete training state."""
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }, path)
    print(f"Checkpoint saved: epoch {epoch}, loss {loss:.4f}")

def load_checkpoint(model, optimizer, path, device='cpu'):
    """Load complete training state."""
    checkpoint = torch.load(path, map_location=device)
    
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    
    return checkpoint['epoch'], checkpoint['loss']

# Example usage
model = SimpleNet(10, 20, 5)
optimizer = torch.optim.Adam(model.parameters())

# Save
save_checkpoint(model, optimizer, epoch=5, loss=0.234, path='checkpoint.pth')

# Load
loaded_epoch, loaded_loss = load_checkpoint(model, optimizer, 'checkpoint.pth')
print(f"Resumed from epoch {loaded_epoch}, loss {loaded_loss:.4f}")

# Clean up
os.remove('checkpoint.pth')

Checkpoint saved: epoch 5, loss 0.2340
Resumed from epoch 5, loss 0.2340


### 7.3 Distributed Training Concepts

**You don't need to implement this yet, just understand it.**

**Key Concepts:**

| Approach | Description | Use Case |
|----------|-------------|----------|
| `DataParallel` (DP) | Split batch across GPUs, single process | Quick prototyping |
| `DistributedDataParallel` (DDP) | One process per GPU, synchronized | Production training |
| FSDP | Shard model + gradients + optimizer | Very large models |

**Why DDP is preferred over DP:**
- DP has GIL bottleneck (Python's Global Interpreter Lock)
- DDP has better GPU utilization
- DDP scales better to multiple nodes

**Gradient Synchronization:**
- After backward pass, gradients are averaged across all processes
- All processes have identical gradients before optimizer step

In [47]:
# ─────────────────────────────────────────────────────────────────────────────
# DISTRIBUTED TRAINING PATTERNS (Conceptual)
# ─────────────────────────────────────────────────────────────────────────────

# These are templates - actual distributed training requires proper setup

"""
# ─── DATA PARALLEL (Simple, but limited) ───
model = nn.DataParallel(model)
# Automatically splits batch across available GPUs
# But: single process, GIL bottleneck

# ─── DISTRIBUTED DATA PARALLEL (Production) ───
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize process group
dist.init_process_group(backend='nccl')

# Wrap model
local_rank = int(os.environ['LOCAL_RANK'])
model = model.to(local_rank)
model = DDP(model, device_ids=[local_rank])

# Use DistributedSampler for data
from torch.utils.data.distributed import DistributedSampler
sampler = DistributedSampler(dataset)
dataloader = DataLoader(dataset, sampler=sampler)

# Training loop is the same!
# DDP automatically synchronizes gradients

# Launch with:
# torchrun --nproc_per_node=4 train.py

# ─── FSDP (For very large models) ───
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP

model = FSDP(model)
# Shards model parameters, gradients, and optimizer states
# Enables training models that don't fit on single GPU
"""

print("Distributed training concepts:")
print("• DataParallel: Easy but limited (GIL bottleneck)")
print("• DDP: Production standard (one process per GPU)")
print("• FSDP: For models larger than single GPU memory")
print("\nLaunch DDP training with: torchrun --nproc_per_node=N train.py")

Distributed training concepts:
• DataParallel: Easy but limited (GIL bottleneck)
• DDP: Production standard (one process per GPU)
• FSDP: For models larger than single GPU memory

Launch DDP training with: torchrun --nproc_per_node=N train.py


---

## 8. Final Notes

### 8.1 Common Questions & Answers

**Q: How does PyTorch build graphs compared to TensorFlow?**

PyTorch uses **dynamic graphs (define-by-run)** — the computation graph is built on-the-fly during execution. Each forward pass creates a new graph, which is immediately used for backprop and then discarded. TensorFlow (1.x) used **static graphs (define-then-run)** — you build the entire graph first, then execute it in a session. TF 2.x added eager mode (like PyTorch), but `@tf.function` still compiles to static graphs for performance. PyTorch's approach is more Pythonic and easier to debug; TensorFlow's static graphs enable more optimization.

---

**Q: Why do LLMs use decoder-only transformers?**

Because language modeling is **autoregressive** — predicting the next token given all previous tokens. Decoder-only architectures use **causal masking** to prevent attending to future tokens, which is exactly what you need for generation. Encoder-decoder (like T5) is for sequence-to-sequence tasks (translation, summarization) where you have a complete input. Encoder-only (like BERT) is for understanding tasks (classification, NER) where you need bidirectional context. For pure generation (GPT, LLaMA, Claude), decoder-only is simpler and scales better.

---

**Q: Why are logits passed to loss functions (not softmax)?**

**Numerical stability.** `CrossEntropyLoss` internally computes `log_softmax(logits)`, which uses the log-sum-exp trick to avoid overflow/underflow:

$$\log(\text{softmax}(x_i)) = x_i - \log\sum_j e^{x_j}$$

If you pass probabilities (after softmax), taking `log` of very small values causes numerical issues. Also, computing softmax then log is redundant work — `log_softmax` fuses these operations efficiently.

---

**Q: Why is LayerNorm used instead of BatchNorm in transformers?**

**BatchNorm** normalizes across the batch dimension — it computes mean/variance over all samples for each feature. This breaks with:
- **Variable sequence lengths** (different positions shouldn't share statistics)
- **Small batches** (unstable statistics)
- **Inference** (needs running statistics from training)

**LayerNorm** normalizes across the feature dimension for each sample independently. Each token gets normalized by its own statistics, making it batch-size agnostic and suitable for autoregressive generation where batch size is often 1.

---

**Q: Why does mixed precision require scaling?**

**FP16 has limited range** — max value ~65,504, min positive ~6e-8. Gradients during backprop can easily underflow to zero (vanishing gradients) or overflow to inf. The **GradScaler** multiplies the loss by a large factor (e.g., 1024) before `.backward()`, which scales up all gradients proportionally. After backward, it unscales before `optimizer.step()`. If gradients overflow (inf/nan), it skips the update and reduces the scale factor. This keeps gradients in FP16's representable range while maintaining numerical accuracy.

---

**Q: Why is attention O(n²)?**

The attention formula computes **every query against every key**:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

For sequence length $n$:
- $QK^T$ is $[n \times d] \cdot [d \times n] = [n \times n]$ — that's $n^2$ dot products
- Softmax over $n \times n$ matrix
- Multiply by $V$: $[n \times n] \cdot [n \times d]$

Both compute and memory are $O(n^2)$. This is why long-context models need optimizations like FlashAttention (memory-efficient), sliding window attention (sparse), or linear attention approximations.

---

### 8.2 Quick Reference Card

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        PYTORCH ESSENTIALS CHEATSHEET                        │
├─────────────────────────────────────────────────────────────────────────────┤
│ TENSORS                                                                     │
│   torch.tensor([1,2,3])           # Create tensor                          │
│   x.to(device)                    # Move to device                         │
│   x.requires_grad_(True)          # Enable gradients                       │
│   x.detach()                      # Remove from graph                      │
├─────────────────────────────────────────────────────────────────────────────┤
│ TRAINING LOOP                                                               │
│   outputs = model(inputs)         # Forward                                │
│   loss = criterion(outputs, y)    # Loss                                   │
│   loss.backward()                 # Backward                               │
│   optimizer.step()                # Update                                 │
│   optimizer.zero_grad()           # Zero grads                             │
├─────────────────────────────────────────────────────────────────────────────┤
│ MODEL MODES                                                                 │
│   model.train()                   # Training (dropout ON)                  │
│   model.eval()                    # Evaluation (dropout OFF)               │
│   with torch.no_grad():           # No gradient tracking                   │
├─────────────────────────────────────────────────────────────────────────────┤
│ SAVING/LOADING                                                              │
│   torch.save(model.state_dict(), 'model.pth')                              │
│   model.load_state_dict(torch.load('model.pth', map_location=device))      │
├─────────────────────────────────────────────────────────────────────────────┤
│ MIXED PRECISION                                                             │
│   scaler = GradScaler()                                                    │
│   with autocast():                                                         │
│       loss = model(x)                                                      │
│   scaler.scale(loss).backward()                                            │
│   scaler.step(optimizer)                                                   │
│   scaler.update()                                                          │
├─────────────────────────────────────────────────────────────────────────────┤
│ GRADIENT CONTROL                                                            │
│   torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)         │
│   loss = loss / accumulation_steps  # For gradient accumulation            │
└─────────────────────────────────────────────────────────────────────────────┘
```

---


## 9. PyTorch Examples

*Practical examples putting everything together.*

### 9.1 Basic Neural Network Training

A complete example building and training a simple fully-connected neural network from scratch.

In [48]:
import torch

""" 
Neural Networks with torch.nn
The torch.nn module provides a set of tools to build and train neural networks.

Building a Model (Class-based Approach)

You define a neural network by subclassing torch.nn.Module.

what is a FULLY CONNECTED LAYER?: a fully connected layer is a layer in a neural network where each neuron in the layer is connected to every neuron in the previous layer.
from tensorflow we know that the dense layer is a fully connected layer.

Forward Pass: The forward pass is the process of passing the input data through the layers of the neural network and computing the output.
- NOTE a backward pass is the process of computing the gradients of the loss function with respect to the model parameters. not the same as the forward pass.

The RELU is a non-linear activation function that is used to introduce non-linearity into the network. see TF notes for more details.

EX:
in our example we have 2 layers, the first layer has 784 input neurons and 128 output neurons, and the second layer has 128 input neurons and 10 output neurons.
in the forward pass we pass the input data through the first layer, apply the ReLU activation function what relu dose in our case is lets us activate the neurons based on the input, 
and then pass the output through the second layer and return that as the output of the network.

ReLU (Rectified Linear Unit) sets all negative values to zero and keeps positive values unchanged.
In a simple NN, it adds non-linearity, helping the model learn complex patterns instead of just straight lines. another one is signmoid (see tf)
it basically tells the NN what input values to keep and what to ignore. i.e what is important and what is not.
Mathematically:
ReLU(x)=max⁡(0,x)
ReLU(x)=max(0,x)
✅ Keeps positives → same
✅ Turns negatives → 0
"""
import torch.nn as nn # for neural networks
import torch.optim as optim # for optimizers

class SimpleNN(nn.Module): # inherit from nn.Module
    def __init__(self): 
        super(SimpleNN, self).__init__() # call the parent class constructor to initialize the module
        self.fc1 = nn.Linear(784, 128)  # Fully connected layer (input 784, output 128)
        self.fc2 = nn.Linear(128, 10)  # Fully connected layer (input 128, output 10)

    def forward(self, x): # forward pass
        x = torch.relu(self.fc1(x))  # ReLU activation function 
        x = self.fc2(x) # output layer takes input from previous layer
        return x # return the output of the network

# Model instantiation
model = SimpleNN()

""" 
Training a Model

Training a model involves the following steps:

Define the loss function (e.g., nn.CrossEntropyLoss). this is used for multi-class classification problems i.e predicting numbers from 0 to 9 for ex the loss function is used to compute the difference between the predicted output and the target output help us measure how well the model is performing.
Define an optimizer (e.g., optim.SGD or optim.Adam). this is used to update the model parameters based on the gradients computed in the backward pass. basically we use the optimizer to update the weights and biases of the network in the direction that reduces the loss function.
Perform the forward pass and compute the loss. this is used to compute the difference between the predicted output and the target output help us measure how well the model is performing.
Backpropagate to compute gradients. this is used to compute the gradients of the loss function with respect to the model parameters. after we compute the gradients we can update the weights and biases of the network in the direction that reduces the loss function.
Update weights using the optimizer. this is used to update the model parameters based on the gradients computed in the backward pass. basically we use the optimizer to update the weights and biases of the network in the direction that reduces the loss function.

Backwards pass vs Backpropagation: 
Backpropagation = the method (the algorithm for calculating gradients).
Backward pass = the action (the model doing the gradient calculation step during training).

Simple Timeline in Training:
Forward pass → compute output.
Compute loss.
Backward pass → run backpropagation → compute gradients.
Optimizer step → update weights.

# NOTE: a epoch is one complete pass through the entire training dataset so here we have 10 epochs meaning we will pass through the entire training dataset (X, y) 10 times.
"""
# NOTE: you random data maches the model meaning they are the same shape, so you can use any data you want, but in a real world example you would use real data.
# Create 1000 samples, each with 10 features a feature is a column in the input data that represents a specific characteristic of the data like e.g. height, weight, age, etc.
X = torch.randn(1000, 784)  # inputs (features)
y = torch.randint(0, 10, (1000,))  # outputs (labels) - for binary classification (0 or 1)
# Wrap into a dataset
dataset = torch.utils.data.TensorDataset(X, y)
# Create a trainloader
trainloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
epochs = 10
for epoch in range(epochs): # for each epoch
    for data, target in trainloader:
        optimizer.zero_grad()  # Zero the gradients
        output = model(data)  # Forward pass using the model above
        loss = criterion(output, target)  # Compute loss
        loss.backward()  # Backpropagate
        optimizer.step()  # Update weights
    print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item()}")  # Print loss for each epoch
# we should see the loss decrease as we train the model

""" 
Datasets and DataLoader
PyTorch provides utilities for loading and batching data using torch.utils.data.Dataset and DataLoader.

Creating a Custom Dataset
- A dataset is a class that inherits from torch.utils.data.Dataset and implements the __len__ and __getitem__ methods.
- in torch a dataset is a class that represents a collection of data samples. it is used to load and preprocess the data for training and testing the model.

### what we achove here is we can load the data in batches of 64 samples and shuffle the data so that we dont get the same data every time we train the model.

- you can also use Predefined Datasets (e.g., MNIST, CIFAR)
"""
from torch.utils.data import Dataset, DataLoader

train_data = torch.randn(1000, 784)  # Random training data with 1000 samples and 784 features
train_labels = torch.randint(0, 10, (1000,))  # Random training labels numbers from 0 to 9
class CustomDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

# Example usage
train_dataset = CustomDataset(train_data, train_labels) # this is the dataset we created above
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True) # this is the dataloader that loads the dataset in batches of 64 samples




Epoch 1/10, Loss: 2.409788131713867
Epoch 2/10, Loss: 1.7556540966033936
Epoch 3/10, Loss: 1.3171563148498535
Epoch 4/10, Loss: 0.7755150198936462
Epoch 5/10, Loss: 0.4362756907939911
Epoch 6/10, Loss: 0.1316525787115097
Epoch 7/10, Loss: 0.07957600057125092
Epoch 8/10, Loss: 0.05105067417025566
Epoch 9/10, Loss: 0.05781383067369461
Epoch 10/10, Loss: 0.03630366548895836


### 9.2 Transformer Language Model on MPS

Training a transformer-based language model on Apple Silicon using MPS acceleration.

In [49]:
import torch
import torch.nn as nn
import torch.optim as optim
import time

# Check if MPS (Metal Performance Shaders) is available
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu") # for cpu do device = torch.device("cpu")
print(f"Using device: {device}")

# Define a simple transformer-based language model
class SimpleTransformerLM(nn.Module):
    def __init__(self, vocab_size=10000, embedding_dim=512, nhead=8, 
                 num_layers=6, dim_feedforward=2048):
        super(SimpleTransformerLM, self).__init__()
        
        # Word embeddings
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # Positional encoding (simplified for this example)
        self.pos_encoder = nn.Sequential(
            nn.Linear(embedding_dim, embedding_dim),
            nn.GELU()
        )
        
        # Transformer encoder layers
        encoder_layers = nn.TransformerEncoderLayer(
            d_model=embedding_dim, 
            nhead=nhead,
            dim_feedforward=dim_feedforward,
            batch_first=True
        )
        self.transformer_encoder = nn.TransformerEncoder(encoder_layers, num_layers)
        
        # Output layer
        self.output = nn.Linear(embedding_dim, vocab_size)
        
    def forward(self, x):
        # x shape: [batch_size, seq_len]
        x = self.embedding(x)  # [batch_size, seq_len, embedding_dim]
        x = self.pos_encoder(x)
        x = self.transformer_encoder(x)
        x = self.output(x)  # [batch_size, seq_len, vocab_size]
        return x

# Create model and move to MPS device
vocab_size = 10000  # Vocabulary size
seq_length = 128    # Sequence length
batch_size = 16     # Batch size

model = SimpleTransformerLM(vocab_size=vocab_size)
model.to(device)  # Move model to GPU
print(f"Model moved to {device}")

# Generate some dummy data
input_data = torch.randint(0, vocab_size, (batch_size, seq_length)).to(device)
target_data = torch.randint(0, vocab_size, (batch_size, seq_length)).to(device)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train model for a few steps to demonstrate GPU usage
def train_step(model, inputs, targets):
    model.train()
    optimizer.zero_grad()
    
    # Forward pass
    outputs = model(inputs)
    # Reshape for cross entropy
    outputs = outputs.view(-1, vocab_size)
    targets = targets.view(-1)
    
    # Calculate loss
    loss = criterion(outputs, targets)
    
    # Backward pass
    loss.backward()
    optimizer.step()
    
    return loss.item()

# Test training performance
print("\nTraining for 20 iterations to test performance:")
start_time = time.time()
for i in range(20):
    loss = train_step(model, input_data, target_data)
    print(f"Iteration {i+1}, Loss: {loss:.4f}")
end_time = time.time()
print(f"Training time: {end_time - start_time:.2f} seconds") # 2.80 sec on GPU, 13.35 sec on CPU (MacBook Pro M2 16GB)

# Generate predictions with the model
def generate_text(model, start_tokens, max_length=20):
    model.eval()
    current_tokens = start_tokens.clone()
    
    for _ in range(max_length):
        with torch.no_grad():
            # Get model predictions for the next token
            logits = model(current_tokens)
            next_token_logits = logits[:, -1, :]
            
            # Sample from the distribution
            probs = torch.softmax(next_token_logits, dim=-1)
            next_token = torch.multinomial(probs, 1)
            
            # Append new token to sequence
            current_tokens = torch.cat([current_tokens, next_token], dim=1)
    
    return current_tokens

# Try generating some "text" (just token IDs in this example)
start_seq = torch.randint(0, vocab_size, (1, 5)).to(device)
generated = generate_text(model, start_seq)
print("\nGenerated token sequence:")
print(generated.cpu().numpy())

Using device: mps
Model moved to mps

Training for 20 iterations to test performance:
Iteration 1, Loss: 9.3868
Iteration 2, Loss: 9.0019
Iteration 3, Loss: 8.4839
Iteration 4, Loss: 8.0962
Iteration 5, Loss: 7.6600
Iteration 6, Loss: 9.2323
Iteration 7, Loss: 7.6815
Iteration 8, Loss: 7.6208
Iteration 9, Loss: 7.5674
Iteration 10, Loss: 7.5357
Iteration 11, Loss: 7.5157
Iteration 12, Loss: 7.4886
Iteration 13, Loss: 7.4116
Iteration 14, Loss: 7.1883
Iteration 15, Loss: 7.3648
Iteration 16, Loss: 7.2120
Iteration 17, Loss: 6.7272
Iteration 18, Loss: 7.0580
Iteration 19, Loss: 6.4870
Iteration 20, Loss: 6.4689
Training time: 6.48 seconds

Generated token sequence:
[[8382 2019 8302 9113 4589 3610 6330 9380 4788 5181 8065 4707 8917 6371
   516 5624  178 7638 1814 8643 7851 1157 6889 1783  581]]
