# Chapter 6 Companion Notebook
**Build Your First LLM — Chapter 6: NumPy & PyTorch Survival Guide**

Run these cells top-to-bottom to see the tensor basics, masking, broadcasting, and a tiny training loop.

In [None]:
# ===== IMPORTS =====
# torch: The PyTorch library for tensor operations (think: multi-dimensional arrays)
# numpy: The classic numerical computing library (PyTorch is inspired by it)
# torch.nn: Neural network building blocks (layers, etc.)
import torch, numpy as np
import torch.nn as nn

print('Torch version:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
# CUDA = GPU computing. If True, we can use GPU acceleration (10-100× faster for deep learning)

## What is a Tensor?

A **tensor** is a multi-dimensional array of numbers:
- **1D tensor** = a list `[1, 2, 3]` (like a row in a spreadsheet)
- **2D tensor** = a table/matrix (rows and columns)
- **3D tensor** = a stack of tables (like multiple sheets in Excel)

**Why PyTorch instead of NumPy?**
Both handle multi-dimensional arrays, but PyTorch adds:
1. **GPU support** — move data to GPU for 10-100× speedup
2. **Automatic gradients** — computes derivatives for training neural networks
3. **Neural network layers** — pre-built building blocks

## Creating tensors
From Python/NumPy data, and with common fill rules.

In [None]:
# ===== Creating Tensors =====

# From Python data
data = torch.tensor([[1, 2, 3], [4, 5, 6]])  # 2×3 tensor

# Fill rules (create tensors filled with specific values)
zeros = torch.zeros(3, 4)     # 3×4 tensor of zeros
ones = torch.ones(2, 3, 4)    # 2×3×4 tensor of ones
uniform = torch.rand(3, 4)    # Random values in [0, 1)
normal = torch.randn(3, 4)    # Random values from normal distribution (mean=0, std=1)
integers = torch.randint(0, 10, (3, 4))  # Random integers in [0, 10)

# Ranged sequences (like Python's range)
sequence = torch.arange(0, 10, 2)  # [0, 2, 4, 6, 8]
linspace = torch.linspace(0, 1, 5)  # 5 evenly-spaced points from 0 to 1

# Data types (dtype) — control precision
float_tensor = torch.tensor([1.0, 2.0], dtype=torch.float32)  # 32-bit floats (default)
int_tensor = torch.tensor([1, 2], dtype=torch.long)           # 64-bit integers (for indices)

# NumPy ↔ PyTorch (they share memory — changes to one affect the other!)
np_data = np.array([1, 2, 3], dtype=np.float32)
torch_from_np = torch.from_numpy(np_data)  # shares memory with np_data
back_to_np = torch_from_np.numpy()         # back to NumPy

# Device placement — CPU or GPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
gpu_tensor = torch.randn(3, 4, device=device, dtype=torch.float32)

print('zeros shape:', zeros.shape)
print('gpu_tensor device:', gpu_tensor.device)

## Reproducibility: Setting Seeds
Critical for debugging and comparing experiments

In [None]:
# Fix random state for reproducibility
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)

# These will now be identical on every run
a = torch.randn(3, 4)
b = torch.randn(3, 4)

# Try without seed - results change each time
# But with seed, they're reproducible
torch.manual_seed(42)
x1 = torch.randn(2, 3)
torch.manual_seed(42)
x2 = torch.randn(2, 3)
print('With seed, tensors match:', torch.allclose(x1, x2))

# Why this matters: Reproducible experiments for debugging and research

## From Lists to Tensors: Building Up Dimensions
Connect Chapter 5's simple lists to multi-dimensional tensors

In [None]:
# Step 1: 1D tensors (Chapter 5 callback)
# From Chapter 5: token IDs from tokenizer
token_ids = [2, 3, 4, 6]
tokens_1d = torch.tensor(token_ids)
print(f"1D shape: {tokens_1d.shape}")  # torch.Size([4])
print(f"Data: {tokens_1d}")

In [None]:
# Step 2: Batching (1D → 2D)
# Process multiple sentences at once
batch = torch.tensor([
    [2, 3, 4, 6],    # sentence 1
    [5, 7, 8, 9]     # sentence 2
])
print(f"2D batch shape: {batch.shape}")  # torch.Size([2, 4])
print("First sentence:", batch[0])
print("Second token of first sentence:", batch[0, 1])

In [None]:
# Step 3: Embeddings (2D → 3D)
# Add 768 numbers per token (GPT-2 style)
embeddings_3d = torch.randn(2, 4, 768)
print(f"3D embeddings shape: {embeddings_3d.shape}")  # torch.Size([2, 4, 768])

# Navigate dimensions: batch → token → features
print("First sentence embeddings:", embeddings_3d[0].shape)      # (4, 768)
print("First token of first sentence:", embeddings_3d[0, 0].shape)  # (768,)

In [None]:
# Step 4: Attention heads preview (3D → 4D)
# Multi-head attention adds another dimension (don't worry about details yet)
attention_4d = torch.randn(2, 8, 4, 4)  # (batch, heads, seq, seq)
print(f"4D attention shape: {attention_4d.shape}")
print("Shape interpretation: (batch, heads, seq_len, seq_len)")

## Reshaping, squeezing, permuting
Shape gymnastics you’ll use constantly.

In [None]:
# ===== Reshaping Operations =====
# Reshaping = reorganizing data without changing values (like rearranging boxes in a warehouse)

x = torch.arange(12)               # Create 1D tensor [0,1,2,...,11], shape (12,)
print(f'Start: {x.shape}')

# view() — reshape but requires "contiguous" memory (data laid out sequentially)
x = x.view(3, 4)                   # Reshape to (3, 4): 3 rows × 4 columns
print(f'After view(3,4): {x.shape}')

# reshape() — like view() but handles non-contiguous tensors (safer, use this when unsure)
x = x.reshape(2, 2, 3)             # Reshape to (2, 2, 3)
print(f'After reshape(2,2,3): {x.shape}')

# -1 means "figure this dimension out for me"
x = x.view(-1, 3)                  # -1 becomes 4 (12 total elements ÷ 3 = 4)
print(f'After view(-1,3): {x.shape}')

# unsqueeze/squeeze — add or remove dimensions of size 1
y = torch.randn(3, 4)
y = y.unsqueeze(0)                 # (1, 3, 4) — add batch dimension at position 0
y = y.unsqueeze(-1)                # (1, 3, 4, 1) — add dimension at end
y = y.squeeze()                    # Remove ALL size-1 dims → (3, 4)
print(f'After squeeze: {y.shape}')

# permute — reorder dimensions (like transposing but for any number of dims)
z = torch.randn(2, 3, 4)           # (batch, seq, features)
z = z.permute(0, 2, 1)             # (batch, features, seq) — swap last two dims
print(f'After permute: {z.shape}')

# flatten — collapse dimensions
t = torch.randn(2, 3, 4)
t_flat = t.flatten()               # (24,) — all dims collapsed
t_flat_features = t.flatten(1)     # (2, 12) — flatten starting at dim 1
print(f'Fully flattened: {t_flat.shape}')
print(f'Flatten features: {t_flat_features.shape}')

## Indexing and slicing
Pick out batches, tokens, and use boolean masks.

In [None]:
# ===== Indexing and Slicing =====
# Navigate multi-dimensional data like you'd navigate folders: batch → token → features

# Create a 4D tensor simulating attention: (batch, heads, seq, seq)
x = torch.randn(2, 4, 6, 6)  # 2 batches, 4 heads, 6 tokens, 6 tokens

# Basic indexing
first_batch = x[0]              # Shape: (4, 6, 6) - first batch, all heads
first_head = x[0, 0]            # Shape: (6, 6) - first batch, first head
single_value = x[0, 0, 0, 0]    # Shape: () - a single number

# Slicing with colons
first_two_batches = x[:2]       # Shape: (2, 4, 6, 6) - first 2 batches
all_but_last_token = x[:, :, :-1, :]  # Shape: (2, 4, 5, 6) - remove last token
every_other_head = x[:, ::2]    # Shape: (2, 2, 6, 6) - heads 0 and 2

# Negative indices count from the end
last_token = x[:, :, -1, :]     # Shape: (2, 4, 6) - last token in each sequence

# Boolean masking (filter by condition)
mask = torch.tensor([True, False, True, False])
filtered_heads = x[0, mask]     # Shape: (2, 6, 6) - only heads 0 and 2

print(f'Original shape: {x.shape}')
print(f'First batch shape: {first_batch.shape}')
print(f'All but last token: {all_but_last_token.shape}')
print(f'Filtered heads: {filtered_heads.shape}')

## Attention: The Heart of Transformers

**The big picture:** Attention computes a weighted average of all words, where weights come from relevance scores. Like reading "The cat sat on the mat"—when processing "sat", you look back at "cat" (who sat?) and "mat" (sat where?).

We'll build this in 3 steps:
1. Basic math (4 operations)
2. Add causal masking (prevent future peeking)
3. Production shortcut (PyTorch does it all)

In [None]:
import torch
import torch.nn.functional as F

# ===== The Basic Math of Attention =====
# Attention: Query (what am I looking for?), Key (what do I have?), Value (what to return?)

# Simulate embeddings for 5 tokens in batch of 2
batch, seq_len, d_head = 2, 5, 64
Q = torch.randn(batch, seq_len, d_head)  # (2, 5, 64)
K = torch.randn(batch, seq_len, d_head)
V = torch.randn(batch, seq_len, d_head)

# Step 1: Compute scores (how much does each token match every other?)
# The @ operator is matrix multiplication (same as torch.matmul)
# K.transpose(-2, -1) swaps the last two dimensions: (2, 5, 64) → (2, 64, 5)
# Result: (2, 5, 64) @ (2, 64, 5) → (2, 5, 5) — a 5×5 grid of scores per batch
scores = Q @ K.transpose(-2, -1)
print(f'Scores shape: {scores.shape}')  # (2, 5, 5)

# Step 2: Scale (prevent softmax saturation)
# Without scaling, large dot products → softmax gives ~1 for max, ~0 for others
scores = scores / (d_head ** 0.5)  # divide by sqrt(64) = 8

# Step 3: Softmax (turn scores into probabilities)
# dim=-1 means "along the last dimension" (across keys)
attn_weights = torch.softmax(scores, dim=-1)  # each row sums to 1

# Step 4: Weighted sum (blend values using probabilities)
output = attn_weights @ V  # (2, 5, 64)

print(f'Output shape: {output.shape}')  # same as input: (2, 5, 64)
print(f'Attention weights sum to 1: {attn_weights[0, 0].sum():.4f}')
print(f'\nToken 2\'s attention weights: {attn_weights[0, 2]}')
print('(shows how much token 2 attends to each of the 5 tokens)')

## Token and Position Embeddings

Every LLM starts by converting token IDs to dense vectors.

**The problem:** Neural networks can't process raw text like "cat". Token IDs (like `5` for "cat") are arbitrary—ID 5 isn't "closer" to 6 than to 500.

**The solution:** Map each token ID to a learned vector (768 numbers for GPT-2). Similar words learn similar vectors through training.

In [None]:
# ===== Token and Position Embeddings =====
# GPT-2 dimensions
vocab_size, d_model, max_seq_len = 50257, 768, 1024

# Token embeddings: a lookup table (like a dictionary: token_id → vector)
# nn.Embedding creates a table with vocab_size rows and d_model columns
# Each row is a 768-dimensional vector representing one token
token_embedding = nn.Embedding(vocab_size, d_model)

# Create some random token IDs (pretend these came from a tokenizer)
token_ids = torch.randint(0, vocab_size, (2, 5))  # 2 sentences, 5 tokens each

# Look up embeddings (just indexing into the table!)
token_vectors = token_embedding(token_ids)
print(f'Token embeddings: {token_vectors.shape}')  # (2, 5, 768)

# Position embeddings: where in the sequence (token 0, token 1, etc.)
# Same idea: a lookup table where row i represents "position i"
pos_embedding = nn.Embedding(max_seq_len, d_model)
position_ids = torch.arange(5).unsqueeze(0).expand(2, -1)  # [[0,1,2,3,4], [0,1,2,3,4]]
pos_vectors = pos_embedding(position_ids)
print(f'Position embeddings: {pos_vectors.shape}')  # (2, 5, 768)

# Combine: element-wise addition (same position, same shape!)
# Why add instead of concatenate? Addition keeps dimension at 768 (not 1536)
# The model learns to encode BOTH meaning AND position in the same vector
input_embeddings = token_vectors + pos_vectors
print(f'Combined: {input_embeddings.shape}')  # (2, 5, 768)

# Parameter counting (how many numbers to learn?)
token_params = vocab_size * d_model   # 50,257 tokens × 768 dims
pos_params = max_seq_len * d_model    # 1,024 positions × 768 dims
print(f'Token params: {token_params:,}')      # 38,597,376
print(f'Position params: {pos_params:,}')     # 786,432
print(f'Total: {token_params + pos_params:,}')  # 39,383,808

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

# Simple 2-layer MLP for toy classification (self-contained)
class SimpleMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_classes):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_classes),
        )

    def forward(self, x):
        return self.net(x)

# The 5-Step Training Recipe:
# 1. Forward pass -> get predictions
# 2. Compute loss -> measure error
# 3. Backward pass -> calculate gradients
# 4. Clip gradients -> prevent explosions
# 5. Update weights -> adjust parameters

# Setup: model, optimizer, loss function, fake data
model = SimpleMLP(16, 32, 2)  # input 16-dim, hidden 32-dim, output 2 classes
optimizer = optim.Adam(model.parameters(), lr=1e-3)  # adaptive learning rate
loss_fn = nn.CrossEntropyLoss()  # for classification

# Fake training data (batch_size=64)
inputs = torch.randn(64, 16)          # 64 examples, 16 features each
labels = torch.randint(0, 2, (64,))   # 64 labels (class 0 or 1)

# Training loop (3 epochs)
for epoch in range(3):
    # ===== Training Phase =====
    model.train()  # Enable dropout/batch norm (if present)

    # Step 1: Zero out old gradients (they accumulate by default!)
    optimizer.zero_grad(set_to_none=True)  # set_to_none saves memory

    # Step 2: Forward pass
    logits = model(inputs)  # Get predictions (raw scores)

    # Step 3: Compute loss
    loss = loss_fn(logits, labels)  # How wrong are we?

    # Step 4: Backward pass (compute gradients)
    loss.backward()  # Fill .grad for every parameter

    # Step 5: Gradient clipping (prevents exploding gradients)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    # Step 6: Update weights
    optimizer.step()  # Adjust parameters using gradients

    print(f"Epoch {epoch}: loss={loss.item():.4f}")

# ===== Evaluation Phase =====
model.eval()  # Disable dropout/batch norm
with torch.no_grad():  # Don't track gradients (saves memory)
    preds = model(inputs).argmax(dim=-1)  # Get class predictions
    accuracy = (preds == labels).float().mean()  # Fraction correct
    print(f"Accuracy: {accuracy:.2f}")

print()
print("✅ Key points:")
print("  - .backward() fills every parameter\'s .grad")
print("  - Optimizer uses gradients to adjust weights")
print("  - Gradient clipping prevents loss spikes (critical for LLMs!)")
print("  - .eval() and no_grad() save memory during evaluation")


In [None]:
# Full embedding module
class GPT2Embeddings(nn.Module):
    def __init__(self, vocab_size, max_seq_len, d_model, dropout=0.1):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.pos_embedding = nn.Embedding(max_seq_len, d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, token_ids):
        batch_size, seq_len = token_ids.shape
        position_ids = torch.arange(seq_len, device=token_ids.device)
        position_ids = position_ids.unsqueeze(0).expand(batch_size, -1)
        
        token_emb = self.token_embedding(token_ids)
        pos_emb = self.pos_embedding(position_ids)
        
        embeddings = token_emb + pos_emb
        embeddings = self.dropout(embeddings)
        return embeddings

# Test it
embed_layer = GPT2Embeddings(50257, 1024, 768)
token_ids = torch.randint(0, 50257, (2, 10))
output = embed_layer(token_ids)
print(f"Embedding output: {output.shape}")  # (2, 10, 768)

# Total parameters
total_params = sum(p.numel() for p in embed_layer.parameters())
print(f"Total embedding parameters: {total_params:,}")  # 39,383,808

## The Training Loop

**Key concepts:**
- `nn.Module`: Base class for neural network layers. Your model inherits from it.
- `super().__init__()`: Calls the parent class's initialization (required boilerplate)
- `nn.Linear(in, out)`: A matrix multiplication layer (in×out weight matrix)
- `nn.ReLU()`: Activation function — keeps positive values, zeros out negatives
- `CrossEntropyLoss`: Measures how wrong classification predictions are
- `optimizer.zero_grad()`: Clears old gradients (they accumulate by default!)
- `.backward()`: Computes gradients via automatic differentiation
- `.step()`: Updates weights using the computed gradients

In [None]:
# Step 2: Add Causal Masking (Prevent Cheating)

# Problem: If token 2 can see tokens 3 and 4, it can cheat during training!
# Solution: Block future positions by setting their scores to -inf

# What the mask looks like (False = allow, True = block):
# Token 0 can see: [0]           ← only itself
# Token 1 can see: [0, 1]        ← past + itself
# Token 2 can see: [0, 1, 2]     ← past + itself
# Token 3 can see: [0, 1, 2, 3]
# Token 4 can see: [0, 1, 2, 3, 4]

seq_len = 5
causal_mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()

# Apply mask before softmax
scores = Q @ K.transpose(-2, -1) / (d_head ** 0.5)
scores = scores.masked_fill(causal_mask, float('-inf'))  # -inf becomes 0 after softmax
attn_weights = torch.softmax(scores, dim=-1)

print("Causal attention weights (token 2 can only see tokens 0,1,2):")
print(attn_weights[0, 2])  # positions 3 and 4 are zero
print("\nNow the model learns to predict 'mat' without seeing 'mat' first!")

In [None]:
# Step 3: Production Shortcut (One Line)

# You just learned the 4-step manual process to understand what's happening.
# In practice, PyTorch does it all for you:

output = F.scaled_dot_product_attention(
    Q, K, V,
    is_causal=True  # automatically applies causal masking
)

print(f"Output shape: {output.shape}")  # (2, 5, 64)

# Why use this instead of manual?
# - 2-4× faster (uses FlashAttention)
# - Less memory (doesn't store full attention matrix)
# - Handles edge cases (numerical stability, dropout, mask broadcasting)

print("\n✅ When to use manual vs. production:")
print("   Learning: Write it manually to understand the math")
print("   Production: Use F.scaled_dot_product_attention always")

In [None]:
seq_len = 5
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
print(mask)


## Broadcasting examples
Bias add and masking broadcast.

In [None]:
x = torch.randn(3, 4)
y = x + 5
batch = torch.randn(32, 10, 768)
bias = torch.randn(768)
result = batch + bias  # bias broadcasts
scores = torch.randn(4, 8, 10, 10)
mask = torch.triu(torch.ones(1, 1, 10, 10), 1)
masked = scores + mask * -1e9
print('result shape:', result.shape)


## Essential Operations: Element-wise, Matmul, Reductions

**Element-wise operations** work position-by-position—like adding two spreadsheets cell-by-cell. If `a` and `b` are both 3×4 tensors, then `a + b` adds `a[0,0]` to `b[0,0]`, `a[0,1]` to `b[0,1]`, and so on. Same shape in, same shape out.

In [None]:
a = torch.randn(3, 4)
b = torch.randn(3, 4)
add = a + b
mul = a * b
square = a ** 2
exp = torch.exp(a)
x = torch.randn(32, 10, 64)
W = torch.randn(64, 128)
y = x @ W
total = a.sum()
row_sums = a.sum(dim=1, keepdim=True)
col_means = a.mean(dim=0)
max_vals, max_idx = a.max(dim=1)
c = torch.cat([a, b], dim=0)
d = torch.stack([a, b], dim=0)
logits = torch.randn(3, 5)
probs = torch.softmax(logits, dim=-1)
print('y shape:', y.shape)
print('probs row sums:', probs.sum(dim=-1))


## Autograd: Automatic Gradients

**What's a gradient?** The derivative—how much does output change when input changes?

**Why care?** Training a neural network means adjusting millions of parameters. Gradients tell us which direction to adjust. PyTorch's autograd does this automatically.

In [None]:
# Simple computation graph: x → y → z
x = torch.tensor([2.0, 3.0], requires_grad=True)  # track operations on x
y = x ** 2        # y = [4.0, 9.0]
z = y.sum()       # z = 13.0

# Compute gradients automatically
z.backward()      # "how does z change if I change x?"
print('Gradients:', x.grad)  # tensor([4., 6.])

# What just happened?
# z = (x²).sum() → dz/dx = 2x
# At x=[2, 3], gradients are 2*[2, 3] = [4, 6]
# backward() computed this by walking the graph backward!

print('\n✅ Manual check: dz/dx = 2x')
print(f'   At x=[2, 3]: 2*x = {2 * x.detach()}')

# When to stop tracking (saves memory during inference):
print('\n--- Stopping gradient tracking ---')

# Option 1: Context manager (for a block of code)
with torch.no_grad():
    y_no_grad = x * 2  # no gradient tracking
    print(f'No grad computed: {y_no_grad}')

# Option 2: Detach (for a single tensor)
detached = x.detach()  # new tensor, no grad history
print(f'Detached tensor: {detached}')

## Summary

You've now seen all the core PyTorch operations for LLM development:

✅ **Tensor basics** - creation, dtypes, devices
✅ **Reproducibility** - seeds for debugging
✅ **Dimension building** - 1D → 2D → 3D → 4D progression
✅ **Reshaping & indexing** - navigating multi-dimensional data
✅ **Attention mechanism** - manual + production patterns
✅ **Causal masking** - preventing future token peeking
✅ **Embeddings** - token + position representations
✅ **Broadcasting** - automatic shape expansion
✅ **Math ops** - element-wise, matmul, reductions
✅ **Autograd** - automatic differentiation
✅ **Training loops** - forward, backward, optimize with gradient clipping

**Next steps:** Chapter 7 will show you how to prepare real text data for training!