# PyTorch from the Ground Up

A complete, practical guide -- from tensors to training loops.

**Contents:**
1. What is PyTorch?
2. Tensors -- The Foundation
3. Tensor Operations
4. Autograd -- How PyTorch Learns
5. Building a Neural Network (Manual Way)
6. `torch.nn` -- The High-Level API
7. The Standard Training Loop
8. Data -- Dataset and DataLoader
9. GPU Usage
10. Saving & Loading Models
11. The Batch Dimension
12. The Full Mental Model

---

## Setup
PyTorch comes pre-installed on Google Colab. If running locally:
```bash
pip install torch torchvision
```

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"MPS available (Apple Silicon): {torch.backends.mps.is_available()}")

---
## 1. What is PyTorch?

PyTorch is a deep learning framework built on two core ideas:

- **Tensors** -- n-dimensional arrays (like NumPy) that can run on GPU
- **Autograd** -- automatic differentiation engine that computes gradients for you

Everything else (neural networks, optimizers, etc.) is built on top of these.

### Why not just use NumPy?

NumPy arrays live only on **CPU**. PyTorch tensors can live on **GPU**, which has thousands of cores doing parallel math. A matrix multiply that takes 500ms on CPU might take 2ms on GPU -- the difference between training in weeks vs hours.

**Key insight:** Tensors are not just data containers -- they are the *computation substrate*. Everything flows through them.

---
## 2. Tensors -- The Foundation

A tensor is just a number, vector, matrix, or any n-dimensional array.

**Why tensors over plain lists?**
- GPU support -- run on thousands of parallel cores
- Built-in math operations (matmul, sum, etc.)
- Autograd integration -- gradients flow through them
- Efficient memory layout (contiguous C arrays under the hood)

In [None]:
# Scalar (0D tensor)
a = torch.tensor(3.14)
print(f"Scalar: {a}")
print(f"Shape: {a.shape}")     # torch.Size([])
print(f"Dimensions: {a.ndim}") # 0

In [None]:
# Vector (1D)
b = torch.tensor([1.0, 2.0, 3.0])
print(f"Vector: {b}")
print(f"Shape: {b.shape}")  # torch.Size([3])

In [None]:
# Matrix (2D)
c = torch.tensor([[1, 2], [3, 4]])
print(f"Matrix:\n{c}")
print(f"Shape: {c.shape}")  # torch.Size([2, 2])

In [None]:
# 3D tensor -- e.g., batch of grayscale images: [batch, height, width]
d = torch.zeros(8, 28, 28)
print(f"3D tensor shape: {d.shape}")  # torch.Size([8, 28, 28])

### Creating Tensors -- Common Patterns

In [None]:
print("zeros:    ", torch.zeros(3, 4))
print("ones:     ", torch.ones(2, 3))
print("rand:     ", torch.rand(2, 3))        # uniform [0, 1)
print("randn:    ", torch.randn(2, 3))       # standard normal
print("arange:   ", torch.arange(0, 10, 2)) # [0, 2, 4, 6, 8]
print("linspace: ", torch.linspace(0, 1, 5))
print("eye:    \n", torch.eye(3))           # identity matrix

### Data Types (dtypes)

| dtype | Use case |
|-------|----------|
| `float32` | Default for ML -- best balance of speed and precision |
| `float16` / `bfloat16` | Large model training -- saves memory |
| `int64` | Class labels, indices |
| `bool` | Masks |

**Why dtype matters:** float16 uses half the memory of float32. Wrong dtype can cause silent incorrect results or runtime errors.

In [None]:
x = torch.tensor([1.0, 2.0])    # float32 by default
print(f"dtype: {x.dtype}")

y = torch.tensor([1, 2])        # int64 by default
print(f"dtype: {y.dtype}")

# Casting
print("to float64:", x.to(torch.float64))
print("to int:    ", x.int())
print("to bool:   ", x.bool())

---
## 3. Tensor Operations

In [None]:
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])

print(f"a + b  = {a + b}")
print(f"a * b  = {a * b}")
print(f"a ** 2 = {a ** 2}")

In [None]:
print(f"sum:  {a.sum()}")
print(f"mean: {a.mean()}")
print(f"max:  {a.max()}")
print(f"std:  {a.std()}")

In [None]:
# Matrix multiplication -- the most important operation in deep learning
# Every linear layer is just a matmul
A = torch.randn(3, 4)
B = torch.randn(4, 5)
C = A @ B           # same as torch.matmul(A, B)
print(f"A: {A.shape}, B: {B.shape}, A@B: {C.shape}")

In [None]:
# Reshaping -- you will use this constantly
x = torch.arange(12, dtype=torch.float32)
print(f"Original: {x.shape}")
print(f"reshape(3, 4):  {x.reshape(3, 4).shape}")
print(f"reshape(3, -1): {x.reshape(3, -1).shape}")   # -1 = infer
print(f"unsqueeze(0):   {x.unsqueeze(0).shape}")      # [12] -> [1, 12]
print(f"unsqueeze(1):   {x.unsqueeze(1).shape}")      # [12] -> [12, 1]
y = torch.zeros(1, 12)
print(f"squeeze:        {y.squeeze().shape}")          # [1, 12] -> [12]

In [None]:
# Indexing & slicing -- same as NumPy
x = torch.randn(4, 3)
print(f"x shape: {x.shape}")
print(f"First row:      {x[0]}")
print(f"Second column:  {x[:, 1]}")
print(f"Rows 1-2 shape: {x[1:3, :].shape}")
print(f"Boolean mask:   {x[x > 0][:4]}")

---
## 4. Autograd -- How PyTorch Learns

### Why Automatic Differentiation?

Training requires computing gradients -- how much does each parameter contribute to the error?

For a network with millions of parameters, doing this by hand is impossible. Before autograd, researchers had to **manually derive and code gradient equations** for every architecture.

**What autograd does:** You write the forward computation. PyTorch records every operation as a computational graph. `backward()` walks the graph backwards and applies the chain rule automatically.

PyTorch is **define-by-run** -- the graph is built dynamically as you execute. You can use normal Python debugging, conditionals, loops -- no restrictions.

When you mark a tensor with `requires_grad=True`, PyTorch tracks every operation on it.

In [None]:
x = torch.tensor(3.0, requires_grad=True)

# y = x^2 + 2x + 1
y = x ** 2 + 2*x + 1

y.backward()   # compute dy/dx

# dy/dx = 2x + 2 = 2(3) + 2 = 8
print(f"x = {x.item()}")
print(f"y = {y.item()}")
print(f"dy/dx = {x.grad.item()}")   # 8.0

In [None]:
# Multiple variables -- chain rule applied automatically
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)

c = a * b        # c = ab
d = c + a        # d = ab + a

d.backward()

# d(d)/d(a) = b + 1 = 4
# d(d)/d(b) = a = 2
print(f"d(d)/d(a) = {a.grad.item()}")   # 4.0
print(f"d(d)/d(b) = {b.grad.item()}")   # 2.0

In [None]:
# torch.no_grad() -- stop tracking (for inference)
# Benefit: saves ~50% memory + faster
# When to use: ALWAYS during validation and inference

x = torch.randn(3, requires_grad=True)

y = x * 2
print(f"requires_grad: {y.requires_grad}")           # True

with torch.no_grad():
    y = x * 2
    print(f"requires_grad (no_grad): {y.requires_grad}")  # False

### Why `zero_grad()` is Critical

PyTorch **accumulates** gradients by default. Each `backward()` *adds* to existing gradients rather than replacing them.

Forgetting `zero_grad()` is one of the most common PyTorch bugs -- the model still runs, just trains incorrectly.

In [None]:
w = torch.tensor(1.0, requires_grad=True)

print("--- Without zero_grad (WRONG) ---")
for i in range(3):
    loss = w * 2
    loss.backward()
    print(f"  Step {i+1} grad: {w.grad.item()}")  # 2, 4, 6 -- accumulates!

print()

w2 = torch.tensor(1.0, requires_grad=True)
print("--- With zero_grad (CORRECT) ---")
for i in range(3):
    loss = w2 * 2
    loss.backward()
    print(f"  Step {i+1} grad: {w2.grad.item()}")  # 2, 2, 2 -- correct!
    w2.grad.zero_()

---
## 5. Building a Neural Network -- The Manual Way

Before using high-level APIs, let us see what happens underneath.

**Goal:** Learn `y = 3x + 2` from noisy data using manual gradient descent.

**The training cycle:**
1. Forward pass -- compute predictions
2. Compute loss -- measure error
3. Backward pass -- compute gradients
4. Update parameters -- move to reduce loss
5. Zero gradients -- reset for next step

In [None]:
torch.manual_seed(42)

# Generate fake data: y = 3x + 2 + noise
X = torch.randn(100, 1)
y = 3 * X + 2 + 0.1 * torch.randn(100, 1)

print(f"X shape: {X.shape}, y shape: {y.shape}")

In [None]:
# Parameters to learn
W = torch.randn(1, 1, requires_grad=True)
b = torch.zeros(1,    requires_grad=True)

print(f"Initial W: {W.item():.4f}  (target: 3.0)")
print(f"Initial b: {b.item():.4f}  (target: 2.0)")

lr = 0.1
losses = []

for epoch in range(100):
    # 1. Forward
    y_pred = X @ W + b

    # 2. Loss (MSE)
    loss = ((y_pred - y) ** 2).mean()
    losses.append(loss.item())

    # 3. Backward
    loss.backward()

    # 4. Update
    with torch.no_grad():
        W -= lr * W.grad
        b -= lr * b.grad

    # 5. Zero gradients
    W.grad.zero_()
    b.grad.zero_()

print(f"\nLearned W: {W.item():.4f}  (true: 3.0)")
print(f"Learned b: {b.item():.4f}  (true: 2.0)")

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(losses)
ax1.set_title('Training Loss'); ax1.set_xlabel('Epoch')
ax1.set_ylabel('MSE'); ax1.set_yscale('log'); ax1.grid(True)

with torch.no_grad():
    x_plot = torch.linspace(-3, 3, 100).unsqueeze(1)
    y_plot = x_plot @ W + b

ax2.scatter(X.numpy(), y.numpy(), alpha=0.4, label='Data')
ax2.plot(x_plot.numpy(), y_plot.numpy(), 'r-', linewidth=2,
         label=f'y={W.item():.2f}x+{b.item():.2f}')
ax2.set_title('Result'); ax2.legend(); ax2.grid(True)
plt.tight_layout(); plt.show()

---
## 6. `torch.nn` -- The High-Level API

### Why `nn.Module`?

As networks grow, you need:
- Automatic parameter tracking (for the optimizer)
- Train/eval mode switching (dropout, batchnorm behave differently)
- Clean save/load
- Modular composition of sub-networks

`nn.Module` provides all of this. `model.parameters()` recursively finds every parameter. `model.to(device)` moves everything. `model.train()` / `model.eval()` sets mode for every layer.

**Always call `model(x)`, NOT `model.forward(x)` directly.**
The `__call__` method wraps `forward()` with hooks and other important machinery.

In [None]:
class LinearModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 1)   # in_features=1, out_features=1

    def forward(self, x):
        return self.linear(x)

model = LinearModel()
print(model)
print()
for name, param in model.named_parameters():
    print(f"  {name}: shape={param.shape}, value={param.data}")

In [None]:
# Same linear regression using nn -- much cleaner
model = LinearModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
loss_fn = nn.MSELoss()

losses = []
for epoch in range(100):
    y_pred = model(X)               # forward
    loss = loss_fn(y_pred, y)       # loss

    optimizer.zero_grad()           # zero BEFORE backward
    loss.backward()                 # gradients
    optimizer.step()                # update

    losses.append(loss.item())

print(f"W: {model.linear.weight.item():.4f}  (true: 3.0)")
print(f"b: {model.linear.bias.item():.4f}  (true: 2.0)")

### Common Layers

| Layer | Purpose | When to Use |
|-------|---------|-------------|
| `nn.Linear(in, out)` | Fully connected | Tabular data, classifier heads |
| `nn.Conv2d(in, out, k)` | 2D convolution | Images -- captures local spatial patterns |
| `nn.ReLU()` | Non-linearity max(0,x) | Default activation after most layers |
| `nn.Sigmoid()` | Squash to [0,1] | Binary output probability |
| `nn.Dropout(p)` | Random zero activations | Prevent overfitting |
| `nn.BatchNorm1d/2d` | Normalize activations | Stabilize training, allow higher LR |
| `nn.Embedding(V, D)` | Learnable lookup table | NLP word embeddings |
| `nn.LSTM(in, hidden)` | Recurrent layer | Sequences -- text, time series |

In [None]:
# nn.Sequential -- clean syntax for simple stacks
mlp = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)
print(mlp)

total = sum(p.numel() for p in mlp.parameters())
print(f"\nTotal parameters: {total:,}")

### Loss Functions -- Why the Choice Matters

The loss function defines what "wrong" means. Wrong choice = model optimizes the wrong objective.

| Loss | Use When | Why |
|------|----------|-----|
| `MSELoss` | Regression | Penalizes large errors heavily (squares them) |
| `L1Loss` | Regression with outliers | Treats all errors equally, more robust |
| `CrossEntropyLoss` | Multi-class classification | Most common for classification. Combines LogSoftmax + NLL. |
| `BCEWithLogitsLoss` | Binary classification | Numerically stable. Use instead of Sigmoid + BCE. |

**Common mistake:** Using MSELoss for classification treats class labels as ordered numbers -- meaningless.

In [None]:
# MSELoss
mse = nn.MSELoss()
pred   = torch.tensor([2.5, 3.0, 4.0])
target = torch.tensor([2.0, 3.0, 5.0])
print(f"MSE Loss: {mse(pred, target).item():.4f}")

# CrossEntropyLoss -- input is raw LOGITS (not softmax!), target is class indices
ce = nn.CrossEntropyLoss()
logits = torch.tensor([[2.0, 1.0, 0.5],
                        [0.5, 2.5, 0.3]])
labels = torch.tensor([0, 1])
print(f"CrossEntropy Loss: {ce(logits, labels).item():.4f}")

# BCEWithLogitsLoss
bce = nn.BCEWithLogitsLoss()
b_logits = torch.tensor([1.5, -0.5, 2.0])
b_labels = torch.tensor([1.0,  0.0, 1.0])
print(f"BCE Loss: {bce(b_logits, b_labels).item():.4f}")

### Optimizers -- Why Not Just Gradient Descent?

**Vanilla SGD problem:** Same LR for all parameters. Gets stuck in saddle points. Noisy mini-batch updates.

**Adam** tracks a running average of gradient AND squared gradient per parameter -- giving each its own adaptive learning rate. Parameters with consistently large gradients get a smaller effective LR; rarely-updated ones get larger.

**AdamW** fixes Adam's weight decay bug -- decouples L2 regularization from the adaptive gradient update. Use AdamW for most modern work.

**When SGD wins:** For image CNNs, SGD + momentum + LR schedule often achieves better final accuracy than Adam. Adam trains faster but converges to sharper minima that generalize worse.

**Good defaults:** Adam/AdamW: `3e-4`, SGD: `0.01-0.1`

In [None]:
model = LinearModel()
sgd_opt   = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
adam_opt  = torch.optim.Adam(model.parameters(), lr=3e-4)
adamw_opt = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# Compare convergence on same problem
results = {}
for name, opt_cls, lr in [("SGD",   torch.optim.SGD,   0.1),
                            ("Adam",  torch.optim.Adam,  3e-4),
                            ("AdamW", torch.optim.AdamW, 3e-4)]:
    m = LinearModel()
    opt = opt_cls(m.parameters(), lr=lr)
    lf = nn.MSELoss()
    ep_losses = []
    for _ in range(50):
        p = m(X); l = lf(p, y)
        opt.zero_grad(); l.backward(); opt.step()
        ep_losses.append(l.item())
    results[name] = ep_losses

plt.figure(figsize=(8, 4))
for name, ls in results.items():
    plt.plot(ls, label=name)
plt.yscale('log')
plt.title('SGD vs Adam vs AdamW -- Convergence')
plt.xlabel('Epoch'); plt.ylabel('Loss')
plt.legend(); plt.grid(True); plt.show()

---
## 7. The Standard Training Loop

### Why `model.train()` vs `model.eval()`?

Some layers behave **differently** during training vs inference:

- **Dropout:** Train = randomly zeros neurons. Eval = uses all neurons (deterministic).
- **BatchNorm:** Train = normalizes using current batch stats. Eval = uses accumulated running stats.

Forgetting `model.eval()` is a **silent bug** -- model still runs but gives inconsistent, wrong predictions.

In [None]:
# Demonstrate train vs eval mode
dropout_demo = nn.Sequential(nn.Linear(10, 10), nn.Dropout(0.5))
x_test = torch.ones(1, 10)

dropout_demo.train()
out1 = dropout_demo(x_test)
out2 = dropout_demo(x_test)
print("Train mode (random, non-deterministic):")
print(f"  Same results? {torch.allclose(out1, out2)}")

dropout_demo.eval()
out3 = dropout_demo(x_test)
out4 = dropout_demo(x_test)
print("Eval mode (deterministic):")
print(f"  Same results? {torch.allclose(out3, out4)}")

In [None]:
# THE STANDARD TRAINING LOOP -- copy this as your base template

def train_model(model, train_loader, val_loader, optimizer, loss_fn, num_epochs=10):
    train_losses, val_losses = [], []

    for epoch in range(num_epochs):
        # ---- TRAINING ----
        model.train()
        train_loss = 0.0
        for X_batch, y_batch in train_loader:
            # move to device if using GPU: X_batch = X_batch.to(device)
            y_pred = model(X_batch)
            loss   = loss_fn(y_pred, y_batch)

            optimizer.zero_grad()   # zero FIRST
            loss.backward()         # compute gradients
            optimizer.step()        # update parameters

            train_loss += loss.item()

        # ---- VALIDATION ----
        model.eval()
        val_loss = 0.0
        with torch.no_grad():       # no graph = memory savings
            for X_val, y_val in val_loader:
                val_pred = model(X_val)
                val_loss += loss_fn(val_pred, y_val).item()

        avg_train = train_loss / len(train_loader)
        avg_val   = val_loss   / len(val_loader)
        train_losses.append(avg_train)
        val_losses.append(avg_val)

        if (epoch + 1) % 5 == 0:
            print(f"Epoch {epoch+1:3d}/{num_epochs} | "
                  f"Train: {avg_train:.4f} | Val: {avg_val:.4f}")

    return train_losses, val_losses

print("Training loop template defined.")

---
## 8. Data -- Dataset and DataLoader

### Why Not Just Index Your Data Manually?

You could do `X[i:i+32]`. But DataLoader provides:

- **Shuffling** -- breaks spurious patterns the model could overfit to
- **Batching** -- GPUs process many samples in parallel
- **Parallel loading** (`num_workers`) -- load data in background while GPU computes
- **Pin memory** -- speeds up CPU-to-GPU data transfer

### Batch size guidance:
- Too small (1-8): Noisy gradients, poor GPU utilization  
- Too large (>512): May generalize worse, requires more memory  
- Sweet spot: **32 to 256** for most tasks

In [None]:
class RegressionDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __len__(self):
        # Required: total number of samples
        return len(self.X)

    def __getitem__(self, idx):
        # Required: return one sample by index
        return self.X[idx], self.y[idx]


torch.manual_seed(42)
X_all = torch.randn(200, 1)
y_all = 3 * X_all + 2 + 0.1 * torch.randn(200, 1)

train_dataset = RegressionDataset(X_all[:160], y_all[:160])
val_dataset   = RegressionDataset(X_all[160:], y_all[160:])

print(f"Train: {len(train_dataset)} samples")
print(f"Val:   {len(val_dataset)} samples")

In [None]:
train_loader = DataLoader(
    train_dataset,
    batch_size=32,
    shuffle=True,       # shuffle every epoch -- important for training
    num_workers=0,      # 0 = main process; set to 4 for large image datasets
    # pin_memory=True   # uncomment when training on GPU
)

val_loader = DataLoader(
    val_dataset,
    batch_size=32,
    shuffle=False       # no need to shuffle validation
)

X_batch, y_batch = next(iter(train_loader))
print(f"Batch X: {X_batch.shape}")   # [32, 1]
print(f"Batch y: {y_batch.shape}")   # [32, 1]
print(f"Number of training batches: {len(train_loader)}")

In [None]:
# Full training run combining everything
torch.manual_seed(42)
model_full = LinearModel()
optimizer = torch.optim.AdamW(model_full.parameters(), lr=3e-2)
loss_fn = nn.MSELoss()

train_losses, val_losses = train_model(
    model_full, train_loader, val_loader,
    optimizer, loss_fn, num_epochs=20
)

print(f"\nW: {model_full.linear.weight.item():.4f}  (true: 3.0)")
print(f"b: {model_full.linear.bias.item():.4f}  (true: 2.0)")

In [None]:
plt.figure(figsize=(8, 4))
plt.plot(train_losses, label='Train', marker='o', markersize=4)
plt.plot(val_losses,   label='Val',   marker='s', markersize=4)
plt.xlabel('Epoch'); plt.ylabel('Loss')
plt.title('Train vs Val Loss')
plt.legend(); plt.grid(True); plt.show()

---
## 9. GPU Usage

### Why Manual Device Management?

PyTorch is explicit about where each tensor lives -- precise control for multi-GPU, limited VRAM, or mixed pipelines.

**The Golden Rule:** Model and data must be on the **same device**.

Most common error:
```
RuntimeError: Expected all tensors to be on the same device
```
Fix: `.to(device)` on both model and data.

In [None]:
# Device selection -- works on Colab GPU, Apple Silicon, and CPU
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
elif torch.backends.mps.is_available():
    device = torch.device("mps")   # Apple Silicon
    print("Apple Silicon MPS")
else:
    device = torch.device("cpu")
    print("CPU only")

print(f"Using: {device}")

In [None]:
# Move model to device
model_gpu = LinearModel().to(device)
print(f"Model device: {next(model_gpu.parameters()).device}")

# In training loop, move each batch to device
for X_batch, y_batch in train_loader:
    X_batch = X_batch.to(device)
    y_batch = y_batch.to(device)

    y_pred = model_gpu(X_batch)   # both on same device
    break

print(f"Batch device: {X_batch.device}")

---
## 10. Saving & Loading Models

### Why `state_dict()` Instead of the Whole Model?

`torch.save(model)` saves the entire Python object -- tied to your class definition and file structure. Rename anything, upgrade PyTorch, and it may break.

`state_dict()` saves just the **parameter tensors as a plain dictionary** -- portable, stable, version-independent. This is the professional standard.

**Pattern:** Save state_dict. Load into freshly-constructed model.

In [None]:
# Save
torch.save(model_full.state_dict(), "linear_model.pth")
print("Saved.")

sd = model_full.state_dict()
print(f"Keys: {list(sd.keys())}")
for k, v in sd.items():
    print(f"  {k}: {v}")

In [None]:
# Load
loaded_model = LinearModel()
loaded_model.load_state_dict(
    torch.load("linear_model.pth", map_location=device)
)
loaded_model.eval()   # ALWAYS set eval for inference!

with torch.no_grad():
    test_x = torch.tensor([[1.0]])
    pred = loaded_model(test_x)
    print(f"Prediction for x=1.0: {pred.item():.4f}  (expected ~5.0)")

---
## 11. The Batch Dimension -- Why Data Is Always `[B, ...]`

Nearly every tensor in PyTorch has a **batch dimension first**:

| Data type | Shape |
|-----------|-------|
| Images | `[batch, channels, height, width]` |
| Text | `[batch, sequence_length]` |
| Tabular | `[batch, features]` |

**Why?** GPUs parallelize over the batch dimension. Processing 32 samples simultaneously costs essentially the same compute as processing 1. Model weights are shared across all samples.

**Single-sample inference:** You still need to add a batch dimension with `unsqueeze(0)`.

In [None]:
# Inference on a single sample
single = X_all[0]                          # shape [1] -- just features
print(f"Single sample: {single.shape}")

single_batched = single.unsqueeze(0)       # shape [1, 1] -- with batch dim
print(f"With batch:    {single_batched.shape}")

loaded_model.eval()
with torch.no_grad():
    pred = loaded_model(single_batched)
    print(f"Prediction: {pred.item():.4f}")
    print(f"Actual:     {y_all[0].item():.4f}")

---
## 12. The Full Mental Model

```
Raw Data
   |
   v  Dataset + transforms
Tensors in Batches
   |
   v  DataLoader (shuffled, parallel)
Forward Pass
   |
   v  nn.Module layers
Predictions
   |
   v  Loss Function
Scalar Loss
   |
   v  loss.backward()
Gradients
   |
   v  optimizer.step()
Updated Parameters
   |
   v  repeat for N epochs
Trained Model
```

### Every Component Has a Clear Role

| Component | Role | Without it... |
|-----------|------|---------------|
| **Tensor** | Holds data + parameters | No computation |
| **Autograd** | Computes gradients automatically | Manual calculus for every model |
| **nn.Module** | Organizes parameters + layers | Messy manual tracking |
| **Loss Function** | Defines what "wrong" means | Model optimizes the wrong thing |
| **Optimizer** | Updates parameters intelligently | Slow / unstable convergence |
| **DataLoader** | Feeds data efficiently | No parallelism, no shuffling |
| **train()/eval()** | Controls layer behavior | Silent wrong predictions |
| **no_grad()** | Saves memory at inference | OOM errors, slow inference |

---
## Quick Reference Cheatsheet

```python
# TENSOR CREATION
torch.zeros(3, 4)
torch.randn(3, 4)
torch.tensor([1.0, 2.0])
torch.arange(0, 10, 2)

# SHAPES
x.shape, x.ndim, x.dtype
x.reshape(3, -1)          # -1 = infer
x.unsqueeze(0)            # add batch dim
x.squeeze()               # remove size-1 dims
x.to(device)              # move to GPU/CPU

# AUTOGRAD
x = torch.tensor(1.0, requires_grad=True)
loss.backward()           # compute gradients
optimizer.zero_grad()     # ALWAYS before backward()
with torch.no_grad(): ... # inference / validation

# TRAINING LOOP
# model.train()
# pred = model(X); loss = fn(pred, y)
# optimizer.zero_grad(); loss.backward(); optimizer.step()

# VALIDATION
# model.eval()
# with torch.no_grad(): pred = model(X)

# DEVICE
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
tensor.to(device)

# SAVE / LOAD
torch.save(model.state_dict(), "model.pth")
model.load_state_dict(torch.load("model.pth"))
model.eval()  # after loading!
```