
# PyTorch in 75 Minutes 

**Pre Reqs:** basic Python/NumPy  
**Goal:** Leave ready to read PyTorch code, build/train simple models, and debug common issues.

**What you'll learn**
1. Tensors & vectorization (device, dtype, shapes, broadcasting)
2. Autograd: building and differentiating computation graphs
3. `nn.Module`, losses, and optimizers
4. Input pipelines with `Dataset` / `DataLoader`
5. Canonical training & evaluation loops (+ checkpoints)
6. Mini project (MNIST)
7. Performance tips: `torch.compile`, mixed precision
8. Common gotchas and debugging patterns


## 0–5 min — Framing & Setup

In [None]:

# Environment & reproducibility
import sys, math, time, random
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

print("Python:", sys.version.split()[0])
print("PyTorch:", torch.__version__)

#the following determines if we use the GPU or CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device in use:", device)

#for reproducibility, set all random seeds to a fixed value
torch.manual_seed(123)
random.seed(123)


### <mark>Why Pytorch instead of Numpy?

<mark>It can use a GPU as well as a CPU<br>
It can compute gradients automatically (autograd)<br>
It has a lot of useful functions for deep learning (e.g. layers, loss fns, optimizers, etc.)<br>


## 5–15 min — Tensors & Vectorization
Key ideas: **device**, **dtype**, **shapes**, **broadcasting**, and avoiding Python loops.<br>
<mark>See broadcasting notebook for further broadcasting details.


In [None]:

# Create tensors directly on the chosen device
x = torch.arange(12, dtype=torch.float32, device=device).reshape(3, 4)
w = torch.randn(4, 2, device=device)
b = torch.zeros(2, device=device)

y = x @ w + b  # broadcast bias
print(f"shapes: x.shape={x.shape}, w.shape={w.shape}, y.shape={y.shape}, b.shape={b.shape}")
print(y[:2])


In [None]:

# Broadcasting demo
a = torch.randn(8, 1, 6, device=device)
c = torch.randn(1, 5, 6, device=device)
out = a + c  # -> [8, 5, 6]
print("Broadcasted shape:", out.shape)



**Notes**
- Prefer constructing on the right device (`device=device`) or use `.to(device)`.
- `view` vs `reshape`: `view` requires contiguous memory; `reshape` is safer and may copy.
- Use vectorized operations; avoid explicit Python loops for math on tensors.


## 15–25 min — Autograd (automatic differentiation)

In [None]:
# A scalar function and its gradient
x = torch.tensor([2.0], requires_grad=True, device=device)
y = x**2 + 3*x + 1            # dy/dx = 2x + 3 = 7 when x=2
y.backward()
print("x.grad:", x.grad)
print("y.requires_grad:", y.requires_grad)


# Grads accumulate: zero them if reusing tensors
x.grad.zero_()
print("x.grad:", x.grad)
y2 = (x * 5 + 1)
y2.backward()
print("x.grad after second backward:", x.grad)

# Detach and no_grad
x2 = (x.detach() * 10)        # breaks graph
print("x2.requires_grad:", x2.requires_grad)
y2 = (x2 * 5 + 1)
# y2.backward()            # throws exception


with torch.no_grad():
    z = x * 7 + 1               # won't track gradients
    print("z.requires_grad:", z.requires_grad)
    # z.backward()                # throws exception
print("no_grad result:", z.item())


**Pitfalls**
- Gradients **accumulate**; be sure to zero them between steps.
- Wrap evaluation in `torch.no_grad()` to save memory/compute.


## 25–40 min — `nn.Module`, Losses, Optimizers

In [None]:

class MLP(nn.Module):   #pytorch equivalent to our Numpy MLP class
    def __init__(self, in_dim=784, hid=256, out_dim=10):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hid), 
            nn.ReLU(),
            nn.Linear(hid, out_dim)
        )
    def forward(self, x):
        return self.net(x)

model = MLP().to(device)    #put on device of choice (GPU or CPU)
# criterion = nn.MSELoss()    #pytorch equivalent to the mean square loss we calcualed using numpy and the Value class
# optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)    #pytorch equivalent of the gradient descent algorithm we implemented using numpy and the value class
criterion = nn.CrossEntropyLoss()   
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)    #pytorch equivalent of the gradient descent algorithm we implemented using numpy and the value class


print("Total trainable model params=", sum(p.numel() for p in model.parameters())/1e6, "Million")  # 1st layer=(784*256 + 256 biases), 2nd layer=256*10 + 10 biases = 0.2M



**Notes**
- For classification with integer labels, use `nn.CrossEntropyLoss` with **logits** (no softmax).
- For regression, `nn.MSELoss` and ensure shapes align (`[N, 1]` vs `[N]`).
- Common optimizers: `SGD` and `Adam`.


## 40–55 min — Input Pipelines: `Dataset` / `DataLoader`

In [None]:

# A tiny synthetic classification dataset
class ToyDataset(Dataset):
    def __init__(self, n=1024, d=20):
        g = torch.Generator().manual_seed(0)
        self.x = torch.randn(n, d, generator=g)
        true_w = torch.randn(d, 1, generator=g)
        logits = self.x @ true_w + 0.25 * torch.randn(n, 1, generator=g)
        self.y = (logits.squeeze() > 0).long()     # {0,1}
    def __len__(self): return len(self.x)
    def __getitem__(self, idx): return self.x[idx], self.y[idx]

train_ds = ToyDataset(2048, d=20)
test_ds  = ToyDataset(512, d=20)
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=0, pin_memory=True)
test_loader  = DataLoader(test_ds,  batch_size=256, shuffle=False, num_workers=0, pin_memory=True)

xb, yb = next(iter(train_loader))
print("Batch shapes:", xb.shape, yb.shape, "| dtypes:", xb.dtype, yb.dtype)


## 55–65 min — Canonical Training & Evaluation Loops

In [None]:

def train_one_epoch(model, loader, optimizer, criterion, device=device):
    model.train()
    total_loss, correct, total = 0.0, 0, 0
    for xb, yb in loader:
        xb, yb = xb.to(device), yb.to(device)
        if xb.dim() > 2:
            xb = xb.view(xb.size(0), -1)
        optimizer.zero_grad(set_to_none=True)
        logits = model(xb)
        loss = criterion(logits, yb)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * xb.size(0)
        pred = logits.argmax(dim=1)
        correct += (pred == yb).sum().item()
        total += xb.size(0)
    return total_loss/total, correct/total

@torch.no_grad()
def evaluate(model, loader, criterion, device=device):
    model.eval()
    total_loss, correct, total = 0.0, 0, 0
    for xb, yb in loader:
        xb, yb = xb.to(device), yb.to(device)
        if xb.dim() > 2:
            xb = xb.view(xb.size(0), -1)
        logits = model(xb)
        loss = criterion(logits, yb)
        total_loss += loss.item() * xb.size(0)
        pred = logits.argmax(dim=1)
        correct += (pred == yb).sum().item()
        total += xb.size(0)
    return total_loss/total, correct/total

# Quick demo training on synthetic data
demo_model = MLP(in_dim=20, hid=64, out_dim=2).to(device)
demo_criterion = nn.CrossEntropyLoss()
demo_optimizer = torch.optim.Adam(demo_model.parameters(), lr=1e-3)

for epoch in range(2):
    tr_loss, tr_acc = train_one_epoch(demo_model, train_loader, demo_optimizer, demo_criterion, device)
    te_loss, te_acc = evaluate(demo_model, test_loader, demo_criterion, device)
    print(f"epoch {epoch+1}: train_loss={tr_loss:.3f} acc={tr_acc:.3f} | "
          f"test_loss={te_loss:.3f} acc={te_acc:.3f}")


### Checkpointing (save & load)

In [None]:

# Save
torch.save({
    "model": demo_model.state_dict(),
    "opt": demo_optimizer.state_dict()
}, "ckpt_demo.pt")
print("Saved to ckpt_demo.pt")

# Load
ckpt = torch.load("ckpt_demo.pt", map_location=device)
demo_model.load_state_dict(ckpt["model"])
demo_optimizer.load_state_dict(ckpt["opt"])
print("Reloaded checkpoint.")


## 65–70 min — Mini Project: MNIST in ~30 Lines 

In [None]:

# Try torchvision/MNIST
import torchvision
from torchvision import transforms


tfm = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)),
])
try:
    train_set = torchvision.datasets.MNIST(root="./data", train=True, download=True, transform=tfm)
    test_set  = torchvision.datasets.MNIST(root="./data", train=False, download=True, transform=tfm)
except Exception as e:
    print("MNIST download failed, switching to fallback:", e)
    use_mnist = False

train_loader_m = DataLoader(train_set, batch_size=128, shuffle=True, num_workers=2, pin_memory=True)
test_loader_m  = DataLoader(test_set,  batch_size=256, shuffle=False, num_workers=2, pin_memory=True)
model_m = MLP(in_dim=28*28, hid=256, out_dim=10).to(device)
opt_m = torch.optim.Adam(model_m.parameters(), lr=1e-3)
ce = nn.CrossEntropyLoss()
for epoch in range(1):
    tr_l, tr_a = train_one_epoch(model_m, train_loader_m, opt_m, ce, device)
    te_l, te_a = evaluate(model_m, test_loader_m, ce, device)
    print(f"[MNIST] epoch {epoch+1}: train_acc={tr_a:.3f} test_acc={te_a:.3f}")


## 70–73 min — Performance & Modern Tips

In [None]:

# torch.compile (PyTorch 2.x). May not speed up every model; test on your hardware.
try:
    compiled_demo = torch.compile(demo_model)  # reuses earlier demo model
    print("Compiled model OK.")
except Exception as e:
    print("torch.compile not available or failed:", e)


In [None]:

# Mixed precision: speedups on GPUs with tensor cores
scaler = None
if torch.cuda.is_available():
    scaler = torch.cuda.amp.GradScaler()
    print("Using GradScaler for mixed precision on CUDA.")
else:
    print("CUDA not available; skipping mixed precision demo.")



**Tips**
- Increase `DataLoader(num_workers, pin_memory=True)` for faster host→device transfer.
- Try `torch.backends.cudnn.benchmark = True` for convnets with static shapes.
- Profile before optimizing; verify correctness after any performance tweaks.


## 73–75 min — Gotchas, Debugging, Next Steps

In [None]:

# Common checks
try:
    print("xb dtype/device:", xb.dtype, xb.device)
except NameError:
    print("Run the DataLoader cell first to define a sample batch (xb).")

print("First parameter device:", next(demo_model.parameters()).device)

# NaN/Inf scan (example)
bad = []
for name, p in demo_model.named_parameters():
    if torch.isnan(p).any() or torch.isinf(p).any():
        bad.append(name)
print("Params with NaN/Inf:", bad or "None")



### Where to go next
- **Vision:** `nn.Conv2d`, padding/stride/dilation, `torchvision.models`
- **Sequence/Language:** `nn.Transformer`, `torchtext`, Hugging Face integration
- **Extras:** Schedulers, custom losses/metrics, model quantization 