
# PyTorch Fundamentals (Torch Tutorial Style): **Tensors** + **Autograd** + MLP on Nonlinear Data

This notebook mirrors the structure of the **official PyTorch “Basics” tutorials** and then applies the concepts in a small end‑to‑end example:
1. **Tensors**
2. **Automatic Differentiation** with `torch.autograd`
3. MLP classifier on **non-linearly separable** synthetic data (two moons)

> Goal: make you fluent in the *core* PyTorch mechanics you’ll reuse later for larger models.


## 0) Setup

In [None]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt

device = "cuda" if torch.cuda.is_available() else "cpu"

print("Torch:", torch.__version__)
print("Matplotlib:", matplotlib.__version__)
print("CUDA available:", torch.cuda.is_available())
print("device:", device)

assert torch.cuda.is_available(), "⚠️ Please enable GPU in Runtime -> Change runtime type"

torch.manual_seed(0)



## 1) Tensors

### Informal idea
A **tensor** is PyTorch’s core data structure: a typed, shape-aware, device-aware multi-dimensional array.
Most deep learning is just tensor ops + automatic differentiation.

### What you should walk away with
- how to **create** tensors
- how to inspect **shape / dtype / device**
- basic **indexing / slicing**
- **operations** and in-place semantics
- moving data across **CPU/GPU**
- interoperability with **NumPy**


### 1.1 Creating tensors

In [None]:

# From Python data
a = [[1, 2], [3, 4]]
t1 = torch.tensor(a)
print("t1:", t1, "| dtype:", t1.dtype)

In [None]:
# From another tensor (shares metadata rules)
t2 = torch.ones_like(t1)
t3 = torch.rand_like(t1, dtype=torch.float32)
print("t2:", t2, "| t3:", t3, "| t3 dtype:", t3.dtype)

In [None]:
# Random / constant tensors
t4 = torch.randn(3, 4)          # standard normal
t5 = torch.zeros(2, 3, 5)       # zeros
t6 = torch.arange(0, 10)        # 0..9
print("t4 shape:", t4.shape, "t5 shape:", t5.shape, "t6:", t6)

### 1.2 Tensor attributes: shape, dtype, device

In [None]:

x = torch.randn(2, 3, device=device, dtype=torch.float32)
print("shape:", x.shape)
print("dtype:", x.dtype)
print("device:", x.device)


### 1.3 Indexing and slicing

In [None]:

x = torch.arange(0, 12).reshape(3, 4)
print(x)

print("x[0]:", x[0])
print("x[:, 1:3]:", x[:, 1:3])
print("last column:", x[:, -1])

# boolean masking
mask = (x % 2 == 0)
print("mask:", mask)
print("even elements:", x[mask])


### 1.4 Basic operations (and broadcasting)

In [None]:

a = torch.randn(2, 3, device=device)
b = torch.randn(2, 3, device=device)

print("add:", (a + b).shape)
print("mul:", (a * b).shape)

# matrix multiply
m1 = torch.randn(2, 4, device=device)
m2 = torch.randn(4, 3, device=device)
mm = m1 @ m2
print("matmul shape:", mm.shape)

# broadcasting: (2,3) + (3,) -> (2,3)
v = torch.randn(3, device=device)
c = a + v
print("broadcasted add shape:", c.shape)


### 1.5 In-place operations (use with care)

In [None]:

x = torch.ones(3, device=device)
print("before:", x)

# in-place add
x.add_(2)
print("after add_:", x)

# In-place ops can break autograd in some situations; avoid them unless you know why you need them.


### 1.6 Moving tensors across devices

In [None]:

x_cpu = torch.randn(2, 2, device="cpu")
x_dev = x_cpu.to(device)
print("cpu device:", x_cpu.device, "| moved device:", x_dev.device)


### 1.7 NumPy bridge (CPU only)

In [None]:

import numpy as np

x = torch.randn(3, 3)          # on CPU
x_np = x.numpy()               # shares memory
x_np[0, 0] = 123.0
print("torch sees numpy change:", x[0,0].item())

y_np = np.ones((2,2), dtype=np.float32)
y = torch.from_numpy(y_np)     # shares memory
y[0,0] = 7.0
print("numpy sees torch change:", y_np[0,0])



## 2) Automatic Differentiation with `torch.autograd`

### Informal idea
PyTorch can automatically compute gradients for tensor expressions by building a **dynamic computation graph** during the forward pass.

### Formal idea (very compact)
For a scalar objective \(L(	heta)\), PyTorch computes \(
abla_	heta L\) via reverse‑mode AD (backprop).

### What you should walk away with
- how `requires_grad` controls tracking
- what `.backward()` does
- how gradients accumulate in `.grad`
- when to use `detach()` and `torch.no_grad()`


### 2.1 A tiny scalar example

In [None]:

x = torch.tensor(2.0, device=device, requires_grad=True)
y = x**3 + 2*x + 1  # y = x^3 + 2x + 1
y.backward()        # dy/dx

print("x:", x.item())
print("y:", y.item())
print("dy/dx should be 3x^2 + 2 =", 3*(2.0**2) + 2)
print("autograd dy/dx:", x.grad.item())


### 2.2 Vector gradients: scalar output required for `.backward()`

In [None]:

x = torch.randn(3, device=device, requires_grad=True)
y = (x * x).sum()  # scalar
y.backward()
print("x:", x)
print("grad (should be 2x):", x.grad)


### 2.3 Gradient accumulation (why you zero grads)

In [None]:

x = torch.tensor(1.0, device=device, requires_grad=True)

y1 = x * 3
y1.backward()
print("after first backward:", x.grad.item())  # 3

y2 = x * 4
y2.backward()
print("after second backward (accumulated):", x.grad.item())  # 3 + 4

# reset
x.grad.zero_()
print("after zero_:", x.grad.item())


### 2.4 Detaching and `no_grad` (stop tracking)

In [None]:

x = torch.randn(2, 2, device=device, requires_grad=True)
y = (x @ x).sum()

# detach: new tensor shares storage but has no grad history
x_det = x.detach()
print("x_det requires_grad?", x_det.requires_grad)

with torch.no_grad():
    z = x * 10
print("z requires_grad?", z.requires_grad)

# You typically use no_grad during evaluation/inference.


### 2.5 Inspecting the computation graph (light touch)

In [None]:

x = torch.randn(3, device=device, requires_grad=True)
y = (x.sin() * x).sum()
print("y.grad_fn:", y.grad_fn)  # shows last operation node
y.backward()
print("x.grad:", x.grad)



## 3) Apply the basics: MLP classification on non-linearly separable data (two moons)

We now use the mechanics from sections (1) and (2) to:
- generate a **two-moons** dataset (not linearly separable)
- train a small MLP with `CrossEntropyLoss`
- visualize the decision boundary


### 3.1 Generate two-moons data (no sklearn)

In [None]:

def make_moons(n_samples=2000, noise=0.18, distance=0.30, device="cpu"):
    n0 = n_samples // 2
    n1 = n_samples - n0

    theta0 = torch.rand(n0, device=device) * math.pi
    theta1 = torch.rand(n1, device=device) * math.pi

    x0 = torch.stack([torch.cos(theta0), torch.sin(theta0)], dim=1)
    x1 = torch.stack([1.0 - torch.cos(theta1), -torch.sin(theta1) - distance], dim=1)

    X = torch.cat([x0, x1], dim=0).float()
    y = torch.cat([torch.zeros(n0, device=device, dtype=torch.long),
                   torch.ones(n1, device=device, dtype=torch.long)], dim=0)

    X = X + noise * torch.randn_like(X)
    perm = torch.randperm(X.size(0), device=device)
    return X[perm], y[perm]

X, y = make_moons(device=device)
print("X:", X.shape, "y:", y.shape, "classes:", y.min().item(), "to", y.max().item())

def plot_points(X, y, title="data"):
    Xc = X.detach().cpu()
    yc = y.detach().cpu()
    plt.figure(figsize=(6, 5))
    plt.scatter(Xc[:, 0], Xc[:, 1], c=yc, s=10)
    plt.title(title)
    plt.xlabel("x1"); plt.ylabel("x2")
    plt.show()

plot_points(X, y, "Two moons (non-linearly separable)")


### 3.2 Train/val split + DataLoader

In [None]:

from torch.utils.data import TensorDataset, DataLoader

def train_val_split(X, y, val_frac=0.2):
    N = X.size(0)
    n_val = int(val_frac * N)
    X_val, y_val = X[:n_val], y[:n_val]
    X_train, y_train = X[n_val:], y[n_val:]
    return X_train, y_train, X_val, y_val

X_train, y_train, X_val, y_val = train_val_split(X, y, val_frac=0.2)

train_loader = DataLoader(TensorDataset(X_train, y_train), batch_size=128, shuffle=True)
val_loader = DataLoader(TensorDataset(X_val, y_val), batch_size=256, shuffle=False)

print("train:", X_train.shape, "val:", X_val.shape)


### 3.3 Optional baseline: linear classifier

In [None]:

class LinearBaseline(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(2, 2)
    def forward(self, x):
        return self.fc(x)

def accuracy_from_logits(logits, y):
    return (logits.argmax(dim=-1) == y).float().mean()

baseline = LinearBaseline().to(device)
opt = torch.optim.SGD(baseline.parameters(), lr=0.1)
crit = nn.CrossEntropyLoss()

for epoch in range(60):
    baseline.train()
    for xb, yb in train_loader:
        logits = baseline(xb)
        loss = crit(logits, yb)
        opt.zero_grad()
        loss.backward()
        opt.step()

baseline.eval()
with torch.no_grad():
    accs = [accuracy_from_logits(baseline(xb), yb).item() for xb, yb in val_loader]
print("Linear baseline val acc:", sum(accs)/len(accs))


### 3.4 MLP classifier (`nn.Module`)

In [None]:

class MLP(nn.Module):
    def __init__(self, in_dim=2, hidden=(64, 64), num_classes=2, dropout=0.0):
        super().__init__()
        layers = []
        prev = in_dim
        for h in hidden:
            layers.append(nn.Linear(prev, h))
            layers.append(nn.ReLU())
            if dropout > 0:
                layers.append(nn.Dropout(dropout))
            prev = h
        layers.append(nn.Linear(prev, num_classes))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)  # logits

model = MLP(hidden=(64, 64), dropout=0.2).to(device)
print("num params:", sum(p.numel() for p in model.parameters()))


### 3.5 Train the MLP (end-to-end autograd)

In [None]:

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-3, weight_decay=1e-4)

train_losses, val_accs = [], []

for epoch in range(80):
    model.train()
    running = 0.0
    for xb, yb in train_loader:
        logits = model(xb)
        loss = criterion(logits, yb)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running += loss.item() * xb.size(0)

    train_loss = running / len(train_loader.dataset)
    train_losses.append(train_loss)

    model.eval()
    with torch.no_grad():
        accs = [accuracy_from_logits(model(xb), yb).item() for xb, yb in val_loader]
    val_acc = sum(accs) / len(accs)
    val_accs.append(val_acc)

    if (epoch + 1) % 20 == 0 or epoch == 0:
        print(f"epoch {epoch+1:02d} | train loss {train_loss:.4f} | val acc {val_acc:.4f}")


### 3.6 Plot learning curves

In [None]:

plt.figure(figsize=(6, 4))
plt.plot(train_losses)
plt.title("Training loss"); plt.xlabel("epoch"); plt.ylabel("loss")
plt.show()

plt.figure(figsize=(6, 4))
plt.plot(val_accs)
plt.title("Validation accuracy"); plt.xlabel("epoch"); plt.ylabel("accuracy")
plt.ylim(0, 1.05)
plt.show()


### 3.7 Decision boundary visualization

In [None]:

def plot_decision_boundary(model, X, y, grid_steps=300, pad=0.6, title="decision boundary"):
    model.eval()
    Xc = X.detach().cpu()
    yc = y.detach().cpu()

    x_min, x_max = Xc[:, 0].min() - pad, Xc[:, 0].max() + pad
    y_min, y_max = Xc[:, 1].min() - pad, Xc[:, 1].max() + pad

    xs = torch.linspace(x_min, x_max, grid_steps)
    ys = torch.linspace(y_min, y_max, grid_steps)
    xx, yy = torch.meshgrid(xs, ys, indexing="xy")
    grid = torch.stack([xx.reshape(-1), yy.reshape(-1)], dim=-1).to(next(model.parameters()).device)

    with torch.no_grad():
        logits = model(grid)
        preds = logits.argmax(dim=-1).reshape(grid_steps, grid_steps).cpu()

    plt.figure(figsize=(6, 5))
    plt.contourf(xx, yy, preds, alpha=0.35)
    plt.scatter(Xc[:, 0], Xc[:, 1], c=yc, s=10)
    plt.title(title)
    plt.xlabel("x1"); plt.ylabel("x2")
    plt.show()

plot_decision_boundary(model, X_val, y_val, title="MLP decision boundary (val set)")



## 4) Homework / Extensions

1. Increase noise in `make_moons(noise=0.30)` and observe accuracy + boundary.
2. Try smaller MLPs `(8,8)` vs larger `(128,128,128)` and compare under/overfitting.
3. Add dropout (`dropout=0.2`) and see how it affects boundary smoothness.
4. Replace ReLU with GELU.
