# Topic 2: Automatic Differentiation & Backpropagation

## Learning Objectives

By the end of this notebook, you will:
- Understand WHY automatic differentiation is the "magic" behind neural networks
- Compute gradients using PyTorch's autograd system
- Understand the computational graph and how backpropagation works
- Know when to use `requires_grad`, `.backward()`, and `.detach()`
- Avoid common gradient-related bugs
- Implement gradient descent from scratch

---

## 1. The Big Picture: Why Automatic Differentiation?

### The Neural Network Training Problem

Imagine you're training a neural network to recognize cats. The network makes predictions, but they're wrong at first. How do you improve it?

**The solution**: Calculate how much each parameter (weight) contributes to the error, then adjust it to reduce the error. This requires computing **gradients** (derivatives).

### The Challenge

A modern neural network might have **billions of parameters**. Computing gradients manually is:
- **Tedious**: You'd need to derive formulas for every layer type
- **Error-prone**: One mistake breaks everything
- **Inflexible**: Changing your model means re-deriving everything

### The Solution: Automatic Differentiation (Autograd)

**Autograd does calculus for you automatically!** You write forward pass code, and PyTorch:
1. Builds a computational graph tracking operations
2. Computes gradients automatically using the chain rule
3. Stores gradients in `.grad` attributes

**This is the key innovation that made modern deep learning practical.**

### Why PyTorch's Approach?

PyTorch uses **dynamic computational graphs** (define-by-run):
- Graph is built during execution
- Easy to debug (just print values!)
- Flexible for dynamic models (different computation per input)
- Pythonic and intuitive

Compare to TensorFlow 1.x (static graphs): Had to define entire graph first, harder to debug.

In [None]:
# Setup
import torch
import numpy as np
import matplotlib.pyplot as plt

print(f"PyTorch version: {torch.__version__}")

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

---

## 2. Gradients: The Math Behind Learning

### What Is a Gradient?

A **gradient** tells you how much a function's output changes when you change its input.

**Mathematical definition**: For function $f(x)$, the gradient (derivative) is:
$$\frac{df}{dx} = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}$$

**Intuitive meaning**: 
- Positive gradient: Increasing $x$ increases $f(x)$
- Negative gradient: Increasing $x$ decreases $f(x)$
- Large gradient: $f(x)$ is very sensitive to $x$
- Zero gradient: $f(x)$ is at a minimum, maximum, or saddle point

### Why Gradients for Learning?

To minimize loss $L(\theta)$ where $\theta$ are parameters:
1. Compute gradient: $\frac{\partial L}{\partial \theta}$
2. Update parameters: $\theta_{\text{new}} = \theta_{\text{old}} - \eta \frac{\partial L}{\partial \theta}$

This is **gradient descent** - we move parameters in the direction that reduces loss!

In [None]:
# Visualize a function and its gradient
# Function: f(x) = x^2
# Gradient: df/dx = 2x

x = np.linspace(-5, 5, 100)
y = x**2
gradient = 2*x

plt.figure(figsize=(14, 5))

# Plot function
plt.subplot(1, 2, 1)
plt.plot(x, y, 'b-', linewidth=2, label='f(x) = x²')
plt.xlabel('x', fontsize=12)
plt.ylabel('f(x)', fontsize=12)
plt.title('Function: f(x) = x²', fontsize=14)
plt.grid(True, alpha=0.3)
plt.legend(fontsize=12)

# Mark some points
test_points = [-3, 0, 3]
for xp in test_points:
    yp = xp**2
    plt.plot(xp, yp, 'ro', markersize=8)
    plt.annotate(f'({xp}, {yp})', xy=(xp, yp), xytext=(xp+0.5, yp+2),
                fontsize=10, color='red')

# Plot gradient
plt.subplot(1, 2, 2)
plt.plot(x, gradient, 'r-', linewidth=2, label='df/dx = 2x')
plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.xlabel('x', fontsize=12)
plt.ylabel('df/dx', fontsize=12)
plt.title('Gradient: df/dx = 2x', fontsize=14)
plt.grid(True, alpha=0.3)
plt.legend(fontsize=12)

# Mark gradients at test points
for xp in test_points:
    grad = 2*xp
    plt.plot(xp, grad, 'ro', markersize=8)
    plt.annotate(f'grad={grad}', xy=(xp, grad), xytext=(xp+0.5, grad+0.5),
                fontsize=10, color='red')

plt.tight_layout()
plt.show()

print("Key observations:")
print("1. At x=0: gradient=0 (minimum point)")
print("2. At x=-3: gradient=-6 (negative, so f(x) is decreasing)")
print("3. At x=3: gradient=6 (positive, so f(x) is increasing)")
print("\nFor gradient descent: move opposite to gradient direction!")

---

## 3. PyTorch's Autograd: Basic Usage

### The Core Concept: `requires_grad=True`

Tell PyTorch which tensors you want gradients for by setting `requires_grad=True`.

**When to use**:
- Model parameters (weights, biases): `requires_grad=True`
- Input data: Usually `requires_grad=False` (we don't update inputs)
- Intermediate computations: Automatically inherit from inputs

In [None]:
# Example: Compute gradient of f(x) = x^2 at x = 3.0

# Create a tensor that requires gradient
x = torch.tensor(3.0, requires_grad=True)
print(f"x: {x}")
print(f"x.requires_grad: {x.requires_grad}")
print()

# Forward pass: compute function
y = x ** 2
print(f"y = x^2: {y}")
print(f"y.requires_grad: {y.requires_grad}")  # Inherited from x!
print()

# Backward pass: compute gradient
y.backward()  # Computes dy/dx

# Access gradient
print(f"Gradient dy/dx: {x.grad}")
print(f"Expected (2*x): {2*3.0}")
print(f"Match: {x.grad.item() == 6.0}")

### What Just Happened?

1. **Forward pass**: PyTorch built a computational graph:
   ```
   x (leaf) → [PowBackward] → y
   ```

2. **Backward pass**: Starting from `y`, PyTorch:
   - Applied chain rule backwards through graph
   - Computed $\frac{dy}{dx} = 2x = 6$
   - Stored result in `x.grad`

3. **Key point**: You wrote `x**2`, PyTorch figured out the derivative automatically!

In [None]:
# More complex example: f(x, y) = x^2 + 3xy + y^2

x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

# Forward pass
z = x**2 + 3*x*y + y**2
print(f"z = x^2 + 3xy + y^2 = {z}")
print()

# Backward pass
z.backward()

# Gradients
print(f"dz/dx: {x.grad}")
print(f"Expected (2x + 3y): {2*2.0 + 3*3.0}")
print()

print(f"dz/dy: {y.grad}")
print(f"Expected (3x + 2y): {3*2.0 + 2*3.0}")

---

## 4. The Computational Graph

### What Is the Computational Graph?

PyTorch builds a **directed acyclic graph (DAG)** where:
- **Nodes**: Tensors (data)
- **Edges**: Operations (functions)
- **Direction**: Forward (data flow) and backward (gradient flow)

### Example Graph

For: `z = (x + y) * w`

```
Forward:
x ──┐
    ├─→ [Add] ─→ a ─→ [Mul] ─→ z
y ──┘              ↑
                   │
w ─────────────────┘

Backward:
         ∂z/∂x ←── [AddBackward] ←┐
         ∂z/∂y ←──────────────────┤
         ∂z/∂w ←── [MulBackward] ←┘ ── ∂z/∂z = 1
```

In [None]:
# Build a computational graph
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
w = torch.tensor(4.0, requires_grad=True)

# Forward: z = (x + y) * w
a = x + y      # Intermediate node
z = a * w      # Output

print("Computational Graph:")
print(f"a = x + y = {a}")
print(f"z = a * w = {z}")
print()

# Inspect the graph
print("Graph structure:")
print(f"z.grad_fn: {z.grad_fn}")
print(f"a.grad_fn: {a.grad_fn}")
print()

# Backward pass
z.backward()

# Gradients via chain rule:
# dz/dx = dz/da * da/dx = w * 1 = 4
# dz/dy = dz/da * da/dy = w * 1 = 4  
# dz/dw = a * 1 = 5

print("Gradients:")
print(f"dz/dx: {x.grad} (expected: {w.item()})")
print(f"dz/dy: {y.grad} (expected: {w.item()})")
print(f"dz/dw: {w.grad} (expected: {a.item()})")

### Dynamic Graphs: The PyTorch Advantage

The graph is built **during execution**, so you can use normal Python control flow!

In [None]:
# Dynamic graph example: different computation based on input
def dynamic_function(x, condition):
    if condition:
        return x ** 2  # Graph: x -> [Pow] -> output
    else:
        return x ** 3  # Graph: x -> [Pow] -> output

# Test both paths
x = torch.tensor(2.0, requires_grad=True)

# Path 1: x^2
y1 = dynamic_function(x, condition=True)
y1.backward()
print(f"Path 1 (x^2): gradient = {x.grad} (expected: 2x = 4)")

# Reset gradient
x.grad.zero_()

# Path 2: x^3
y2 = dynamic_function(x, condition=False)
y2.backward()
print(f"Path 2 (x^3): gradient = {x.grad} (expected: 3x^2 = 12)")

print("\nThis flexibility is why PyTorch is great for research!")

---

## 5. The Chain Rule: How Backpropagation Works

### The Chain Rule in Calculus

For composite functions: $f(g(x))$

$$\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}$$

### In Neural Networks

For a 3-layer network: input → layer1 → layer2 → layer3 → loss

$$\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial \text{layer3}} \cdot \frac{\partial \text{layer3}}{\partial \text{layer2}} \cdot \frac{\partial \text{layer2}}{\partial \text{layer1}} \cdot \frac{\partial \text{layer1}}{\partial w_1}$$

**Backpropagation = Efficient application of chain rule**

### Why "Backpropagation"?

We compute gradients starting from the output and propagating **backwards** through the network:
1. Start at loss: $\frac{\partial L}{\partial L} = 1$
2. Compute gradients layer by layer going backwards
3. Each layer uses the gradient from the next layer (chain rule!)

In [None]:
# Manual chain rule example: f(g(x)) = (x^2 + 1)^3
# Let g(x) = x^2 + 1, f(g) = g^3
# df/dx = df/dg * dg/dx = 3g^2 * 2x = 3(x^2 + 1)^2 * 2x

x = torch.tensor(2.0, requires_grad=True)

# Forward pass
g = x**2 + 1        # g = 5
f = g**3            # f = 125

print(f"x: {x.item()}")
print(f"g = x^2 + 1: {g.item()}")
print(f"f = g^3: {f.item()}")
print()

# Automatic differentiation
f.backward()

# Manual calculation
df_dg = 3 * g**2     # = 3 * 25 = 75
dg_dx = 2 * x        # = 4
df_dx_manual = df_dg * dg_dx  # = 75 * 4 = 300

print("Chain rule breakdown:")
print(f"df/dg = 3g^2: {df_dg.item()}")
print(f"dg/dx = 2x: {dg_dx.item()}")
print(f"df/dx = df/dg * dg/dx: {df_dx_manual.item()}")
print()
print(f"PyTorch computed: {x.grad.item()}")
print(f"Match: {torch.isclose(x.grad, df_dx_manual)}")

---

## 6. Vector and Matrix Gradients

### Jacobian Matrix

For vector function $\mathbf{y} = f(\mathbf{x})$ where $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$:

The Jacobian is:
$$J = \begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n}
\end{bmatrix}$$

### Why This Matters

Neural networks work with vectors/matrices, not scalars. Understanding vector gradients helps you:
- Debug shape mismatches
- Understand backpropagation in matrix form
- Implement custom layers

In [None]:
# Vector-to-scalar function: most common in ML (loss functions)
# Example: L(x) = sum(x^2) where x is a vector

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
print(f"x: {x}")

# Compute loss (scalar output)
L = (x ** 2).sum()  # L = 1 + 4 + 9 = 14
print(f"L = sum(x^2): {L}")
print()

# Backward pass
L.backward()

# Gradient is a vector: dL/dx = [2x1, 2x2, 2x3]
print(f"Gradient dL/dx: {x.grad}")
print(f"Expected [2x1, 2x2, 2x3]: {2*x}")

In [None]:
# Matrix gradient example: Linear layer y = Wx
# W: (2, 3) weight matrix
# x: (3,) input vector  
# y: (2,) output vector

W = torch.tensor([[1.0, 2.0, 3.0],
                  [4.0, 5.0, 6.0]], requires_grad=True)
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

# Forward pass
y = W @ x  # Matrix-vector multiplication
print(f"y = Wx: {y}")

# To compute gradients, we need a scalar loss
loss = y.sum()  # Simple loss: sum of outputs
print(f"loss = sum(y): {loss}")
print()

# Backward pass
loss.backward()

print(f"Gradient dL/dx: {x.grad}")
print(f"Shape: {x.grad.shape}")
print()

print(f"Gradient dL/dW:\n{W.grad}")
print(f"Shape: {W.grad.shape}")

# Key insight: Gradients have the same shape as the original tensors!

### Important: Non-Scalar Outputs

`backward()` only works on scalars by default. For vector outputs, you need to provide a gradient vector.

In [None]:
# Vector output requires gradient argument
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2  # Vector output: [1, 4, 9]

print(f"y: {y}")
print(f"y.shape: {y.shape}")
print()

# This would fail:
# y.backward()  # RuntimeError: grad can be implicitly created only for scalar outputs

# Need to provide gradient vector (often called "vector-Jacobian product")
# Interpretation: How much does each output affect our final scalar?
# For equal weighting: use ones
gradient_vector = torch.ones_like(y)
y.backward(gradient=gradient_vector)

print(f"Gradient: {x.grad}")
print(f"Expected [2x1, 2x2, 2x3]: {2*x}")

# In practice, you usually have a scalar loss, so this is automatic

---

## 7. Common Autograd Operations

### Gradient Accumulation

**Important**: PyTorch **accumulates** gradients by default (adds to existing `.grad`).

In [None]:
# Demonstrate gradient accumulation
x = torch.tensor(2.0, requires_grad=True)

# First computation
y1 = x ** 2
y1.backward()
print(f"After first backward: x.grad = {x.grad}")

# Second computation (without zeroing gradients!)
y2 = x ** 3
y2.backward()
print(f"After second backward: x.grad = {x.grad}")  # 4 + 12 = 16
print()

# This is why we need to zero gradients in training loops!
x.grad.zero_()  # Reset to zero
print(f"After zero_(): x.grad = {x.grad}")

# Or set to None (more memory efficient)
x.grad = None
print(f"After setting None: x.grad = {x.grad}")

### Detaching from the Graph

Sometimes you want to use a tensor's value without tracking gradients.

In [None]:
# Detach example: Using a computed value without gradients
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2

# Detach y from the computational graph
y_detached = y.detach()

print(f"y: {y}, requires_grad: {y.requires_grad}")
print(f"y_detached: {y_detached}, requires_grad: {y_detached.requires_grad}")
print()

# Use detached value in new computation
z = y_detached ** 2  # z = (x^2)^2 = x^4, but gradient stops at y

z.backward()
print(f"x.grad: {x.grad}")  # None! Gradient flow was stopped

# Common use cases:
# 1. Computing metrics during training (accuracy, etc.)
# 2. Implementing stop-gradient techniques
# 3. Target networks in RL (frozen copies)

### No-Grad Context

Disable gradient computation for a block of code (more efficient).

In [None]:
# torch.no_grad() context manager
x = torch.tensor(2.0, requires_grad=True)

# Normal computation (tracks gradients)
y1 = x ** 2
print(f"y1.requires_grad: {y1.requires_grad}")

# Inside no_grad block
with torch.no_grad():
    y2 = x ** 2
    print(f"y2.requires_grad: {y2.requires_grad}")
    
# Use cases:
# 1. Evaluation/inference (don't need gradients)
# 2. Updating parameters (optimizer.step() uses this internally)
# 3. Computing metrics

print("\nBenefit: Saves memory and speeds up computation!")

---

## 8. Gradient Descent from Scratch

Now let's implement gradient descent manually to see how it all fits together!

In [None]:
# Problem: Find minimum of f(x) = (x - 3)^2 + 5
# We know analytically: minimum at x = 3, value = 5
# Let's use gradient descent to find it!

def f(x):
    """Function to minimize"""
    return (x - 3)**2 + 5

# Initialize parameter
x = torch.tensor(0.0, requires_grad=True)  # Start at x=0

# Hyperparameters
learning_rate = 0.1
n_iterations = 50

# Track history for visualization
history_x = []
history_f = []

print("Iteration | x      | f(x)   | gradient")
print("-" * 50)

for i in range(n_iterations):
    # Forward pass
    loss = f(x)
    
    # Backward pass
    loss.backward()
    
    # Save history
    history_x.append(x.item())
    history_f.append(loss.item())
    
    # Print progress
    if i % 10 == 0:
        print(f"{i:9d} | {x.item():6.3f} | {loss.item():6.3f} | {x.grad.item():8.3f}")
    
    # Update parameters (gradient descent step)
    with torch.no_grad():
        x -= learning_rate * x.grad  # x_new = x_old - lr * gradient
    
    # Zero gradients for next iteration
    x.grad.zero_()

print("-" * 50)
print(f"\nFinal result:")
print(f"x = {x.item():.6f} (expected: 3.0)")
print(f"f(x) = {f(x).item():.6f} (expected: 5.0)")

In [None]:
# Visualize gradient descent
x_plot = np.linspace(-1, 6, 100)
y_plot = (x_plot - 3)**2 + 5

plt.figure(figsize=(12, 5))

# Plot 1: Function and optimization path
plt.subplot(1, 2, 1)
plt.plot(x_plot, y_plot, 'b-', linewidth=2, label='f(x) = (x-3)² + 5')
plt.plot(history_x, history_f, 'ro-', markersize=4, linewidth=1, label='Gradient descent path')
plt.plot(history_x[0], history_f[0], 'go', markersize=10, label='Start')
plt.plot(history_x[-1], history_f[-1], 'r*', markersize=15, label='End')
plt.plot(3, 5, 'bs', markersize=10, label='True minimum')
plt.xlabel('x', fontsize=12)
plt.ylabel('f(x)', fontsize=12)
plt.title('Gradient Descent Optimization', fontsize=14)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)

# Plot 2: Loss over iterations
plt.subplot(1, 2, 2)
plt.plot(history_f, 'b-', linewidth=2)
plt.axhline(y=5, color='r', linestyle='--', label='Minimum value')
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('f(x)', fontsize=12)
plt.title('Loss vs Iterations', fontsize=14)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Observations:")
print("1. Started far from minimum (x=0)")
print("2. Gradually converged to x=3")
print("3. Larger steps at start (large gradient), smaller near minimum")
print("4. This is how neural networks learn!")

---

## Mini Exercises

Test your understanding with these exercises!

### Exercise 1: Compute Gradients

Given $f(x, y) = 2x^2 + xy + 3y^2$, compute:
1. $\frac{\partial f}{\partial x}$ at $(x=1, y=2)$
2. $\frac{\partial f}{\partial y}$ at $(x=1, y=2)$

Use PyTorch's autograd and verify against manual calculation.

In [None]:
# Your code here


In [None]:
# Solution
x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)

# Forward pass
f = 2*x**2 + x*y + 3*y**2
print(f"f(1, 2) = {f.item()}")
print()

# Backward pass
f.backward()

# Results
print(f"df/dx: {x.grad.item()}")
print(f"df/dy: {y.grad.item()}")
print()

# Manual calculation:
# df/dx = 4x + y = 4(1) + 2 = 6
# df/dy = x + 6y = 1 + 6(2) = 13
print("Expected:")
print(f"df/dx = 4x + y = {4*1 + 2}")
print(f"df/dy = x + 6y = {1 + 6*2}")

### Exercise 2: Gradient Accumulation

Create a tensor `x = 3.0` and:
1. Compute `y = x**2` and get gradient
2. **Without resetting gradient**, compute `z = x**3` and get gradient
3. What is `x.grad` now? Why?
4. Reset gradient and verify

In [None]:
# Your code here


In [None]:
# Solution
x = torch.tensor(3.0, requires_grad=True)

# 1. First computation
y = x**2
y.backward()
print(f"After y=x^2: x.grad = {x.grad.item()}")
print(f"Expected (2x): {2*3}")
print()

# 2. Second computation WITHOUT resetting
z = x**3
z.backward()
print(f"After z=x^3: x.grad = {x.grad.item()}")
print(f"Expected (2x + 3x^2): {2*3 + 3*3**2}")
print()

# 3. Explanation
print("Why? Gradients accumulate!")
print("grad from y = 6")
print("grad from z = 27") 
print("total = 6 + 27 = 33")
print()

# 4. Reset and verify
x.grad.zero_()
z = x**3
z.backward()
print(f"After reset and z=x^3: x.grad = {x.grad.item()}")
print(f"Expected (3x^2): {3*3**2}")

### Exercise 3: Matrix Gradients

Implement a simple linear transformation:
- Weight matrix `W` of shape `(3, 4)` (random)
- Input vector `x` of shape `(4,)` (random)
- Compute `y = Wx`
- Loss: `L = (y**2).sum()`
- Compute gradients of `L` with respect to `W` and `x`

In [None]:
# Your code here


In [None]:
# Solution
torch.manual_seed(42)

# Create tensors
W = torch.randn(3, 4, requires_grad=True)
x = torch.randn(4, requires_grad=True)

print(f"W shape: {W.shape}")
print(f"x shape: {x.shape}")
print()

# Forward pass
y = W @ x
print(f"y shape: {y.shape}")
print(f"y: {y}")
print()

# Loss
L = (y**2).sum()
print(f"Loss: {L.item()}")
print()

# Backward pass
L.backward()

# Gradients
print(f"dL/dW shape: {W.grad.shape}")  # Same as W: (3, 4)
print(f"dL/dW:\n{W.grad}")
print()

print(f"dL/dx shape: {x.grad.shape}")  # Same as x: (4,)
print(f"dL/dx: {x.grad}")

print("\nKey insight: Gradient shapes match parameter shapes!")

---

## Comprehensive Exercise: Linear Regression with Gradient Descent

Implement linear regression from scratch using autograd!

**Problem**: Given data points $(x, y)$, find the best line $y = wx + b$

**Steps**:
1. Generate synthetic data: $y = 3x + 2 + \text{noise}$
2. Initialize parameters `w` and `b` randomly
3. For each iteration:
   - Compute predictions: `y_pred = w * x + b`
   - Compute loss: MSE = $\frac{1}{n}\sum(y_{\text{pred}} - y)^2$
   - Compute gradients using `.backward()`
   - Update parameters using gradient descent
4. Visualize results

**Goal**: Recover `w ≈ 3` and `b ≈ 2`

In [None]:
# Your code here


In [None]:
# Solution
torch.manual_seed(42)

# 1. Generate data
n_samples = 100
true_w = 3.0
true_b = 2.0

x = torch.randn(n_samples)
y = true_w * x + true_b + torch.randn(n_samples) * 0.5  # Add noise

print(f"Generated {n_samples} data points")
print(f"True parameters: w={true_w}, b={true_b}")
print()

# 2. Initialize parameters
w = torch.randn(1, requires_grad=True)
b = torch.randn(1, requires_grad=True)

print(f"Initial w: {w.item():.3f}")
print(f"Initial b: {b.item():.3f}")
print()

# 3. Training loop
learning_rate = 0.01
n_iterations = 200
losses = []

for i in range(n_iterations):
    # Forward pass
    y_pred = w * x + b
    
    # Compute loss (MSE)
    loss = ((y_pred - y)**2).mean()
    losses.append(loss.item())
    
    # Backward pass
    loss.backward()
    
    # Update parameters
    with torch.no_grad():
        w -= learning_rate * w.grad
        b -= learning_rate * b.grad
    
    # Zero gradients
    w.grad.zero_()
    b.grad.zero_()
    
    # Print progress
    if (i+1) % 40 == 0:
        print(f"Iter {i+1}: loss={loss.item():.4f}, w={w.item():.3f}, b={b.item():.3f}")

print()
print(f"Final w: {w.item():.3f} (true: {true_w})")
print(f"Final b: {b.item():.3f} (true: {true_b})")

In [None]:
# 4. Visualize results
plt.figure(figsize=(14, 5))

# Plot 1: Data and fitted line
plt.subplot(1, 2, 1)
plt.scatter(x.numpy(), y.numpy(), alpha=0.5, label='Data')

# True line
x_line = torch.linspace(x.min(), x.max(), 100)
y_true_line = true_w * x_line + true_b
plt.plot(x_line.numpy(), y_true_line.numpy(), 'g--', linewidth=2, label='True line')

# Learned line
with torch.no_grad():
    y_learned_line = w * x_line + b
plt.plot(x_line.numpy(), y_learned_line.numpy(), 'r-', linewidth=2, label='Learned line')

plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Linear Regression: Data and Fit', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)

# Plot 2: Loss curve
plt.subplot(1, 2, 2)
plt.plot(losses, 'b-', linewidth=2)
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Loss (MSE)', fontsize=12)
plt.title('Training Loss', fontsize=14)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Success! We've implemented linear regression using only autograd.")
print("This is the foundation of how all neural networks learn!")

---

## Key Takeaways

1. **Autograd is the magic**: Automatically computes gradients using chain rule
2. **Computational graph**: PyTorch tracks operations to enable backpropagation
3. **Dynamic graphs**: Built during execution, making PyTorch flexible and Pythonic
4. **requires_grad=True**: Mark tensors you want gradients for
5. **.backward()**: Computes all gradients in one call
6. **Gradients accumulate**: Always zero them between iterations!
7. **torch.no_grad()**: Disable gradient tracking for efficiency
8. **Gradient descent**: The fundamental optimization algorithm for neural networks

### Connection to Neural Networks

What we learned:
- Parameters have `requires_grad=True`
- Forward pass computes loss
- `.backward()` computes gradients
- Update parameters: `param -= lr * param.grad`
- Zero gradients for next iteration

This is **exactly** how neural networks train! The only difference is scale and abstraction.

---

## Next Steps

You now understand the engine behind neural network training! Next, we'll build actual neural networks using PyTorch's `nn.Module`.

Continue to: [Topic 3: Building Neural Networks with nn.Module](03_neural_networks.ipynb)

---

## Further Reading

- [PyTorch Autograd Documentation](https://pytorch.org/docs/stable/autograd.html)
- [Automatic Differentiation Explained](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html)
- [Calculus on Computational Graphs](http://colah.github.io/posts/2015-08-Backprop/)
- [CS231n: Backpropagation](http://cs231n.github.io/optimization-2/)