# Task 5.3: Loss Functions and Optimizers

**Module:** 5 - Phase 1 Capstone: MicroGrad+  
**Time:** 1.5 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand how loss functions guide learning
- [ ] Implement MSE Loss for regression and Cross-Entropy Loss for classification
- [ ] Build SGD optimizer with momentum
- [ ] Implement the Adam optimizer
- [ ] See how optimizers update parameters to minimize loss

---

## üìö Prerequisites

- Completed: Task 5.1 and 5.2 (Tensor and Layers)
- Knowledge of: Basic calculus, gradient descent concept

---

## üåç Real-World Context

Loss functions and optimizers are the "brain" of neural network training:
- **Loss functions** tell the network "how wrong" its predictions are
- **Optimizers** use gradients to update parameters and improve predictions

Choosing the right loss function and optimizer can make or break your model's performance!

---

## üßí ELI5: Loss Functions and Optimizers

> **Imagine you're learning to throw darts at a bullseye.**
>
> **Loss Function = Your Coach's Feedback**
> - MSE Loss: "You missed by 5 centimeters" (measures the distance squared)
> - Cross-Entropy Loss: "You were 80% confident it would hit, but it missed completely!" (penalizes confident wrong predictions)
>
> **Optimizer = How You Adjust Your Aim**
> - SGD: "Move your arm 1cm to the left" (simple correction)
> - SGD + Momentum: "You've been moving left for the last 5 throws, keep that trend going!" (builds up speed)
> - Adam: "Based on how you've been doing and how consistent your throws are, try this specific adjustment" (smart, adaptive)
>
> **In AI terms:** The loss function computes a number that represents how bad the predictions are. The optimizer uses the gradients of this loss to figure out how to adjust each weight to make the predictions better.

> **üìù Learning Note:**
>
> In this notebook, we implement loss functions and optimizers from scratch for educational purposes.
> The complete, tested implementations are already available in the `micrograd_plus` package.
> After understanding how they work, you can import them directly:
>
> ```python
> from micrograd_plus import MSELoss, CrossEntropyLoss, SGD, Adam
> ```

In [None]:
# Setup
import numpy as np
import matplotlib.pyplot as plt
import sys
from pathlib import Path

# Robust path resolution - works regardless of working directory
def _find_module_root():
    """Find the module root directory containing micrograd_plus."""
    current = Path.cwd()
    for parent in [current] + list(current.parents):
        if (parent / 'micrograd_plus' / '__init__.py').exists():
            return str(parent)
    return str(Path.cwd().parent)

sys.path.insert(0, _find_module_root())

from micrograd_plus import Tensor
from micrograd_plus.utils import set_seed

set_seed(42)

---

## Part 1: Mean Squared Error (MSE) Loss

**MSE Loss** is used for regression problems. It measures the average squared difference between predictions and targets:

$$\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$$

### Why Squared?
- Always positive (can't have negative loss)
- Penalizes large errors more than small ones
- Smooth gradient (derivative is `2(y - ≈∑)`)

In [None]:
class MSELoss:
    """
    Mean Squared Error Loss.
    
    MSE = mean((pred - target)^2)
    
    Used for regression problems where you predict continuous values.
    
    Args:
        reduction: How to combine losses ('mean', 'sum', or 'none')
    """
    
    def __init__(self, reduction: str = 'mean'):
        if reduction not in ('mean', 'sum', 'none'):
            raise ValueError(f"reduction must be 'mean', 'sum', or 'none'")
        self.reduction = reduction
    
    def __call__(self, pred: Tensor, target: Tensor) -> Tensor:
        """
        Compute MSE loss.
        
        Args:
            pred: Predictions from the model
            target: Ground truth values
        
        Returns:
            Scalar loss value
        """
        if not isinstance(target, Tensor):
            target = Tensor(target)
        
        diff = pred - target
        squared = diff ** 2
        
        if self.reduction == 'mean':
            return squared.mean()
        elif self.reduction == 'sum':
            return squared.sum()
        else:
            return squared

In [None]:
# Test MSE Loss
mse_loss = MSELoss()

# Simple example
pred = Tensor([1.0, 2.0, 3.0], requires_grad=True)
target = Tensor([1.5, 2.0, 2.5])

loss = mse_loss(pred, target)
print(f"Predictions: {pred.data}")
print(f"Targets: {target.data}")
print(f"Differences: {(pred.data - target.data)}")
print(f"Squared: {(pred.data - target.data)**2}")
print(f"MSE Loss: {loss.item():.4f}")
print(f"Expected: {np.mean((pred.data - target.data)**2):.4f}")

# Backward pass
loss.backward()
print(f"\nGradient: {pred.grad}")
print(f"Expected (2*(pred-target)/N): {2 * (pred.data - target.data) / 3}")

---

## Part 2: Cross-Entropy Loss

**Cross-Entropy Loss** is used for classification. It measures how different the predicted probability distribution is from the true distribution.

$$\text{CE} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)$$

For single-label classification with integer targets:
$$\text{CE} = -\log(\hat{y}_{\text{true class}})$$

### Why Cross-Entropy?
- Penalizes confident wrong predictions heavily
- Works well with softmax output
- Natural fit for probability distributions

In [None]:
class CrossEntropyLoss:
    """
    Cross-Entropy Loss for classification.
    
    Expects raw logits (not softmax) as predictions.
    Targets should be class indices (integers).
    
    Internally computes: -log(softmax(logits)[target_class])
    
    Args:
        reduction: How to combine losses ('mean', 'sum', or 'none')
    """
    
    def __init__(self, reduction: str = 'mean'):
        if reduction not in ('mean', 'sum', 'none'):
            raise ValueError(f"reduction must be 'mean', 'sum', or 'none'")
        self.reduction = reduction
    
    def __call__(self, logits: Tensor, targets: Tensor) -> Tensor:
        """
        Compute cross-entropy loss.
        
        Args:
            logits: Raw predictions of shape (batch_size, num_classes)
            targets: Class indices of shape (batch_size,)
        
        Returns:
            Scalar loss value
        """
        if not isinstance(targets, Tensor):
            targets = Tensor(targets)
        
        batch_size = logits.shape[0]
        num_classes = logits.shape[1]
        
        # Compute log-softmax (numerically stable)
        log_probs = logits.log_softmax(axis=1)
        
        # Create one-hot encoding of targets
        target_indices = targets.data.astype(np.int32)
        one_hot = np.zeros((batch_size, num_classes), dtype=np.float32)
        one_hot[np.arange(batch_size), target_indices] = 1.0
        
        # Negative log likelihood
        nll = -(log_probs * Tensor(one_hot)).sum(axis=1)
        
        if self.reduction == 'mean':
            return nll.mean()
        elif self.reduction == 'sum':
            return nll.sum()
        else:
            return nll

In [None]:
# Test Cross-Entropy Loss
ce_loss = CrossEntropyLoss()

# Example: 2 samples, 3 classes
# Sample 1: Strong prediction for class 0 (correct)
# Sample 2: Wrong prediction (predicts class 0, true is class 2)
logits = Tensor([
    [2.0, 1.0, 0.1],  # Predicts class 0
    [2.0, 1.0, 0.1],  # Predicts class 0 (wrong!)
], requires_grad=True)

targets = Tensor([0, 2])  # True classes

loss = ce_loss(logits, targets)

# Show what's happening
softmax_probs = np.exp(logits.data - logits.data.max(axis=1, keepdims=True))
softmax_probs = softmax_probs / softmax_probs.sum(axis=1, keepdims=True)

print(f"Logits:\n{logits.data}")
print(f"\nSoftmax probabilities:\n{softmax_probs.round(3)}")
print(f"\nTrue classes: {targets.data}")
print(f"\nLoss breakdown:")
print(f"  Sample 0: -log({softmax_probs[0, 0]:.3f}) = {-np.log(softmax_probs[0, 0]):.3f} (correct prediction)")
print(f"  Sample 1: -log({softmax_probs[1, 2]:.3f}) = {-np.log(softmax_probs[1, 2]):.3f} (wrong prediction)")
print(f"\nCross-Entropy Loss: {loss.item():.4f}")

# Backward
loss.backward()
print(f"\nGradient on logits:\n{logits.grad}")
print("(Gradient = softmax - one_hot for each sample)")

In [None]:
# Visualize how cross-entropy penalizes confident wrong predictions
probs = np.linspace(0.01, 0.99, 100)
ce_values = -np.log(probs)

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.plot(probs, ce_values, 'b-', linewidth=2)
plt.xlabel('Predicted Probability for Correct Class')
plt.ylabel('Cross-Entropy Loss')
plt.title('Cross-Entropy Loss vs Confidence')
plt.axvline(x=0.5, color='r', linestyle='--', alpha=0.5, label='50% confident')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
# Compare MSE vs CE for classification
mse_values = (1 - probs) ** 2
plt.plot(probs, ce_values, 'b-', linewidth=2, label='Cross-Entropy')
plt.plot(probs, mse_values, 'r-', linewidth=2, label='MSE')
plt.xlabel('Predicted Probability for Correct Class')
plt.ylabel('Loss')
plt.title('Cross-Entropy vs MSE')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Notice: Cross-entropy penalizes low-confidence correct predictions more than MSE.")
print("This is why CE is preferred for classification - it pushes for confident correct predictions!")

---

## Part 3: SGD Optimizer

**Stochastic Gradient Descent** updates parameters in the direction that reduces loss:

$$\theta_{t+1} = \theta_t - \eta \cdot \nabla L(\theta_t)$$

Where:
- $\theta$ are the parameters
- $\eta$ is the learning rate
- $\nabla L$ is the gradient of the loss

### With Momentum

Momentum helps SGD "build up speed" in consistent directions:

$$v_t = \mu \cdot v_{t-1} + \nabla L(\theta_t)$$
$$\theta_{t+1} = \theta_t - \eta \cdot v_t$$

In [None]:
from typing import List, Dict

class SGD:
    """
    Stochastic Gradient Descent optimizer with optional momentum.
    
    Args:
        params: List of parameters to optimize
        lr: Learning rate (step size)
        momentum: Momentum factor (default: 0, meaning no momentum)
        weight_decay: L2 regularization factor (default: 0)
    """
    
    def __init__(self, params: List[Tensor], lr: float = 0.01, 
                 momentum: float = 0.0, weight_decay: float = 0.0):
        self.params = list(params)
        self.lr = lr
        self.momentum = momentum
        self.weight_decay = weight_decay
        
        # Velocity buffers for momentum
        self.velocities: Dict[int, np.ndarray] = {}
    
    def zero_grad(self) -> None:
        """Reset all parameter gradients to zero."""
        for p in self.params:
            if p.grad is not None:
                p.grad = np.zeros_like(p.data)
    
    def step(self) -> None:
        """Update parameters using their gradients."""
        for i, p in enumerate(self.params):
            if p.grad is None:
                continue
            
            grad = p.grad.copy()
            
            # Apply weight decay (L2 regularization)
            if self.weight_decay != 0:
                grad = grad + self.weight_decay * p.data
            
            # Apply momentum
            if self.momentum != 0:
                if i not in self.velocities:
                    self.velocities[i] = np.zeros_like(p.data)
                
                v = self.velocities[i]
                v = self.momentum * v + grad
                self.velocities[i] = v
                grad = v
            
            # Update parameter
            p.data = p.data - self.lr * grad

In [None]:
# Demonstrate SGD optimization
set_seed(42)

# Simple optimization problem: minimize (x - 3)^2
# The minimum is at x = 3
x = Tensor([0.0], requires_grad=True)
target = 3.0

optimizer = SGD([x], lr=0.1)

print("Optimizing (x - 3)^2 with SGD:")
print(f"Initial x = {x.item():.4f}")

history = [x.item()]
for step in range(20):
    # Forward: compute loss
    loss = (x - target) ** 2
    
    # Backward: compute gradients
    optimizer.zero_grad()
    loss.backward()
    
    # Update
    optimizer.step()
    
    history.append(x.item())
    
    if step < 5 or step >= 18:
        print(f"Step {step+1}: x = {x.item():.4f}, loss = {loss.item():.6f}, grad = {x.grad[0]:.4f}")
    elif step == 5:
        print("...")

print(f"\nFinal x = {x.item():.4f} (target: {target})")

In [None]:
# Compare SGD with and without momentum
def optimize_with_sgd(momentum, lr=0.1, steps=50):
    x = Tensor([0.0], requires_grad=True)
    optimizer = SGD([x], lr=lr, momentum=momentum)
    history = [x.item()]
    
    for _ in range(steps):
        loss = (x - 3.0) ** 2
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        history.append(x.item())
    
    return history

# Run both
history_no_momentum = optimize_with_sgd(momentum=0.0)
history_momentum = optimize_with_sgd(momentum=0.9)

# Plot
plt.figure(figsize=(10, 4))
plt.plot(history_no_momentum, 'b-', label='SGD (no momentum)', linewidth=2)
plt.plot(history_momentum, 'r-', label='SGD + momentum (0.9)', linewidth=2)
plt.axhline(y=3.0, color='k', linestyle='--', alpha=0.3, label='Target')
plt.xlabel('Step')
plt.ylabel('x value')
plt.title('SGD Optimization: Effect of Momentum')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("Notice: Momentum helps converge faster by building up velocity!")

---

## Part 4: Adam Optimizer

**Adam** (Adaptive Moment Estimation) is the most popular optimizer. It combines:
- **Momentum**: Uses exponential moving average of gradients
- **RMSprop**: Adapts learning rate per-parameter based on gradient history

$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$$
$$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$$
$$\hat{m}_t = \frac{m_t}{1-\beta_1^t}$$
$$\hat{v}_t = \frac{v_t}{1-\beta_2^t}$$
$$\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

In [None]:
from typing import Tuple

class Adam:
    """
    Adam optimizer.
    
    Combines momentum and adaptive learning rates for fast, stable convergence.
    
    Args:
        params: List of parameters to optimize
        lr: Learning rate (default: 0.001)
        betas: Coefficients for computing running averages (default: (0.9, 0.999))
        eps: Small constant for numerical stability (default: 1e-8)
        weight_decay: L2 regularization factor (default: 0)
    """
    
    def __init__(self, params: List[Tensor], lr: float = 0.001,
                 betas: Tuple[float, float] = (0.9, 0.999),
                 eps: float = 1e-8, weight_decay: float = 0.0):
        self.params = list(params)
        self.lr = lr
        self.betas = betas
        self.eps = eps
        self.weight_decay = weight_decay
        
        # State
        self.m: Dict[int, np.ndarray] = {}  # First moment
        self.v: Dict[int, np.ndarray] = {}  # Second moment
        self.t = 0  # Timestep
    
    def zero_grad(self) -> None:
        """Reset all parameter gradients to zero."""
        for p in self.params:
            if p.grad is not None:
                p.grad = np.zeros_like(p.data)
    
    def step(self) -> None:
        """Update parameters using Adam algorithm."""
        self.t += 1
        beta1, beta2 = self.betas
        
        for i, p in enumerate(self.params):
            if p.grad is None:
                continue
            
            grad = p.grad.copy()
            
            # Apply weight decay
            if self.weight_decay != 0:
                p.data = p.data - self.lr * self.weight_decay * p.data
            
            # Initialize moment buffers
            if i not in self.m:
                self.m[i] = np.zeros_like(p.data)
                self.v[i] = np.zeros_like(p.data)
            
            m = self.m[i]
            v = self.v[i]
            
            # Update biased first moment estimate
            m = beta1 * m + (1 - beta1) * grad
            self.m[i] = m
            
            # Update biased second moment estimate
            v = beta2 * v + (1 - beta2) * (grad ** 2)
            self.v[i] = v
            
            # Bias correction
            m_hat = m / (1 - beta1 ** self.t)
            v_hat = v / (1 - beta2 ** self.t)
            
            # Update parameter
            p.data = p.data - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

In [None]:
# Compare SGD vs Adam on a harder optimization problem
def rosenbrock(x: Tensor, y: Tensor) -> Tensor:
    """
    Rosenbrock function - a classic optimization test function.
    
    The Rosenbrock function is a non-convex function used as a performance test
    for optimization algorithms. It's also known as the "banana function" due
    to its curved valley shape.
    
    Formula: f(x, y) = (1 - x)^2 + 100 * (y - x^2)^2
    
    Properties:
        - Global minimum at (1, 1) where f(1, 1) = 0
        - The global minimum lies inside a long, narrow, parabolic-shaped valley
        - Finding the valley is trivial, but converging to the minimum is difficult
    
    Args:
        x: First coordinate (Tensor)
        y: Second coordinate (Tensor)
    
    Returns:
        Tensor: Rosenbrock function value at (x, y)
    """
    return (1 - x) ** 2 + 100 * (y - x ** 2) ** 2

def optimize_rosenbrock(optimizer_class, **kwargs) -> Tuple[List[Tuple[float, float]], float]:
    """
    Optimize the Rosenbrock function using a given optimizer.
    
    Args:
        optimizer_class: The optimizer class to use (SGD or Adam)
        **kwargs: Arguments to pass to the optimizer
        
    Returns:
        Tuple of (optimization history, final loss value)
    """
    x = Tensor([-1.0], requires_grad=True)
    y = Tensor([1.0], requires_grad=True)
    
    optimizer = optimizer_class([x, y], **kwargs)
    
    history = [(x.item(), y.item())]
    
    for _ in range(1000):
        # Compute loss
        loss = rosenbrock(x, y)
        
        # Backward
        optimizer.zero_grad()
        loss.backward()
        
        # Update
        optimizer.step()
        
        history.append((x.item(), y.item()))
    
    return history, loss.item()

# Run both optimizers
sgd_history, sgd_final_loss = optimize_rosenbrock(SGD, lr=0.0001)
adam_history, adam_final_loss = optimize_rosenbrock(Adam, lr=0.01)

print(f"Final loss - SGD: {sgd_final_loss:.6f}, Adam: {adam_final_loss:.6f}")
print(f"Minimum is at (1, 1) with loss 0")

In [None]:
# Visualize optimization paths
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Create contour plot
x_range = np.linspace(-2, 2, 100)
y_range = np.linspace(-1, 3, 100)
X, Y = np.meshgrid(x_range, y_range)
Z = (1 - X) ** 2 + 100 * (Y - X ** 2) ** 2

for ax, history, name in [(axes[0], sgd_history, 'SGD'),
                           (axes[1], adam_history, 'Adam')]:
    ax.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), alpha=0.7)
    
    # Plot optimization path
    xs, ys = zip(*history[::10])  # Every 10th point
    ax.plot(xs, ys, 'r.-', markersize=2, linewidth=0.5, alpha=0.7)
    ax.plot(history[0][0], history[0][1], 'go', markersize=10, label='Start')
    ax.plot(history[-1][0], history[-1][1], 'r*', markersize=15, label='End')
    ax.plot(1, 1, 'bx', markersize=15, label='Global min')
    
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_title(f'{name} Optimization Path')
    ax.legend()
    ax.set_xlim(-2, 2)
    ax.set_ylim(-1, 3)

plt.tight_layout()
plt.show()

---

## Part 5: Putting It All Together

Let's train a simple model using our loss functions and optimizers!

In [None]:
# Import our layers
from micrograd_plus import Linear, ReLU, Sequential

# Generate simple classification data
set_seed(42)

def generate_spiral_data(n_points=100, n_classes=3):
    """Generate spiral dataset for classification"""
    X = np.zeros((n_points * n_classes, 2), dtype=np.float32)
    y = np.zeros(n_points * n_classes, dtype=np.int32)
    
    for class_idx in range(n_classes):
        ix = range(n_points * class_idx, n_points * (class_idx + 1))
        r = np.linspace(0.0, 1, n_points)
        t = np.linspace(class_idx * 4, (class_idx + 1) * 4, n_points) + np.random.randn(n_points) * 0.2
        X[ix] = np.c_[r * np.sin(t), r * np.cos(t)]
        y[ix] = class_idx
    
    return X, y

# Generate data
X, y = generate_spiral_data()

# Visualize
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', s=20)
plt.title('Spiral Dataset')
plt.xlabel('x1')
plt.ylabel('x2')
plt.colorbar(label='Class')
plt.show()

print(f"Data shape: {X.shape}")
print(f"Labels shape: {y.shape}")

In [None]:
# Build and train a model
set_seed(42)

# Model
model = Sequential(
    Linear(2, 64),
    ReLU(),
    Linear(64, 64),
    ReLU(),
    Linear(64, 3)
)

# Loss and optimizer
loss_fn = CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=0.01)

# Convert data to tensors
X_tensor = Tensor(X, requires_grad=True)
y_tensor = Tensor(y)

# Training loop
losses = []
accuracies = []

print("Training...")
for epoch in range(200):
    # Forward pass
    logits = model(X_tensor)
    loss = loss_fn(logits, y_tensor)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    
    # Update
    optimizer.step()
    
    # Track metrics
    losses.append(loss.item())
    predictions = np.argmax(logits.data, axis=1)
    accuracy = np.mean(predictions == y)
    accuracies.append(accuracy)
    
    if epoch % 20 == 0:
        print(f"Epoch {epoch:3d}: Loss = {loss.item():.4f}, Accuracy = {accuracy:.2%}")

print(f"\nFinal: Loss = {losses[-1]:.4f}, Accuracy = {accuracies[-1]:.2%}")

In [None]:
# Visualize training and decision boundary
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Loss curve
axes[0].plot(losses)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training Loss')
axes[0].grid(True, alpha=0.3)

# Accuracy curve
axes[1].plot(accuracies)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Training Accuracy')
axes[1].grid(True, alpha=0.3)

# Decision boundary
h = 0.02
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Predict on grid
model.eval()
grid_tensor = Tensor(np.c_[xx.ravel(), yy.ravel()].astype(np.float32))
Z = model(grid_tensor)
Z = np.argmax(Z.data, axis=1).reshape(xx.shape)

axes[2].contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
axes[2].scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', s=20, edgecolors='k', linewidth=0.5)
axes[2].set_xlabel('x1')
axes[2].set_ylabel('x2')
axes[2].set_title('Decision Boundary')

plt.tight_layout()
plt.show()

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Learning Rate Too High or Too Low

```python
# ‚ùå Too high: loss explodes or oscillates
optimizer = SGD(params, lr=10.0)

# ‚ùå Too low: training takes forever
optimizer = SGD(params, lr=0.00001)

# ‚úÖ Start with reasonable defaults
optimizer = SGD(params, lr=0.01)  # For SGD
optimizer = Adam(params, lr=0.001)  # For Adam
```

### Mistake 2: Not Zeroing Gradients

```python
# ‚ùå Wrong: gradients accumulate!
for epoch in range(epochs):
    loss = model(x)
    loss.backward()  # Gradients keep growing!
    optimizer.step()

# ‚úÖ Right: zero gradients each iteration
for epoch in range(epochs):
    optimizer.zero_grad()
    loss = model(x)
    loss.backward()
    optimizer.step()
```

---

## üéâ Checkpoint

You've learned:
- ‚úÖ How MSE Loss measures regression error
- ‚úÖ How Cross-Entropy Loss works for classification
- ‚úÖ How SGD updates parameters with optional momentum
- ‚úÖ How Adam adaptively adjusts learning rates
- ‚úÖ How to combine everything to train a neural network!

---

## üßπ Cleanup

In [None]:
# Cleanup - release memory
from micrograd_plus.utils import cleanup_notebook
cleanup_notebook(globals())