## Loss Functions and Optimization in AI

Welcome! This beginner-friendly notebook takes you from intuition to math to code for loss functions and optimization. You'll read short stories and analogies, learn the formulas with every symbol explained, and then see visualizations and simulations.

What you will learn:
- Why we need loss functions and optimization
- The most common losses: MSE, MAE, Cross-Entropy, Hinge, KL-Divergence, plus Huber and Focal
- Optimizers: Gradient Descent, SGD, Mini-batch, Momentum, Nesterov, AdaGrad, RMSProp, Adam
- How learning rate, batch size, and epochs affect training
- Visual intuition: loss surfaces and optimization paths
- Real-world examples and step-by-step gradient updates
- Practice problems and experiments you can try


In [None]:
# Setup: imports and plotting defaults
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import cm

np.random.seed(42)
sns.set(style="whitegrid", context="notebook")

# Utility: helper to annotate plots
def despine(axis=None):
    ax = axis or plt.gca()
    sns.despine(ax=ax)

# Utility: meshgrid for contour plots

def make_mesh(xmin, xmax, ymin, ymax, steps=200):
    x = np.linspace(xmin, xmax, steps)
    y = np.linspace(ymin, ymax, steps)
    X, Y = np.meshgrid(x, y)
    return X, Y


### Why do we need loss functions? (Story)
Imagine teaching a robot to throw a ball into a basket. After each throw, you need a way to tell it "how bad" the throw was. Was it 2 meters short? 10 cm to the left? That number—the "badness" of the attempt—is the loss. The robot uses this number to learn to throw better next time.

- Without a loss, the robot gets no feedback. No feedback means no learning.
- The loss converts "performance" into a single number we can minimize.
- Smaller loss → better predictions.

In machine learning, we predict something (price, class, probability), compare it to the true answer, compute the loss, and adjust parameters to make future predictions better.


### Intuition: Real-world analogies
- **GPS navigation**: The loss is like your distance to the destination. Optimization is choosing turns to get closer each minute.
- **Darts**: Each dart lands somewhere. The loss is the distance from the bullseye. The strategy you use to correct your aim is optimization.
- **Cooking to taste**: Taste (loss) tells you how far your soup is from "just right." Adjusting salt/heat (optimization) moves you toward better taste.


### Why do we need optimization? (Story + Analogy)
You have a map of hills and valleys (the loss surface). Your model's parameters are your location on this map. You want to get to the lowest valley (minimum loss). Optimization is your hiking strategy:
- Look at the slope under your feet (gradient)
- Step downhill (update)
- Keep stepping until you reach a low place

Different strategies (optimizers) choose step sizes and directions differently to reach the valley faster and more reliably.


## Loss Functions: Definitions, intuition, and when to use

#### Mean Squared Error (MSE)
Formula: \( \text{MSE}(\hat{y}, y) = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2 \)
- \(\hat{y}_i\): predicted value for example i
- \(y_i\): true value for example i
- \(n\): number of examples
- Squaring heavily penalizes large errors → sensitive to outliers
- **Use when**: regression with Gaussian-like noise; smooth, convex, differentiable

#### Mean Absolute Error (MAE)
Formula: \( \text{MAE}(\hat{y}, y) = \frac{1}{n} \sum_{i=1}^{n} |\hat{y}_i - y_i| \)
- Absolute value penalizes linearly → robust to outliers
- Non-differentiable at 0, but subgradients exist
- **Use when**: you want median-like behavior, robustness to outliers

#### Binary Cross-Entropy (Log Loss)
Formula: \( \text{BCE}(\hat{p}, y) = -\frac{1}{n}\sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i) \right] \)
- \(\hat{p}_i\): predicted probability of class 1
- \(y_i \in \{0,1\}\): true label
- Derived from maximum likelihood under Bernoulli model
- **Use when**: binary classification

#### Multiclass Cross-Entropy
Softmax: \( \hat{p}_{i,k} = \frac{e^{z_{i,k}}}{\sum_j e^{z_{i,j}}} \), Loss: \( -\frac{1}{n} \sum_{i=1}^n \sum_{k} y_{i,k}\log(\hat{p}_{i,k}) \)
- \(z_{i,k}\): logit for class k; \(y_{i,k}\): one-hot label
- **Use when**: multiclass classification

#### Hinge Loss (for SVM)
Binary (labels \(y\in\{-1, +1\}\)): \( \max(0, 1 - y\cdot f(x)) \)
- Encourages a margin: predictions not just correct, but confidently correct
- **Use when**: margin-based classifiers (SVM)

#### KL-Divergence (information distance)
\( D_{\mathrm{KL}}(P\,\Vert\,Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} \)
- Measures how one probability distribution \(Q\) diverges from true distribution \(P\)
- Asymmetric; not a metric
- **Use when**: measuring distribution mismatch (e.g., VAEs, distillation)

### Properties and connections
- **Convexity**: MSE, MAE, hinge are convex; cross-entropy is convex for logistic regression; KL is convex in Q
- **Bias-variance link**: MSE decomposes into bias^2 + variance + noise
- **Likelihood**: MSE ↔ Gaussian, MAE ↔ Laplace, BCE ↔ Bernoulli, CE ↔ categorical
- **Choosing a loss**:
  - Noisy Gaussian-ish regression → MSE
  - Robust regression → MAE or Huber
  - Class probabilities → Cross-Entropy
  - Margins and robustness to outliers → Hinge
  - Distribution matching → KL


### Custom/Composite Losses

#### Huber Loss (smooth L1)
\[ \mathcal{L}_\delta(r) = \begin{cases}
 \tfrac{1}{2} r^2 & \text{if } |r| \le \delta \\
 \delta(|r| - \tfrac{1}{2}\delta) & \text{otherwise}
\end{cases} \]
- \(r = \hat{y} - y\) is the residual; \(\delta\) is a tuning threshold
- Quadratic near 0 (like MSE), linear in tails (like MAE)
- **Use when**: you want MSE smoothness but MAE robustness

#### Focal Loss (for class imbalance)
Binary: \( \mathrm{FL}(\hat{p}, y) = -\alpha (1-\hat{p})^{\gamma} y\log(\hat{p}) - (1-\alpha) \hat{p}^{\gamma}(1-y)\log(1-\hat{p}) \)
- \(\gamma\): focusing parameter, down-weights easy examples
- \(\alpha\): class-balancing factor
- **Use when**: heavy class imbalance (e.g., fraud detection, detection tasks)

Notes:
- Composite losses often add regularization terms: \(\mathcal{L}_\text{total} = \mathcal{L}_\text{data} + \lambda \|\theta\|^2\)
- Scaling and normalization matter for training stability


## Optimization Algorithms: from gradients to updates

### Gradient Descent (batch)
- Update: \( \theta_{t+1} = \theta_t - \eta \, \nabla_\theta \mathcal{L}(\theta_t) \)
- \(\eta\): learning rate; \(\nabla_\theta \mathcal{L}\): gradient over full dataset

### Stochastic Gradient Descent (SGD)
- Use one sample at a time: \( \theta_{t+1} = \theta_t - \eta \, \nabla_\theta \ell(\theta_t; x_i, y_i) \)
- Noisy but fast; enables online learning

### Mini-batch SGD
- Use small batches: balances stability and speed

### Practical notes
- Too large \(\eta\) → divergence; too small → slow
- Shuffle data every epoch for SGD/mini-batch
- Normalize inputs to improve conditioning


### Momentum, Nesterov, and Adaptive Methods

#### Momentum
\[ v_{t+1} = \beta v_t + (1-\beta)\, \nabla_\theta \mathcal{L}(\theta_t), \quad \theta_{t+1} = \theta_t - \eta \, v_{t+1} \]
- \(\beta\): momentum coefficient (e.g., 0.9). Averages gradients to smooth zig-zag.

#### Nesterov Accelerated Gradient (NAG)
Look ahead by momentum before computing gradient:
\[ v_{t+1} = \beta v_t + (1-\beta)\, \nabla_\theta \mathcal{L}(\theta_t - \eta \beta v_t) \]
\[ \theta_{t+1} = \theta_t - \eta v_{t+1} \]

#### AdaGrad
\[ G_{t+1} = G_t + g_t \odot g_t, \quad \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_{t+1}} + \epsilon} \odot g_t \]
- Per-parameter learning rates shrink over time; good for sparse features

#### RMSProp
Exponentially decaying average of squared gradients:
\[ s_{t+1} = \rho s_t + (1-\rho) \, g_t^2, \quad \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{s_{t+1}} + \epsilon} g_t \]

#### Adam
Combines momentum (first moment) + RMSProp (second moment):
\[ m_{t+1} = \beta_1 m_t + (1-\beta_1) g_t, \quad v_{t+1} = \beta_2 v_t + (1-\beta_2) g_t^2 \]
Bias-corrected:
\[ \hat{m}_{t+1} = \frac{m_{t+1}}{1-\beta_1^{t+1}}, \; \hat{v}_{t+1} = \frac{v_{t+1}}{1-\beta_2^{t+1}}, \; \theta_{t+1} = \theta_t - \eta \frac{\hat{m}_{t+1}}{\sqrt{\hat{v}_{t+1}} + \epsilon} \]

Notes:
- Adam often works out-of-the-box (\(\eta\approx 1e-3\))
- Tune \(\beta_1, \beta_2\) for stability (common: 0.9, 0.999)
- Consider decoupled weight decay for regularization (AdamW)


## Training Dynamics: LR, batch size, epochs, and pitfalls
- **Learning rate (\(\eta\))**: too high → oscillation/divergence; too low → slow
- **Batch size**: small → noisy but explores valleys; large → stable but may get stuck
- **Epochs**: number of full passes; watch for overfitting
- **Challenges**: local minima (rare in deep nets), saddle points, plateaus, poor conditioning (zig-zag), exploding/vanishing gradients
- **Fixes**: normalization, good initialization, momentum/Adam, LR schedules, gradient clipping


In [None]:
# Synthetic dataset for regression (for MSE surface)
# y = 2.0 * x + 1.0 + noise
rng = np.random.default_rng(0)
x_reg = rng.uniform(-3, 3, size=60)
noise = rng.normal(0, 0.6, size=x_reg.shape)
y_reg = 2.0 * x_reg + 1.0 + noise

# Grid over parameters (w, b)
w_grid = np.linspace(-1.5, 4.0, 120)
b_grid = np.linspace(-2.0, 4.0, 120)
W, B = np.meshgrid(w_grid, b_grid)

# Compute MSE surface
Y_hat = W[None, :, :] * x_reg[:, None, None] + B[None, :, :]
residuals = Y_hat - y_reg[:, None, None]
MSE_surface = np.mean(residuals**2, axis=0)

# Plot contours
plt.figure(figsize=(7, 5))
contours = plt.contour(W, B, MSE_surface, levels=30, cmap="viridis")
plt.clabel(contours, inline=True, fontsize=8, fmt="%.2f")
plt.title("MSE Loss Surface over (w, b)")
plt.xlabel("w")
plt.ylabel("b")
despine()
plt.show()


In [None]:
# Binary logistic regression toy example for Cross-Entropy visualization
rng = np.random.default_rng(1)
N = 100
x_pos = rng.normal(1.5, 0.6, size=(N//2, 2))
x_neg = rng.normal(-1.5, 0.6, size=(N//2, 2))
X_cls = np.vstack([x_pos, x_neg])
y_cls = np.hstack([np.ones(N//2), np.zeros(N//2)])

# Logistic regression model: p = sigmoid(w^T x + b)

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

# Fix b and visualize CE over a 2D grid of w = (w1, w2)
w1 = np.linspace(-4, 4, 121)
w2 = np.linspace(-4, 4, 121)
W1, W2 = np.meshgrid(w1, w2)
bias = 0.0
Z = W1[None, :, :] * X_cls[:, 0][:, None, None] + W2[None, :, :] * X_cls[:, 1][:, None, None] + bias
P = sigmoid(Z)
# Binary cross-entropy for each point, averaged across samples
CE_surface = -np.mean(y_cls[:, None, None] * np.log(P + 1e-10) + (1 - y_cls)[:, None, None] * np.log(1 - P + 1e-10), axis=0)

plt.figure(figsize=(7, 5))
contours = plt.contour(W1, W2, CE_surface, levels=30, cmap="magma")
plt.clabel(contours, inline=True, fontsize=8, fmt="%.2f")
plt.title("Cross-Entropy Loss over (w1, w2) with b=0")
plt.xlabel("w1")
plt.ylabel("w2")
despine()
plt.show()


In [None]:
# Optimizer simulations on a 2D quadratic bowl to visualize paths
# Loss: f(w) = 0.5 * [a*w1^2 + b*w2^2] (ill-conditioned if a != b)
a, b = 1.0, 10.0

def f(w):
    return 0.5 * (a * w[0]**2 + b * w[1]**2)

def grad_f(w):
    return np.array([a * w[0], b * w[1]])

w0 = np.array([3.5, 3.5])
T = 60

# Trajectories for different optimizers

def run_gd(eta):
    w = w0.copy()
    traj = [w.copy()]
    for t in range(T):
        g = grad_f(w)
        w = w - eta * g
        traj.append(w.copy())
    return np.array(traj)


def run_momentum(eta, beta=0.9):
    w = w0.copy()
    v = np.zeros_like(w)
    traj = [w.copy()]
    for t in range(T):
        g = grad_f(w)
        v = beta * v + (1 - beta) * g
        w = w - eta * v
        traj.append(w.copy())
    return np.array(traj)


def run_adam(eta=0.1, beta1=0.9, beta2=0.999, eps=1e-8):
    w = w0.copy()
    m = np.zeros_like(w)
    v = np.zeros_like(w)
    traj = [w.copy()]
    for t in range(1, T + 1):
        g = grad_f(w)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * (g * g)
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        w = w - eta * m_hat / (np.sqrt(v_hat) + eps)
        traj.append(w.copy())
    return np.array(traj)

traj_gd = run_gd(eta=0.15)
traj_mom = run_momentum(eta=0.3, beta=0.9)
traj_adam = run_adam(eta=0.25)

# Plot the bowl and the paths
X, Y = make_mesh(-4, 4, -4, 4, steps=200)
Z = 0.5 * (a * X**2 + b * Y**2)
plt.figure(figsize=(7, 6))
cs = plt.contour(X, Y, Z, levels=30, cmap="Greys")
plt.clabel(cs, inline=True, fontsize=8, fmt="%.1f")
plt.plot(traj_gd[:, 0], traj_gd[:, 1], "-o", ms=3, label="GD (eta=0.15)")
plt.plot(traj_mom[:, 0], traj_mom[:, 1], "-o", ms=3, label="Momentum (eta=0.3)")
plt.plot(traj_adam[:, 0], traj_adam[:, 1], "-o", ms=3, label="Adam (eta=0.25)")
plt.legend()
plt.title("Optimization Paths on an Ill-Conditioned Quadratic")
plt.xlabel("w1")
plt.ylabel("w2")
despine()
plt.show()


In [None]:
# SGD vs batch GD on linear regression with adjustable batch size and LR
rng = np.random.default_rng(2)
x = rng.uniform(-2, 2, size=200)
y = 1.2 * x - 0.7 + rng.normal(0, 0.3, size=x.shape)
X = np.c_[x, np.ones_like(x)]  # [x, 1]

true_w = np.array([1.2, -0.7])


def mse(yhat, y):
    return np.mean((yhat - y) ** 2)


def grad_mse_wrt_w(Xb, yb, w):
    # gradient of MSE w.r.t. w for batch Xb
    yhat = Xb @ w
    return (2.0 / len(Xb)) * (Xb.T @ (yhat - yb))


def train_sgd(X, y, lr=0.1, batch_size=1, epochs=20, method="sgd"):
    w = rng.normal(0, 1, size=2)
    hist = {"w": [w.copy()], "loss": []}
    for ep in range(epochs):
        idx = rng.permutation(len(X))
        Xs, ys = X[idx], y[idx]
        for i in range(0, len(Xs), batch_size):
            Xb = Xs[i : i + batch_size]
            yb = ys[i : i + batch_size]
            g = grad_mse_wrt_w(Xb, yb, w)
            w = w - lr * g
            hist["w"].append(w.copy())
            hist["loss"].append(mse(X @ w, y))
    hist["w"] = np.array(hist["w"])  # (steps, 2)
    hist["loss"] = np.array(hist["loss"])  # (steps,)
    return w, hist

# Run experiments
w_sgd, h_sgd = train_sgd(X, y, lr=0.2, batch_size=1, epochs=10)
w_mb, h_mb = train_sgd(X, y, lr=0.1, batch_size=32, epochs=10)
w_bg, h_bg = train_sgd(X, y, lr=0.05, batch_size=len(X), epochs=10)

# Plot loss curves
plt.figure(figsize=(7, 4))
plt.plot(h_sgd["loss"], label="SGD (bs=1, lr=0.2)")
plt.plot(h_mb["loss"], label="Mini-batch (bs=32, lr=0.1)")
plt.plot(h_bg["loss"], label="Batch GD (full, lr=0.05)")
plt.yscale("log")
plt.xlabel("Step")
plt.ylabel("MSE (log scale)")
plt.title("Effect of Batch Size and Learning Rate on Convergence")
plt.legend()
despine()
plt.show()

# Plot parameter paths in (w0, w1)
plt.figure(figsize=(6, 5))
plt.plot(h_sgd["w"][:, 0], h_sgd["w"][:, 1], label="SGD")
plt.plot(h_mb["w"][:, 0], h_mb["w"][:, 1], label="Mini-batch")
plt.plot(h_bg["w"][:, 0], h_bg["w"][:, 1], label="Batch GD")
plt.scatter([true_w[0]], [true_w[1]], c="red", marker="*", s=120, label="True")
plt.xlabel("w_slope")
plt.ylabel("w_bias")
plt.title("Parameter Trajectories")
plt.legend()
despine()
plt.show()


## Real-world scenarios: Choosing losses and optimizers

1. **Linear regression for housing prices**
   - Loss: MSE (noise roughly Gaussian, continuous target)
   - Optimizer: Mini-batch SGD or Adam
   - Tip: Standardize features; start with LR ~ 1e-2 to 1e-3 for Adam

2. **Fraud detection (high class imbalance)**
   - Loss: Binary Cross-Entropy + Focal Loss variant or class weights
   - Optimizer: Adam (handles sparse informative features well)
   - Tip: Monitor precision/recall; use AUROC/PR AUC; consider undersampling/oversampling

3. **Image classification (multiclass)**
   - Loss: Cross-Entropy with softmax
   - Optimizer: SGD with Momentum or Adam; cosine LR schedule
   - Tip: Use data augmentation; weight decay to regularize

4. **SVM-style margin classification**
   - Loss: Hinge
   - Optimizer: specialized solvers or SGD on hinge loss
   - Tip: Scaling matters; hinge encourages larger margins


## Worked examples: step-by-step gradients and updates

### Example A: Linear regression (one step)
Model: \(\hat{y} = wx + b\), Loss: \(\ell = (\hat{y} - y)^2\)
- Given: \(x=2\), \(y=5\), current \(w=1\), \(b=0\), \(\eta=0.1\)
- Forward: \(\hat{y}=1\cdot 2 + 0 = 2\)
- Residual: \(r = \hat{y}-y = -3\)
- Gradients: \(\partial \ell/\partial w = 2 r x = -12\), \(\partial \ell/\partial b = 2 r = -6\)
- Update: \(w' = 1 - 0.1(-12) = 2.2\), \(b' = 0 - 0.1(-6) = 0.6\)

### Example B: Logistic regression (binary) with BCE
Model: \(p = \sigma(wx+b)\), Loss: \(\ell = -[y\log p + (1-y)\log(1-p)]\)
- Given: \(x=1.0\), \(y=1\), current \(w=0\), \(b=0\), \(\eta=0.5\)
- Forward: \(z=0\Rightarrow p=0.5\)
- Gradient: \(\partial \ell/\partial w = (p-y)x = -0.5\), \(\partial \ell/\partial b = (p-y) = -0.5\)
- Update: \(w' = 0 - 0.5(-0.5) = 0.25\), \(b' = 0 - 0.5(-0.5) = 0.25\)

### Example C: Huber loss derivative
- Residual \(r=\hat{y}-y\)
- If \(|r| \le \delta\): gradient like MSE, \(\partial \ell/\partial \hat{y} = r\)
- Else: like MAE, \(\partial \ell/\partial \hat{y} = \delta\,\mathrm{sign}(r)\)


## Practice problems

1. Compute one GD step for linear regression
   - Data: \((x, y) = (3, 10)\)
   - Current: \(w=2, b=1\), \(\eta=0.05\)
   - Loss: MSE on this single sample. What are \(w'\) and \(b'\)?

2. Logistic regression gradient by hand
   - \(x=2.0\), \(y=0\), \(w=1.0\), \(b=-1.0\), \(\eta=0.2\)
   - Compute \(p=\sigma(wx+b)\), gradients, and updated \(w', b'\)

3. Implement Huber loss
   - Modify the code to implement Huber and compare with MSE for outliers.

4. Experiment: learning rate and batch size
   - Change `lr` and `batch_size` in the SGD cell. Observe loss curves.

5. Optimizer swap
   - Reuse the quadratic bowl cell and add RMSProp. Compare its path to Adam.

6. Class imbalance exercise
   - Modify the BCE surface cell to add class weights or focal term. Observe differences.


## Summary / Key Takeaways
- **Loss is feedback**: it turns prediction quality into a single number to minimize.
- **Pick the right loss**: MSE/MAE for regression, Cross-Entropy for probabilities, Hinge for margins, KL for distributions.
- **Optimization is the route down the hill**: GD, SGD, Momentum, Adam each balance speed and stability.
- **Hyperparameters matter**: learning rate, batch size, epochs can radically change training.
- **Practical heuristics**: normalize inputs, start with Adam (1e-3), later try SGD+Momentum for final polish.

### Cheat-sheet
| Loss Function | Best Use Case | Optimizer | Why |
|---|---|---|---|
| MSE | Regression with Gaussian noise | Mini-batch SGD/Adam | Smooth, convex; well-understood |
| MAE | Robust regression (outliers) | Adam | Linear penalty; robust to outliers |
| BCE | Binary classification | Adam | Probabilistic; calibrated outputs |
| CE (softmax) | Multiclass classification | SGD+Momentum/Adam | Stable training, good generalization |
| Hinge | Margin-based classification | SGD | Encourages large margins |
| KL Divergence | Distribution matching | Adam | Works with probabilistic models |
| Huber | Mix of MSE/MAE | Adam | Smooth near 0, robust in tails |
| Focal | Class imbalance | Adam | Focuses on hard examples |


## Bonus: Modify and explore
- Try different initial points in the optimizer path cell. Do you still see zig-zag with GD?
- Change the conditioning of the bowl (set `a=1, b=100`). How do optimizers behave?
- In the logistic CE surface, move `b` away from 0 and observe changes.
- Add Nesterov or RMSProp implementations and compare paths visually.
