# 01: Gradients and Derivatives

**Module 1.1: Calculus & Optimization**

## Learning Objectives

By the end of this notebook, you will:
1. Understand partial derivatives and their geometric meaning
2. Compute gradients by hand and with PyTorch
3. Use directional derivatives to analyze functions
4. Connect these concepts to genomics loss functions

## Resources
- Solomon, *Numerical Algorithms*, §1.4.1-1.4.2
- Cohen, *Practical Linear Algebra*, Chapter 2
- Ananthaswamy, *Why Machines Learn*, Chapter 5

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Plot settings
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

---
## 1. From Single Variable to Multiple Variables

### Single Variable Review

For $f(x): \mathbb{R} \to \mathbb{R}$, the derivative is:

$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$

**Interpretation:** Rate of change of $f$ as $x$ changes.

### Multiple Variables: Partial Derivatives

For $f(x_1, x_2, ..., x_n): \mathbb{R}^n \to \mathbb{R}$, we have **partial derivatives**:

$$\frac{\partial f}{\partial x_k} = \lim_{h \to 0} \frac{f(x_1, ..., x_k + h, ..., x_n) - f(x_1, ..., x_k, ..., x_n)}{h}$$

**Key idea:** Hold all other variables constant, differentiate with respect to one.

### Example: Computing Partial Derivatives

Let $f(x, y) = x^2 + 3xy + 2y^2$

**Partial with respect to $x$** (treat $y$ as constant):
$$\frac{\partial f}{\partial x} = 2x + 3y$$

**Partial with respect to $y$** (treat $x$ as constant):
$$\frac{\partial f}{\partial y} = 3x + 4y$$

In [None]:
# Verify with PyTorch
def f(x, y):
    return x**2 + 3*x*y + 2*y**2

# Create tensors with gradient tracking
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

# Compute function value
z = f(x, y)
print(f"f(2, 3) = {z.item()}")

# Compute gradients
z.backward()

print(f"∂f/∂x at (2,3) = {x.grad.item()}  (Expected: 2*2 + 3*3 = 13)")
print(f"∂f/∂y at (2,3) = {y.grad.item()}  (Expected: 3*2 + 4*3 = 18)")

---
## 2. The Gradient Vector

The **gradient** collects all partial derivatives into a vector:

$$\nabla f(\vec{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

### Key Properties

1. **Direction:** $\nabla f$ points in the direction of steepest ascent
2. **Magnitude:** $\|\nabla f\|$ gives the rate of steepest ascent
3. **Perpendicular to level sets:** $\nabla f$ is perpendicular to contour lines $f(\vec{x}) = c$

In [None]:
# Visualize gradient as direction of steepest ascent
def f_2d(xy):
    x, y = xy[0], xy[1]
    return x**2 + 2*y**2  # Elliptical bowl

# Create grid for contour plot
x_range = np.linspace(-3, 3, 100)
y_range = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x_range, y_range)
Z = X**2 + 2*Y**2

# Compute gradient at several points
points = [(-2, 1), (1, 2), (2, -1), (-1, -1.5)]

fig, ax = plt.subplots(figsize=(10, 8))

# Contour plot
contours = ax.contour(X, Y, Z, levels=15, cmap='viridis')
ax.clabel(contours, inline=True, fontsize=8)

# Plot gradients at each point
for px, py in points:
    # Gradient: (2x, 4y)
    grad_x, grad_y = 2*px, 4*py
    
    # Normalize for visualization (but show actual direction)
    scale = 0.3
    ax.arrow(px, py, scale*grad_x, scale*grad_y, 
             head_width=0.15, head_length=0.1, fc='red', ec='red')
    ax.plot(px, py, 'ko', markersize=8)

ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Gradient vectors (red) point perpendicular to contour lines\n(direction of steepest ascent)')
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---
## 3. Directional Derivative

The **directional derivative** measures the rate of change of $f$ in any direction $\vec{v}$:

$$D_{\vec{v}}f(\vec{x}) = \nabla f(\vec{x}) \cdot \vec{v} = \|\nabla f\| \|\vec{v}\| \cos\theta$$

where $\theta$ is the angle between $\nabla f$ and $\vec{v}$.

### Key Insight

- **Maximum** when $\vec{v}$ aligns with $\nabla f$ ($\theta = 0$)
- **Zero** when $\vec{v}$ is perpendicular to $\nabla f$ ($\theta = 90°$)
- **Minimum** when $\vec{v}$ is opposite to $\nabla f$ ($\theta = 180°$)

In [None]:
# Compute directional derivative
def directional_derivative(f, x, v):
    """
    Compute directional derivative of f at x in direction v.
    
    Args:
        f: Function that takes a tensor and returns a scalar
        x: Point (tensor with requires_grad=True)
        v: Direction vector (tensor)
    
    Returns:
        Directional derivative (scalar)
    """
    x = x.clone().requires_grad_(True)
    y = f(x)
    y.backward()
    grad = x.grad
    
    # D_v f = grad · v
    return torch.dot(grad, v).item()

# Example: f(x,y) = x^2 + 2y^2
def f_bowl(xy):
    return xy[0]**2 + 2*xy[1]**2

x = torch.tensor([1.0, 1.0])

# Gradient at (1,1) is (2, 4)
# Test different directions
directions = {
    'Along gradient (2,4) normalized': torch.tensor([2.0, 4.0]) / torch.norm(torch.tensor([2.0, 4.0])),
    'Opposite to gradient': -torch.tensor([2.0, 4.0]) / torch.norm(torch.tensor([2.0, 4.0])),
    'Perpendicular (4,-2) normalized': torch.tensor([4.0, -2.0]) / torch.norm(torch.tensor([4.0, -2.0])),
    'x-direction (1,0)': torch.tensor([1.0, 0.0]),
}

print("Directional derivatives at (1,1):")
print("Gradient = (2, 4), |∇f| = ", torch.norm(torch.tensor([2.0, 4.0])).item())
print("-" * 50)

for name, v in directions.items():
    dd = directional_derivative(f_bowl, x, v)
    print(f"{name}: {dd:.4f}")

---
## 4. Vector Norms: Measuring Magnitude

Different norms measure vector "size" differently:

| Norm | Formula | Interpretation |
|------|---------|----------------|
| $L_1$ | $\|\vec{x}\|_1 = \sum_i \|x_i\|$ | Manhattan distance |
| $L_2$ | $\|\vec{x}\|_2 = \sqrt{\sum_i x_i^2}$ | Euclidean distance |
| $L_\infty$ | $\|\vec{x}\|_\infty = \max_i \|x_i\|$ | Maximum element |

### Why This Matters for Genomics

- **L2 norm:** Used in ridge regression, PCA
- **L1 norm:** Used in LASSO for sparse gene selection
- **L∞ norm:** Used for numerical stability checks

In [None]:
def compute_norms(x):
    """Compute L1, L2, and Linf norms of a vector."""
    return {
        'L1': torch.norm(x, p=1).item(),
        'L2': torch.norm(x, p=2).item(),
        'Linf': torch.norm(x, p=float('inf')).item()
    }

# Example: Gene expression difference vector
# Imagine comparing expression of 5 genes between two conditions
gene_diff = torch.tensor([0.5, -2.0, 0.1, 3.0, -0.3])

print("Gene expression differences:", gene_diff.tolist())
print("\nNorms:")
for name, value in compute_norms(gene_diff).items():
    print(f"  {name}: {value:.4f}")

print("\nInterpretation:")
print("  L1: Total absolute change across all genes")
print("  L2: Overall magnitude (Euclidean)")
print("  Linf: Largest single gene change (gene 4 = 3.0)")

---
## 5. Gradient of Common Functions

### Important Formulas (memorize these!)

| Function | Gradient |
|----------|----------|
| $f(\vec{x}) = \vec{a}^\top \vec{x}$ | $\nabla f = \vec{a}$ |
| $f(\vec{x}) = \vec{x}^\top A \vec{x}$ | $\nabla f = (A + A^\top)\vec{x}$ |
| $f(\vec{x}) = \vec{x}^\top A \vec{x}$ (symmetric $A$) | $\nabla f = 2A\vec{x}$ |
| $f(\vec{x}) = \|\vec{x}\|_2^2$ | $\nabla f = 2\vec{x}$ |

### MSE Loss Gradient (Critical for ML!)

For predictions $\hat{y} = X\vec{w}$ and targets $\vec{y}$:

$$L(\vec{w}) = \frac{1}{n}\|X\vec{w} - \vec{y}\|_2^2$$

$$\nabla_\vec{w} L = \frac{2}{n}X^\top(X\vec{w} - \vec{y})$$

In [None]:
# Verify MSE gradient formula
torch.manual_seed(42)

# Simulate gene expression data
n_samples = 100
n_features = 5  # e.g., 5 covariates

X = torch.randn(n_samples, n_features)  # Design matrix
y = torch.randn(n_samples)               # Target (e.g., gene expression)
w = torch.randn(n_features, requires_grad=True)  # Coefficients

# MSE Loss
def mse_loss(w, X, y):
    residuals = X @ w - y
    return torch.mean(residuals**2)

# Compute gradient with autograd
loss = mse_loss(w, X, y)
loss.backward()
autograd_gradient = w.grad.clone()

# Compute gradient with formula
with torch.no_grad():
    residuals = X @ w - y
    formula_gradient = (2/n_samples) * X.T @ residuals

print("Gradient comparison (should match):")
print(f"Autograd: {autograd_gradient.numpy().round(4)}")
print(f"Formula:  {formula_gradient.numpy().round(4)}")
print(f"\nMax difference: {(autograd_gradient - formula_gradient).abs().max().item():.2e}")

---
## 6. Genomics Application: Gene Expression Loss Function

In differential expression analysis, we often minimize:

$$L(\beta) = -\log P(\text{data} | \beta)$$

For a simplified Gaussian model:

$$L(\beta) = \sum_{i=1}^{n} (y_i - X_i^\top \beta)^2$$

Let's compute gradients for a toy gene expression example.

In [None]:
# Simulated differential expression analysis
torch.manual_seed(123)

# Simulate: 50 samples, 2 conditions (control vs treatment)
n_samples = 50

# Design matrix: intercept + treatment indicator
X = torch.zeros(n_samples, 2)
X[:, 0] = 1  # Intercept
X[:25, 1] = 0  # Control
X[25:, 1] = 1  # Treatment

# True coefficients: baseline=5, treatment effect=2
beta_true = torch.tensor([5.0, 2.0])

# Simulated gene expression with noise
y = X @ beta_true + 0.5 * torch.randn(n_samples)

# Starting guess
beta = torch.tensor([0.0, 0.0], requires_grad=True)

# Compute loss and gradient
loss = torch.mean((X @ beta - y)**2)
loss.backward()

print("Simulated Gene Expression Analysis")
print("=" * 40)
print(f"True coefficients: {beta_true.tolist()}")
print(f"Initial guess: {beta.detach().tolist()}")
print(f"Initial loss: {loss.item():.4f}")
print(f"Gradient at initial guess: {beta.grad.tolist()}")
print(f"\nGradient interpretation:")
print(f"  ∂L/∂β₀ = {beta.grad[0].item():.2f} → Intercept should {'increase' if beta.grad[0] < 0 else 'decrease'}")
print(f"  ∂L/∂β₁ = {beta.grad[1].item():.2f} → Treatment effect should {'increase' if beta.grad[1] < 0 else 'decrease'}")

---
## Exercises

### Exercise 1: Manual Gradient Computation

Compute the gradient of $f(x, y, z) = x^2y + yz^2 + xz$ by hand, then verify with PyTorch.

### Exercise 2: Directional Derivative

For $f(x, y) = x^2 - xy + y^2$ at point $(1, 2)$:
1. Compute $\nabla f$
2. Find the directional derivative in direction $(3, 4)$ (normalize first!)
3. What direction gives the maximum rate of increase?

### Exercise 3: Norm Comparison

Generate a random vector of length 100. Compare L1, L2, and L∞ norms. Which is largest? Smallest? Why?

### Exercise 4: MSE Gradient

Derive the gradient of the MSE loss $L(\vec{w}) = \|X\vec{w} - \vec{y}\|_2^2$ step by step using the chain rule.

In [None]:
# Exercise 1: Your solution here
# f(x, y, z) = x²y + yz² + xz
# ∂f/∂x = ?
# ∂f/∂y = ?
# ∂f/∂z = ?



In [None]:
# Exercise 2: Your solution here



In [None]:
# Exercise 3: Your solution here



---
## Summary

| Concept | Key Formula | Interpretation |
|---------|-------------|----------------|
| Partial derivative | $\frac{\partial f}{\partial x_k}$ | Rate of change in one variable |
| Gradient | $\nabla f = (\frac{\partial f}{\partial x_1}, ..., \frac{\partial f}{\partial x_n})$ | Direction of steepest ascent |
| Directional derivative | $D_{\vec{v}}f = \nabla f \cdot \vec{v}$ | Rate of change in direction $\vec{v}$ |
| MSE gradient | $\frac{2}{n}X^\top(X\vec{w} - \vec{y})$ | Used in linear regression |

## Next Notebook

**02_hessian_newton.ipynb:** Second derivatives, the Hessian matrix, and Newton's optimization method.