# 03: Jacobian Matrix and Backpropagation

**Module 1.1: Calculus & Optimization**

## Learning Objectives

By the end of this notebook, you will:
1. Compute the Jacobian matrix for vector-valued functions
2. Understand the chain rule for Jacobians
3. Connect Jacobians to neural network backpropagation
4. Compute Jacobians for common layer types (linear, ReLU)

## Resources
- Solomon, *Numerical Algorithms*, §1.4.2
- Ananthaswamy, *Why Machines Learn*, Chapter 5
- ISLR, §10.7

In [None]:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from torch.autograd.functional import jacobian

torch.manual_seed(42)
plt.rcParams['figure.figsize'] = (10, 6)

---
## 1. Gradient vs Jacobian vs Hessian

| | Function Type | Output Shape | Use Case |
|---|--------------|--------------|----------|
| **Gradient** | $f: \mathbb{R}^n \to \mathbb{R}$ | Vector $(n,)$ | Loss functions |
| **Jacobian** | $f: \mathbb{R}^n \to \mathbb{R}^m$ | Matrix $(m \times n)$ | NN layers, transformations |
| **Hessian** | $f: \mathbb{R}^n \to \mathbb{R}$ | Matrix $(n \times n)$ | Curvature, Newton's method |

### The Jacobian Matrix

For $f: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian is:

$$Df = \begin{bmatrix}
\frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\
\frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n}
\end{bmatrix} \in \mathbb{R}^{m \times n}$$

**Shape:** (number of outputs) × (number of inputs)

In [None]:
# Example: Simple vector-valued function
def f(x):
    """f: R² → R³"""
    return torch.stack([
        x[0]**2 + x[1],      # f1 = x² + y
        x[0] * x[1],          # f2 = xy
        torch.sin(x[0])       # f3 = sin(x)
    ])

# Compute Jacobian at a point
x = torch.tensor([1.0, 2.0])
J = jacobian(f, x)

print("Function: f(x,y) = [x² + y, xy, sin(x)]")
print(f"\nInput shape: {x.shape} (n=2)")
print(f"Output shape: {f(x).shape} (m=3)")
print(f"Jacobian shape: {J.shape} (m×n = 3×2)")

print("\nJacobian at (1, 2):")
print(J)

print("\nVerification (by hand):")
print("∂f1/∂x = 2x = 2,  ∂f1/∂y = 1")
print("∂f2/∂x = y = 2,   ∂f2/∂y = x = 1")
print(f"∂f3/∂x = cos(x) = {np.cos(1):.4f}, ∂f3/∂y = 0")

---
## 2. Jacobian of a Linear Layer

A neural network linear layer:

$$f(\vec{x}) = W\vec{x} + \vec{b}$$

### Key Result

**The Jacobian of a linear layer is just the weight matrix!**

$$\frac{\partial f}{\partial \vec{x}} = W$$

In [None]:
# Demonstrate: Jacobian of linear layer = W
W = torch.tensor([[1.0, 2.0],
                  [3.0, 4.0],
                  [5.0, 6.0]])
b = torch.tensor([0.1, 0.2, 0.3])

def linear_layer(x):
    return W @ x + b

x = torch.tensor([1.0, 1.0])
J = jacobian(linear_layer, x)

print("Weight matrix W:")
print(W)
print("\nJacobian of linear layer:")
print(J)
print("\n✓ Jacobian = W")

---
## 3. Chain Rule for Jacobians

For composed functions $f = h \circ g$:

$$J_f = J_h \cdot J_g$$

**Matrix multiplication!**

In [None]:
# Demonstrate chain rule
W1 = torch.randn(3, 2)  # Layer 1: R² → R³
W2 = torch.randn(4, 3)  # Layer 2: R³ → R⁴

def g(x): return W1 @ x
def h(z): return W2 @ z
def f(x): return h(g(x))

x = torch.randn(2)

J_f_direct = jacobian(f, x)
J_f_chain = jacobian(h, g(x)) @ jacobian(g, x)

print(f"Direct Jacobian:\n{J_f_direct}")
print(f"\nChain rule J_h @ J_g:\n{J_f_chain}")
print(f"\n✓ Match! Diff: {(J_f_direct - J_f_chain).abs().max():.2e}")

---
## 4. Jacobian of Element-wise Functions

**Element-wise functions have diagonal Jacobians!**

For ReLU:
$$J_{\text{ReLU}} = \text{diag}(\mathbb{1}_{z_1 > 0}, \mathbb{1}_{z_2 > 0}, \ldots)$$

In [None]:
# Jacobian of ReLU
z = torch.tensor([2.0, -1.0, 3.0, -0.5])
J_relu = jacobian(torch.relu, z)

print(f"Input z: {z.tolist()}")
print(f"ReLU(z): {torch.relu(z).tolist()}")
print(f"\nJacobian (diagonal matrix):")
print(J_relu)
print("\n→ Dead neurons (z ≤ 0) block gradient flow!")

---
## 5. Full Neural Network Layer

For $f(\vec{x}) = \sigma(W\vec{x} + \vec{b})$:

$$J_f = J_\sigma \cdot W$$

In [None]:
# Full layer with activation
W = torch.tensor([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
b = torch.tensor([0.1, -0.5, 0.3])

def layer_with_relu(x):
    return torch.relu(W @ x + b)

x = torch.tensor([0.1, 0.1])
z = W @ x + b  # Pre-activation

J = jacobian(layer_with_relu, x)
J_manual = torch.diag((z > 0).float()) @ W

print(f"Pre-activation z: {z.tolist()}")
print(f"\nJacobian:\n{J}")
print(f"\n✓ Row 2 zeroed (z[1]={z[1]:.2f} < 0)")

---
## 6. Backpropagation Through a Network

Gradients flow backward through Jacobians:

$$\frac{\partial L}{\partial \vec{x}} = \frac{\partial L}{\partial \vec{y}} \cdot J_{\text{layer}_n} \cdot \ldots \cdot J_{\text{layer}_1}$$

In [None]:
# Simple network
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(2, 3), nn.ReLU(),
            nn.Linear(3, 2), nn.ReLU(),
            nn.Linear(2, 1)
        )
    def forward(self, x): return self.layers(x)

net = SimpleNet()
x = torch.tensor([1.0, 2.0], requires_grad=True)

y = net(x)
loss = y**2
loss.backward()

print(f"∂L/∂x = {x.grad.tolist()}")

# Verify with full Jacobian
J_net = jacobian(lambda x: net(x), torch.tensor([1.0, 2.0]))
grad_manual = 2 * net(torch.tensor([1.0, 2.0])) * J_net
print(f"Manual:  {grad_manual.squeeze().tolist()}")

---
## Exercises

### Exercise 1: Manual Jacobian
Compute the Jacobian of $f(x,y) = (x^2 + y, xy, e^x)$ by hand at $(1, 2)$.

### Exercise 2: Chain Rule
For a 4-layer network with shapes 3→5→4→2→1, what is the shape of the full Jacobian $\partial y / \partial x$?

### Exercise 3: Dead Neurons
Create an input where ALL neurons in a ReLU layer are dead. What is the Jacobian?

### Exercise 4: Sigmoid Saturation
Compute the Jacobian of sigmoid at $z = 10$. Why is this problematic for training?

In [None]:
# Exercise 1: Your solution


In [None]:
# Exercise 2: Your solution


In [None]:
# Exercise 3: Your solution


In [None]:
# Exercise 4: Your solution


---
## Summary

| Concept | Key Point |
|---------|----------|
| **Jacobian shape** | (outputs × inputs) = $(m \times n)$ |
| **Linear layer** | Jacobian = $W$ |
| **Chain rule** | $J_f = J_h \cdot J_g$ (matrix multiply) |
| **Element-wise** | Diagonal Jacobian |
| **ReLU** | 0/1 diagonal (gates gradients) |
| **Backprop** | Chain Jacobians backward |

## Next: 04_autodiff.ipynb