# Day 2 — The Backward Pass & Chain Rule

**Building LLMs from Scratch · Following Andrej Karpathy's micrograd**

Today we implement **backpropagation**: computing gradients via the chain rule. The forward pass builds a computation graph; the backward pass propagates derivatives from output to every input. Each gradient answers: *"How much of the final result is my fault?"*

**What you'll build:**
- Extend the Value class with `_backward` closures for `+` and `×`
- Manually run the backward pass step-by-step
- Understand how addition *distributes* and multiplication *swaps* gradients
- See the `+=` vs `=` trap when the same variable appears twice
- Verify all gradients against PyTorch

## 1. Value Class with `_backward`

We extend the Value class from Day 1 with `_backward` closures. Each operation stores a function that knows how to propagate the upstream gradient to its children.

- **Addition:** $\frac{\partial(a+b)}{\partial a} = 1$ → both inputs get the full gradient (distributes)
- **Multiplication:** $\frac{\partial(a \times b)}{\partial a} = b$ → each input gets the *other's* value × gradient (swaps)

In [None]:
class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._prev = set(_children)
        self._op = _op
        self._backward = lambda: None  # default: no-op for leaf nodes

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += out.grad   # ∂(a+b)/∂a = 1 → pass through
            other.grad += out.grad  # ∂(a+b)/∂b = 1 → pass through
        out._backward = _backward
        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad   # ∂(a*b)/∂a = b
            other.grad += self.data * out.grad   # ∂(a*b)/∂b = a
        out._backward = _backward
        return out

    def __repr__(self):
        return f"Value(data={self.data}, grad={self.grad})"

print("Value class with _backward for + and * defined.")

## 2. Manual Backward Pass

Build the graph $L = (a \times b) + c$, then manually call `_backward()` in **reverse topological order** (output → inputs). We must call L's _backward first, then d's.

In [None]:
a = Value(2.0)
b = Value(-3.0)
c = Value(10.0)
d = a * b   # d = -6.0
L = d + c   # L = 4.0

print("Forward pass:")
print(f"  a={a.data}, b={b.data}, c={c.data}")
print(f"  d = a*b = {d.data}")
print(f"  L = d+c = {L.data}")
print()

# Manual backward: seed L.grad = 1.0, then call _backward in reverse order
L.grad = 1.0
L._backward()   # propagates to d and c
d._backward()   # propagates to a and b

print("Backward pass (manual _backward calls):")
print(f"  L.grad = {L.grad}")
print(f"  d.grad = {d.grad}")
print(f"  c.grad = {c.grad}")
print(f"  a.grad = {a.grad}")
print(f"  b.grad = {b.grad}")

## 3. The Chain Rule in Action

**Addition distributes:** Both children receive the full upstream gradient. If $L = d + c$, then $\frac{\partial L}{\partial d} = \frac{\partial L}{\partial c} = 1$.

**Multiplication swaps:** Each child gets the *other's* value times the upstream gradient. If $d = a \times b$, then $\frac{\partial d}{\partial a} = b$ and $\frac{\partial d}{\partial b} = a$.

In [None]:
print("Chain rule intuition:")
print("  Addition:  ∂(a+b)/∂a = 1  →  grad passes through unchanged to both inputs")
print("  Addition:  ∂(a+b)/∂b = 1  →  same")
print("  Multiplication: ∂(a*b)/∂a = b  →  a gets (b's value) × upstream_grad")
print("  Multiplication: ∂(a*b)/∂b = a  →  b gets (a's value) × upstream_grad")
print()
print("For L = (a*b) + c with a=2, b=-3, c=10:")
print("  L._backward() → d.grad += 1, c.grad += 1  (addition distributes)")
print("  d._backward() → a.grad += (-3)*1 = -3, b.grad += 2*1 = 2  (multiplication swaps)")

## 4. Gradient Accumulation — The `+=` vs `=` Trap

When the same variable appears **twice** (e.g. $L = a + a$), its gradient is the **sum** of contributions from each use. Each path contributes +1, so $a.grad = 2.0$.

Using `=` instead of `+=` would **overwrite** the first contribution — producing wrong gradients. Always use `+=` in _backward.

In [None]:
a = Value(3.0)
L = a + a   # L = 6.0, both inputs are the same node

L.grad = 1.0
L._backward()  # addition: both children get += 1.0 → a gets 1 + 1 = 2.0

print("L = a + a  (same variable used twice)")
print(f"  L.data = {L.data}")
print(f"  a.grad = {a.grad}  ← should be 2.0 (both paths contribute +1)")
print()
print("If we had used '=' instead of '+=' in _backward, a.grad would be 1.0 (WRONG).")
print("The += ensures gradients from multiple paths accumulate correctly.")

## 5. Verification with PyTorch

Rebuild the same graph $L = (a \times b) + c$ in PyTorch with `requires_grad=True`, call `.backward()`, and compare gradients.

In [None]:
import torch

a_t = torch.tensor(2.0, requires_grad=True)
b_t = torch.tensor(-3.0, requires_grad=True)
c_t = torch.tensor(10.0, requires_grad=True)
d_t = a_t * b_t
L_t = d_t + c_t

L_t.backward()

print("PyTorch gradients:")
print(f"  a.grad = {a_t.grad.item()}")
print(f"  b.grad = {b_t.grad.item()}")
print(f"  c.grad = {c_t.grad.item()}")
print()
print("Our micrograd (from Section 2): a.grad=-3, b.grad=2, c.grad=1")
print("Match:", a_t.grad.item() == -3.0 and b_t.grad.item() == 2.0 and c_t.grad.item() == 1.0)

## 6. Complex Expression — $L = (a \times b + c) \times (a + b)$

Compute gradients for a multi-branch expression. We need to reset gradients (or use fresh Values) since we're reusing the Value class. Then verify against PyTorch.

In [None]:
def reset_grads(*values):
    for v in values:
        v.grad = 0.0

a = Value(2.0)
b = Value(-3.0)
c = Value(10.0)

# L = (a*b + c) * (a + b)
ab = a * b
ab_plus_c = ab + c
a_plus_b = a + b
L = ab_plus_c * a_plus_b

print("Forward: L = (a*b + c) * (a + b)")
print(f"  ab = {ab.data}, ab+c = {ab_plus_c.data}, a+b = {a_plus_b.data}")
print(f"  L = {L.data}")
print()

# Manual backward: reverse topological order
# Order: L → (ab_plus_c, a_plus_b) → (ab, c) and (a, b) → (a, b)
L.grad = 1.0
L._backward()       # → ab_plus_c, a_plus_b
ab_plus_c._backward()  # → ab, c
a_plus_b._backward()   # → a, b
ab._backward()         # → a, b

print("Our gradients:")
print(f"  a.grad = {a.grad}")
print(f"  b.grad = {b.grad}")
print(f"  c.grad = {c.grad}")

In [None]:
# PyTorch verification for L = (a*b + c) * (a + b)
a_t = torch.tensor(2.0, requires_grad=True)
b_t = torch.tensor(-3.0, requires_grad=True)
c_t = torch.tensor(10.0, requires_grad=True)

ab_t = a_t * b_t
ab_plus_c_t = ab_t + c_t
a_plus_b_t = a_t + b_t
L_t = ab_plus_c_t * a_plus_b_t

L_t.backward()

print("PyTorch gradients:")
print(f"  a.grad = {a_t.grad.item()}")
print(f"  b.grad = {b_t.grad.item()}")
print(f"  c.grad = {c_t.grad.item()}")
print()
print("Match:", abs(a.grad - a_t.grad.item()) < 1e-6 and abs(b.grad - b_t.grad.item()) < 1e-6 and abs(c.grad - c_t.grad.item()) < 1e-6)

---

**Building LLMs from Scratch** · Day 2 of 80

| [← Day 1: Forward Pass](llm_day01_forward_pass.ipynb) | [Day 2 article](https://omkarray.com/llm-day2.html) | [Day 3: Topological Sort →](llm_day03_topological_sort.ipynb) |