# Backpropagation & Computation Graphs

## Derivatives: Intuition and Calculation

Imagine a simple cost function:  

$$
J(w) = w^2.
$$

At $w=3$, we have $J(3)=9$. If we increase $w$ by a small amount $\epsilon$ (for example, $\epsilon=0.001$), then $w$ becomes $3.001$, and  

$$
J(3.001) \approx 9.006001.
$$

The change in $J$ is roughly $0.006001$, which is about $6\epsilon$. This tells us that for a tiny change in $w$, the cost changes by approximately  

$$
\frac{dJ}{dw} \approx 6,
$$

when $w=3$. More generally, by calculus we know:  

$$
\frac{dJ}{dw} = 2w.
$$

Thus, when $w=2$, $\frac{dJ}{dw}=4$, and when $w=-3$, $\frac{dJ}{dw}=-6$. The derivative tells us how sensitive $J$ is to changes in $w$, much like a hill's slope indicates how steep it is.

---

## The Computation Graph Concept

A computation graph represents the sequence of operations in a neural network. Instead of writing one long equation, the network’s operations are broken into smaller, manageable steps. Consider a simple network defined by:  

$$
a = wx + b,
$$

with a cost function:  

$$
J = \frac{1}{2}(a - y)^2.
$$

We can break this computation into these steps:
- **Step 1:** Compute an intermediate value $c = w \times x$.
- **Step 2:** Compute the output $a = c + b$.
- **Step 3:** Compute the error $d = a - y$.
- **Step 4:** Compute the cost $J = \frac{1}{2}d^2$.

During **forward propagation**, we calculate these steps from input to output. In **backpropagation**, we reverse the process, starting from the cost $J$ and working backwards. At each node, we use the chain rule to determine how a small change in that node affects $J$. For instance, a change in $d$ affects $J$ by  

$$
\frac{\partial J}{\partial d} = d \quad \text{(or, more precisely, $d$ times a constant factor depending on the cost function)}.
$$

This information is then used to compute the gradient with respect to earlier variables, eventually yielding the gradients with respect to $w$ and $b$.

---

## A Larger Neural Network Example

Consider a neural network with one hidden layer. Suppose we have:

Input $x = 1$ and target $y = 5$.

Hidden layer computation:  

$$z_1 = w_1 \cdot x + b_1,\quad a_1 = g(z_1),$$  
  
- where $g(z)$ is the ReLU activation $g(z)=\max(0,z)$. If $w_1=2$ and $b_1=0$, then $z_1=2$ and $a_1=2$.

Output layer computation:  

$$z_2 = w_2 \cdot a_1 + b_2,\quad a_2 = g(z_2).$$  
  
- If $w_2=3$ and $b_2=1$, then $z_2=7$ and $a_2=7$.

Cost calculation:  

$$J = \frac{1}{2}(a_2 - y)^2 = \frac{1}{2}(7-5)^2 = 2.$$

The computation graph now includes nodes for:
- Multiplying $w_1$ and $x$,
- Adding $b_1$ to get $z_1$,
- Applying ReLU to obtain $a_1$,
- Multiplying $w_2$ and $a_1$,
- Adding $b_2$ to get $z_2$,
- Applying ReLU to obtain $a_2$, and
- Computing the final cost $J$.

Backpropagation begins at the cost $J$ and computes gradients at each node using the chain rule. For example, a small increase in $w_1$ will affect $z_1$, then $a_1$, and so on until it changes $J$. By combining these effects, we can calculate $\frac{\partial J}{\partial w_1}$ accurately. A numerical check (increasing $w_1$ by $\epsilon$ and observing the change in $J$) confirms the calculated gradient.

---

## Efficiency and Automatic Differentiation

Backpropagation is efficient because it reuses intermediate gradient calculations rather than computing them independently for every parameter. In a network with $n$ nodes and $p$ parameters, backpropagation takes roughly $O(n+p)$ steps rather than $O(n \times p)$, which is essential for large models.

Modern frameworks like TensorFlow and PyTorch implement **automatic differentiation (autodiff)**. Autodiff automatically constructs the computation graph and applies backpropagation to compute all the necessary gradients. This automation spares researchers from manually deriving gradients and helps in building complex models more reliably.


# Detailed Backpropagation Example with the Chain Rule



## The Example Network

Consider a network with one input, one weight, and one bias. The network computes an output $a$ from an input $x$ as follows:

$$
a = wx + b.
$$

The cost function (or loss) is given by:

$$
J = \frac{1}{2}(a - y)^2,
$$

- where $y$ is the true target value.

---

## Breaking Down the Computation: The Computation Graph

We can break the computation into four sequential steps (nodes):

1. **Compute the weighted input:**  

$$ c = w \times x $$

2. **Add the bias:**  

$$ a = c + b $$

3. **Compute the error (difference):**  

$$ d = a - y $$

4. **Compute the cost:**  

$$ J = \frac{1}{2}d^2 $$

A simplified diagram of the computation graph is:

```
   w, x         b, y
    |            |
    v            |
   [*]          (constant y)
    |            |
    v            |
   (c = w*x)  --> (+)  -->  a = c + b  -->  [-]  -->  d = a - y  -->  [Square]  -->  J
```

---

## Applying the Chain Rule in Backpropagation

The **chain rule** lets us compute the derivative of a composite function. If a variable $z$ depends on $y$, and $y$ depends on $x$, then:

$$
\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}.
$$

Our goal is to compute the gradient $\frac{\partial J}{\partial w}$, i.e., how a small change in $w$ affects the cost $J$.

### Step-by-Step Backpropagation

**Step 1: From $J$ to $d$**

At Node 4, the cost is:

$$
J = \frac{1}{2}d^2.
$$

Differentiate with respect to $d$:

$$
\frac{\partial J}{\partial d} = d.
$$

- *Interpretation:* A small change $\delta d$ in $d$ causes a change in $J$ of approximately $d \cdot \delta d$.

---

**Step 2: From $d$ to $a$**

At Node 3, the error is:

$$
d = a - y.
$$

Differentiate with respect to $a$:

$$
\frac{\partial d}{\partial a} = 1.
$$

By the chain rule:

$$
\frac{\partial J}{\partial a} = \frac{\partial J}{\partial d} \cdot \frac{\partial d}{\partial a} = d \cdot 1 = d.
$$

- *Interpretation:* Changes in $a$ propagate directly to $d$.

---

**Step 3: From $a$ to $c$**

At Node 2, we have:

$$
a = c + b.
$$

Differentiate with respect to $c$:

$$
\frac{\partial a}{\partial c} = 1.
$$

So:

$$
\frac{\partial J}{\partial c} = \frac{\partial J}{\partial a} \cdot \frac{\partial a}{\partial c} = d \cdot 1 = d.
$$

---

**Step 4: From $c$ to $w$**

At Node 1, $c$ is computed as:

$$
c = w \times x.
$$

Differentiate with respect to $w$:

$$
\frac{\partial c}{\partial w} = x.
$$

Finally, by the chain rule:

$$
\frac{\partial J}{\partial w} = \frac{\partial J}{\partial c} \cdot \frac{\partial c}{\partial w} = d \cdot x.
$$

- *Interpretation:* The sensitivity of $J$ with respect to $w$ is the product of the propagated error $d$ and the input $x$.

---

**Bonus: Gradient with Respect to $b$**

Since $a = c + b$, differentiating with respect to $b$ gives:

$$
\frac{\partial a}{\partial b} = 1.
$$

Thus:

$$
\frac{\partial J}{\partial b} = \frac{\partial J}{\partial a} \cdot \frac{\partial a}{\partial b} = d \cdot 1 = d.
$$

---

## Updating Parameters Using Gradients

Once we have the gradients, we update the parameters using a method such as gradient descent. For a learning rate $\alpha$, the update rules are:

$$
w \leftarrow w - \alpha \frac{\partial J}{\partial w} \quad \text{and} \quad b \leftarrow b - \alpha \frac{\partial J}{\partial b}.
$$

---

## Recap: The Role of the Chain Rule

The chain rule allows us to break a complex derivative into simpler parts. In our example:
- We first computed how $J$ changes with $d$.
- Then, we saw how changes in $a$ affect $d$, and so on.
- Multiplying these derivatives together gave us the gradient with respect to $w$:

$$
\frac{\partial J}{\partial w} = \frac{\partial J}{\partial d} \cdot \frac{\partial d}{\partial a} \cdot \frac{\partial a}{\partial c} \cdot \frac{\partial c}{\partial w} = d \cdot 1 \cdot 1 \cdot x = d \cdot x.
$$

This systematic backtracking through the computation graph is the essence of backpropagation. It enables neural networks to efficiently compute gradients and update parameters during training.
