## Backpropagation



### Gradients in a Two-Layer Neural Network

For the network:
$
f(x; \mathbf{w}) = \text{softmax}(\mathbf{W}_2 \, \text{ReLU}(\mathbf{W}_1 x + \mathbf{b}_1) + \mathbf{b}_2)
$

#### Gradient with respect to \( $\mathbf{b}_1$ \):

$
\frac{\partial \text{CE}(y, f(x, w))}{\partial \mathbf{b}_1} = 
\frac{\partial \text{CE}}{\partial f} \cdot
\frac{\partial f}{\partial h_2} \cdot
\frac{\partial h_2}{\partial h_1} \cdot
\frac{\partial h_1}{\partial \mathbf{b}_1}
$

#### Gradient with respect to \( $\mathbf{W}_2$ \):

$
\frac{\partial \text{CE}(y, f(x, w))}{\partial \mathbf{W}_2} = 
\frac{\partial \text{CE}}{\partial f} \cdot
\frac{\partial f}{\partial h_2} \cdot
\frac{\partial h_2}{\partial \mathbf{W}_2}
$

In [1]:
import numpy as np

# --- 18b ---
# Given values
f = np.array([0.1, 0.6, 0.3])  # softmax output
y = 1  # target class index (Python 0-based, so y=2 in question means index 1)

# 1. Partial derivative of cross-entropy wrt logits (pre-softmax)
dCE_dz = f.copy()
dCE_dz[y] -= 1
print("∂CE/∂z =", dCE_dz)

# 2. Derivative of ReLU wrt input
z = np.array([-0.5, 0.3, 2])
dReLU_db1 = np.diag((z > 0).astype(int))
print("∂ReLU/∂b1 = diag", np.diag(dReLU_db1))

# --- 18c ---
# Training samples: (logits, true label)
samples = [
    (np.array([0.1, 0.5]), 0),   # y=1
    (np.array([-1, 1]), 1),      # y=2
    (np.array([0, -1]), 0)       # y=1
]

def softmax(logits):
    e = np.exp(logits - np.max(logits))
    return e / e.sum()

cross_entropies = []
for logits, label in samples:
    probs = softmax(logits)
    ce = -np.log(probs[label])
    cross_entropies.append(ce)

mean_ce = np.mean(cross_entropies)
print("Mean cross entropy =", mean_ce)

∂CE/∂z = [ 0.1 -0.4  0.3]
∂ReLU/∂b1 = diag [0 1 1]
Mean cross entropy = 0.45106831698704936
