### 13. What Is Backpropagation?

Once we compute the **loss**, we need to understand how to change each **parameter** to reduce it. This is the goal of **backpropagation**.

**Backpropagation** is just a fancy name for using the **chain rule** from calculus to compute **gradients** of the loss with respect to weights and biases.

Let’s say the total loss is $\mathcal{L}$.

We want to compute:

- $\frac{\partial \mathcal{L}}{\partial W}$: How changing the weights affects loss
- $\frac{\partial \mathcal{L}}{\partial b}$: How changing biases affects loss

This allows us to **update** these parameters to make the network better.

We go **layer-by-layer from last to first**, applying the chain rule:

$$
\frac{\partial \mathcal{L}}{\partial W^{(2)}} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial W^{(2)}}
$$

Where:
- $\hat{y}$: network's prediction
- $z^{(2)}$: raw score before sigmoid
- $W^{(2)}$: weig


### 14. Gradients at Each Layer

Let’s go step-by-step:

1. **Loss Layer (Binary Crossentropy)**
   - We already derived:
     $$
     \frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1 - y}{1 - \hat{y}}
     $$

2. **Sigmoid Layer**
   - Sigmoid's derivative:
     $$
     \frac{\partial \hat{y}}{\partial z} = \hat{y}(1 - \hat{y})
     $$

3. **Dense Layer**
   - For weights:
     $$
     \frac{\partial z}{\partial W} = a^{(1)} \quad (\text{input from previous layer})
     $$
   - For biases:
     $$
     \frac{\partial z}{\partial b} = 1
     $$

All these are computed in code as:
- `dvalues`: gradient coming from next layer
- `dweights = inputs.T @ dvalues`
- `dbiases = np.sum(dvalues)`


In [None]:
# Backward pass
loss_function.backward(activation2.output, y)
activation2.backward(loss_function.dinputs)
dense2.backward(activation2.dinputs)
activation1.backward(dense2.dinputs)
dense1.backward(activation1.dinputs)


### 15. Why Are Gradients Important?

Gradients tell us:

- Which direction (increase or decrease) to change weights
- By how much

If $\frac{\partial \mathcal{L}}{\partial w}$ is:
- Positive → reduce the weight
- Negative → increase the weight

This is why we use:

$$
w \leftarrow w - \eta \cdot \frac{\partial \mathcal{L}}{\partial w}
$$

Where $\eta$ is the learning rate.

This update step is done by an **optimizer**.


### 16. Optimizer: Adam

**Adam** stands for:
- **A**daptive
- **M**oment
- **E**stimation

It combines:
- **Momentum**: smooths gradient using moving average
- **RMSprop**: adapts learning rate for each parameter

Adam keeps:
- **m**: moving average of gradients (1st moment)
- **v**: moving average of squared gradients (2nd moment)

Then performs update:

$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
$$

Update rule:

$$
\theta \leftarrow \theta - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
$$

Where:
- $\theta$: parameter (weight or bias)
- $\eta$: learning rate
- $\epsilon$: small constant to avoid division by zero


### 17. Training Loop (Putting Everything Together)

Each epoch:
1. Forward pass through all layers
2. Compute loss and accuracy
3. Backward pass through all layers
4. Optimizer updates weights

This is repeated for many epochs (iterations over dataset), slowly improving the model.


In [None]:
for epoch in range(10001):
    dense1.forward(X)
    activation1.forward(dense1.output)
    dense2.forward(activation1.output)
    activation2.forward(dense2.output)

    data_loss = loss_function.calculate(activation2.output, y)
    regularization_loss = loss_function.regularization_loss(dense1) + loss_function.regularization_loss(dense2)
    loss = data_loss + regularization_loss

    predictions = (activation2.output > 0.5) * 1
    accuracy = np.mean(predictions == y)

    if not epoch % 100:
        print(f'epoch: {epoch}, acc: {accuracy:.3f}, loss: {loss:.3f}, lr: {optimizer.current_learning_rate}')

    # Backward
    loss_function.backward(activation2.output, y)
    activation2.backward(loss_function.dinputs)
    dense2.backward(activation2.dinputs)
    activation1.backward(dense2.dinputs)
    dense1.backward(activation1.dinputs)

    # Update
    optimizer.pre_update_params()
    optimizer.update_params(dense1)
    optimizer.update_params(dense2)
    optimizer.post_update_params()


### 18. Regularization

Regularization discourages weights from growing too large.

We add penalties to the loss function:

#### L2 (Ridge) Regularization:

$$
\mathcal{L}_{reg} = \lambda \sum w^2
$$

#### L1 (Lasso) Regularization:

$$
\mathcal{L}_{reg} = \lambda \sum |w|
$$

This helps the model **generalize better** and not memorize training data.

The total loss becomes:

$$
\mathcal{L}_{total} = \mathcal{L}_{data} + \mathcal{L}_{reg}
$$


### 19. Validation Step

We evaluate model performance on **unseen test data** to ensure it generalizes.

Steps:
1. Forward pass on test data
2. Compute test loss and accuracy


In [None]:
X_test, y_test = spiral_data(samples=100, classes=2)
y_test = y_test.reshape(-1, 1)

dense1.forward(X_test)
activation1.forward(dense1.output)
dense2.forward(activation1.output)
activation2.forward(dense2.output)

loss = loss_function.calculate(activation2.output, y_test)
predictions = (activation2.output > 0.5) * 1
accuracy = np.mean(predictions == y_test)
print(f'validation, acc: {accuracy:.3f}, loss: {loss:.3f}')
