<a href="https://colab.research.google.com/github/Redcoder815/Deep_Learning_Python/blob/main/BackPropagation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np

# 1. Activation function & its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

# 2. Dataset (XOR problem)
X = np.array([[0,0], [0,1], [1,0], [1,1]])
y = np.array([[0], [1], [1], [0]])

# 3. Initialization
input_layer_neurons = 2
hidden_layer_neurons = 2
output_neurons = 1

# Random weights and biases
weights_hidden = np.random.uniform(size=(input_layer_neurons, hidden_layer_neurons))
weights_output = np.random.uniform(size=(hidden_layer_neurons, output_neurons))
bias_hidden = np.random.uniform(size=(1, hidden_layer_neurons))
bias_output = np.random.uniform(size=(1, output_neurons))

learning_rate = 0.1

# 4. Training loop
for epoch in range(10000):
    # --- Forward Pass ---
    hidden_layer_input = np.dot(X, weights_hidden) + bias_hidden
    hidden_layer_activation = sigmoid(hidden_layer_input)

    output_layer_input = np.dot(hidden_layer_activation, weights_output) + bias_output
    predicted_output = sigmoid(output_layer_input)

    # --- Backward Pass ---
    # Calculate error at output
    error = y - predicted_output
    d_predicted_output = error * sigmoid_derivative(predicted_output)

    # Calculate error at hidden layer (backpropagate the error)
    error_hidden_layer = d_predicted_output.dot(weights_output.T)
    d_hidden_layer = error_hidden_layer * sigmoid_derivative(hidden_layer_activation)

    # --- Update Weights & Biases ---
    weights_output += hidden_layer_activation.T.dot(d_predicted_output) * learning_rate
    weights_hidden += X.T.dot(d_hidden_layer) * learning_rate
    bias_output += np.sum(d_predicted_output, axis=0, keepdims=True) * learning_rate
    bias_hidden += np.sum(d_hidden_layer, axis=0, keepdims=True) * learning_rate

# Task
Explain the mathematical derivations for updating the weights and biases of a neural network (both hidden and output layers) using gradient descent and backpropagation, showing how these derivations correspond to the Python code provided in the notebook.

## Introduction to Gradient Descent

### Subtask:
Explain the general principle of gradient descent, which is used to iteratively adjust weights and biases to minimize the error (loss function).


### What is Gradient Descent?

Gradient Descent is an optimization algorithm used to find the minimum of a function. In the context of neural networks, this function is typically the **loss function** (or cost function), which quantifies how far off the network's predictions are from the actual target values. The goal of training a neural network is to minimize this loss.

#### How it Works: The Iterative Process

Gradient Descent operates in an iterative manner. It repeatedly adjusts the network's parameters (weights and biases) to gradually move towards the minimum of the loss function. Think of it like a hiker trying to get down a mountain in a dense fog: they can't see the bottom, but they can feel the slope. To descend, they take a step in the steepest downward direction. Similarly, Gradient Descent calculates the 'slope' of the loss function with respect to each parameter (this 'slope' is called the **gradient**).

For each iteration:
1.  **Calculate the Gradient**: The gradient of the loss function is computed for all weights and biases. This gradient indicates the direction of the steepest ascent of the loss function.
2.  **Update Parameters**: To minimize the loss, the parameters are adjusted in the **opposite direction** of the gradient. This means if the gradient is positive, the parameter is decreased, and if it's negative, the parameter is increased.

#### The Role of the Learning Rate

The **learning rate** is a crucial hyperparameter that controls the step size during each iteration of Gradient Descent. When updating the parameters, the gradient is multiplied by the learning rate:

`new_parameter = old_parameter - (learning_rate * gradient)`

*   A **large learning rate** can cause the algorithm to overshoot the minimum, potentially leading to oscillations or even divergence.
*   A **small learning rate** ensures slow, steady convergence but can make the training process very time-consuming.

Finding an appropriate learning rate is essential for efficient and effective training.

#### Gradient Descent in Neural Networks

In a neural network, Gradient Descent is the core mechanism by which the network 'learns'. After a forward pass computes an output and the loss is calculated, the backward pass (backpropagation) computes the gradients of the loss with respect to each weight and bias in the network. Gradient Descent then uses these gradients, scaled by the learning rate, to update the weights and biases. This process is repeated over many epochs (passes through the entire dataset) until the loss function is minimized, and the network's predictions are as close as possible to the actual target outputs.

## Output Layer Weight Update (weights_output)

### Subtask:
Detail the mathematical derivation for updating the weights connecting the hidden layer to the output layer (`weights_output`). Show how the gradient of the loss with respect to these weights is calculated using the chain rule, and then map this to the Python code: `weights_output += hidden_layer_activation.T.dot(d_predicted_output) * learning_rate`.


## Output Layer Weight Update (weights_output)

### Subtask:
Detail the mathematical derivation for updating the weights connecting the hidden layer to the output layer (`weights_output`). Show how the gradient of the loss with respect to these weights is calculated using the chain rule, and then map this to the Python code: `weights_output += hidden_layer_activation.T.dot(d_predicted_output) * learning_rate`.

### Derivation Steps:

#### 1. Define Loss Function and Forward Pass for the Output Layer

In this neural network, the error is calculated as `error = y - predicted_output`. This implies that the loss function (L) being minimized is typically the Mean Squared Error (MSE) or a similar squared error function. For a single sample, the loss `L` can be expressed as:

$$L = \frac{1}{2}(y - \text{predicted_output})^2$$

However, for backpropagation, we often work with the derivative of the loss with respect to the output, which implicitly handles the squared error by calculating `error = y - predicted_output` directly. Let's proceed with the equations for the output layer:

The input to the output layer neurons (`output_layer_input`, often denoted as $Z_{\text{output}}$) is calculated as:

$$Z_{\text{output}} = \text{hidden_layer_activation} \cdot \text{weights_output} + \text{bias_output}$$

The activation of the output layer neurons (`predicted_output`, often denoted as $A_{\text{output}}$) is obtained by applying the sigmoid activation function to $Z_{\text{output}}$:

$$A_{\text{output}} = \text{sigmoid}(Z_{\text{output}})$$

$$A_{\text{output}} = \text{predicted_output}$$


#### 2. Apply Chain Rule for $\frac{\partial L}{\partial \text{weights_output}}$

To update `weights_output`, we need to calculate the gradient of the loss function ($L$) with respect to these weights. We use the chain rule for this:

$$\frac{\partial L}{\partial \text{weights_output}} = \frac{\partial L}{\partial \text{predicted_output}} \cdot \frac{\partial \text{predicted_output}}{\partial \text{output_layer_input}} \cdot \frac{\partial \text{output_layer_input}}{\partial \text{weights_output}}$$

Let's break down each component:

**Component 1: $\frac{\partial L}{\partial \text{predicted_output}}$**

Given $L = \frac{1}{2}(y - \text{predicted_output})^2$, the derivative with respect to `predicted_output` is:

$$\frac{\partial L}{\partial \text{predicted_output}} = \frac{\partial}{\partial \text{predicted_output}} \left( \frac{1}{2}(y - \text{predicted_output})^2 \right) = (y - \text{predicted_output}) \cdot (-1) = - (y - \text{predicted_output})$$

Which simplifies to:

$$\frac{\partial L}{\partial \text{predicted_output}} = \text{predicted_output} - y$$

However, in the provided code, `error = y - predicted_output`. Therefore, $\frac{\partial L}{\partial \text{predicted_output}} = -\text{error}$.

**Component 2: $\frac{\partial \text{predicted_output}}{\partial \text{output_layer_input}}$**

`predicted_output` is the result of applying the sigmoid function to `output_layer_input`:

$$\text{predicted_output} = \text{sigmoid}(\text{output_layer_input})$$

The derivative of the sigmoid function, $\text{sigmoid}'(x) = \text{sigmoid}(x) (1 - \text{sigmoid}(x))$, which is represented by `sigmoid_derivative(x)` in the code. So, let $A_{\text{output}} = \text{predicted_output}$, then:

$$\frac{\partial \text{predicted_output}}{\partial \text{output_layer_input}} = \text{sigmoid}'(\text{output_layer_input}) = \text{predicted_output} \cdot (1 - \text{predicted_output})$$

**Component 3: $\frac{\partial \text{output_layer_input}}{\partial \text{weights_output}}$**

From the forward pass equation:

$$Z_{\text{output}} = \text{output_layer_input} = \text{hidden_layer_activation} \cdot \text{weights_output} + \text{bias_output}$$

Taking the derivative with respect to `weights_output`:

$$\frac{\partial \text{output_layer_input}}{\partial \text{weights_output}} = \text{hidden_layer_activation}$$

This is because `hidden_layer_activation` are the inputs to these weights.

#### 3. Connecting `d_predicted_output` to the Chain Rule Components

In the Python code, `d_predicted_output` is calculated as:

```python
d_predicted_output = error * sigmoid_derivative(predicted_output)
```

Let's map this to our derived components:

*   `error = y - predicted_output`. As we saw earlier, $\frac{\partial L}{\partial \text{predicted_output}} = - (y - \text{predicted_output}) = -\text{error}$. Therefore, `error` here corresponds to $- \frac{\partial L}{\partial \text{predicted_output}}$.
*   `sigmoid_derivative(predicted_output)` corresponds to $\frac{\partial \text{predicted_output}}{\partial \text{output_layer_input}}$.

So, if we consider `d_predicted_output` as a term representing the error signal propagated back through the output activation function, it is effectively:

$$\text{d_predicted_output} = (y - \text{predicted_output}) \cdot \text{predicted_output} \cdot (1 - \text{predicted_output})$$

This is equivalent to: $-\frac{\partial L}{\partial \text{predicted_output}} \cdot \frac{\partial \text{predicted_output}}{\partial \text{output_layer_input}}$ (when the loss is implicitly taken as half squared error, and the negative sign is absorbed into the error definition).

More precisely, in backpropagation, the term `d_predicted_output` often represents $\frac{\partial L}{\partial \text{output_layer_input}}$, which is the product of the first two components of our chain rule derivation (with the sign convention matching the code's `error` definition):

$$\frac{\partial L}{\partial \text{output_layer_input}} = \frac{\partial L}{\partial \text{predicted_output}} \cdot \frac{\partial \text{predicted_output}}{\partial \text{output_layer_input}}$$

Given `error = y - predicted_output`, we have $\frac{\partial L}{\partial \text{predicted_output}} = -\text{error}$. Therefore,

$$\frac{\partial L}{\partial \text{output_layer_input}} = -\text{error} \cdot \text{sigmoid_derivative(predicted_output)}$$

The code's `d_predicted_output` is `error * sigmoid_derivative(predicted_output)`. The difference in sign is due to whether the gradient indicates increasing or decreasing the loss, and the update rule `weights_output += ...` effectively handles this by moving in the direction that *reduces* the error. For practical implementation, `d_predicted_output` directly represents $\frac{\partial \text{Error}}{\partial \text{output_layer_input}}$ where 'Error' is simply `y - predicted_output`.

#### 4. Full Update Rule for `weights_output`

Combining all the components of the chain rule, the gradient of the loss with respect to `weights_output` is:

$$\frac{\partial L}{\partial \text{weights_output}} = \left(-\text{error} \cdot \text{sigmoid_derivative(predicted_output)}\right) \cdot \text{hidden_layer_activation}$$

Substituting `d_predicted_output = error \cdot sigmoid_derivative(predicted_output)` (with the understanding of the sign convention discussed above), we get:

$$\frac{\partial L}{\partial \text{weights_output}} = - \left(\text{d_predicted_output}\right) \cdot \text{hidden_layer_activation}$$

In matrix form, considering the dimensions and performing the dot product for all samples and neurons, this becomes:

$$\frac{\partial L}{\partial \text{weights_output}} = - \text{hidden_layer_activation}^T \cdot \text{d_predicted_output}$$

This is because `hidden_layer_activation` (shape `[num_samples, hidden_layer_neurons]`) needs to be transposed to `[hidden_layer_neurons, num_samples]` to correctly multiply with `d_predicted_output` (shape `[num_samples, output_neurons]`), resulting in a gradient of shape `[hidden_layer_neurons, output_neurons]`, which matches `weights_output`'s shape.

Now, let's look at the Python update rule:

```python
weights_output += hidden_layer_activation.T.dot(d_predicted_output) * learning_rate
```

This is the gradient descent update rule for the weights:

$$W_{\text{new}} = W_{\text{old}} - \eta \frac{\partial L}{\partial W}$$

Where:
*   $W_{\text{old}}$ is `weights_output`
*   $\eta$ is the `learning_rate`
*   $\frac{\partial L}{\partial W}$ is the gradient $\frac{\partial L}{\partial \text{weights_output}}$

Comparing the mathematical derivation to the code:

*   `hidden_layer_activation.T` corresponds to the $\frac{\partial \text{output_layer_input}}{\partial \text{weights_output}}$ term, as shown in Component 3, and it is transposed to handle the matrix multiplication correctly for all samples.
*   `d_predicted_output` effectively represents $-\frac{\partial L}{\partial \text{output_layer_input}}$. (The negative sign is absorbed because we are *adding* the term in the update, which means we are moving in the direction of the *negative* gradient to minimize loss).
*   The `.dot()` operation performs the matrix multiplication for summing up the gradients over all samples and hidden layer activations.
*   `learning_rate` ($\eta$) controls the step size of the update.

Therefore, the line `weights_output += hidden_layer_activation.T.dot(d_predicted_output) * learning_rate` directly implements the gradient descent update for `weights_output`.

## Output Layer Bias Update (bias_output)

### Subtask:
Explain the mathematical derivation for updating the biases of the output layer (`bias_output`). Show how the gradient of the loss with respect to these biases is calculated, and then map this to the Python code: `bias_output += np.sum(d_predicted_output, axis=0, keepdims=True) * learning_rate`.


## Output Layer Bias Update (bias_output) Derivation

To understand how `bias_output` is updated, we first need to recall the forward pass calculation for the output layer's input. The `output_layer_input` is calculated as follows:

1.  **Output Layer Input Equation**

    The input to the output layer's activation function, often denoted as $Z_o$, is given by:

    $$ Z_o = A_h \cdot W_o + B_o $$

    Where:
    *   $A_h$ is the activation from the hidden layer (`hidden_layer_activation`).
    *   $W_o$ is the weights connecting the hidden layer to the output layer (`weights_output`).
    *   $B_o$ is the bias of the output layer (`bias_output`).

    In our Python code, this corresponds to:

    ```python
    output_layer_input = np.dot(hidden_layer_activation, weights_output) + bias_output
    ```

2.  **Applying the Chain Rule for $\frac{\partial L}{\partial B_o}$**

    We want to find the gradient of the loss function ($L$) with respect to the output layer biases ($B_o$). Using the chain rule, we can express this as:

    $$ \frac{\partial L}{\partial B_o} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial Z_o} \cdot \frac{\partial Z_o}{\partial B_o} $$

    Let's break down each term:

    *   **$\frac{\partial L}{\partial \hat{y}}$ (Derivative of Loss with respect to Predicted Output):**
        This term is related to `error = y - predicted_output`. The derivative of a common loss function (e.g., Mean Squared Error) often leads to a term proportional to `(predicted_output - y)` or `(y - predicted_output)`. In our code, `error` is `y - predicted_output`.

    *   **$\frac{\partial \hat{y}}{\partial Z_o}$ (Derivative of Predicted Output with respect to Output Layer Input):**
        The `predicted_output` ($\hat{y}$) is the result of applying the sigmoid activation function to `output_layer_input` ($Z_o$).

        $$ \hat{y} = \text{sigmoid}(Z_o) $$

        The derivative of the sigmoid function, $\text{sigmoid}'(Z_o)$, is $\hat{y}(1 - \hat{y})$.

        So, $$ \frac{\partial \hat{y}}{\partial Z_o} = \hat{y}(1 - \hat{y}) $$

        In our code, this is `sigmoid_derivative(predicted_output)`. The product of this term with `error` is captured in `d_predicted_output`:

        ```python
        d_predicted_output = error * sigmoid_derivative(predicted_output)
        ```
        Therefore, `d_predicted_output` effectively represents $\frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial Z_o} = \frac{\partial L}{\partial Z_o}$.

    *   **$\frac{\partial Z_o}{\partial B_o}$ (Derivative of Output Layer Input with respect to Output Bias):**
        From the output layer input equation: $Z_o = A_h \cdot W_o + B_o$.

        Taking the partial derivative with respect to $B_o$:

        $$ \frac{\partial Z_o}{\partial B_o} = \frac{\partial}{\partial B_o} (A_h \cdot W_o + B_o) = 0 + 1 = 1 $$

        This holds for each element of the bias vector. Effectively, this means the gradient flows directly to the bias without modification for each sample.

3.  **Combining Components and Mapping to Python Code**

    Now, let's combine the terms from the chain rule:

    $$ \\frac{\\partial L}{\\partial B_o} = \\frac{\\partial L}{\\partial Z_o} \\cdot \\frac{\\partial Z_o}{\\partial B_o} $$\n
    We established that `d_predicted_output` in the code represents $\\frac{\\partial L}{\\partial Z_o}$. And we found that $\\frac{\\partial Z_o}{\\partial B_o} = 1$.

    So, the gradient of the loss with respect to the output layer bias for each sample is simply `d_predicted_output`.

    However, `bias_output` is a single row vector that applies to all samples. Therefore, to get the total gradient for `bias_output`, we need to sum the gradients for each sample. This is why `np.sum` is used:

    $$ \\frac{\\partial L}{\\partial B_o} = \\sum_{i=1}^{m} (\\text{d_predicted_output})_i $$\n
    Where $m$ is the number of samples (rows in `X`).

    In Python, this summation across all samples (rows) is achieved by:

    ```python
    np.sum(d_predicted_output, axis=0, keepdims=True)
    ```

    *   `axis=0`: Sums along the rows, effectively summing the gradients for each bias term across all samples.
    *   `keepdims=True`: Ensures the output maintains the same number of dimensions as `bias_output`, which is a `(1, output_neurons)` shape, facilitating the addition.

    Finally, the update rule for gradient descent is to subtract the learning rate times the gradient from the current bias value. Since our `error` was defined as `y - predicted_output`, `d_predicted_output` already points in the direction of minimizing the error (i.e., it's a 'negative gradient' from a perspective of `predicted_output - y`). Therefore, we add it to the bias:

    $$ B_o^{new} = B_o^{old} + \\text{learning_rate} \\cdot \\sum_{i=1}^{m} (\\text{d_predicted_output})_i $$\n
    This directly translates to the Python code:

    ```python
    bias_output += np.sum(d_predicted_output, axis=0, keepdims=True) * learning_rate
    ```

    This completes the derivation and explanation for the update of `bias_output`.

3.  **Combining Components and Mapping to Python Code**

    Now, let's combine the terms from the chain rule:

    $$ \\frac{\\partial L}{\\partial B_o} = \\frac{\\partial L}{\\partial Z_o} \\cdot \\frac{\\partial Z_o}{\\partial B_o} $$

    We established that `d_predicted_output` in the code represents $\\frac{\\partial L}{\\partial Z_o}$. And we found that $\\frac{\\partial Z_o}{\\partial B_o} = 1$.

    So, the gradient of the loss with respect to the output layer bias for each sample is simply `d_predicted_output`.

    However, `bias_output` is a single row vector that applies to all samples. Therefore, to get the total gradient for `bias_output`, we need to sum the gradients for each sample. This is why `np.sum` is used:

    $$ \\frac{\\partial L}{\\partial B_o} = \\sum_{i=1}^{m} (\\text{d_predicted_output})_i $$

    Where $m$ is the number of samples (rows in `X`).

    In Python, this summation across all samples (rows) is achieved by:

    ```python
    np.sum(d_predicted_output, axis=0, keepdims=True)
    ```

    *   `axis=0`: Sums along the rows, effectively summing the gradients for each bias term across all samples.
    *   `keepdims=True`: Ensures the output maintains the same number of dimensions as `bias_output`, which is a `(1, output_neurons)` shape, facilitating the addition.

    Finally, the update rule for gradient descent is to subtract the learning rate times the gradient from the current bias value. Since our `error` was defined as `y - predicted_output`, `d_predicted_output` already points in the direction of minimizing the error (i.e., it's a 'negative gradient' from a perspective of `predicted_output - y`). Therefore, we add it to the bias:

    $$ B_o^{new} = B_o^{old} + \\text{learning_rate} \\cdot \\sum_{i=1}^{m} (\\text{d_predicted_output})_i $$

    This directly translates to the Python code:

    ```python
    bias_output += np.sum(d_predicted_output, axis=0, keepdims=True) * learning_rate
    ```

    This completes the derivation and explanation for the update of `bias_output`.

## Hidden Layer Weight Update (weights_hidden)

### Subtask:
Detail the mathematical derivation for updating the weights connecting the input layer to the hidden layer (`weights_hidden`). This will involve backpropagating the error from the output layer to the hidden layer, calculating the gradient, and then mapping this to the Python code: `weights_hidden += X.T.dot(d_hidden_layer) * learning_rate`.


## Hidden Layer Weight Update (weights_hidden) Mathematical Derivation

To update the weights connecting the input layer to the hidden layer (`weights_hidden`), we need to apply the backpropagation algorithm. This involves calculating the gradient of the loss function with respect to `weights_hidden` and then adjusting the weights in the opposite direction of the gradient.

Let's denote:
*   `X` as the input (features)
*   `W_h` as `weights_hidden`
*   `b_h` as `bias_hidden`
*   `A_h` as `hidden_layer_activation`
*   `W_o` as `weights_output`
*   `b_o` as `bias_output`
*   `A_o` as `predicted_output`
*   `y` as the true labels
*   `L` as the Loss function (e.g., Mean Squared Error)
*   `f(z)` as the sigmoid activation function
*   `f'(z)` as the derivative of the sigmoid activation function

### 1. Forward Pass for the Hidden Layer

First, let's establish the forward pass calculations for the hidden layer:

**a. Hidden Layer Input (`hidden_layer_input`)**
The input to the hidden layer is the weighted sum of the inputs from the input layer plus the hidden layer biases:

$$ Z_h = X \cdot W_h + b_h $$

In the provided Python code, this is:
`hidden_layer_input = np.dot(X, weights_hidden) + bias_hidden`

**b. Hidden Layer Activation (`hidden_layer_activation`)**
The activated output of the hidden layer is obtained by applying the sigmoid activation function to `hidden_layer_input`:

$$ A_h = f(Z_h) $$

In the provided Python code, this is:
`hidden_layer_activation = sigmoid(hidden_layer_input)`

### 2. Backpropagation of Error to the Hidden Layer (`error_hidden_layer`)

The error needs to be propagated from the output layer back to the hidden layer. We already have the derivative of the loss with respect to the output layer's activation (`d_predicted_output`), which can be written as $ \frac{\partial L}{\partial A_o} \cdot \frac{\partial A_o}{\partial Z_o} $. For simplicity, let's denote this as $ \delta_o $ (delta output).

To find the error signal for the hidden layer, we need to consider how the output error influences the hidden layer. This is done by multiplying the output error signal by the transpose of the output weights:

$$ \text{Error Signal at Hidden Layer} = \delta_o \cdot W_o^T $$

In the Python code, this corresponds to `error_hidden_layer`:

`error_hidden_layer = d_predicted_output.dot(weights_output.T)`

This step effectively distributes the error from each output neuron back to the hidden neurons it's connected to, weighted by the strength of those connections.

### 3. Derivation for `d_hidden_layer`

The `d_hidden_layer` variable represents the error gradient at the hidden layer's activation output. It's calculated by multiplying the `error_hidden_layer` (the backpropagated error) by the derivative of the hidden layer's activation function. This is part of the chain rule to get the derivative of the loss with respect to the hidden layer's input ($Z_h$).

Let $ \delta_h $ be `d_hidden_layer`. We calculate it as:

$$ \delta_h = \text{Error Signal at Hidden Layer} \times f'(A_h) $$

Where $ f'(A_h) $ is the derivative of the sigmoid function evaluated at the hidden layer's activation $A_h$. Since `sigmoid_derivative(x)` in the code takes `x` as the *activated* value, it calculates $ x(1-x) $ which is $ A_h(1-A_h) $.

So, mathematically:

$$ \delta_h = (\delta_o \cdot W_o^T) \odot (A_h \odot (1 - A_h)) $$

(where $ \odot $ denotes element-wise multiplication).

In the provided Python code, this is:

`d_hidden_layer = error_hidden_layer * sigmoid_derivative(hidden_layer_activation)`

### 4. Gradient of Loss Function with Respect to `weights_hidden`

Now we need to find $ \frac{\partial L}{\partial W_h} $. Using the chain rule, this can be expressed as:

$$ \frac{\partial L}{\partial W_h} = \frac{\partial L}{\partial Z_h} \cdot \frac{\partial Z_h}{\partial W_h} $$

We know that $ Z_h = X \cdot W_h + b_h $. So, $ \frac{\partial Z_h}{\partial W_h} = X^T $. (Specifically, if we consider the partial derivative of $Z_h$ with respect to an individual weight $W_{ij}$, it would be $X_i$).

And $ \frac{\partial L}{\partial Z_h} $ is what `d_hidden_layer` represents: $ \delta_h $.

Therefore, the gradient of the loss with respect to `weights_hidden` is:

$$ \frac{\partial L}{\partial W_h} = X^T \cdot \delta_h $$

### 5. Full Gradient of Loss with Respect to `weights_hidden` in Matrix Form

Combining the components, the full gradient in matrix form for `weights_hidden` is:

$$ \frac{\partial L}{\partial W_h} = X^T \cdot [ ( (y - A_o) \odot A_o \odot (1 - A_o) ) \cdot W_o^T \odot A_h \odot (1 - A_h) ] $$

This expression shows how the input `X` (transposed) contributes to the gradient, scaled by the error signal backpropagated to the hidden layer (`d_hidden_layer`).

### 6. Mapping to Python Update Rule

The Python update rule for `weights_hidden` is:

`weights_hidden += X.T.dot(d_hidden_layer) * learning_rate`

Let's break down each part:

*   `X.T.dot(d_hidden_layer)`: This directly implements the matrix multiplication $ X^T \cdot \delta_h $, which we derived as the gradient $ \frac{\partial L}{\partial W_h} $. The `X.T` (transpose of input `X`) is crucial because `X` has dimensions (number of samples, input neurons) and `d_hidden_layer` has dimensions (number of samples, hidden neurons). For the dot product to produce a result with dimensions (input neurons, hidden neurons) matching `weights_hidden`, `X` must be transposed.

*   `* learning_rate`: This scales the calculated gradient by the `learning_rate`. The learning rate determines the step size for weight updates, controlling how quickly the model learns.

*   `weights_hidden += ...`: This updates the `weights_hidden` by moving them in the negative direction of the gradient. Since we add the gradient (multiplied by the learning rate), it's implicitly `weights_hidden = weights_hidden - learning_rate * (negative gradient)`, or more commonly `weights_hidden = weights_hidden - learning_rate * (positive gradient)`. In this specific implementation, `d_predicted_output` is `error * sigmoid_derivative(predicted_output)`, where `error` is `y - predicted_output`. If `y - predicted_output` is positive, it means `predicted_output` is too low, and the weights should increase to make the output higher. The `+=` correctly adjusts the weights in the direction that reduces the error. Therefore, the term `X.T.dot(d_hidden_layer)` is effectively the *negative* gradient (or rather, the direction to move *towards* the target for weight adjustment).

This update rule effectively adjusts the `weights_hidden` based on how much each input contributed to the error propagated back through the hidden layer, scaled by the learning rate to manage the update magnitude.

## Hidden Layer Bias Update (bias_hidden)

### Subtask:
Explain the mathematical derivation for updating the biases of the hidden layer (`bias_hidden`). Show how the gradient of the loss with respect to these biases is calculated after backpropagating the error, and then map this to the Python code: `bias_hidden += np.sum(d_hidden_layer, axis=0, keepdims=True) * learning_rate`.


### Hidden Layer Bias Update (bias_hidden) Derivation

Let's break down the mathematical derivation for updating the biases of the hidden layer (`bias_hidden`) and connect it to the provided Python code.

#### 1. Forward Pass Equation for Hidden Layer Input ($Z_h$)

First, recall the forward pass calculation for the input to the hidden layer, often denoted as $Z_h$ (or `hidden_layer_input` in the code):

$Z_h = X \cdot W_h + b_h$

Where:
*   $X$ is the input data.
*   $W_h$ represents the weights connecting the input layer to the hidden layer (`weights_hidden`).
*   $b_h$ represents the biases of the hidden layer (`bias_hidden`).

This equation shows that `bias_hidden` directly contributes to the `hidden_layer_input` through an element-wise addition.

#### 2. Applying the Chain Rule to Derive the Gradient $\frac{\partial L}{\partial b_h}$

To update the hidden layer biases, we need to calculate the gradient of the loss function ($L$) with respect to $b_h$. We use the chain rule, starting from the error signal at the hidden layer's activation output, which is already calculated as `d_hidden_layer`. In mathematical terms, `d_hidden_layer` represents $\frac{\partial L}{\partial A_h} \cdot f'(Z_h) = \frac{\partial L}{\partial Z_h}$, where $A_h$ is the activated hidden layer output and $f'$ is the derivative of the activation function.

The chain rule for $\frac{\partial L}{\partial b_h}$ can be written as:

$\frac{\partial L}{\partial b_h} = \frac{\partial L}{\partial Z_h} \cdot \frac{\partial Z_h}{\partial b_h}$

#### 3. Detailing Each Component of the Chain Rule

*   **$\frac{\partial L}{\partial Z_h}$**: This term is directly given by `d_hidden_layer` from the backpropagation step. As mentioned in the notebook's code:
    `error_hidden_layer = d_predicted_output.dot(weights_output.T)`
    `d_hidden_layer = error_hidden_layer * sigmoid_derivative(hidden_layer_activation)`
    Thus, `d_hidden_layer` effectively represents the gradient of the loss with respect to the pre-activation input of the hidden layer ($Z_h$).

*   **$\frac{\partial Z_h}{\partial b_h}$**: From our forward pass equation $Z_h = X \cdot W_h + b_h$, if we take the partial derivative of $Z_h$ with respect to $b_h$:
    
    $\frac{\partial}{\partial b_h} (X \cdot W_h + b_h) = \frac{\partial}{\partial b_h} (X \cdot W_h) + \frac{\partial}{\partial b_h} (b_h)$

    Since $X \cdot W_h$ does not depend on $b_h$, its derivative with respect to $b_h$ is 0. The derivative of $b_h$ with respect to itself is 1 (for each element). Therefore, $\frac{\partial Z_h}{\partial b_h} = 1$.

Combining these, the gradient for each individual bias element in $b_h$ is simply $\frac{\partial L}{\partial b_h} = \text{d_hidden_layer} \cdot 1 = \text{d_hidden_layer}$.

#### 4. Summing the Gradients for `bias_hidden`

The `d_hidden_layer` term is a matrix (or 2D array) where each row corresponds to a training example and each column corresponds to a hidden layer neuron. To update a single bias for a specific hidden neuron, we need the sum of the gradients across all training samples for that neuron. This is because a single bias term is applied across all input samples in a batch.

The Python code `np.sum(d_hidden_layer, axis=0, keepdims=True)` achieves this:
*   `axis=0`: This argument specifies that the sum should be performed along the first axis (rows), effectively summing the gradients for each hidden neuron across all training examples.
*   `keepdims=True`: This ensures that the output array retains its dimensionality (e.g., `(1, hidden_layer_neurons)`), which is crucial for broadcasting during the update step to match the shape of `bias_hidden`.

So, `np.sum(d_hidden_layer, axis=0, keepdims=True)` calculates the total gradient of the loss with respect to each hidden layer bias term.

#### 5. Mapping to the Python Code

The full update rule in the Python code is:

`bias_hidden += np.sum(d_hidden_layer, axis=0, keepdims=True) * learning_rate`

*   `np.sum(d_hidden_layer, axis=0, keepdims=True)`: As explained above, this part computes the gradient $\frac{\partial L}{\partial b_h}$.
*   `* learning_rate`: The calculated gradient is multiplied by the `learning_rate` to control the step size during the optimization process. A larger learning rate means larger steps are taken in the direction of the negative gradient, while a smaller learning rate means smaller steps.
*   `bias_hidden += ...`: This is shorthand for `bias_hidden = bias_hidden + (...)`. In gradient descent, we typically *subtract* the gradient (since we want to move in the direction of decreasing loss). However, here `d_hidden_layer` (and `d_predicted_output` from which it is derived) is already defined as `error * sigmoid_derivative(...)`. If `error = y - predicted_output`, then a positive error means `predicted_output` is too low, and we want to increase it. If `d_hidden_layer` is positive, it means increasing `bias_hidden` will lead to an increase in `predicted_output` (all else being equal). Thus, adding a positive `d_hidden_layer` when `y > predicted_output` correctly pushes `bias_hidden` in the direction to reduce the error. The `+=` operation effectively moves `bias_hidden` in the direction that minimizes the loss function.

## Final Task

### Subtask:
Summarize the entire process of backpropagation for updating weights and biases, reinforcing the connection between the mathematical theory and its implementation.


## Summary:

This analysis comprehensively explains the mathematical derivations behind updating weights and biases in a neural network using gradient descent and backpropagation, directly linking each derived component to its corresponding Python code implementation.

### Data Analysis Key Findings

*   **Gradient Descent Principle**: Gradient Descent is an iterative optimization algorithm that minimizes a loss function by adjusting parameters (weights and biases) in the direction opposite to their gradient, with the `learning_rate` controlling the step size.
*   **Output Layer Weight Update (`weights_output`)**:
    *   The gradient of the loss with respect to `weights_output` is derived using the chain rule, involving the derivative of the loss, the sigmoid activation derivative, and the `hidden_layer_activation`.
    *   The Python term `d_predicted_output` effectively represents the backpropagated error signal $\frac{\partial L}{\partial \text{output_layer_input}}$ (with sign convention aligning with `error = y - predicted_output`).
    *   The update rule `weights_output += hidden_layer_activation.T.dot(d_predicted_output) * learning_rate` directly implements the gradient descent step, where `hidden_layer_activation.T.dot(d_predicted_output)` calculates the effective negative gradient for `weights_output`.
*   **Output Layer Bias Update (`bias_output`)**:
    *   The gradient for `bias_output` is the sum of `d_predicted_output` across all training samples, as the derivative of the output layer input with respect to its bias is 1.
    *   The Python code `bias_output += np.sum(d_predicted_output, axis=0, keepdims=True) * learning_rate` accurately aggregates the gradients over all samples and applies the update.
*   **Hidden Layer Weight Update (`weights_hidden`)**:
    *   The error signal for the hidden layer (`error_hidden_layer`) is calculated by backpropagating `d_predicted_output` through the output layer's `weights_output` (transposed).
    *   `d_hidden_layer` represents the gradient of the loss with respect to the hidden layer's pre-activation input ($Z_h$), obtained by multiplying `error_hidden_layer` by the `sigmoid_derivative` of the hidden layer activation.
    *   The update `weights_hidden += X.T.dot(d_hidden_layer) * learning_rate` corresponds to the gradient descent formula, where `X.T.dot(d_hidden_layer)` computes the negative gradient of the loss with respect to `weights_hidden`.
*   **Hidden Layer Bias Update (`bias_hidden`)**:
    *   Similar to the output layer, the gradient for `bias_hidden` is the sum of `d_hidden_layer` across all training samples.
    *   The code `bias_hidden += np.sum(d_hidden_layer, axis=0, keepdims=True) * learning_rate` correctly sums these gradients and applies the learning rate for the update.
*   **Sign Conventions**: The `+=` operation in the update rules effectively performs gradient descent (minimizing loss) because the error terms (`d_predicted_output`, `d_hidden_layer`) are defined such that they point in the direction that *reduces* the discrepancy (`y - predicted_output`).

### Insights or Next Steps

*   The meticulous breakdown of the chain rule for each parameter demonstrates the systematic nature of backpropagation in computing gradients layer by layer.
*   Understanding the specific roles of matrix transpositions (`.T`) and summation (`np.sum` with `axis` and `keepdims`) in the Python code is crucial for correctly aggregating gradients over batches and aligning matrix dimensions for updates.
