<a href="https://colab.research.google.com/github/Redcoder815/Deep_Learning_Python/blob/main/01BackPropagation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
import numpy as np


class NeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size

        self.weights_input_hidden = np.random.randn(
            self.input_size, self.hidden_size)
        self.weights_hidden_output = np.random.randn(
            self.hidden_size, self.output_size)

        self.bias_hidden = np.zeros((1, self.hidden_size))
        self.bias_output = np.zeros((1, self.output_size))

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def sigmoid_derivative(self, x):
        return x * (1 - x)

    def feedforward(self, X):
        self.hidden_activation = np.dot(
        X, self.weights_input_hidden) + self.bias_hidden
        self.hidden_output = self.sigmoid(self.hidden_activation)

        self.output_activation = np.dot(
        self.hidden_output, self.weights_hidden_output) + self.bias_output
        self.predicted_output = self.sigmoid(self.output_activation)

        return self.predicted_output

    def backward(self, X, y, learning_rate):
        output_error = y - self.predicted_output
        output_delta = output_error * \
        self.sigmoid_derivative(self.predicted_output)

        hidden_error = np.dot(output_delta, self.weights_hidden_output.T)
        hidden_delta = hidden_error * self.sigmoid_derivative(self.hidden_output)

        self.weights_hidden_output += np.dot(self.hidden_output.T,
                                         output_delta) * learning_rate
        self.bias_output += np.sum(output_delta, axis=0,
                               keepdims=True) * learning_rate
        self.weights_input_hidden += np.dot(X.T, hidden_delta) * learning_rate
        self.bias_hidden += np.sum(hidden_delta, axis=0,
                               keepdims=True) * learning_rate

    def train(self, X, y, epochs, learning_rate):
        for epoch in range(epochs):
            output = self.feedforward(X)
            self.backward(X, y, learning_rate)
            if epoch % 4000 == 0:
                loss = np.mean(np.square(y - output))
                print(f"Epoch {epoch}, Loss:{loss}")

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)
nn.train(X, y, epochs=10000, learning_rate=0.1)

output = nn.feedforward(X)
print("Predictions after training:")
print(output)

Epoch 0, Loss:0.2639659691915888
Epoch 4000, Loss:0.006595200421080553
Epoch 8000, Loss:0.0022121611755944754
Predictions after training:
[[0.02510323]
 [0.95959851]
 [0.95893597]
 [0.05081071]]


### The Mathematical Basis: The Chain Rule

In a neural network, our goal is to minimize a **Loss Function (L)** by adjusting the network's weights and biases. To do this, we need to calculate the *gradient* of the loss with respect to each weight and bias â€“ that is, how much the loss changes when a specific weight or bias changes. This is where derivatives come in.

Consider a simple path in your network:

1.  **Input `X`** goes into a layer.
2.  It's multiplied by **Weights `W`** and added to **Bias `b`** to get an **Activation `Z`** (`Z = XW + b`).
3.  `Z` is then passed through an **Activation Function `f`** (like sigmoid) to get the **Output `A`** (`A = f(Z)`).
4.  This `A` then contributes to the **Loss `L`**.

To find out how a weight `W` affects the Loss `L` (i.e., `dL/dW`), we use the chain rule:

`dL/dW = (dL/dA) * (dA/dZ) * (dZ/dW)`

Let's look at each term:

*   **`dL/dA`**: This is the gradient of the Loss with respect to the output of the activation function. It tells us how sensitive the loss is to changes in the output of this particular layer.

*   **`dA/dZ`**: This is the **derivative of the activation function** with respect to its input. If `A = f(Z)`, then `dA/dZ = f'(Z)`. **This is precisely where `self.sigmoid_derivative(self.hidden_activation)` or `self.sigmoid_derivative(self.output_activation)` comes from in your code.**

*   **`dZ/dW`**: This is the derivative of the activation `Z` with respect to the weight `W`. Since `Z = XW + b`, then `dZ/dW = X` (the input to the layer).

### Connecting to Your `backward` Function

Let's map this to the `output_delta` calculation in your code:

```python
output_error = y - self.predicted_output  # This is proportional to dL/dA_output
output_delta = output_error * self.sigmoid_derivative(self.predicted_output)
```

In this context, `output_error` (or a related error term) essentially captures `dL/dA_output`. Then, `self.sigmoid_derivative(self.predicted_output)` calculates `dA_output / dZ_output`, where `predicted_output` is `A_output` and the argument to the `sigmoid_derivative` should ideally be the pre-activation `Z_output`. However, since `f'(Z) = f(Z)(1-f(Z)) = A(1-A)`, you correctly use `self.predicted_output` (which is `A_output`) directly.

So, `output_delta` becomes proportional to `(dL/dA_output) * (dA_output/dZ_output)`. This is the `dL/dZ_output` term for the output layer.

Similarly, for the hidden layer:

```python
hidden_error = np.dot(output_delta, self.weights_hidden_output.T) # This propagates the error dL/dZ_output back to become dL/dA_hidden
hidden_delta = hidden_error * self.sigmoid_derivative(self.hidden_output)
```

Here, `hidden_error` represents `dL/dA_hidden` (how much the loss is affected by the hidden layer's output). Then, `self.sigmoid_derivative(self.hidden_output)` is `dA_hidden / dZ_hidden`. Multiplying them gives `dL/dZ_hidden`.

### Summary

The `backward` function *implements* the chain rule. The `sigmoid_derivative` function provides the `dA/dZ` part of that chain. By multiplying the error propagated from subsequent layers (`dL/dA`) by the derivative of the activation function (`dA/dZ`), we effectively compute `dL/dZ` for a given layer. This `dL/dZ` is then used to find the gradients with respect to weights (`dL/dW = (dL/dZ) * (dZ/dW)`) and biases (`dL/db = (dL/dZ) * (dZ/db)`).

The derivative plays a crucial role in the `backward` function, which implements the backpropagation algorithm for training the neural network. Specifically, the derivative of the activation function (in this case, the sigmoid function) is used to calculate how much the weights and biases should be adjusted.

### The Sigmoid Function and its Derivative

Your `NeuralNetwork` class uses the sigmoid activation function, defined as:

```python
def sigmoid(self, x):
    return 1 / (1 + np.exp(-x))
```

The derivative of the sigmoid function, `f'(x)`, when `x` is the output of the sigmoid function `f(z)`, is `f(z) * (1 - f(z))`. Your implementation reflects this:

```python
def sigmoid_derivative(self, x):
    # Here, 'x' is assumed to be the output of the sigmoid function
    return x * (1 - x)
```

### How the Derivative is Used in the `backward` Function

The `backward` function calculates the error at the output layer and then propagates it back through the network to update the weights and biases. The `sigmoid_derivative` is essential for this process:

1.  **Output Layer Delta Calculation:**
    ```python
    output_error = y - self.predicted_output
    output_delta = output_error * self.sigmoid_derivative(self.predicted_output)
    ```
    *   `output_error`: This is the difference between the actual target `y` and the `predicted_output` of the network.
    *   `self.sigmoid_derivative(self.predicted_output)`: This calculates the derivative of the sigmoid function with respect to the `predicted_output`. Multiplying the `output_error` by this derivative gives us `output_delta`, which indicates how much the output weights need to change to reduce the error.

2.  **Hidden Layer Delta Calculation:**
    ```python
    hidden_error = np.dot(output_delta, self.weights_hidden_output.T)
    hidden_delta = hidden_error * self.sigmoid_derivative(self.hidden_output)
    ```
    *   `hidden_error`: This error is propagated back from the output layer to the hidden layer. It's calculated by taking the dot product of `output_delta` with the transpose of the `weights_hidden_output`.
    *   `self.sigmoid_derivative(self.hidden_output)`: Similar to the output layer, this calculates the derivative of the sigmoid function for the `hidden_output`. Multiplying `hidden_error` by this derivative gives `hidden_delta`, which tells us how much the hidden layer weights and biases need to be adjusted.

These `delta` values are then used to update the weights and biases using the `learning_rate` to minimize the overall loss of the neural network.