## Introduction

In this notebook, we'll delve into the concept of backpropagation in neural networks. Backpropagation is a key algorithm for training neural networks, enabling them to learn from data by adjusting their weights and biases. We'll explore the implementation of backpropagation in the provided neural network code, understand how it computes gradients, and updates the network parameters to minimize the loss function.

Before we proceed, ensure that you have the NumPy library installed. If not, you can install it using the following command:

In [7]:
pip install numpy





[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Now to recap, here's our full class of Network from our previous notebook:

In [8]:
import random
import numpy as np

class Network(object):

    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]

    def feedforward(self, a):
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data):

        training_data = list(training_data)
        n = len(training_data)

        if test_data:
            test_data = list(test_data)
            n_test = len(test_data)

        for j in range(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in range(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print("Epoch {} : {} / {}".format(j,self.evaluate(test_data),n_test))
            
            else:
                print("Epoch {} complete".format(j))

    def update_mini_batch(self, mini_batch, eta):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        for l in range(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

    def evaluate(self, test_data):
        test_results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)

    def cost_derivative(self, output_activations, y):
        return (output_activations-y)

#### Miscellaneous functions
def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    return sigmoid(z)*(1-sigmoid(z))

Now, let's break down the backpropagation part of this class.

# 1. Initialization
First, let's start by initializing the neural network with random weights and biases. We'll create an instance of the Network class with a specified architecture (number of layers and neurons in each layer) and random initialization for weights and biases.

```python
import numpy as np
import random

class Network(object):

    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]

# 2. Backpropagation Algorithm
Now, we'll explore the backpropagation algorithm, which computes the gradients of the cost function with respect to the weights and biases of the network. We'll understand how errors are propagated backward through the network to update the parameters.

So the first thing we need to understand is that the main purpose of the backpropagation is to find the rate of change of the error function, with respect to the weights and biases. For this example, we'll talk about the weights only, but do keep n mind that the biases also behave the same way, only a little bit simpler.

We''l take the example of neural network with 3 neurons in the input layer, 2 neurons in the hidden layer and one neuron in the output layer. Let's say we want to change the weight between the first neuron of the hidden layer and the neuron of the output layer. Here's the mathemathical explaination of it (note that the superscript denote the layer, while the subscript denote the position of neuron from the top):

![Backpropagation algorithm fro output neuron](./images/2.1.jpg)

Now, thats-'s just for the output layer. To change the weights of other layer(s), the process is similar but in this case the total error is the sum of error of the neurons in that layer. Hence, the derivative of the error function would be broken down like this:

 ![Backpropagation algorithm for other neuron](./images/2.2.jpg)

 If you're still confused, I found [this](https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/) to be really helpful!

Here's it's implementation in code:

```python
def backprop(self, x, y):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # forward propagation
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward propagation
        delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        for l in range(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)
```

This method takes the output of the forward propagation and the desired output, labeled x and y respectively.


First, it intializes nabla_b and nabla_w, which is just numpy array of changes we need to make to the weights and biases.Of course, initially all the elements in it will be zero

```python
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
```

Next, forward propagation start to find an array that represent the output that the network give prior to the adjustmentof the weights and biases. This should be self-explanatory by now. If not, i suggest you to reread the FOrward Propagation notebook.

```python
# forward propagation
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
```

Now, the backpropagation algorithm starts.

```python
        delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        for l in range(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)
```
First, we start by finding the derivative of the cost function using the method cost_derivative, which given as follows:
```python
def cost_derivative(self, output_activations, y):
        return (output_activations-y)
```

This basically mean the delta is equal to the cost derivative times with the derivative of the sigmoid function:
```python
def sigmoid_prime(z):
    return sigmoid(z)*(1-sigmoid(z))
```

Note that these functions(cost_derivative and sigmoid_prime is derived from their formula) Since we use the mean squared error (MSE) cost function, the derivative of the cost function is just the difference between the output we get witht he output we want. The same goes with the sigmoid function and it's derivative(sigmoid_prime)

Now,we can directly set the last element of the nabla_b with delta, but for nabla_w we must find the dot product between delta and the activations of the hidden layer before the output layer, which is where the activation[-2], which is the derivative of total error with respect to input of the output layer.However, we need to transpose the activations[-2] first to make sure its shape is aligned with delta or otherwise the dot product could not be calculated

```python
nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
```

Now, since we already calculated the nabla_b and nabla_w of the last layer, let's proceed with the rest of the layers:

```python
for l in range(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)
```

And finally, the backpropagation method returns the the nabla_b and nabla_w whenver it's called.


# 3. Update Rule
Finally, we'll implement the update rule to adjust the weights and biases of the network using the computed gradients and a learning rate.This is done in the update_mini_batch method:

```python
def update_mini_batch(self, mini_batch, eta):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]
```

As you can see, the returned value of nabla_b and nabla_w is used to update the weights and biases by substracting the weights and biases with the new weights and biases times the learning rate, eta.

# Conclusion
Backpropagation is a fundamental algorithm for training neural networks, enabling them to learn from data and make accurate predictions. In this notebook, we explored the implementation of backpropagation in the provided neural network code, understanding its role in optimizing network parameters and improving performance.