# neuralthreads
[medium](https://neuralthreads.medium.com/i-was-not-satisfied-by-any-deep-learning-tutorials-online-37c5e9f4bea1)

## Chapter 5 — Diving Deep in the Neural Networks

*Backpropagation — Made super easy for you, Part 1*

This is the post you all have been waiting for. In this post, we will go through Backpropagation, the most complex thing in Deep Learning but it is actually very simple if done in an organized manner. After this post, you will never look at Backpropagation the same way you did before. I guarantee.

At the end of this post, we will learn to calculate gradients via a game called ‘Jumping Back’ which will have some rules. This game will help you to visualize how Backpropagation works and how simple and easy Backpropagation actually is.

> Note — This post uses many things from the previous chapters. It is recommended that you have a look at the previous posts.

### 5.2.1 Backpropagation in ANNs — Part 1

In this post, we will learn how to use Backpropagation to calculate gradients which we will use to update weights and biases to reduce the loss via some Optimizer.

> Note — The architecture of the Neural Network is the same as it was in the previous post, i.e., 4 layers with 5, 3, 5, and 4 nodes.

Before going forward, a few things first:  
**First**, the activation function for the first hidden layer is ReLU with leak = 0.1  
**Second**, the activation function for the second hidden layer and the output layer is the Sigmoid function.  
**Third**, the loss function used is Mean Square Error, MSE  
**Fourth**, We will use SGD Optimizer with a learning rate = 0.01 

![our nn](image-32.png)

Now, let us look at the steps which we will do here

> Step 1 - A forward feed like we did in the previous post  
> Step 2 - Initializing SGD Optimizer  
> Step 3 - Entering the training loop  
>   Step 3.1 - A forward feed to see loss before training  
>   Step 3.2 - Using Backpropagation to calculate gradients  
>   Step 3.3 - Using SGD Optimizer to update weights and biases  
> Step 4 - A forward feed to verify that the loss has been reduced and to see how close predicted values are to true values  

Let us do it in Python.

### Step 1 — A forward feed like we did in the previous post

In [42]:
import numpy as np                          # importing NumPy
np.random.seed(42)

input_nodes = 5                             # nodes in each layer
hidden_1_nodes = 3
hidden_2_nodes = 5
output_nodes = 4

Inputs and true outputs

In [43]:
x = np.random.randint(1, 100, size = (input_nodes, 1)) / 100
x                                     # Inputs
                             # Inputs

                                  # Outputs


array([[0.52],
       [0.93],
       [0.15],
       [0.72],
       [0.61]])

In [44]:
y = np.random.randint(1, 100, size = (output_nodes, 1)) / 100
y     

array([[0.21],
       [0.83],
       [0.87],
       [0.75]])

This time along with the activation functions and the loss function, we will also define their derivatives.

In [45]:
def relu(x, leak = 0):                      # ReLU
    return np.where(x <= 0, leak * x, x)

def relu_dash(x, leak = 0):                 # ReLU derivative
    return np.where(x <= 0, leak, 1)

def sig(x):                                 # Sigmoid
    return 1/(1 + np.exp(-x))

def sig_dash(x):                            # Sigmoid derivative
    return sig(x) * (1 - sig(x))

def mse(y_true, y_pred):                    # MSE
    return np.mean((y_true - y_pred)**2)

def mse_grad(y_true, y_pred):               # MSE derivative    
    N = y_true.shape[0]    
    return -2*(y_true - y_pred)/N

Random initialization of weights and biases

In [46]:
w1 = np.random.random(size = (hidden_1_nodes, input_nodes))    # w1
b1 = np.zeros(shape = (hidden_1_nodes, 1))                     # b1
w2 = np.random.random(size = (hidden_2_nodes, hidden_1_nodes)) # w2
b2 = np.zeros(shape = (hidden_2_nodes, 1))                     # b2
w3 = np.random.random(size = (output_nodes, hidden_2_nodes))   # w3
b3 = np.zeros(shape = (output_nodes, 1))                       # b3

Forward feed before training

In [47]:
in_hidden_1 = w1.dot(x) + b1                      # forward feed
out_hidden_1 = relu(in_hidden_1, leak = 0.1)
in_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = sig(in_hidden_2)
in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = sig(in_output_layer)
print(y_hat)                                             # y_hat
print(y)                                                 # y
mse(y, y_hat)                                     # MSE loss

[[0.83237553]
 [0.89655717]
 [0.87337397]
 [0.92904704]]
[[0.21]
 [0.83]
 [0.87]
 [0.75]]


0.1059625955371147

### Step 2 — Initializing SGD Optimizer

In [48]:
learning_rate = 0.01

### Step 3 — Entering training loop

We will call the training loops ‘epochs’. We will have 10,000 trainig loops.

Total number of epochs

In [49]:
epochs = 10000

### Step 3.1 — A forward feed to see loss before training

We will print loss before training every time to see that it is reducing after each training epoch.

In [50]:
for epoch in range(epochs):
#----------------------Forward Propagation--------------------------
    
    in_hidden_1 = w1.dot(x) + b1
    out_hidden_1 = relu(in_hidden_1, leak = 0.1)
    in_hidden_2 = w2.dot(out_hidden_1) + b2
    out_hidden_2 = sig(in_hidden_2)
    in_output_layer = w3.dot(out_hidden_2) + b3
    y_hat = sig(in_output_layer)
    
    loss = mse(y, y_hat)
    print(f'loss before training is {loss} -- epoch number {epoch + 1}')
    print('\n')

loss before training is 0.1059625955371147 -- epoch number 1


loss before training is 0.1059625955371147 -- epoch number 2


loss before training is 0.1059625955371147 -- epoch number 3


loss before training is 0.1059625955371147 -- epoch number 4


loss before training is 0.1059625955371147 -- epoch number 5


loss before training is 0.1059625955371147 -- epoch number 6


loss before training is 0.1059625955371147 -- epoch number 7


loss before training is 0.1059625955371147 -- epoch number 8


loss before training is 0.1059625955371147 -- epoch number 9


loss before training is 0.1059625955371147 -- epoch number 10


loss before training is 0.1059625955371147 -- epoch number 11


loss before training is 0.1059625955371147 -- epoch number 12


loss before training is 0.1059625955371147 -- epoch number 13


loss before training is 0.1059625955371147 -- epoch number 14


loss before training is 0.1059625955371147 -- epoch number 15


loss before training is 0.1059625955371147 -- epo

### Step 3.2 — Calculating gradients via Backpropagation

Now, the question is how to update weights and biases.

For example, take weight w3₁₁
If we can calculate

In [51]:
%%latex

\begin{gather*}
    \frac{\partial loss}{\partial W3_{11}} \\
    \\
    \begin{align*}
        \text{Then we can use SGD Optimizer to update it like this.} \\
    \end{align*}
    \\
    W3_{11} \, += -lr \ast \frac{\partial loss}{\partial W3_{11}} \\
    \\
    \begin{align*}
        \text{Or we can do even better.} \\
        \text{For every weight in w3, we can do this} \\

    \end{align*} \\
    \\
    \newcommand{\arraystretch}{2}
    \begin{bmatrix*}
        W3_{11} & W3_{21} & W3_{31} & W3_{41} & W3_{51}\\
        W3_{12} & W3_{22} & W3_{32} & W3_{42} & W3_{52} \\
        W3_{13} & W3_{23} & W3_{33} & W3_{43} & W3_{53} \\
        W3_{14} & W3_{24} & W3_{34} & W3_{44} & W3_{54} \\ 
    \end{bmatrix*} += 
    \begin{bmatrix*}
        \frac{\partial loss}{\partial W3_{11}} & \frac{\partial loss}{\partial W3_{21}} & \frac{\partial loss}{\partial W3_{31}} & \frac{\partial loss}{\partial W3_{41}} & \frac{\partial loss}{\partial W3_{51}}\\
        \frac{\partial loss}{\partial W3_{12}} & \frac{\partial loss}{\partial W3_{22}} & \frac{\partial loss}{\partial W3_{32}} & \frac{\partial loss}{\partial W3_{42}} & \frac{\partial loss}{\partial W3_{52}} \\
        \frac{\partial loss}{\partial W3_{13}} & \frac{\partial loss}{\partial W3_{23}} & \frac{\partial loss}{\partial W3_{33}} & \frac{\partial loss}{\partial W3_{43}} & \frac{\partial loss}{\partial W3_{53}} \\
        \frac{\partial loss}{\partial W3_{14}} & \frac{\partial loss}{\partial W3_{24}} & \frac{\partial loss}{\partial W3_{34}} & \frac{\partial loss}{\partial W3_{44}} & \frac{\partial loss}{\partial W3_{54}} \\ 
    \end{bmatrix*}
    \\
    \begin{align*}
        \text{It can be rearranged as} \\
    \end{align*}
    \\
    W3 \, \mathrel{+}= update\_W3 \\
    update\_W3 = -lr \ast grad\_W3
    \\
    \begin{align*}
        \text{where} \\
    \end{align*}
    \\
    \newcommand{\arraystretch}{2}
    grad\_W3 = 
    \begin{bmatrix*}
        \frac{\partial loss}{\partial W3_{11}} & \frac{\partial loss}{\partial W3_{21}} & \frac{\partial loss}{\partial W3_{31}} & \frac{\partial loss}{\partial W3_{41}} & \frac{\partial loss}{\partial W3_{51}}\\
        \frac{\partial loss}{\partial W3_{12}} & \frac{\partial loss}{\partial W3_{22}} & \frac{\partial loss}{\partial W3_{32}} & \frac{\partial loss}{\partial W3_{42}} & \frac{\partial loss}{\partial W3_{52}} \\
        \frac{\partial loss}{\partial W3_{13}} & \frac{\partial loss}{\partial W3_{23}} & \frac{\partial loss}{\partial W3_{33}} & \frac{\partial loss}{\partial W3_{43}} & \frac{\partial loss}{\partial W3_{53}} \\
        \frac{\partial loss}{\partial W3_{14}} & \frac{\partial loss}{\partial W3_{24}} & \frac{\partial loss}{\partial W3_{34}} & \frac{\partial loss}{\partial W3_{44}} & \frac{\partial loss}{\partial W3_{54}} \\ 
    \end{bmatrix*}
    \\
    \begin{align*}
        \text{Let us start by finding the first term, i.e.,} \\
    \end{align*}
    \\
    \frac{\partial loss}{\partial W3_{11}} \\
    \\
    \begin{align*}
        \text{We know that,} \\
    \end{align*}
    \\
    loss = mse \\
    \\
    \begin{align*}
        \text{and} \\
    \end{align*}
\end{gather*}

<IPython.core.display.Latex object>

In [52]:
%%latex
\begin{gather*}
    mse = f(\hat{y_1}, \hat{y_2}, \hat{y_3}, \hat{y_4}) \\
        \\
    \begin{align*}
        \text{So we can wright} \\
    \end{align*}
    \\
    \\
    \frac{\partial loss}{\partial W3_{11}} = \frac{\partial mse}{\partial W3_{11}}\\
    \\
    \frac{\partial loss}{\partial W3_{11}} = \frac{\partial mse}{\partial \hat{y_1}} \cdot \frac{\partial \hat{y_1}}{\partial W3_{11}} + 
                                             \frac{\partial mse}{\partial \hat{y_2}} \cdot \frac{\partial \hat{y_2}}{\partial W3_{11}} +  
                                             \frac{\partial mse}{\partial \hat{y_3}} \cdot \frac{\partial \hat{y_3}}{\partial W3_{11}} +  
                                             \frac{\partial mse}{\partial \hat{y_4}} \cdot \frac{\partial \hat{y_4}}{\partial W3_{11}} \\

    \\
    \begin{align*}
        \text{We also know that,} \\
    \end{align*}
    \\
    \hat{y_1} = f(I\_OL_1) \\
    \hat{y_2} = f(I\_OL_2) \\
    \hat{y_3} = f(I\_OL_3) \\
    \hat{y_4} = f(I\_OL_4) \\
    \\
    \begin{align*}
        \text{So, we can write,} \\
    \end{align*}
    \\
    \frac{\partial loss}{\partial W3_{11}} = \frac{\partial mse}{\partial \hat{y_1}} \cdot \frac{\partial \hat{y_1}}{\partial I\_OL_1} \cdot \frac{\partial I\_OL_1}{\partial W3_{11}} + 
                                             \frac{\partial mse}{\partial \hat{y_2}} \cdot \frac{\partial \hat{y_2}}{\partial I\_OL_2} \cdot \frac{\partial I\_OL_2}{\partial W3_{11}} + 
                                             \frac{\partial mse}{\partial \hat{y_3}} \cdot \frac{\partial \hat{y_3}}{\partial I\_OL_3} \cdot \frac{\partial I\_OL_3}{\partial W3_{11}} + 
                                             \frac{\partial mse}{\partial \hat{y_4}} \cdot \frac{\partial \hat{y_4}}{\partial I\_OL_4} \cdot \frac{\partial I\_OL_4}{\partial W3_{11}} \\

    \\
    \begin{align*}
        \text{We also know that,} \\
    \end{align*}
    \\
    I\_OL_1 = O\_H2_1 \cdot W3_{11} + O\_H2_2 \cdot W3_{21} + O\_H2_3 \cdot W3_{31} + O\_H2_4 \cdot W3_{41} + O\_H2_5 \cdot W3_{51} + B3_1  \\
        \\
    I\_OL_2 = O\_H2_1 \cdot W3_{12} + O\_H2_2 \cdot W3_{22} + O\_H2_3 \cdot W3_{32} + O\_H2_4 \cdot W3_{42} + O\_H2_5 \cdot W3_{52} + B3_2  \\
        \\
    I\_OL_3 = O\_H2_1 \cdot W3_{13} + O\_H2_2 \cdot W3_{23} + O\_H2_3 \cdot W3_{33} + O\_H2_4 \cdot W3_{43} + O\_H2_5 \cdot W3_{53} + B3_3  \\
        \\
    I\_OL_4 = O\_H2_1 \cdot W3_{14} + O\_H2_2 \cdot W3_{24} + O\_H2_3 \cdot W3_{34} + O\_H2_4 \cdot W3_{44} + O\_H2_5 \cdot W3_{54} + B3_4   \\
    \\
    \\
    \begin{align*}
        \text{So these terms are 0 (Zero)} \\
    \end{align*}
    \\
    \frac{\partial loss}{\partial W3_{11}} = \frac{\partial mse}{\partial \hat{y_1}} \cdot \frac{\partial \hat{y_1}}{\partial I\_OL_1} \cdot \frac{\partial I\_OL_1}{\partial W3_{11}} + 
                                             \frac{\partial mse}{\partial \hat{y_2}} \cdot \frac{\partial \hat{y_2}}{\partial I\_OL_2} \cdot \color{red} \frac{\partial I\_OL_2}{\partial W3_{11}} \color{white} + 
                                             \frac{\partial mse}{\partial \hat{y_3}} \cdot \frac{\partial \hat{y_3}}{\partial I\_OL_3} \cdot \color{red} \frac{\partial I\_OL_3}{\partial W3_{11}} \color{white} + 
                                             \frac{\partial mse}{\partial \hat{y_4}} \cdot \frac{\partial \hat{y_4}}{\partial I\_OL_4} \cdot \color{red} \frac{\partial I\_OL_4}{\partial W3_{11}} \color{white}\\
    \\
    \begin{align*}
        \text{We have} \\
    \end{align*}
    \\
    \frac{\partial loss}{\partial W3_{11}} = \frac{\partial mse}{\partial \hat{y_1}} \cdot \frac{\partial \hat{y_1}}{\partial I\_OL_1} \cdot \frac{\partial I\_OL_1}{\partial W3_{11}} \\
    \\
    \begin{align*}
        \text{or} \\
    \end{align*}
    \\
    \frac{\partial loss}{\partial W3_{11}} = \frac{\partial mse}{\partial \hat{y_1}} \cdot \frac{\partial \hat{y_1}}{\partial I\_OL_1} \cdot O\_H2_1 \\
    \\
    \begin{align*}
        \text{Like this, we can find every term in grad\_w3} \\
    \end{align*}
    \\
    \newcommand{\arraystretch}{2}
    grad\_W3 = 
    \begin{bmatrix*}
        \frac{\partial mse}{\partial \hat{y_1}} \cdot \frac{\partial \hat{y_1}}{\partial I\_OL_1} \cdot O\_H2_1 
            & \frac{\partial mse}{\partial \hat{y_1}} \cdot \frac{\partial \hat{y_1}}{\partial I\_OL_1} \cdot O\_H2_2 
            & \frac{\partial mse}{\partial \hat{y_1}} \cdot \frac{\partial \hat{y_1}}{\partial I\_OL_1} \cdot O\_H2_3 
            & \frac{\partial mse}{\partial \hat{y_1}} \cdot \frac{\partial \hat{y_1}}{\partial I\_OL_1} \cdot O\_H2_4 
            & \frac{\partial mse}{\partial \hat{y_1}} \cdot \frac{\partial \hat{y_1}}{\partial I\_OL_1} \cdot O\_H2_5 \\
        \frac{\partial mse}{\partial \hat{y_2}} \cdot \frac{\partial \hat{y_2}}{\partial I\_OL_2} \cdot O\_H2_1 
            & \frac{\partial mse}{\partial \hat{y_2}} \cdot \frac{\partial \hat{y_2}}{\partial I\_OL_2} \cdot O\_H2_2 
            & \frac{\partial mse}{\partial \hat{y_2}} \cdot \frac{\partial \hat{y_2}}{\partial I\_OL_2} \cdot O\_H2_3 
            & \frac{\partial mse}{\partial \hat{y_2}} \cdot \frac{\partial \hat{y_2}}{\partial I\_OL_2} \cdot O\_H2_4 
            & \frac{\partial mse}{\partial \hat{y_2}} \cdot \frac{\partial \hat{y_2}}{\partial I\_OL_2} \cdot O\_H2_5 \\
        \frac{\partial mse}{\partial \hat{y_3}} \cdot \frac{\partial \hat{y_3}}{\partial I\_OL_3} \cdot O\_H2_1 
            & \frac{\partial mse}{\partial \hat{y_3}} \cdot \frac{\partial \hat{y_3}}{\partial I\_OL_3} \cdot O\_H2_2 
            & \frac{\partial mse}{\partial \hat{y_3}} \cdot \frac{\partial \hat{y_3}}{\partial I\_OL_3} \cdot O\_H2_3 
            & \frac{\partial mse}{\partial \hat{y_3}} \cdot \frac{\partial \hat{y_3}}{\partial I\_OL_3} \cdot O\_H2_4 
            & \frac{\partial mse}{\partial \hat{y_3}} \cdot \frac{\partial \hat{y_3}}{\partial I\_OL_3} \cdot O\_H2_5 \\
        \frac{\partial mse}{\partial \hat{y_4}} \cdot \frac{\partial \hat{y_4}}{\partial I\_OL_4} \cdot O\_H2_1 
            & \frac{\partial mse}{\partial \hat{y_4}} \cdot \frac{\partial \hat{y_4}}{\partial I\_OL_4} \cdot O\_H2_2 
            & \frac{\partial mse}{\partial \hat{y_4}} \cdot \frac{\partial \hat{y_4}}{\partial I\_OL_4} \cdot O\_H2_3 
            & \frac{\partial mse}{\partial \hat{y_4}} \cdot \frac{\partial \hat{y_4}}{\partial I\_OL_4} \cdot O\_H2_4
            & \frac{\partial mse}{\partial \hat{y_4}} \cdot \frac{\partial \hat{y_4}}{\partial I\_OL_4} \cdot O\_H2_5 \\ 
    \end{bmatrix*} \\
    \\
    \begin{align*}
        \text{We can reduce it like this} \\
    \end{align*}
    \\
    \newcommand{\arraystretch}{2}
    grad\_W3 = 
    \begin{bmatrix*}
        \frac{\partial mse}{\partial \hat{y_1}} \cdot \frac{\partial \hat{y_1}}{\partial I\_OL_1} \\
        \frac{\partial mse}{\partial \hat{y_2}} \cdot \frac{\partial \hat{y_2}}{\partial I\_OL_2} \\
        \frac{\partial mse}{\partial \hat{y_3}} \cdot \frac{\partial \hat{y_3}}{\partial I\_OL_3} \\
        \frac{\partial mse}{\partial \hat{y_4}} \cdot \frac{\partial \hat{y_4}}{\partial I\_OL_4} \\ 
    \end{bmatrix*} \cdot 
    \begin{bmatrix*}
        O\_H2_1 &
        O\_H2_2 &
        O\_H2_3 &
        O\_H2_4 &
        O\_H2_5
    \end{bmatrix*} \\
    \\
    \newcommand{\arraystretch}{2}
    grad\_W3 = 
    \begin{bmatrix*}
        \frac{\partial mse}{\partial \hat{y_1}} \\
        \frac{\partial mse}{\partial \hat{y_2}} \\
        \frac{\partial mse}{\partial \hat{y_3}} \\
        \frac{\partial mse}{\partial \hat{y_4}} \\ 
    \end{bmatrix*} *
    \begin{bmatrix*}
        \frac{\partial \hat{y_1}}{\partial I\_OL_1} \\
        \frac{\partial \hat{y_2}}{\partial I\_OL_2} \\
        \frac{\partial \hat{y_3}}{\partial I\_OL_3} \\
        \frac{\partial \hat{y_4}}{\partial I\_OL_4} \\ 
    \end{bmatrix*} \cdot 
    \begin{bmatrix*}
        O\_H2_1 &
        O\_H2_2 &
        O\_H2_3 &
        O\_H2_4 &
        O\_H2_5
    \end{bmatrix*} \\
    \\
    \begin{align*}
        \text{And, finally, we have} \\
    \end{align*}
    \\
    grad\_W3 = (mse\_grad(y, \hat{y}) * sig\_dash(I\_OL)) \cdot O\_H2^{T}


\end{gather*}

<IPython.core.display.Latex object>

In a single line, we have calculated all the gradients for weights in **w3**. Backpropagation is very simple when done in an organized fashion.

Similarly, like this, we can calculate the gradients for **b3**

In [53]:
%%latex
\begin{gather*}
    B3_1 \mathrel{+}= - lr * \frac{\partial mse}{\partial B3_{1}} \\
    \begin{align*}
        \\
        \text{Or, for every bias in b3, we can do this}
        \\
    \end{align*} \\
    \newcommand{\arraystretch}{2}
    \begin{bmatrix*}
        B3_1 \\
        B3_2 \\
        B3_3 \\
        B3_4 
    \end{bmatrix*} \mathrel{+}= - lr * 
    \begin{bmatrix*}
        \frac{\partial loss}{\partial B3_1} \\
        \frac{\partial loss}{\partial B3_2} \\
        \frac{\partial loss}{\partial B3_3} \\
        \frac{\partial loss}{\partial B3_4}
    \end{bmatrix*} \\
    \begin{align*}
        \\
        \text{It can be rearranged as,} 
        \\
    \end{align*} \\
    B3_1 \mathrel{+}= update\_B3 \\
    update\_B3 = - lr * grad\_B3 \\
    \begin{align*}
        \\
        \text{where,} 
        \\
    \end{align*} \\
    grad\_B3 = 
    \newcommand{\arraystretch}{2}
    \begin{bmatrix*}
        \frac{\partial loss}{\partial B3_1} \\
        \frac{\partial loss}{\partial B3_2} \\
        \frac{\partial loss}{\partial B3_3} \\
        \frac{\partial loss}{\partial B3_4}
    \end{bmatrix*} \\
    \begin{align*}
        \\
        \text{Let us start by finding the first term, i.e.,} 
        \\
    \end{align*} \\
    \frac{\partial loss}{\partial B3_1}  \\
    \begin{align*}
        \\
        \text{We know that,} 
        \\
    \end{align*} \\
    loss = mse \\
    \\
    \begin{align*}
        \text{and}
    \end{align*} \\
    \\
    mse = f(\hat{y_1}, \hat{y_2}, \hat{y_3}, \hat{y_4}) \\
        \\
    \begin{align*}
        \text{So we can wright}
    \end{align*}
    \\
    \frac{\partial loss}{\partial B3_1} = \frac{\partial mse}{\partial B3_1} \\
    \\
    \frac{\partial loss}{\partial B3_1} = \frac{\partial mse}{\partial \hat{y_1}} \cdot \frac{\partial \hat{y_1}}{\partial B3_1} + 
                                             \frac{\partial mse}{\partial \hat{y_2}} \cdot \frac{\partial \hat{y_2}}{\partial B3_1} +  
                                             \frac{\partial mse}{\partial \hat{y_3}} \cdot \frac{\partial \hat{y_3}}{\partial B3_1} +  
                                             \frac{\partial mse}{\partial \hat{y_4}} \cdot \frac{\partial \hat{y_4}}{\partial B3_1} \\

    \\
    \begin{align*}
        \text{We also know that,} \\
    \end{align*}
    \\
    \hat{y_1} = f(I\_OL_1) \\
    \hat{y_2} = f(I\_OL_2) \\
    \hat{y_3} = f(I\_OL_3) \\
    \hat{y_4} = f(I\_OL_4) \\
    \\
    \begin{align*}
        \text{So, we can write,} \\
    \end{align*}
    \\
    \frac{\partial loss}{\partial B3_1} = \frac{\partial mse}{\partial \hat{y_1}} \cdot \frac{\partial \hat{y_1}}{\partial I\_OL_1} \cdot \frac{\partial I\_OL_1}{\partial B3_1} + 
                                             \frac{\partial mse}{\partial \hat{y_2}} \cdot \frac{\partial \hat{y_2}}{\partial I\_OL_2} \cdot \frac{\partial I\_OL_2}{\partial B3_1} + 
                                             \frac{\partial mse}{\partial \hat{y_3}} \cdot \frac{\partial \hat{y_3}}{\partial I\_OL_3} \cdot \frac{\partial I\_OL_3}{\partial B3_1} + 
                                             \frac{\partial mse}{\partial \hat{y_4}} \cdot \frac{\partial \hat{y_4}}{\partial I\_OL_4} \cdot \frac{\partial I\_OL_4}{\partial B3_1} \\

    \\
    \begin{align*}
        \text{We also know that,} \\
    \end{align*}
    \\
    \\
    I\_OL_1 = O\_H2_1 \cdot W3_{11} + O\_H2_2 \cdot W3_{21} + O\_H2_3 \cdot W3_{31} + O\_H2_4 \cdot W3_{41} + O\_H2_5 \cdot W3_{51} + B3_1  \\
        \\
    I\_OL_2 = O\_H2_1 \cdot W3_{12} + O\_H2_2 \cdot W3_{22} + O\_H2_3 \cdot W3_{32} + O\_H2_4 \cdot W3_{42} + O\_H2_5 \cdot W3_{52} + B3_2  \\
        \\
    I\_OL_3 = O\_H2_1 \cdot W3_{13} + O\_H2_2 \cdot W3_{23} + O\_H2_3 \cdot W3_{33} + O\_H2_4 \cdot W3_{43} + O\_H2_5 \cdot W3_{53} + B3_3  \\
        \\
    I\_OL_4 = O\_H2_1 \cdot W3_{14} + O\_H2_2 \cdot W3_{24} + O\_H2_3 \cdot W3_{34} + O\_H2_4 \cdot W3_{44} + O\_H2_5 \cdot W3_{54} + B3_4   \\
       \\
    \begin{align*}
        \text{So these terms are 0 (Zero)} \\
    \end{align*}
    \\
    \frac{\partial loss}{\partial B3_1} = \frac{\partial mse}{\partial \hat{y_1}} \cdot \frac{\partial \hat{y_1}}{\partial I\_OL_1} \cdot \frac{\partial I\_OL_1}{\partial B3_1} + 
                                             \frac{\partial mse}{\partial \hat{y_2}} \cdot \frac{\partial \hat{y_2}}{\partial I\_OL_2} \cdot \color{red} \frac{\partial I\_OL_2}{\partial B3_1} \color{white} + 
                                             \frac{\partial mse}{\partial \hat{y_3}} \cdot \frac{\partial \hat{y_3}}{\partial I\_OL_3} \cdot \color{red} \frac{\partial I\_OL_3}{\partial B3_1} \color{white} + 
                                             \frac{\partial mse}{\partial \hat{y_4}} \cdot \frac{\partial \hat{y_4}}{\partial I\_OL_4} \cdot \color{red} \frac{\partial I\_OL_4}{\partial B3_1} \color{white} \\
    \\
    \begin{align*}
        \text{We have} \\
    \end{align*}
    \\
    \frac{\partial loss}{\partial B3_1} = \frac{\partial mse}{\partial \hat{y_1}} \cdot \frac{\partial \hat{y_1}}{\partial I\_OL_1} \cdot \frac{\partial I\_OL_1}{\partial B3_1}
    \\
    \begin{align*}
        \text{or} \\
    \end{align*}
    \\
    \frac{\partial loss}{\partial B3_1} = \frac{\partial mse}{\partial \hat{y_1}} \cdot \frac{\partial \hat{y_1}}{\partial I\_OL_1}  \color{red} Why ? \color{white} \\
    \\
    \begin{align*}
        \text{Like this, we can find every term in grad\_b3.} \\
    \end{align*}
    \\
    \newcommand{\arraystretch}{2}
    grad\_B3 = 
    \begin{bmatrix*}
        \frac{\partial mse}{\partial \hat{y_1}} \cdot \frac{\partial \hat{y_1}}{\partial I\_OL_1} \\
        \frac{\partial mse}{\partial \hat{y_2}} \cdot \frac{\partial \hat{y_2}}{\partial I\_OL_2} \\
        \frac{\partial mse}{\partial \hat{y_3}} \cdot \frac{\partial \hat{y_3}}{\partial I\_OL_3} \\
        \frac{\partial mse}{\partial \hat{y_4}} \cdot \frac{\partial \hat{y_4}}{\partial I\_OL_4} \\ 
    \end{bmatrix*}
    \\
    \begin{align*}
        \text{We can reduce it like this} \\
    \end{align*}
    \\
    \newcommand{\arraystretch}{2}
    grad\_B3 = 
    \begin{bmatrix*}
        \frac{\partial mse}{\partial \hat{y_1}} \\
        \frac{\partial mse}{\partial \hat{y_2}} \\
        \frac{\partial mse}{\partial \hat{y_3}} \\
        \frac{\partial mse}{\partial \hat{y_4}} \\ 
    \end{bmatrix*} \cdot 
    \begin{bmatrix*}
        \frac{\partial \hat{y_1}}{\partial I\_OL_1} \\
        \frac{\partial \hat{y_2}}{\partial I\_OL_2} \\
        \frac{\partial \hat{y_3}}{\partial I\_OL_3} \\
        \frac{\partial \hat{y_4}}{\partial I\_OL_4}
    \end{bmatrix*} \\
    \\
    \begin{align*}
        \text{And, finally, we have} \\
    \end{align*} 
    \\
    grad\_B3 = mse\_grad(y, \hat{y}) * sig\_dash(I\_OL) \\
\end{gather*}

<IPython.core.display.Latex object>

Again, in a single line, we have calculated the gradients for b3.

In [56]:
# -------------------------- Gradient Calculation via Backpropagation ------------------------------ #

grad_w3 = mse_grad(y, y_hat) * sig_dash(in_output_layer) .dot(out_hidden_2.T )  # grad_w3 
               
grad_b3 = mse_grad(y, y_hat) * sig_dash(in_output_layer) # grad_b3

**We can develop a trick via a game we will call ‘Jumping Back’.**

Suppose, we start from true value ‘y’

![jump back game](image-33.png)

Now we jump back and we notice that we have crossed the loss line, so, now we have the loss gradient in the gradient variables.

![jump back game, step 2](image-35.png)

In [59]:
%%latex
\begin{gather*}
    grad = mse\_grad(y, \hat{y})
\end{gather*}

<IPython.core.display.Latex object>

As we jump back again, we now cross the activation function line, so, in the gradient variables, we will have the activation function derivative.

![2 step](image-37.png)

In [61]:
%%latex
\begin{gather*}
    grad = mse\_grad(y, \hat{y}) * sig\_dash(I\_OL)
\end{gather*}

<IPython.core.display.Latex object>

Now, we have reached the w3 weights and b3 biases, so, gradients till now will be used to update b3

In [62]:
%%latex
\begin{gather*}
    grad\_B3 = mse\_grad(y, \hat{y}) * sig\_dash(I\_OL)
\end{gather*}

<IPython.core.display.Latex object>

And, for weights w3, we will have a dot product with the transpose of whatever we have on the other end of the weights.

In [63]:
%%latex
\begin{gather*}
    grad\_W3 = (mse\_grad(y, \hat{y}) * sig\_dash(I\_OL)) \cdot O\_H2^{T}
\end{gather*}

<IPython.core.display.Latex object>

Once again, we can see the gradients in Python

In [64]:
grad_w3 = mse_grad(y, y_hat) * sig_dash(in_output_layer) .dot(out_hidden_2.T) # grad_w3 
               
rad_b3 = mse_grad(y, y_hat) * sig_dash(in_output_layer) # grad_b3

Now, let us talk about updating weights and biases in w2 and b2.

Suppose if we can sum all the gradients up to the output of the second hidden layer or up to w3 and b3 in shape (-1, 1), then we will have exactly the same situation after jumping the loss line.

We will call those gradients ‘error_grad_upto_H2

![error_grad_upto_H2](image-38.png)

In [70]:
%%latex
\begin{gather*}
    grad = error\_grad\_upto\_H2
\end{gather*}

<IPython.core.display.Latex object>

As we jump back, we cross the activation function line, so, in the gradient variables, we will have the activation function derivative.

![jump again](image-40.png)

In [69]:
%%latex
\begin{gather*}
    grad = error\_grad\_upto\_H2 * sig\_dash(I\_H2)
\end{gather*}

<IPython.core.display.Latex object>

As you can see, we have reached w2 and b2, so the gradients till now will be used to update b2.

In [71]:
%%latex
\begin{gather*}
    grad\_B2 = error\_grad\_upto\_H2 * sig\_dash(I\_H2)
\end{gather*}

<IPython.core.display.Latex object>

And, for weights w2, we will have dot product with whatever we have on the other end of weights w2.

In [72]:
%%latex
\begin{gather*}
    grad\_W2 = (error\_grad\_upto\_H2 * sig\_dash(I\_H2)) \cdot O\_H1 ^ {T}
\end{gather*}

<IPython.core.display.Latex object>

Now let us try to do this in an analytical way.

The trick is the same. If we can calculate

In [74]:
%%latex
\begin{gather*}
    \frac{\partial loss}{\partial W2_{11}}
\end{gather*}

<IPython.core.display.Latex object>

Then, we can update it with SGD like this

In [76]:
%%latex
\begin{gather*}
    W2_{11} \mathrel{+}= - lr * \frac{\partial loss}{\partial W2_{11}}
\end{gather*}

<IPython.core.display.Latex object>

In [None]:
%%latex
\begin{gather*}
    grad\_W3 = error\_grad\_upto\_H2 * sig\_dash(I\_H2)
\end{gather*}