## About the network

We are going to explore the following network for understanding the derivation of various equations of the feed forward step and Backpropogation steps

![alt text](images/network.png "Title")


1. Its a 3 layered network with 1 input layer , 1 hidded layer and 1 output layer.
2. The Activation function at each node is Sigmoid $\sigma$
3. The total Error $E_T$ is the summation of the Individual Errors $E_1$ and $E_2$ from the outlet nodes $O_1$ and $O_2$
4. $W_1,W_2,W_3,W_4,W_5,W_6,W_7,W_8$ are the paramters that need to be trained (assumption is Bias is not there)
5. The error function used is the L2 error


## Feed propogation equations

Forward propagation is how neural networks make predictions. Input data is __forward propagated__ through the network layer by layer to the final layer which outputs a prediction


We need to find out the equations for the given network

* There are two input nodes i1 and i2
* we need to calculate the values of the network at nodes $h_1,a\_h_1,h_2,a\_h_2,o_1,a\_o_1,o_2,a\_o_2,E_1,E_2, E_t$

$\Rightarrow$ any node value is calulated as the weighted sum of the connections it is receiving e.g.. 

$h_1=i_1*W_1 + i_2*W_2 $


$\Rightarrow$ The neuron values is passed through an activation function which brings the non-linearity in the network. The main idea is to decide when to fire the neuron. In the given network we are working with sigmoid activation neuron so the value of h1 will be passed through the sigmoid function to get an updated value

$a\_h_1=\sigma(h_1)$

Now, $\sigma(x)=\frac{1}{1+exp^(-x)} $

so $a\_h_1= \sigma(h_1)= \frac{1}{1+exp^(-h_1)} $

similarly we can say,

$h_2=i_1*W_3 + i_2*W_4 $

$a\_h_2= \sigma(h_2)= \frac{1}{1+exp^(-h_2)} $

Now the node o1 is receiving the activated value of h1 which is a_h1 and activated value of h2 which is a_h2

so 
$o_1=a\_h_1*W_5 + a\_h_2*W_6 $

$a\_o_1= \sigma(o_1)= \frac{1}{1+exp^(-o_1)} $


similarly

$o_2=a\_h_1*W_7 + a\_h_2*W_8 $

$a\_o_2= \sigma(o_2)= \frac{1}{1+exp^(-o_2)} $

$\Rightarrow$ Now we need to calculate the Error terms $E_1 and E_2$

in this case we are using the L2 error which is defined as 
$ 1/2*(actual\_value - predicted\_value)^2 $

Here $t_1$ and $t_2$ are the actual values that needs to be obtained from the two output nodes o1 and o2.

The predicted value is the activated output from the o1 and o2 nodes 

so

$E1=1/2*(a\_o_1-t_1)^2 $

$E2=1/2*(a\_o_2-t_2)^2 $

The total error is the sum of E1 and E2

$E_t=E_1+E_2$



Hence the final set of equations for the feed forward step are:
1. $h_1=i_1*W_1 + i_2*W_2 $
2. $a\_h_1= \frac{1}{1+exp^(-h_1)} $
3. $h_2=i_1*W_3 + i_2*W_4 $
4. $a\_h_2= \sigma(h_2)= \frac{1}{1+exp^(-h_2)} $
5. $o_1=a\_h_1*W_5 + a\_h_2*W_6 $
6. $a\_o_1= \sigma(o_1)= \frac{1}{1+exp^(-o_1)} $
7. $o_2=a\_h_1*W_7 + a\_h_2*W_8 $
8. $a\_o_2= \sigma(o_2)= \frac{1}{1+exp^(-o_2)} $
9. $E1=1/2*(a\_o_1-t_1)^2 $
10. $E2=1/2*(a\_o_2-t_2)^2 $





## Backpropogation Formula

Backprop is the step where we try to find out the change in the value of the paramters to reduce the error. The change is found using the Partial derivatives of the parameters with respect to the Loss so $\frac{\partial(Loss)}{\partial(parameter)}$


For the backprop we need to calulate the following
* $ \frac{\partial(E_t)}{\partial(W_1) }$

* $ \frac{\partial(E_t)}{\partial(W_2) }$

* $ \frac{\partial(E_t)}{\partial(W_3)} $

* $ \frac{\partial(E_t)}{\partial(W_4)} $

* $ \frac{\partial(E_t)}{\partial(W_5)} $

* $ \frac{\partial(E_t)}{\partial(W_6)} $

* $ \frac{\partial(E_t)}{\partial(W_7)} $

* $ \frac{\partial(E_t)}{\partial(W_8)} $



$ \blacktriangleright $ Now there are 2 types of parameters :
1. Those that are connected to the output node a_o1 and a_o2 => W5,W6,E7,W8
2. Those that are connected to the hidden units and the input layer => W1,W2,W3,W4

We will see the generic formula will be little different for the two


$ \blacktriangleright $ First lets see the formula for the paramters between the hidden layer and the output layers which is 
$\frac{\partial(E_t)}{\partial(W_5) },\frac{\partial(E_t)}{\partial(W_6) },\frac{\partial(E_t)}{\partial(W_7)},\frac{\partial(E_t)}{\partial(W_8) }$

I will take up $\frac{\partial(E_t)}{\partial(W_5) }$ in detail, others will have similar pattern


$ \frac{\partial(E_t)}{\partial(w_5)} = \frac{\partial(E_1+E_2) }{ \partial(W_5)} = \frac{\partial(E_1)}{\partial(W_5)} + \frac{\partial(E_2)}{\partial(W_5)} $

  Now $ \frac{\partial(E_2)}{\partial(W_5)} =0 $ as $W_5$ is not in the path of $E_2$
  so 

  $ \frac{\partial(E_1)}{\partial(W_5)} + \frac{\partial(E_2)}{\partial(W_5)} = \frac{\partial(E_1)}{\partial(W_5)} $

  In the backprop we want to trace the path which will lead us from the output to the parameter in the backward direction. In the case of $W_5$ we can see following chain to reach $W_5$:
  
   **$  E_t \rightarrow  E_1 \rightarrow a\_o_1 \rightarrow o_1 \leadsto a\_h_1$**  
   
   so by the chain rule ,

  $ \frac{\partial(E_1)}{\partial(W_5)} = \frac{\partial(E_1)}{\partial(a\_o_1)} * \frac{\partial(a\_o_1)}{\partial(o_1)}* \frac{\partial(o_1)}{\partial(W_5)}$

  Now lets look at the individual derivatives

*   $ \frac{\partial(E_1)}{\partial(a\_o_1)} = \frac{\partial (1/2(t_1-a\_o_1)^2)}{\partial (a\_o_1)} =(t_1-a\_o_1)*(-1)=a\_o_1-t_1  $ $\Rightarrow $ **this is reduced to the predicted-actual**



* $ \frac{\partial(a\_o_1)}{\partial(o_1)} = \frac{\partial \sigma(o_1)}{\partial(o_1)} = \sigma^\prime(o_1) $ $\Rightarrow$ **basically find out the derivative of the sigmoid function**

  Also the derivative of a $\sigma(x) $ is quite simple to calculate
  $\sigma^\prime(o_1)= \sigma(o_1)*(1-\sigma(o_1) = a\_o_1 *(1-a\_o_1)$ $\Rightarrow$ **ActivatedValue(1-ActivatedValue)**

* $ \frac{\partial(o_1)}{\partial(W_5)} = \frac{\partial(a\_h_1 * W_5 + a\_h_2* W_6)}{\partial(W_5)} = a\_h_1$ $\Rightarrow$ ** activated output that is connected to the weight w.r.t which we are taking the derivative**

so the $\frac{\partial(E_t)}{\partial(W_5)}$ becomes

  $ \boxed{\frac{\partial(E_t)}{\partial(W_5)} = (a\_o_1-t_1) * a\_o_1*(1-a\_o_1) * a\_h_1  } $


In the similar way, the path for $W_8$ is **$ E_t \rightarrow  E_2 \rightarrow a\_o_2 \rightarrow o_2 \leadsto a\_h_2$** 


$ \frac{\partial(E_t)}{\partial(W_8)}= \frac{\partial(E_2)}{\partial(a\_o_2)} * \frac{\partial(a\_o_2)}{\partial(o_2)}* \frac{\partial(o_2)}{\partial(W_8)}= (a\_o_2 - t_2)* a\_o_2*(1-a\_o_2) * a\_h_2  $



path for $W_7$ is $  E_t \rightarrow E_2 \rightarrow a\_o_2 \rightarrow o_2  \leadsto a\_h_1 $

$ \frac{\partial(E_t)}{\partial(W_7)} = \frac{\partial(E_2)}{\partial(a\_o_2)} * \frac{\partial(a\_o_2)}{\partial(o_2)}* \frac{\partial(o_2)}{\partial(W_7)}=(a\_o_2 - t_2)* a\_o_2*(1-a\_o_2) * a\_h_1  $


path for $W_6$ is $ E_t \rightarrow  E_1 \rightarrow a\_o_1 \rightarrow o_1 \leadsto a\_h_2 $

$ \frac{\partial(E_t)}{\partial(W_6)} = \frac{\partial(E_1)}{\partial(a\_o_1)} * \frac{\partial(a\_o_1)}{\partial(o_1)}* \frac{\partial(o_1)}{\partial(W_6)}=(a\_o_1 - t_1)* a\_o_1*(1-a\_o_1) * a\_h_2 $




The above formulation shows that the equations for $W_5$ and $W_6$ are almost same and only differ in the last term which is $\frac{\partial(o_1)}{\partial(W_5)} $ and $\frac{\partial(o_1)}{\partial(W_6)} $  . This is because the backprop path is the same from $E_t \leadsto o_1 $ . It is just the last node that is different $a\_h_1$ and $a\_h_2$ respectively


Similary the equations for $W_7$ and $W_8$ are almost same . only the $\frac{\partial(o_2)}{\partial(W_7)} $ and $\frac{\partial(o_2)}{\partial(W_8)} $ is different

Hence the equations  for $W_5,W_6,W_7,W_8$ can be rewritten as,


$\frac{\partial(E_t)}{\partial(W_5)}= \frac{\partial(E_t)}{\partial(o_1)} * \frac{\partial(o_1)}{\partial(W_5)} = \frac{\partial(E_t)}{\partial(o_1)} * a\_h_1$

$\frac{\partial(E_t)}{\partial(W_6)}= \frac{\partial(E_t)}{\partial(o_1)} * \frac{\partial(o_1)}{\partial(W_6)}= \frac{\partial(E_t)}{\partial(o_1)} * a\_h_2$

$\frac{\partial(E_t)}{\partial(W_7)}= \frac{\partial(E_t)}{\partial(o_2)} * \frac{\partial(o_2)}{\partial(W_7)}= \frac{\partial(E_t)}{\partial(o_2)} * a\_h_1$

$\frac{\partial(E_t)}{\partial(W_8)}= \frac{\partial(E_t)}{\partial(o_2)} * \frac{\partial(o_2)}{\partial(W_8)} = \frac{\partial(E_t)}{\partial(o_2)} * a\_h_2$


$\blacktriangleright $Lets see now how to calculate the backprop for the weights which are 2 level from the output layer i.e. $W_1,W_2,W_3,W_4$


I will calculate the 
$ \frac{\partial(E_t)}{\partial(W_1)} $ in details and then the others will be similarly calculated.

Now to compute the backprop for $W_1$ the back path is more complex

$ E_t \rightarrow E_1  \rightarrow a\_o_1 \rightarrow o_1 \rightarrow a\_h_1 \rightarrow h_1 \leadsto i_1 $

$ E_t \rightarrow E_2  \rightarrow a\_o_2 \rightarrow o_2 \rightarrow a\_h_1 \rightarrow h_1 \leadsto i_1 $

in terms of the partial derivative we will say,

$ \frac{\partial(E_t)}{\partial(W_1)}= ( \frac{\partial(E_1)}{\partial(a\_o_1)} * \frac{\partial(a\_o_1)}{\partial(o_1)} * \frac{\partial(o_1)}{\partial(a\_h_1)}*\frac{\partial(a\_h_1)}{\partial(h_1)}*\frac{\partial(h_1)}{\partial(W_1)} ) + (\frac{\partial(E_2)}{\partial(a\_o_2)} * \frac{\partial(a\_o_2)}{\partial(o_2)} * \frac{\partial(o_2)}{\partial(a\_h_1)}*\frac{\partial(a\_h_1)}{\partial(h_1)}*\frac{\partial(h_1)}{\partial(W_1)})$

= $(\frac{\partial(E_1)}{\partial(a\_o_1)} * \frac{\partial(a\_o_1)}{\partial(o_1)} * \frac{\partial(o_1)}{\partial(a\_h_1)} + \frac{\partial(E_2)}{\partial(a\_o_2)} * \frac{\partial(a\_o_2)}{\partial(o_2)} * \frac{\partial(o_2)}{\partial(a\_h_1)}) * \frac{\partial(a\_h_1)}{\partial(h_1)}*\frac{\partial(h_1)}{\partial(W_1)}$

basically the equation can be written in terms of what we already calculated for $\frac{\partial(E_t)}{\partial(W_5)}$ and $\frac{\partial(E_t)}{\partial(W_7)} $

so we already know from the above terms like
* $\frac{\partial(E_1)}{\partial(a\_o_1)}=a\_o_1-t_1$ 
* $\frac{\partial(a\_o_1)}{\partial(o_1)}=(a\_o_1)*(1-a\_o_1)$,
* $\frac{\partial(E_2)}{\partial(a\_o_2)}=a\_o_2-t_2$,
* $\frac{\partial(a\_o_2)}{\partial(o_2)}=(a\_o_2)*(1-a\_o_2)$

  also the terms 

  $ \frac{\partial(o_1)}{\partial(a\_h_1)} = \frac{\partial(a\_h_1 * W_5 + a\_h_2* W_6)}{\partial(a\_h_1)} =W_5 $

  $ \frac{\partial(o_2)}{\partial(a\_h_1)} = \frac{\partial(a\_h_1 * W_7 + a\_h_2* W_8)}{\partial(a\_h_1)}= W_7 $

  are easy to calculate and will reduce to the weight of the edge connecting the numerator and denominator

  Now the terms
* $ \frac{\partial(a\_h_1)}{\partial(h_1)} $ is similar to $\frac{\partial(a\_o_1)}{\partial(o_1)} $ which basically is of the form $ \frac{\partial(\sigma(x))}{\partial(x)} = \sigma(x)(1-\sigma(x))$  where x is the node like $h_1, h_2, o_1,o_2$

  so the $ \frac{\partial(a\_h_1)}{\partial(h_1)} = a\_h_1 * (1- a\_h_1)$

* $ \frac{\partial(h_1)}{\partial(W_1)} $ is similar to $ \frac{\partial(o_1)}{\partial(W_5)} $  in this case we get the node previous to the numerator so
$ \frac{\partial(h_1)}{\partial(W_1)} = i_1 $


Now taking all the terms together

$ \boxed{\frac{\partial(E_t)}{\partial(W_1)}= \{(a\_o_1-t_1) * a\_o_1*(1-a\_o_1) * W_5+(a\_o_2-t_2) * a\_o_2*(1-a\_o_2) * W_7 \}*a\_h_1 * (1- a\_h_1)*i_1 }$ 


Now on the similar lines we can calulate

$ \frac{\partial(E_t)}{\partial(W_2)} $ which will be the combination of the paths

$ E_t \rightarrow E_1  \rightarrow a\_o_1 \rightarrow o_1 \rightarrow a\_h_1 \rightarrow h_1 \leadsto i_2 $

$ E_t \rightarrow E_2  \rightarrow a\_o_2 \rightarrow o_2 \rightarrow a\_h_1 \rightarrow h_1 \leadsto i_2 $

so the path will be written as ,


$\frac{\partial(E_t)}{\partial(W_2)} = (\frac{\partial(E_1)}{\partial(a\_o_1)} *\frac{ \partial(a\_o_1)}{\partial(o_1)}*\frac{\partial(o_1)}{\partial(a\_h_1)}+\frac{\partial(E_2)}{\partial(a\_o_2)} * \frac{\partial(a\_o_2)}{\partial(o_2)}*\frac{\partial(o_2)}{\partial(a\_h_1)} )*\frac{\partial(a\_h_1)}{\partial(h_1)} * \frac{\partial(h_1)}{\partial(W_2)} $



on solving the above equations we get, 

$ \frac{\partial(E_t)}{\partial(W_2)} = \{(a\_o_1-t_1) * a\_o_1*(1-a\_o_1) * W_5+(a\_o_2-t_2) * a\_o_2*(1-a\_o_2) * W_7 \}*a\_h_1 * (1- a\_h_1)*i_2$ 





$ \partial(E_t)/\partial(W_3) $ which will be the combination of the paths

$ E_t \rightarrow E_1  \rightarrow a\_o_1 \rightarrow o_1 \rightarrow a\_h_2 \rightarrow h_2 \leadsto i_1 $

$ E_t \rightarrow E_2  \rightarrow a\_o_2 \rightarrow o_2 \rightarrow a\_h_2 \rightarrow h_2 \leadsto i_1 $

so  the path will be written as ,


$\frac{\partial(E_t)}{\partial(W_3)} = (\frac{\partial(E_1)}{\partial(a\_o_1)} *\frac{ \partial(a\_o_1)}{\partial(o_1)}*\frac{\partial(o_1)}{\partial(a\_h_2)}+\frac{\partial(E_2)}{\partial(a\_o_2)} * \frac{\partial(a\_o_2)}{\partial(o_2)}*\frac{\partial(o_2)}{\partial(a\_h_2)} )*\frac{\partial(a\_h_2)}{\partial(h_2} * \frac{\partial(h_2)}{\partial(W_3)} $

on solving it becomes, 

$ \frac{\partial(E_t)}{\partial(W_3)} = \{(a\_o_1-t_1) * a\_o_1*(1-a\_o_1) * W_6+(a\_o_2-t_2) * a\_o_2*(1-a\_o_2) * W_8 \}*a\_h_2 * (1- a\_h_2)*i_1 $


Finally 

$ \frac{\partial(E_t)}{\partial(W_4)} $ which will be the combination of the paths

$ E_t \rightarrow E_1  \rightarrow a\_o_1 \rightarrow o_1 \rightarrow a\_h_2 \rightarrow h_2 \leadsto i_2 $

$ E_t \rightarrow E_2  \rightarrow a\_o_2 \rightarrow o_2 \rightarrow a\_h_2 \rightarrow h_2 \leadsto i_2 $

so  the path will be written as ,



$\frac{\partial(E_t)}{\partial(W_4)} = (\frac{\partial(E_1)}{\partial(a\_o_1)} *\frac{ \partial(a\_o_1)}{\partial(o_1)}*\frac{\partial(o_1)}{\partial(a\_h_2)}+\frac{\partial(E_2)}{\partial(a\_o_2)} * \frac{\partial(a\_o_2)}{\partial(o_2)}*\frac{\partial(o_2)}{\partial(a\_h_2)} )*\frac{\partial(a\_h_2)}{\partial(h_2} * \frac{\partial(h_2)}{\partial(W_4)} $

on solving it becomes, 

$\frac{\partial(E_t)}{\partial(W_4)} = \{(a\_o_1-t_1) * a\_o_1*(1-a\_o_1) * W_6+(a\_o_2-t_2) * a\_o_2*(1-a\_o_2) * W_8 \}*a\_h_2 * (1- a\_h_2)*i_2 $




The above formaulas show that $ \frac{\partial(E_t)}{\partial(W_3)} $ and $ \frac{\partial(E_t)}{\partial(W_4)} $ are almost same just the last term which is $i_1$ and $i_2$ respectively

similarly $ \frac{\partial(E_t)}{\partial(W_1)} $ and $ \frac{\partial(E_t)}{\partial(W_2)} $ are same with the last term different to $i_1$ or $i_2$

The reason for this is that in the backprop graph if we see the path traced remain the same till the hidden layer $ E_t \leadsto h_1 $ is same for $W_1$ and  $W_2 $ and only the last connection from $h_1 \rightarrow inputlayer$ is different which ends up as $i_1$ or $i_2$

Similarly the path for 
$W_3$ and  $W_4 $ is the same from $ E_t \leadsto h_2 $. only the last connection from $h_2 \rightarrow inputlayer$ is different which ends up as $i_1$ or $i_2$




Hence , the above equation for $W_1,W_2,W_3,W_4$ can be written as 

$\frac{\partial(E_t)}{\partial(W_1)} = \frac{\partial(E_t)}{\partial(h_1)}*i_1 $

$\frac{\partial(E_t)}{\partial(W_2)} = \frac{\partial(E_t)}{\partial(h_1)}*i_2 $

$\frac{\partial(E_t)}{\partial(W_3)} = \frac{\partial(E_t)}{\partial(h_2)}*i_1 $

$\frac{\partial(E_t)}{\partial(W_4)} = \frac{\partial(E_t)}{\partial(h_2)}*i_2 $


where 
$ \frac{\partial(E_t)}{\partial(h_1)} =  \frac{\partial(E_1)}{\partial(a\_o_1)} * \frac{\partial(a\_o_1)}{\partial(o_1)} * \frac{\partial(o_1)}{\partial(a\_h_1)}*\frac{\partial(a\_h_1)}{\partial(h_1)} + \frac{\partial(E_2)}{\partial(a\_o_2)} * \frac{\partial(a\_o_2)}{\partial(o_2)} * \frac{\partial(o_2)}{\partial(a\_h_1)}*\frac{\partial(a\_h_1)}{\partial(h_1)}$


and 

$\frac{\partial(E_t)}{\partial(h_2)}=\frac{\partial(E_1)}{\partial(a\_o_1)} * \frac{\partial(a\_o_1)}{\partial(o_1)}*\frac{\partial(o_1)}{\partial(a\_h_2)}*\frac{\partial(a\_h_2)}{\partial(h_2)}+\frac{\partial(E_2)}{\partial(a\_o_2)} * \frac{\partial(a\_o_2)}{\partial(o_2)}*\frac{\partial(o_2)}{\partial(a\_h_2)}*\frac{\partial(a\_h_2)}{\partial(h_2)} $



### Final Equations
1. $ \frac{\partial(E_t)}{\partial(W_1)}= \{(a\_o_1-t_1) * a\_o_1*(1-a\_o_1) * W_5+(a\_o_2-t_2) * a\_o_2*(1-a\_o_2) * W_7 \}*a\_h_1 * (1- a\_h_1)*i_1$

2. $ \frac{\partial(E_t)}{\partial(W_2)} = \{(a\_o_1-t_1) * a\_o_1*(1-a\_o_1) * W_5+(a\_o_2-t_2) * a\_o_2*(1-a\_o_2) * W_7 \}*a\_h_1 * (1- a\_h_1)*i_2$ 

3. $ \frac{\partial(E_t)}{\partial(W_3)} = \{(a\_o_1-t_1) * a\_o_1*(1-a\_o_1) * W_6+(a\_o_2-t_2) * a\_o_2*(1-a\_o_2) * W_8 \}*a\_h_2 * (1- a\_h_2)*i_1 $

4. $\frac{\partial(E_t)}{\partial(W_4)} = \{(a\_o_1-t_1) * a\_o_1*(1-a\_o_1) * W_6+(a\_o_2-t_2) * a\_o_2*(1-a\_o_2) * W_8 \}*a\_h_2 * (1- a\_h_2)*i_2 $

5. $\frac{\partial(E_t)}{\partial(W_5)} = (a\_o_1-t_1) * a\_o_1*(1-a\_o_1) * a\_h_1 $

6. $ \frac{\partial(E_t)}{\partial(W_6)} = =(a\_o_1 - t_1)* a\_o_1*(1-a\_o_1) * a\_h_2 $

7.  $ \frac{\partial(E_t)}{\partial(W_7)} = =(a\_o_2 - t_2)* a\_o_2*(1-a\_o_2) * a\_h_1 $

8.  $ \frac{\partial(E_t)}{\partial(W_8)} = =(a\_o_2 - t_2)* a\_o_2*(1-a\_o_2) * a\_h_2 $

