### Forward propagation

The process of propagating the output of each layer in the forward direction to consequently get the final output is called forward propagation. 

The output of each hidden layer becomes the input of the next layer. The output is also called the activation for a layer.

The output/activation for each layer is computed in two steps:
* The weighted sum of the inputs, say $z_i$
* The activation function is applied to the above sum $z_i$ to produce the activation $a_i$
  
Equations:  

$$ z^{(1)} = x \  W^{(1)} + b^{(1)}$$
$$ a^{(1)} = g_1(z^{(1)})$$
$$ z^{(2)} = a^{(1)} W^{(2)}  + b^{(2)}$$
$$ a^{(2)} = g_2(z^{(2)})$$
$$ \vdots $$
$$ z^{(n)} = a^{(n-1)} W^{(n)}  + b^{(n)}$$
$$ a^{(n)} = g_3(z^{(n)})$$
and so on till the final output $y_{pred} = a^{(n)}$.

Convention:  
$z^{(i)}$: weighted averages of the output from the $(i-1)^{th}$ layer  
$a^{(i)}$: activation/output of the $i^{th}$ layer  
$g_i$: activation layer of the $i^{th}$ layer  
$W^{(i)}$: Weight matrices connecting two layers   
$b^{(i)}$: Bias vector for the $i$-th layer  


<img align="center" src="https://drive.google.com/uc?id=1kcWsASHFLoEgRFNpi_cxgYUElzUvOYro" width=700 />


### Backward propagation

The process of propagating the cost in the backward direction to compute the gradients for each layer so as to update the weights and bias is called backward propagation. 

Equations:    
$$W^{(n)} := W^{(n)} - \frac{1}{m}\alpha \frac{\partial J}{\partial W^{(n)}}$$    
$$b^{(n)} := b^{(n)} - \frac{1}{m}\alpha \frac{\partial J}{\partial b^{(n)}}$$
$$ \vdots $$
$$ \vdots $$ 
$$W^{(1)} := W^{(1)} - \frac{1}{m}\alpha \frac{\partial J}{\partial W^{(1)}}$$   
$$b^{(1)} := b^{(1)} - \frac{1}{m}\alpha \frac{\partial J}{\partial b^{(1)}}$$  

Here, $\alpha$ is the learning rate that is multiplied to the gradients to tune the size of each weight/bias update.  
$m$ is the number of training examples.

The gradients are computed using the chain rule for derivatives.

<img align="center" src="https://drive.google.com/uc?id=1kcWsASHFLoEgRFNpi_cxgYUElzUvOYro" width=700 />

One pass of each forward and backward propagation is called an iteration. When all the training examples are iterated once, it is called an epoch.  

### Derivation of Backpropagation equations

We will derive the equation for the neural network with a single hidden layer shown below:

<img align="center" src="https://drive.google.com/uc?id=1-d1EFBF4nLH3_Sy-vvCJOAIlxHuwvMsl" >

Let us first write down the equations for **forward propagation** that we will use to derivae the gradients for backpropagation:

\begin{equation}
\left(x_1, x_2 \right)  \to
\left( z_1^{(1)}, z_2^{(1)} \right)
\to
\left(a_1^{(1)}, a_2^{(1)} \right)
\to
z^{(2)}
\to
p = a^{(2)}
\end{equation}

\begin{equation}
\begin{split}
&z_1^{(1)}= w_{11}^{(1)}x_1 + w_{21}^{(1)}x_2 + b_1^{(1)} \\
&z_2^{(1)}= w_{12}^{(1)}x_1 + w_{22}^{(1)}x_2 + b_2^{(1)} \\
\end{split}
\quad \quad \quad
\begin{split}
&a_1^{(1)}= g(z_1^{(1)}) \\
&a_2^{(1)}= g(z_2^{(1)}) \\
\end{split}
\quad \quad \quad
z^{(2)} = w_{1}^{(2)}a_1^{(1)} + w_{2}^{(2)}a_2^{(1)} + b^{(2)}
\quad \quad \quad
a^{(2)} = g(z^{(2)})
\end{equation}

We know the derivative for the sigmoid activation function

$$\frac{d}{dx}sigmoid(x) = x(1-x)$$
 
and for the logloss cost function:
 
$$
\begin{equation} 
\begin{aligned}
J &= -  \left(y \log(a_2) + (1-y) \log(1-a_2)\right)\\
\frac{dJ}{da_2} &= - \left( \frac{y}{a_2} - \frac{1-y}{1-a_2}\right) = \frac{a_2-y}{a_2(1-a_2)}
\end{aligned}
\end{equation}
$$

We start calculating the gradients from the last node and propagate backwards using the chain rule for partial derivatives:

$$\begin{equation}
\begin{aligned}
\frac{\partial J}{\partial a^{(2)}} &= \frac{a^{(2)}-y}{a^{(2)}\left(1-a^{(2)}\right)} \\
\frac{\partial J}{\partial z^{(2)}} &= \frac{\partial J}{\partial a^{(2)}}\frac{\partial a^{(2)}}{\partial z^{(2)}}
=\frac{a^{(2)}-y}{a^{(2)}\left(1-a^{(2)}\right)}a^{(2)}\left(1-a^{(2)}\right) 
= a^{(2)}-y \\
\frac{\partial J}{\partial w_1^{(2)}} &= \frac{\partial J}{\partial z^{(2)}}\frac{\partial z^{(2)}}{\partial w_1^{(2)}}
= \left(a^{(2)}-y\right)\ a_1^{(1)}\\
\frac{\partial J}{\partial w_2^{(2)}} &= \frac{\partial J}{\partial z^{(2)}}\frac{\partial z^{(2)}}{\partial w_2^{(2)}}
= \left(a^{(2)}-y\right)\ a_2^{(1)}\\
\frac{\partial J}{\partial b^{(2)}} &= \frac{\partial J}{\partial z^{(2)}}\frac{\partial z^{(2)}}{\partial b^{(2)}}
= \left(a^{(2)}-y\right)\\
\frac{\partial J}{\partial a_1^{(1)}} &= \frac{\partial J}{\partial z^{(2)}}\frac{\partial z^{(2)}}{\partial a_1^{(1)}}
= \left(a^{(2)}-y\right)\ w_1^{(2)}\\
\frac{\partial J}{\partial a_2^{(1)}} &= \frac{\partial J}{\partial z^{(2)}}\frac{\partial z^{(2)}}{\partial a_2^{(1)}}
= \left(a^{(2)}-y\right)\ w_2^{(2)}\\
\frac{\partial J}{\partial z_1^{(1)}} &= \frac{\partial J}{\partial a_1^{(1)}}\frac{\partial a_1^{(1)}}{\partial z_1^{(1)}}
= \left(a^{(2)}-y\right)\ w_1^{(2)}\ a_1^{(1)}\left(1-a_1^{(1)}\right)\\
\frac{\partial J}{\partial z_2^{(1)}}
&= \frac{\partial J}{\partial a_2^{(1)}}\frac{\partial a_2^{(1)}}{\partial z_2^{(1)}}
= \left(a^{(2)}-y\right)\ w_2^{(2)}\ a_2^{(1)}\left(1-a_2^{(1)}\right)\\
\frac{\partial J}{\partial w_{11}^{(1)}}
&= \frac{\partial J}{\partial z_1^{(1)}}\frac{\partial z_1^{(1)}}{\partial w_{11}^{(1)}}
= \left(a^{(2)}-y\right)\ w_1^{(2)}\ a_1^{(1)}\left(1-a_1^{(1)}\right)\ x_1\\
\frac{\partial J}{\partial w_{21}^{(1)}}
&= \frac{\partial J}{\partial z_1^{(1)}}\frac{\partial z_1^{(1)}}{\partial w_{21}^{(1)}}
= \left(a^{(2)}-y\right)\ w_1^{(2)}\ a_1^{(1)}\left(1-a_1^{(1)}\right)\ x_2\\
\frac{\partial J}{\partial w_{12}^{(1)}}
&= \frac{\partial J}{\partial z_2^{(1)}}\frac{\partial z_2^{(1)}}{\partial w_{12}^{(1)}}
= \left(a^{(2)}-y\right)\ w_2^{(2)}\ a_2^{(1)}\left(1-a_2^{(1)}\right)\ x_1\\
\frac{\partial J}{\partial w_{22}^{(1)}}
&= \frac{\partial J}{\partial z_2^{(1)}}\frac{\partial z_2^{(1)}}{\partial w_{22}^{(1)}}
= \left(a^{(2)}-y\right)\ w_2^{(2)}\ a_2^{(1)}\left(1-a_2^{(1)}\right)\ x_2\\
\frac{\partial J}{\partial b_1^{(1)}}
&= \frac{\partial J}{\partial z_1^{(1)}}\frac{\partial z_1^{(1)}}{\partial b_1^{(1)}}
= \left(a^{(2)}-y\right)\ w_1^{(2)}\ a_1^{(1)}\left(1-a_1^{(1)}\right)\\
\frac{\partial J}{\partial b_2^{(1)}}
&= \frac{\partial J}{\partial z_2^{(1)}}\frac{\partial z_2^{(1)}}{\partial b_2^{(1)}}
= \left(a^{(2)}-y\right)\ w_2^{(2)}\ a_2^{(1)}\left(1-a_2^{(1)}\right)\\
\end{aligned}
\end{equation}$$

### Vectorizing the Forward and Backward Propagation equations
Forward propagation:

$$
\begin{equation}
\left(x_1, x_2 \right)  \to
\left( z_1^{(1)}, z_2^{(1)} \right)
\to
\left(a_1^{(1)}, a_2^{(1)} \right)
\to
z^{(2)}
\to
p = a^{(2)}
\end{equation}
$$

with the equations:

$$
\begin{equation}
\begin{split}
&z_1^{(1)}= w_{11}^{(1)}x_1 + w_{21}^{(1)}x_2 + b_1^{(1)} \\
&z_2^{(1)}= w_{12}^{(1)}x_1 + w_{22}^{(1)}x_2 + b_2^{(1)} \\
\end{split}
\quad \quad \quad
\begin{split}
&a_1^{(1)}= g(z_1^{(1)}) \\
&a_2^{(1)}= g(z_2^{(1)}) \\
\end{split}
\quad \quad \quad
z^{(2)} = w_{1}^{(2)}a_1^{(1)} + w_{2}^{(2)}a_2^{(1)} + b^{(2)}
\quad \quad \quad
a^{(2)} = g(z^{(2)})
\end{equation}
$$

Converting the set of equations into matrix operations:

$$
\left( z_1^{(1)}, z_2^{(1)} \right)
= \left(x_1, x_2 \right) 
\begin{pmatrix}
w_{11}^{(1)} & w_{12}^{(1)} \\
w_{21}^{(1)} & w_{22}^{(1)} \\
\end{pmatrix}
+ \left( b_1^{(1)}, b_2^{(1)} \right)
\quad \quad \text{and} \quad \quad
\left( a_1^{(1)}, a_2^{(1)} \right) 
= \left( g(z_1^{(1)}), g(z_2^{(1)}) \right) 
= g\left( z_1^{(1)}, z_2^{(1)} \right)
\quad \quad \text{and} \quad \quad
z^{(2)} = w_{1}^{(2)}a_1^{(1)} + w_{2}^{(2)}a_2^{(1)} + b^{(2)}
$$

In the vectorized form, the equations are:

$$
\begin{equation}
z^{(1)} = x \  W^{(1)} + b^{(1)}, \quad \quad \quad
a^{(1)} = g(z^{(1)}), \quad \quad \quad
z^{(2)} = a^{(1)} W^{(2)}  + b^{(2)}, \quad \quad \quad
a^{(2)} = g(z^{(2)}),
\end{equation}
$$

where

$$
\begin{equation}
z^{(1)} = \left( z_1^{(1)}, z_2^{(1)} \right)
, \quad \quad x = \left(x_1, x_2 \right) 
, \quad \quad W^{(1)} =
\begin{pmatrix}
w_{11}^{(1)} & w_{12}^{(1)} \\
w_{21}^{(1)} & w_{22}^{(1)} \\
\end{pmatrix}
, \quad \quad b^{(1)} = \left( b_1^{(1)}, b_2^{(1)} \right)
, \quad \quad a^{(1)} = \left( a_1^{(1)}, a_2^{(1)} \right)
, \quad \quad W^{(2)} =
\begin{pmatrix}
w_{1}^{(2)} \\
w_{2}^{(2)} \\
\end{pmatrix}
\end{equation}
$$

In the vectorized form, the forward propagation is given by:

$$
\begin{equation}
x \longrightarrow z^{(1)} = x \ W^{(1)} + b^{(1)}
\longrightarrow
a^{(1)} = g(z^{(1)})
\longrightarrow
z^{(2)} = a^{(1)} W^{(2)} + b^{(2)}
\longrightarrow
a^{(2)} = g(z^{(2)})
\longrightarrow
p = a^{(2)}
\end{equation}
$$

Can you write the vectorized equations for the backpropagation based on the partial derivatives calculated above? (This exercise will be crucial for implementing the back propagation code in the first assignment.)

$$
\begin{equation}
\begin{matrix} 
dW^{(1)} &=& ?\\
db^{(1)} &=& ?
\end{matrix}
\quad \longleftarrow \quad
dz^{(1)} = \quad ?
\quad \longleftarrow \quad
\begin{matrix} 
dW^{(2)} &=& ?\\
da^{(1)} &=& ?\\
db^{(2)} &=& ?\\
\end{matrix}
\quad \quad \longleftarrow \quad \quad
dz^{(2)} = \quad ?
\quad \longleftarrow \quad
da^{(2)} = \quad ?
\end{equation}
$$
where 
$$
\begin{equation}
\begin{aligned}
& dW^{(1)} =
\begin{pmatrix}
\frac{\partial J}{\partial w_{11}^{(1)}} & \frac{\partial J}{\partial w_{12}^{(1)}} \\
\frac{\partial J}{\partial w_{21}^{(1)}} & \frac{\partial J}{\partial w_{22}^{(1)}} \\
\end{pmatrix}
& db^{(1)} =
\begin{pmatrix}
\frac{\partial J}{\partial b_1^{(1)}},& 
\frac{\partial J}{\partial b_2^{(1)}}
\end{pmatrix}
& \quad \quad  \quad dz^{(1)} =
\begin{pmatrix}
\frac{\partial J}{\partial z_1^{(1)}},& 
\frac{\partial J}{\partial z_2^{(1)}}
\end{pmatrix}
& \quad   \quad da^{(1)} =
\begin{pmatrix}
\frac{\partial J}{\partial a_1^{(1)}},& 
\frac{\partial J}{\partial a_2^{(1)}}
\end{pmatrix}
\quad   \\
& dW^{(2)} =
\begin{pmatrix}
\frac{\partial J}{\partial w_1^{(2)}} \\
\frac{\partial J}{\partial w_2^{(2)}}
\end{pmatrix} 
& db^{(2)} = \frac{\partial J}{\partial b^{(2)}}
& \quad \quad   \quad dz^{(2)} = \frac{\partial J}{\partial z^{(2)}}
&  \ \quad \quad da^{(2)} = \frac{\partial J}{\partial a^{(2)}}
\quad \quad
\end{aligned}
\end{equation}
$$

Note: It is important to get the order of matrices and matrix operations right. You should check that your solution works even if you change the number of nodes in the input and hidden layer nodes.

Tip: You can express some of these gradients in terms of other gradients and use that to simplify the equations. For example, can you write $db^{(2)}$ in terms of  $dz^{(2)}$? This simplification will also be helpful while implementing the code.

In simpler form,
$$
\begin{multline*}
\begin{matrix} 
dW^{(1)} &=& x^T @ dz^{(1)} \\
db^{(1)} &=& dz^{(1)} 
\end{matrix}
 \quad \longleftarrow \quad
dz^{(1)} =  dz^{(2)} @ \left(W^{(2)}\right)^T * a^{(1)} * \left(1-a^{(1)}\right)
 \quad \longleftarrow \quad
\begin{matrix} 
dW^{(2)} &=& \left(a^{(1)}\right)^T @ dz^{(2)} \\
da^{(1)} &=& dz^{(2)} @ \left(W^{(2)}\right)^T\\
db^{(2)} &=& dz^{(2)} \\
\end{matrix} \\
 \quad \longleftarrow \quad  dz^{(2)} = a^{(2)}-y 
 \quad \longleftarrow \quad  da^{(2)} = \frac{a^{(2)}-y}{a^{(2)}\left(1-a^{(2)}\right)}
\end{multline*}
$$

For code implementation:

$dz_2 = a_2 - y$

$db_2 = dz_2$

$dW_2 = a_1^T @ dz_2$

$dz_1 = (dz_2@ W_2^T) * a_1 * (1-a_1)$

$db_1 = dz_1$

$dW_1 = x^T@dz_1$


In longer form,
$$
\begin{multline*}
\begin{matrix} 
dW^{(1)} &=& x^T @ \left(a^{(2)}-y\right) @ \left(W^{(2)}\right)^T * a^{(1)} * \left(1-a^{(1)}\right)\\
db^{(1)} &=& \left(a^{(2)}-y\right) @ \left(W^{(2)}\right)^T * a^{(1)} * \left(1-a^{(1)}\right)
\end{matrix}
 \quad \longleftarrow \quad
dz^{(1)} =  \left(a^{(2)}-y\right) @ \left(W^{(2)}\right)^T * a^{(1)} * \left(1-a^{(1)}\right) \\
\quad \longleftarrow \quad 
\begin{matrix} 
dW^{(2)} &=& \left(a^{(1)}\right)^T @ \left(a^{(2)}-y\right)\\
da^{(1)} &=& \left(a^{(2)}-y\right) @ \left(W^{(2)}\right)^T\\
db^{(2)} &=& \left(a^{(2)}-y\right)\\
\end{matrix}
 \quad \longleftarrow \quad 
dz^{(2)} = a^{(2)}-y
 \quad \longleftarrow \quad
da^{(2)} = \frac{a^{(2)}-y}{a^{(2)}\left(1-a^{(2)}\right)}
\end{multline*}
$$
