# Backpropagation in Neural Networks

The task we need to accomplish consists in recognizing handwritten digits (0 to 9).

In this notebook we will implement the whole Neural Network algorithm, hence **Feed Forward Propagation** (defining the relation between predictors and output) plus **Back Propagation**.

In the previous notebook we implemented just the *Feed Forward Propagation* step using pre-trained weights.<br>
In this notebook instead we will learn how to learn these weights.

Here we start with some theory notions first.

Let's represent again how the *Forward Propagation* step is computed in the Neural Network:
<img src="img/NN.png" alt="Neural Network representation" style="width: 400px;"/>


Let's define some variables first:
1. $L$ = number of layers
2. $s_l$ = number of neurons (nodes) in the $l_{th}$ layer not counting the bias unit
3. $K$ = number of classes, $S_L = K$
4. $m$ = number of data points (row) in the training set
5. $n$ = number of features (columns in the training sets)

Second, let's recall the cost function for Logistic Regression:

$J(\theta) = \frac{1}{m}\sum_{i=1}^m(-y_i log(h_i(\theta)) - (1-y_i)log(1-h_i(\theta))) + \frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2$

The cost for the NN is going to be just a more complicated version of the above:

$J(\Theta) = \frac{1}{m}\sum_{i=1}^m \sum_{k=1}^K\big[-y_k^{(i)} log((h_{\Theta}(x^{(i)}))_k) - (1-y_k^{(i)})log(1-(h_{\Theta}(x^{(i)}))_k)\big] + \frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_l + 1}(\Theta_{i,j}^{(l)})^2$

where $h_{\theta}(x^{(i)})$ is computed as shown in the Figure 2 above, and $K$ = 10 is the total number of possible labels. <br>
Note that $h_{\theta}(x^{(i)})_k = a_k^{(3)}$ is the activation function (output value) of the *k*-th output unit (in the output layer) for the *i*-th training example.

We want to find $\Theta$ to minimize $J(\Theta)$.

To do that, we need to calculate $\nabla J(\Theta)$ w.r.t. (with respect to) to $\Theta$, basically $\frac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}} = D_{i,j}^{(l)}$

The **backpropagation algorithm** helps us to calculate $D_{i,j}^{(l)}$.

Let's see how it works: <br>
**NOTE**: now we start by explaining how the backpropagation algorithm works taking into account that our Neural Network has **3 layers** (one input layer, one hidden layer and one output layer) where the input layer has **400** units (excluding bias term), the hidden layer has **25** units (excluding bias term) and the output layer has **10** units.

Recall that the intuition behind the backpropagation algorithm is as follows. <br>
Given a training example $ \{(x^{(i)}, y^{(i)})\}$, we will first run a "forwars pass" to compute all the activations throughtout the network, including the output value of the hypothesis $h_\theta(x)$. <br>
Then, for each node $j$ in layer $l$, we would like to compute an "error term" $\delta_j^{(l)}$ that measures *how much that node was "responsible" for any errors in our output"*. <br>
For an **output node** (nodes of the output layer), we can directly measure the difference between the network's activation $h_\theta{(x^{(i)})}$ and the true value $y^{(i)}$ and use that to define $\delta_j^{(3)}$ (since layer 3 is the output layer). <br>
For the **hidden units**, we will compute $\delta_j^{(l)}$ based on the weightd average of the error terms of the nodes in the later ($l + 1$). <br>
For the input unit we **do not** compute the error term.

(add here last part, gradient computation) 

In details:

1. $\Delta^{(l)} := 0$ You should define a gradient matrix $\Delta$ for each layer, excluding the input layer (so do not define $\Delta^{(0)}$) and set its values to zero. Concretely, in our case we will define $\Delta^{(2)}$ and $\Delta^{(1)}$. The size of $\Delta^{(l)}$ is euqual to
$$ size(\Delta^{(l)}) = size(\theta^{(l)})$$
These $\Delta^{(l)}$ matrix will be used to store the gradient of the cost function.


2. For $i = 1:m$ (looping through the training set): <br>
   A- set the input layer's values ($a^{(1)}$) to the $t$-th trainig example $x^{(t)}$. <br>
   
   B- perform a **feed forward propagation** pass (Figure 2), computing the activations ($z^{(2)}, a^{(2)}, z^{(3)}, a^{(3)}$) for layers 2 and 3. Note that you need to add $+1$ term to ensure that the vectors of activations for layers $a^{(1)}$ and $a^{(2)}$ also include the bias unit. <br>
   
   C- for each output unit $k$ in layer 3 (the output layer), set:
   $$\delta_k^{(3)} = (a_k^{(3)} - y_k)$$ where $y_k \in \{0, 1\}$ indicates whether the current training example belongs to class $k$ ($y_k = 1$), or if it belongs to a different class ($y_k = 0$). <br>
   
   D- for hidden layer $l = 2$, set:
   $$\delta^{(2)} = (\theta^{(2)})^T \delta^{(3)}.* g'(z^{(2)}) = (\theta^{(2)})^T \delta^{(3)}.* ((a^{(2)}).*(1-a^{(2)}))$$
   (Note that $".*"$ indicates normal multiplication and not matrix multiplication). <br>
   
   E- Accumulate the gradient using the following formula: 
   $$\Delta^{(l)} = \Delta^{(l)} + \delta^{(l + 1)}(a^{(l)})^T $$
   So, in our case, we have calculted so far $\delta^{(3)}$, $\delta^{(2)}$ and we have $\Delta^{(2)}$ and $\Delta^{(1)}$ which we have previously initialized. Then the updates of $\Delta^{(2)}$ and $\Delta^{(1)}$ are:
   $$\Delta^{(2)} = \Delta^{(2)} + \delta^{(3)}(a^{(2)})^T $$
   $$\Delta^{(1)} = \Delta^{(1)} + \delta^{(2)}(a^{(1)})^T $$
   Note that you should skip or remove $\delta_0^{(2)}$ (the bias term).


3. Obtain the regularized gradient for the Neural Network cost function by dividing the accumulated gradients by $\frac{1}{m}$:

$\frac{\partial}{\partial \Theta_{i,j}^{(l)}} J(\Theta)= D_{i,j}^{(l)} = \frac{1}{m}\Delta_{i, j}^{(l)}$ for $j = 0$ <br>

$\frac{\partial}{\partial \Theta_{i,j}^{(l)}} J(\Theta)= D_{i,j}^{(l)} = \frac{1}{m}\Delta_{i, j}^{(l)} + \frac{\lambda}{m} \theta_{i, j}^{(l)}$ for $j \geq 1$