# 5. Neural Networks: Learning

In this section, we will learn how to fit the parameters of the neural network given a training set. 

In our (classification) problems, we will be dealing with either:

1. Binary classification (1 output unit)
2. Multi-class classification (k output units)

### Cost function

Our neural network cost function is a generalization of the logistic regression cost function:

$J(\Theta) = -\frac{1}{m}[\sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} log(h_\Theta(x^{(i)}))_k + (1-y_k^{(i)})log(1-(h_\Theta(x^{(i)}))_k)] + \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} (\Theta_{ji}^{(l)})^2 $

**Notes**:

1. The double sum adds up the logistic regression costs calculated for each cell in the output layer;
2. The triple sum adds up the squares of all the individual $\Theta$s in the entire network;
3. the i in the triple sum does **not** refer to training example i;

### Backpropagation algorithm

Just as a remainder, what we did in the previous section (Week 4) was **forward propagation**. Starting from the first layer all the way to the output layer.

**Intuition**: for each node $j$ in layer $l$ we will calculate the _error_ $\delta_j^{(l)}$

The backpropagation algorithm works as follows:

Given a training set $\{ (x^{(1)}, x^{(2)} ) \cdots (x^{(m)}, y^{(m)}) \}$

1. $\Delta^{(l)}_ij = 0$ for all $(l,i,j)$

2. Perform forward propagation to compute $a^{(l)}$

3. Using $y^{(t)}$, compute $\delta^{(L)} = a^{(L)} − y^{(t)}$  
L = last layer

4. Compute $\delta$ **backwards** starting from L-1:  

$\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)}) .* a^{(l)} .* (1-a^{(l)})$  
$g'(z^{(l)}) = a^{(l)} .* (1-a^{(l)})$  

5. $\Delta^{(l)}_ij = \Delta^{(l)}_ij + a^{(l)}_j \delta^{(l+1)} $

### Implementation Note: Unrolling Parameters

In order to speed up computation, it can be beneficial unroll our initial parameters $\Theta^{(1)}$, $\Theta^{(2)}$ and $\Theta^{(3)}$ in a **single vector** <code>initialTheta</code>.

At this point, we apply:

<code>fminunc(@costFunction, initialTheta, options)</code>

### Gradient Checking 

We can use gradient checking to make sure that our backpropagation works as we intend to. We can approximate the deriv of our cost function to:

$ \frac{\partial } {\partial \Theta} J(\Theta) \approx \frac{J(\Theta + \epsilon) - J(\Theta - \epsilon)} { 2\epsilon}$

**Note**: make sure to use an $\epsilon$ small enough (e.g. $10^{-4}$) but not too small to avoid running in numerical problems as $\epsilon \approx 0$.

### Random Initialization

When we backpropagate, initializing weight to zero is not effective. Instead, it is helpful to initialize randomly in an interval of $[\epsilon, -\epsilon]$

A useful algorithm to follow would be:  

<code>Theta1 = rand(n,m) * (2 * INIT_EPSILON) - INIT_EPSILON;   
 Theta2 = rand(n,m) * (2 * INIT_EPSILON) - INIT_EPSILON;  
Theta3 = rand(1,m) * (2 * INIT_EPSILON) - INIT_EPSILON;  
</code>

**Note**: this $\epsilon$ is NOT related to the one in Gradient Checking.  

### Summary

Let's put it all together. 

First of all, we have to decide our network architecture, including:

* Number of input units = dimension of features $x^{(i)}$
* Number of output units = number of classes
* Number of hidden units per layer = usually the more the better (must balance with cost of computation)

Then we are ready to train our NN.

1. Randomly initialize the weights
2. Implement forward propagation to get $h \Theta(x^{(i)})$ for any $x^{(i)}$
3. Implement the cost function
4. Implement backpropagation to compute partial derivatives
5. Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
6. Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.

**Note**: $J(\Theta)$ is not convex and thus we can end up in a local minimum. 

### Applications 

An interesting application is autonomous driving. In the example presented in the course video, a backpropagation algorithm can learn to "steer" from to the input provided by a human driver and the picture of the road acquired at regular intervals. 