# Neural Networks and Deep Learning

These notes are taken from the [Neural Networks and Deep Learning online book](http://neuralnetworksanddeeplearning.com/) by [Michael Nielsen](http://michaelnielsen.org/).


## Perceptrons
One of the basic neurons is a perceptron, a perceptron is a function that takes inputs $ x_1, x_2, ..., x_n $ of $ \{0, 1\} $, with each input value having a corresponding weight value, $ w_1, w_2, ..., w_n $. 
The following function shows how the output is calculated: 
$$ \begin{eqnarray}
  \mbox{output} & = & \left\{ \begin{array}{ll}
      0 & \mbox{if } \sum_j w_j x_j \leq \mbox{ threshold} \\
      1 & \mbox{if } \sum_j w_j x_j > \mbox{ threshold}
      \end{array} \right.
\end{eqnarray} $$.

Another way to write $ \sum_j w_j x_j $ is $ w \cdot x $. 
Another way to represent the `threshold` is with a `bias`, $ bias \equiv -\mbox{threshold} $.

So the output function becomes: 
$$ \begin{eqnarray}
  \mbox{output} = \left\{ 
    \begin{array}{ll} 
      0 & \mbox{if } w\cdot x + b \leq 0 \\
      1 & \mbox{if } w\cdot x + b > 0
    \end{array}
  \right.
\end{eqnarray} $$.

One can think of the bias as how easy it is to have the perceptron output a `1`. 
One can think of the weight as how important that input value should be considered; the higher the weight, the more important the input value is.

## Sigmoid Neurons

An advantage of sigmoid neurons over perceptrons is that a small change in the input value of a sigmoid neuron will result in a small change in the output value; unlike the behavior of a percpetron. 
This is so because the sigmoid neuron has a smoothing property baked in it.

The first major difference between perceptrons and sigmoid functions is that sigmoid functions use floating point numbers as input and output, whereas perceptrons only use $ 0 $ and $ 1 $ for both input and output.

Like the perceptron, the sigmoid neuron has inputs $ x_1, x_2, ..., x_n $, weights for each input $ w_1, w_2, ..., w_n $, and an overall bias $ b $.
Unlike the perceptron, the function to calculate the output is $ \sigma(w \cdot x+b) $ such that:
$$ \begin{eqnarray} 
  \sigma(z) \equiv \frac{1}{1+e^{-z}}.
\tag{3}\end{eqnarray} $$

$ \sigma $ is called the sigmoid function or the logistic function.

If $ z $ is a large positive number, then $ \sigma(z) $ is going to be $ 1 $, or very close to it.
If $ z $ is a very negative, then $ \sigma(z) $ is going to be $ 0 $.

In the beginning I said that sigmoid neurons allow a small change in $ \Delta w_j $ would result in a small change in $ \Delta output $. 
This is the function that proves that statement:
$$ \begin{eqnarray} 
  \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j}
  \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b,
\end{eqnarray} $$

## Cost functions

The cost function is sometimes referred to as a loss or objective function.
The goal of the neural network is to minimize the value of the cost function.
If the value of the cost function is high, then the neural network is not doing a good job at classifying the inputs.

### Quadratic Cost Function
For the MNIST dataset we will use a _quadratic_ cost function, also known as _mean squared error (MSE)_. 
The cost function is:
$$ \begin{eqnarray}  C(w,b) \equiv
  \frac{1}{2n} \sum_x \| y(x) - a\|^2.
\end{eqnarray} $$
where $ w $ is the collection of all weights, $ b $ is the collection of all biases, $ n $ is the total number of training inputs, $ a $ is the vector of outputs from the network when $ x $ is the input, and the sum is over all training inputs, $ x $.
$ \| v \| $ is the length function for a vector $ v $.

It is noted that $ C(w,b) $ is non-negative because it is the sum of non-negative terms.
The cost $ C(w,b) $ becomes small as $ y(x) $ becomes more equal to $ a $ for all training inputs, $ x $. 

## Minimization Techniques

The goal of training the neural network is to minimize the cost function.
An algorithm must be developed so that we can minimize this function.

In the context of minimizing a function, we want to find the global minimum of some given function. 
A naieve approach to this problem would be to take the derivative of the function at different points and to find where the minimum is.
This approach won't work when there are a large number of variables, like in a neural network.

### _Gradient Descent_

Imagine that a function represents a valley, and that we have an imaginary ball.
Given that the ball starts at some random point, it is assumed that it will roll to the global minimum of the function.

When the imaginary ball moves towards the minimum is represented mathematically by $ \begin{eqnarray} 
  \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 +
  \frac{\partial C}{\partial v_2} \Delta v_2.
\tag{7}\end{eqnarray} $. 
We want $ \Delta C $ to be negative.
In order to minimize $ \Delta C $ we must change $ \Delta v_1 $ and $ \Delta v_2 $.
Let $ \Delta v \equiv (\Delta v_1, \Delta v_2)^T $ where $ T $ is the transpose operation (transpose turns row vectors into column vectors).
Let $ \begin{eqnarray} 
  \nabla C \equiv \left( \frac{\partial C}{\partial v_1}, 
  \frac{\partial C}{\partial v_2} \right)^T.
\tag{8}\end{eqnarray} $, which leads to $ \begin{eqnarray} 
  \Delta C \approx \nabla C \cdot \Delta v.
\tag{9}\end{eqnarray} $.

Now we can modify the equation to be $ \Delta v = -\eta \nabla C $ so that $ \nabla $ is a small, positive parameter and is known as the _learning rate_.
Therefore, $ \Delta C \approx -\eta \nabla C \cdot \nabla C = -\eta \|\nabla C\|^2 $.
We know that $ \| \nabla C \|^2 \geq 0 $ which means that $ \Delta C \leq 0 $, if $ v $ is changed in the right way.
Each position of the imaginary ball is moved in such a manner: $ v \rightarrow v' = v -\eta \nabla C $.

The gradient descent algorithm computes the gradient $ \nabla C $ and moves in the _opposite_ direction.
In order for this to work properly, $ \eta $ must not be too large or the imaginary ball will go past the minimum; $ \eta $ must also not be too small  or the algorithm will be too slow.

In the above equation, there were only two variables, but there can be more variables $ v_1, ..., v_m $ such that $ \Delta v = (\Delta v_1, ..., \Delta v_m)^T $ and $ \nabla C \equiv \left(\frac{\partial C}{\partial v_1}, \ldots, \frac{\partial C}{\partial v_m}\right)^T $.

When we apply the gradient descent problem directly to neural networks, the equations are: 
$ w_k \rightarrow w_k' = w_k-\eta \frac{\partial C}{\partial w_k} $ and $ b_l \rightarrow b_l' = b_l-\eta \frac{\partial C}{\partial b_l} $.