# Neural networks

## Overview and terms

As we saw before, linear classifiers are often not the best at solving complicated problems. Neural networks (NNs) introduce nonlinearity. They were originally developed as mathematical models of the information processing capabilities of biological brains, and were popular in the 80s and early 90s. Recently, they have become popular again, especially as deep neural networks (DNNs), including convolutional NNs (CNN), recurrent NNs (RNN), etc. 
Those more complicated variants are beyond the scope of this course, so let's start with the basics.

### Neurons in the brain
* high plasticity --- whenever you memorize a number, you physically alter your body (brain)
* nerve cells = "neurons", about 100 billion, lots of different types
* chemical reactions control electrical potential inside soma (body of neuron)
* membrane potential > threshold $\Rightarrow$ neuron "fires": pulses of fixed strength and duration are sent down the axon ([all-or-none law](https://en.wikipedia.org/wiki/All-or-none_law))
* axons provide connections (synapses) to many other neurons
* learning modifies strength of synaptics connections

![Diagram of basic neuron and components](../figures/800px-Components_of_neuron.jpg)

[source](https://en.wikipedia.org/wiki/File:Components_of_neuron.jpg)

### Artificial neuron networks
Simplest mathematical model: McCulloch-Pitts neuron, where the activation function is simply a comparison of summed inputs to threshold.

Correspondence with biological neuron:
* weighted inputs $\sim$ synapses (with different strengths)
* summation (= transfer function) $\sim$ signal collection in soma
* activation function $\sim$ neuron fires or not
* activation $\sim$ signal sent down the axon

![a single neuron of an artifical network](https://upload.wikimedia.org/wikipedia/commons/6/60/ArtificialNeuronModel_english.png)

[source](https://en.wikipedia.org/wiki/Backpropagation#/media/File:ArtificialNeuronModel_english.png)

As in brain, need large networks of neurons to achieve complexity needed to solve ML problems.

**Perceptron** = collection of McCulloch-Pitts neurons = simplest NN. Can be proven to solve any classification problem that is linearly separable, but cannot, e.g., model a simple XOR of inputs $\Rightarrow$ need higher complexity.

Below is a diagram of a **multilayer perceptron** (MLP) with a single hidden layer:
![NNFig](https://upload.wikimedia.org/wikipedia/commons/4/46/Colored_neural_network.svg)

circles = neurons, arrows = connections (directed acyclic graph)

Here, the network is **dense**; every node of the hidden layer is connected to all previous and following nodes. 
The arrows indicate that information flows from left (the input) to the right (the output). Every arrow has an assigned **weight** that is a free parameter in the training of the network. 

Computing a series of weighted sums as is done by the above network is mathematically the same as computing just one weighted sum. An MLP with multiple linear hidden layers that all just sum up the inputs is thus equivalent to an MLP with a single linear hidden layer. To make this model more powerful than a linear model, a nonlinear function, the **activation function**, is applied to the weighted sum for each neuron in the hidden layer and used to determine the output that is propagated as input to the following neurons.

The number of neurons in the output layer and the choice of **output activation function** depend on the task the network is intended for. For **binary classification** tasks, a typical setup has a single output neuron with a logistic sigmoid activation.
Large neural networks made up of many hidden layers of computation go under the term **deep learning**. Note that for dense layers, the number of parameters increases linearly in the number of hidden layers but quadratically in the number of neurons in each layer.

## Basics of the mathematics behind NNs
* One can write the output from one neuron as $a_i = g\left(\theta_i^T x_i\right)$, where $i$ is the index of the neuron, $x_i$ its vector of inputs, and $\theta_i$ are the weights of the input connections of the neuron (also a vector). $g$ is the activation function. 
* The example NN in the sketch above has an input layer (layer 1), a single hidden layer (layer 2), and an output layer (layer 3). 
* Combining the outputs of all neurons from one layer into one vector $z = \{z_i\}_i$, we can write $z^{(j)} = \Theta^{(j-1)}a^{(j-1)}$ with an upper index to label the layer. The matrix $\Theta^{(j-1)}$ is formed from all weights of all neurons in layer $j-1$.
* Then $a^{(j)} = g(z^{(j)})$. Thus evaluating the NN is a series of matrix multiplications followed by activation functions. (Which makes obvious that without the nonlinearity introduced by the activation functions, the whole NN would still be a linear map.)

In order to train the NN, we have to determine the weight matrices $\Theta^{(i)}$ (which are basically just a bookkeeping devices holding the weights of all input connections for all neurons of one layer each) that minimize the **cost function**.
This is done using a method called **backpropagation**.
The cost function of a NN is similar to what we have for logistic regression, modified to take into account possible multiple outputs, and with more complicated regularization.

### Backpropagation
To train the network, we define a cost function $E$ that is minimized by propagating "errors" in the output values  backwards, where the error specifies the difference between the current output of the network given some input and the correct output corresponding to this input.

In general we only know the desired output at the very end of the network, i.e. for the predictions made through the output layer, and can thus only compute the error for the last layer. However, the output will depend on the weights of all layers in the network, and the idea of backpropagation is to propagate the error of the output of the last layer back and to compute the gradient of the output with respect to the weights, which then allows to adjust the weights layer by layer.

Using notation from above: 
$$\text{net}_j = \sum_i w_{ij} x_i + \theta_j$$ (interpreting "threshold" as the bias $\theta_j$), $$o_j = \varphi(\text{net}_j),$$
and $E = E(\vec o)$.

The partial derivatives of the cost function with respect to the weights can be obtained from the chain rule,
$$\frac{\partial E}{\partial w_{ij}} 
    = \frac{\partial E}{\partial o_{j}} \frac{\partial o_{j}}{\partial w_{ij}}
    = \frac{\partial E}{\partial o_j} \frac{\partial o_j}{\partial\text{net}_j} \frac{\partial \text{net}_j}{\partial w_{ij}}$$ 
and the weights are modified such that the cost function is reduced:
$$ \Delta w_{ij} = - \eta \frac{\partial E}{\partial w_{ij}}.$$ <!--= - \eta \delta_j o_i -->
$\eta$ is a metaparameter in the training that determines the **learning rate**.

In the derivative of the cost function, the terms from right to left are
* the output of the neuron $i$ feeding into the neuron $j$ with weight $w_{ij}$, which we see from above is $\frac{\partial \text{net}_j}{\partial w_{ij}} = x_i$
  * for a neuron in the first hidden layer, $x_i$ is just the $i$th entry of the input vector, for the following layers it is the output of a neuron in the previous layer
* the derivative of the activation function, $\frac{\partial o_j}{\partial\text{net}_j} = \frac{\partial \varphi(\text{net}_j)}{\partial\text{net}_j} = \varphi'(\text{net}_j)$
* and for a neuron in the last layer, $\frac{\partial E}{\partial o_j}$ is per definition the derivative of the cost function with respect to the output of this neuron.

Using the quadratic difference (mean squared error) $E = (\hat y - y)^2$ as the cost function, $\frac{\partial E}{\partial o_j} = 2(\hat y_j - y_j)$, such that for a neuron in the output layer:
$$\frac{\partial E}{\partial w_{ij}} 
    = 2(\hat y_j - y_j) \varphi'(\text{net}_j) x_i$$ 
$\Rightarrow$ we can compute the modification $\Delta w_{ij}$ we need to do to the weight $w_{ij}$ of a neuron in the output layer.
   
For a neuron in any hidden layer, $\frac{\partial E}{\partial w_{ij}}$ depends on the error terms of all the following layers, which is why we need to compute backwards. <!--- Here, the notation becomes impossible if we don't use an additional index for the layer -->

<!--For more details on backpropagation, take a look, for example, at the [ML course][1] mentioned above.-->
The backpropagation formula is explicitly written down in a clean way, e.g., in [this wiki][2].
Another explanation of the backpropagation with lots of nice animations is given by 3Blue1Brown in the [video "Backpropagation calculus"][3] as part of their series on Deep learning.


[1]: https://www.coursera.org/learn/machine-learning
[2]: https://brilliant.org/wiki/backpropagation/
[3]: https://www.youtube.com/watch?v=tIeHLnjs5U8

### Minimization

Although the loss function of a neural network is typically highly non-convex with many local minima, the optimization is typically performed via gradient descent. That means the parameters are updated step-by-step by following the gradient (steepest descent/ascent). The total cost function of the whole training dataset is the mean of the cost of each individual training example. This allows to perform a **stochastic gradient descent** (SGD) by calculating gradient updates only on a random subset (**batch**) of the training data. The advantages of this method are:

* less computational effort for each gradient update since only a subset of examples has to be (back-)propagated through the network
* less memory consumption
* random fluctuations can help to escape local minima

On the other hand, typically more gradient steps are needed when the gradient is calculated from smaller batches and too small batch sizes can lead to large fluctuations of the loss value during training. As a consequence, there is a trade-off between fast computation of each gradient step and the total number of gradient steps needed, which is tuned by choosing the appropriate batch size. 
There are many improvements to the plain gradient descent that try to adjust the step sizes (**learning rate**) dynamically, possibly on a per-parameter basis. One of the most popular optimization algorithm currently (2019) is [Adam](https://arxiv.org/abs/1412.6980v8). 
A nice overview can be found at http://ruder.io/optimizing-gradient-descent/index.html or https://arxiv.org/abs/1609.04747.

### Activation functions
The activation function of a neuron transforms the net input into the activation (output) of the neuron, i.e. whether the neutron is firing or not.

* Differentiable functions allow the network to be trained with gradient descent.
* Activation functions map the (potentially unbounded) input range of the net input of the neurons to a finite output range and are therefore sometimes referred to as squashing functions.
* Vanishing gradient problem: e.g. for sigmoid function, gradient becomes very small for large values, leading to small learning effects (training is slowed down)

Popular choices include ([source][2]):
![1]

* For example, if we use a logistic function (as special case of the [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function)) as the activation function, we can have $g\left(\theta^Tx\right) = \frac{1}{1+\mathrm{exp}\left(-\theta^Tx\right)}$, 
* or if a Rectified Linear Unit (ReLU), $g\left(\theta^Tx\right) = \mathrm{max}\left(0, \theta^Tx\right)$. 

[1]: ../figures/activation_functions.png "overview of commonly used activation functions"
[2]: https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning

* `sklearn` offers [these](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) activation functions: `identity`, `logistic`, `tanh`, `relu`
* Other frameworks like [Keras][4] provide a larger set of activation functions: 
  * shown above: `sigmoid`, `tanh`, `ReLU` (rectified linear unit)
  * and more: `linear`, [`ELU`][3] (exponential linear unit), [`Softmax`][5], ...
  * in addition, e.g. learnable activations (which maintain a state) are available as advanced activation layers. These include `PReLU` and `LeakyReLU`.

[3]: https://arxiv.org/pdf/1511.07289.pdf
[4]: https://keras.io/activations/
[5]: https://de.wikipedia.org/wiki/Softmax-Funktion