# Multilayer Perceptron and Backpropagation

A MLP is composed of one (passthrough) *input layer*, one or more layers of TLUs, called *hidden layers*, and one final layer of TLUs called the *output layer*. 

The layers close to input layer are usually called *lower layers*, and one close to output layer called *upper layers*.

Every layer except output layer includes the bias neuron and is fully connected to the next layer.

> The signal flows only in one direction (from the inputs to the outputs), so this architecture called *feedforward neural network* (FNN).

When an ANN contains deep stack of hidden layers, it is called *deep neural networks* (DNN). 

In 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published a paper that introduces the *backpropagation*. Which is ground breaking and still in use. In short it is Gradient Descent using an efficient way to calculate gradients automatically.

> Automatically computing gradients is called *automatic differentiation*, or *autodiff*. The autodiff technique used in backpropagation is called *reverse-mode autodiff*. It is fast and precise, and well suited for when function to differentiate has many variables (e.g. connection weights) and few outputs (e.g. loss).

Backpropagation Algorithm:

- It handles one mini-batch at a time, and it goes through full training set multiple times. Each pass though full training set is called an *epoch*.
- Each mini-batch is passed through the network from input layer through hidden layers to the output layer, this is called *forward pass*. The results for every layer is preserved in order to compute gradients.
- Next, algorithm measures the network's output error (by loss function).
- Then it computes how much each output connection contributed to the error analytically by applying chain rule.
- Then it measures how much of these error contributions came from each connection in the layer below, again using chain rule, working backward until the algo reaches to the input layer. This is called *backward pass*.
- Finally, the algorithm performs Gradient Descent step to tweak all the connection weights in the network, using error gradients it just computed.

> It is important to initialize all the hidden layers' connection weights randomly, or else training will fall. For example, if you initialize all weights and bias to zero, then all neurons in a given layer will be perfectly identical, and backpropagation will affect them in a same way, so they will remain identical. Network treat a layer as only one neuron. If you initialize the weights randomly, you *break the symmetry* and allow backpropagation to train a diverse team of neurons.

In order to work this algorithm properly, there is a key change in MLP: they replaced the step function (non-differentiable) to the logistic (sigmoid) function,
$$
\sigma(z) = 1 / (1 + \exp(-z))
$$
Sigmoid is differentiable and ranges from 0 to 1. There are others we can use:

- *Hyperbolic tangent function*: $\tanh(z) = 2\sigma(2z)-1$
  - It is like sigmoid, but ranges from -1 to 1. Which make layer's output more centered around 0 at the beginning of training, which often helps speed up convergence.
- *The Rectified Linear Unit function*: $ReLU(z) = \max(0, z)$
  - It is continuous but not differentiable at 0. But in practice it works well and computed very fast, which makes it default for today's architectures. It's derivative is 0 for z < 0 and has no upper limit which makes it more usable.

We need activation functions because if we chain linear transformations we get a linear transformation. So if you don't have non-linearity between layers, then deep networks is equivalent to single layer network.

## Regression MLPs

We can use MLPs for regression tasks. To predict single values, you just need an output neuron, and for multivariate regression, you need one output neuron for every dimension (value).

In general, you don't need any activation function for the output layer. If you want positive output then you can use ReLU or *softplus function*, which is smooth variant of ReLU: $softplus(z) = \log(1 + \exp(z))$.

Finally, if you want output in a specific range, you could use sigmoid or tanh and scale it up to your specific range.

The loss function is generally mean squared error. But if you have lots of outliers you could use mean absolute error. Alternatively, you can use the *Huber loss*, which is combination of both.

> Huber loss is quadratic when error is smaller than a threshold $\delta$ (typically 1) but linear when greater.

## Classification MLPs

They mostly used in classification tasks. For a binary classification, you can use an output neuron with logistic function. The output will be in range 0 and 1, also gives you probability of 1 (positive class).

MLPs can also easily handle multilabel binary classification. You can use two neurons with logistic function if you have 2 labels. The probability of both does not add up to 1, means label are independent.

If your labels are dependent (e.g. MNIST), then you use output neuron for each label and apply *softmax function* to whole output layer, which produce a probabilities which add up to 1. This is called multiclass classification.

For loss, since we are pretending the output is probability distribution, the *cross-entropy loss* (also called the log loss) is generally a good choice.

