# Week 3
## Key Concepts from This Week

- Multilayer perceptron
- Layer (input, hidden, output)
- Activation function
- Classification
- Random initialization
- Train and test set
---

## Multilayer Perceptron

- This classical model is still widely used for processing vector data. 
- It can also be called  _feed-forward_ model or _dense_ model in literature. The iconic illustration of neural networks shows this model:

![Multilayer Perceptron](images/mlp.svg)
<center><small>License: Glosser.ca <a href="https://creativecommons.org/licenses/by-sa/3.0">CC BY-SA 3.0</a>, via Wikimedia Commons</small></center>

- Models like these consistsof several layers. 
- First, there is an input layer - with model above we have an input with _three_ features. Then a series of neuron layers $\mathbf{h}_1, ..., \mathbf{h}_K$ comes. 
- Each layer consists of several artificial neurons, where $i$-th neuron of $j$-th layer is defined as:
\begin{equation}
h_j^{i} = \sigma_j(\mathbf{w_j^i} \cdot \mathbf{h_{j-1}} + b_j^i)
\end{equation}
- $\mathbf{w}_j^i$ and $b_j^i$ are parameters (weights and bias) for this particular neuron. 
- $\mathbf{h}_{j-1}$ is the vector of values calculated for the neurons from previous layer (with $h_0$ being the input layer) and $\sigma$ is the activation function for this particular layer. 
- Note that each neuron "sees" _all_ the neurons from the previous layer.


### Output
- The final layer $h_K$ serves as an output of the model $\mathbf{\hat{y}}$. 
- Note that we can have multiple output neurons, i.e. we can predict vectors of values, not only scalars. All the layers between input and ouput layers are considered _hidden_ layers.

- The prediction $\mathbf{\hat{y}}$ from output layer is compared with expected value $\mathbf{y}$ via loss function, in the very same way we used these two term in previous lab, e.g. by using _mean squared error_. 
- This loss function can be minimized by _stochastic gradient descent_ algorithm. 
- The SGD for training neural networks is identical with the SGD algorithm we used to train linear regression.


### Defining MLP with Vectors and Matrices

Simplify the calculations with matrix operations:

\begin{equation}
\mathbf{z}_j = \mathbf{W}_j \mathbf{h}_{j-1} + \mathbf{b}_j
\end{equation}

\begin{equation}
\mathbf{h}_j = \sigma_j(z_j)
\end{equation}

- Mathematically, this is the same as calculating the neuron one by one. 
- Note that $i$-th row of $\mathbf{W}_j$ and $i$-th member of $\mathbf{b}_j$ are in fact the parameters for $i$-th neuron:

\begin{equation}
\mathbf{W}_j = \begin{bmatrix}\mathbf{w}_j^1 \\ \mathbf{w}_j^2 \\ \vdots \\ \mathbf{w}_j^M \end{bmatrix}
\end{equation}

- We introduced the $\mathbf{z}$ quantity because it will come handy later. 
- It is sometimes called neuron pre-activation value.


### Architecture Decisions

__1. How many layers should the network have?__  
For smaller datasets you can easily go with only one or two hidden layers. For bigger dataset or more difficult tasks the number of layers grows and they can use tens of layers. Huge state-of-the art image recognition systems have [~100 convolutional hidden layers]

__2. How many neurons should be in the layers?__  
You can start with ~100 neurons for smaller datasets. The biggest current models can use several thousands of neurons per layer. The number of neurons is usually the same for all the hidden layers.

For both number of layers and number of neurons you should check relevant literature to see how big the models are for comparable datasets. We will talk about how to correctly set parameters like these in the future. 

__3. What activation functions should be used for individual layers?__  
All the hidden layers usually use the same activation function. You can use either _ReLU_ or some of its variants, such as _Leaky ReLU_ as a good starting point.

For output layer the activation function is usually different. Here we need to customize the function to fit the task, e.g. when we do regression and we want to have the results between 0 and 1, we can use the logistic regression function. When we do regression for all real numbers, we can use linear function instead.

### Training MLP

MLP can be trained with stochastic gradient descent. The general outline of the SGD algorithm is exactly the same as with linear regression from last week's lab. The equations for the derivatives are:

\begin{equation}
\frac{dL}{d\mathbf{W}_n} = \frac{dL}{d\mathbf{z}_n} \mathbf{h}_{n-1}^T
\end{equation}

\begin{equation}
\frac{dL}{db_n} = \frac{dL}{d\mathbf{z}_n}
\end{equation}

Next we need to calculate $\frac{dL}{d\mathbf{z}_n}$. For all layers, but the last it is defined as:

\begin{equation}
\frac{dL}{d\mathbf{z}_n} = (\mathbf{W}_{n+1}^T\frac{dL}{d\mathbf{z}_{n+1}}) \odot \sigma_n'(\mathbf{z}_n)
\end{equation}

$\odot$ is [Hadamard product]. Note that in the last term we do not use the activation function $\sigma$, but its derivative $\sigma'$, e.g. if the activation function would be $x^2$, its derivative used here would be $2x$.

For the last $K$-th layer the equation is:

\begin{equation}
\frac{dL}{d\mathbf{z}_K} = \frac{dL}{d\mathbf{h}_K} \odot \sigma_K'(\mathbf{z}_K)
\end{equation}

The term $\frac{dL}{d\mathbf{h}_K}$ is calculated according to the definition of loss function.

## Classification

With classification we aim to assign each sample to a class, while we have a predefined finite set of $C$ classes. E.g. we might want to take measurments of _Iris_ flowers and classify them into one of three possible _Iris_ species.

The data look like this:

| | | | | |
| --- | --- | --- | --- | - |
| 5.0 | 3.3 | 1.4 | 0.2 | 0 |
| 7.0 | 3.2 | 4.7 | 1.4 | 1 |
| 5.7 | 2.8 | 4.1 | 1.3 | 1 |
| 6.3 | 3.3 | 6.0 | 2.5 | 2 |

The first four columns are measurments of the flowers:

- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm

The last column is a code fo _Iris_ species:
0. _Iris Setosa_
1. _Iris Versicolour_
2. _Iris Virginica_

<img src="images/iris.jpg" alt="Iris" width="500"/>
<center><small>License: Davefoc <a href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a>, via Wikimedia Commons</small></center>

To solve a general classification problem we propose a simple MLP model with one hidden layer. The two layers are defined as follows:

\begin{equation}
\mathbf{z}_1 = \mathbf{W}_1\mathbf{x} + \mathbf{b}_1
\end{equation}

\begin{equation}
\mathbf{h} = \sigma_1(\mathbf{z}_1)
\end{equation}

\begin{equation}
\mathbf{z}_2 = \mathbf{W}_2\mathbf{h} + \mathbf{b}_2
\end{equation}

\begin{equation}
\mathbf{\hat{y}} = \sigma_2(\mathbf{z}_2)
\end{equation}

This just follows the general equation of MLP we showed above. The hidden layer $\mathbf{h}$ has a customizable size that can be set before the training. The output layer will have a size of 3, one output neuron for each _Iris_ class.

We will use logistic regression function for both $\sigma_1$ and $\sigma_2$. In the hidden layer, it simply serves as a non-linear function. In the output layer, it squishes the $\mathbf{z}_2$ into $(0, 1)$ range and thus we can interpret the outputs $\mathbf{\hat{y}}$ as probabilities. The $j$-th element of $\mathbf{\hat{y}}$ tells us what is the probability that the sample belongs to the $j$-th class. E.g. with ${\mathbf{\hat{y}} = [0.1, 0.8, 0.3]$ we can say that the model gives 10% probability for the first class, 80% for the second class and 30% for the third class. Note that they do not add up to 100%, this model calculates the probability for each class independently. 

The definition of logistic function $\sigma$ is as follows:

\begin{equation}
\sigma(x) = \frac{1}{1 + e^{-x}}
\end{equation}

We need to have a loss function that will compare the predicted probabilities with the true value - the true class of the sample. To do so we will apply _mean squared error_ loss function on the predicted values $\mathbf{\hat{y}}$ and the true value encoded with _one-hot encoding_. One-hot encoding of $j$-th class makes a vector of size $C$, where all components are $0$, except for the $j$-th component, which is $1$.

_Example:_ $\mathbf{\hat{y}}$ is a prediction that the model calculates. $\mathbf{y}$ is a true label we want to achieve. In this case the second element is $1$, so it encodes the second class -- _Iris Versicolour_.

\begin{equation}
\mathbf{\hat{y}} = \begin{bmatrix}0.1 \\ 0.8 \\ 0.3\end{bmatrix}\ \ \ \mathbf{y} = \begin{bmatrix}0 \\ 1 \\ 0\end{bmatrix}
\end{equation}

The loss function for $i$-th sample is then defined as:

\begin{equation}
L^{(i)} = \frac{\sum_{j=1}^C{(\hat{y}_j^{(i)} - y_j^{(i)})^2}}{C}
\end{equation}

The overall loss function is defined as an average of loss functions for all the samples, similarly as before.

__Practical Note:__ Logistic function and MSE are not usually used as activation function and loss function for classification problems. We have better options, but we will use these two in this lab because of their simplicity. In practice, we would use _softmax_ activation and _cross-entropy_ loss function. More information about these two is in the _Further Reading_ section.

### Training classifier

We need to calculate the derivatives for SGD. Following the general equations from earlier we can define the derivatives of parameters $\theta = \{ \mathbf{W}_1, \mathbf{b}_1, \mathbf{W}_2, \mathbf{b}_2 \}$ as:

\begin{equation}
\frac{dL}{d\mathbf{z}_2} = 2(\mathbf{\hat{y}} - \mathbf{y}) \odot \sigma'(\mathbf{z}_2)
\end{equation}

\begin{equation}
\frac{dL}{d\mathbf{W}_2} = \frac{dL}{d\mathbf{z}_2} \mathbf{h}^T
\end{equation}

\begin{equation}
\frac{dL}{d\mathbf{b}_2} = \frac{dL}{d\mathbf{z}_2}
\end{equation}

These are the definitions for second layer parameters. The derivative of activation function is: $\sigma'(x) = \sigma(x)(1 - \sigma(x))$. Otherwise these equations should be easy to understand. Then for the first layer, we just follow the general equations from before:

\begin{equation}
\frac{dL}{d\mathbf{z}_1} = (\mathbf{W}_2^T\frac{dL}{d\mathbf{z}_2}) \odot \sigma'(\mathbf{z}_1)
\end{equation}

\begin{equation}
\frac{dL}{d\mathbf{W}_1} = \frac{dL}{d\mathbf{z}_1} \mathbf{x}^T
\end{equation}

\begin{equation}
\frac{dL}{db_1} = \frac{dL}{d\mathbf{z}_1}
\end{equation}

For all these parameter matrices and vectors the SGD update rule is the same as before:

\begin{equation}
\theta_{i+1} = \theta_i - \alpha \frac{dL}{d\theta_i}
\end{equation}

## Evaluating models

Minimizing loss function is how we train neural networks. However it has two drawbacks for other practical uses:

1. The loss function describes how well we fit the data we show the model. But we want to know how well does the model work on data it has not seen before. E.g. machine translation system that can only translate the sentences it has seen before is not very useful.

To solve this issue we split the data we have into two __non-overlapping__ sets - _training_ set and _testing_ set. Training set is used to train the model, i.e. to directly minimize the loss function. Testing set is then used to evaluate how well does the model work on data it has _not seen_ before.
  

2. Loss function is a rather abstact quantity. It is hard to tell how well does a model with loss $0.2$ perform. or whether a model with loss $0.19$ is significantly better.

To solve this issue, we should use a more straightforward metric for evaluation. For example for classification we can use an _accuracy_ instead. Accuracy tells us the percentage of samples we are able to classify successfully - i.e. how many samples has the highest prediction value for correct class.