<a href="https://colab.research.google.com/github/mdainur/kbtu-ml-book/blob/mlp-layers/MLP_team.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Layers of MLP

From mathematical point of view MLP is a smooth function $F$ which is constructed as a composition of some other functions

$$
F(\boldsymbol x) = (f_{L} \circ f_{L-1} \circ\ldots \circ f_2 \circ f_1)(\boldsymbol x),\quad
\boldsymbol x \in \mathbb R^{n_0}
$$

Each function

$$
    f_\ell \colon \mathbb R^{n_{\ell - 1}} \to \mathbb R^{n_\ell}
$$

is called a **layer**; it converts representation of $(\ell-1)$-th layer

$$
    \boldsymbol x_{\ell -1} \in \mathbb R^{n_{\ell - 1}}
$$

to the representation of $\ell$-th layer

$$
   \boldsymbol x_{\ell} \in \mathbb R^{n_{\ell}}.
$$

Thus, the **input layer** $\boldsymbol x_0 \in \mathbb R^{n_0}$ is converted to the **output layer** $\boldsymbol x_L \in \mathbb R^{n_L}$. All other layers $\boldsymbol x_\ell$, $1\leqslant \ell < L$, are called **hidden layers**. If an MLP has two or more hidden layers, it is called a deep neural network.


```{figure} https://www.researchgate.net/publication/354817375/figure/fig2/AS:1071622807097344@1632506195651/Multi-layer-perceptron-MLP-NN-basic-Architecture.jpg
:align: center
```

```{warning}
The terminology about layers is a bit ambiguous. Both functions $f_\ell$ and their outputs $\boldsymbol x_\ell = f(\boldsymbol x_{\ell - 1})$ are called $\ell$-th layer in different sources.
```

## Parameters of MLP

However, one important element is missing in this description of MLP: parameters! Each layer $f_\ell$ has a vector of parameters $\boldsymbol \theta_\ell\in\mathbb R^{m_\ell}$ (sometimes empty). Hence, a layer should be defined as

$$
    f_\ell \colon \mathbb R^{n_{\ell - 1}} \times \mathbb R^{m_\ell} \to \mathbb R^{n_\ell}.
$$

The representation $\boldsymbol x_\ell$ is calculated from $\boldsymbol x_{\ell -1}$ by the formula

$$
\boldsymbol x_\ell = f_\ell(\boldsymbol x_{\ell - 1},\boldsymbol \theta_\ell)
$$

with some fixed $\boldsymbol \theta_\ell\in\mathbb R^{m_\ell}$. The whole MLP $F$ depends on parameters of all layers:

$$
    F(\boldsymbol x, \boldsymbol \theta), \quad \boldsymbol \theta = (\boldsymbol \theta_1, \ldots, \boldsymbol \theta_L).
$$

All these parameters are trained simultaneously by the {ref}`backpropagation method <backprop>`.


## Dense layer

Edges between two consequetive layers denote **linear** (or **dense**) layer:

$$
    \boldsymbol x_\ell = f(\boldsymbol x_{\ell - 1}; \boldsymbol W, \boldsymbol b) = \boldsymbol {Wx}_{\ell - 1} + \boldsymbol b.
$$

The matrix $\boldsymbol W \in \mathbb R^{n_{\ell - 1}\times n_\ell}$ and vector $\boldsymbol b \in \mathbb R^{n_\ell}$ (bias) are parameters of the linear layer which defines the linear transformation from $\boldsymbol x_{\ell - 1}$ to $\boldsymbol x_{\ell}$.

**Q**. How many numeric parameters does such linear layer have?

```{admonition} Exercise
:class: important

Suppose that we apply one more dense layer:

$$
    \boldsymbol x_{\ell + 1} = \boldsymbol {W'x}_{\ell} + \boldsymbol{b'}
$$

Express $\boldsymbol x_{\ell + 1}$ as a function of $\boldsymbol x_{\ell - 1}$.
```

### Linear layer in PyTorch



In [None]:
import torch

x = torch.randn(5)
x

tensor([-0.0762,  0.6222, -1.1178,  0.9935,  2.0089])

In [None]:
linear_layer = torch.nn.Linear(5, 6)
linear_layer.bias

Parameter containing:
tensor([-0.4328,  0.1590, -0.0809,  0.3756, -0.3382,  0.4353],
       requires_grad=True)

## Activation layer

The perceptron computes a single output from multiple real-valued inputs by forming a linear combination according to its input weights and then possibly putting the output through some nonlinear **activation function**. Mathematically this can be written as


$$
    y = \boldsymbol\psi \Big(\sum\limits_{i=1}^l W_i x_i + b \Big) = \boldsymbol\psi( W^\top \boldsymbol x + b).
$$

where, $\boldsymbol\psi$ is the activatiion function.

```{warning}
Note that different layers may have different activation functions.
```

The original Rosenblatt's perceptron used a Heaviside step function
$$
    \mathbb H(x) = \begin{cases}
        1,& \text{if }  x \geqslant 0, \\
        0,& \text{if }  x < 0.
    \end{cases}
$$
as the activation function $\boldsymbol\phi$. While the value of “1” triggers the activation function and “0” does not. If there exists more than one layer,a value of “1” will be configured to pass the output to the input of the next layer. Consequently, a “0” value is configured to be ignored and will not be passed to the next processor.

----- здесь будет Heaviside step function diagram



Nowadays, and especially in multilayer networks, the activation function is often chosen to be the logistic sigmoid
$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

or the hyperbolic tangent

$$
\tanh(x)
$$

These functions are used because they are mathematically convenient and are close to linear near origin while saturating rather quickly when getting away from the origin. This allows MLP networks to model well both strongly and mildly nonlinear mappings.



**Q**. How do the functions of the logistic sigmoid and hyperbolic tangent are related?

```{admonition} Exercise
:class: important

They are related by  
$$
\frac{\tanh(x)+1}2 = \frac{1}{1 + e^{-2x}}
$$

```

We will discover activation functions in more detail in the {ref}`Activation functions <activations>`.

### Sigmoid function in PyTorch

In [None]:
import torch

torch.manual_seed(1)

x = torch.randn((3, 3, 3))
x

tensor([[[-1.5256, -0.7502, -0.6540],
         [-1.6095, -0.1002, -0.6092],
         [-0.9798, -1.6091, -0.7121]],

        [[ 0.3037, -0.7773, -0.0954],
         [ 0.1394, -1.5785, -0.3206],
         [-0.2993,  1.8793, -0.0721]],

        [[ 0.1578,  1.7163, -0.0561],
         [ 0.9107, -1.3924,  2.6891],
         [-0.1110,  0.2927, -0.1578]]])

In [None]:
y = torch.sigmoid(x)

y.min(), y.max()

(tensor(0.1667), tensor(0.9364))