# Layers of MLP

From mathematical point of view MLP is a smooth function $F$ which is constructed as a composition of some other functions

$$
F(\boldsymbol x) = (f_{L} \circ f_{L-1} \circ\ldots \circ f_2 \circ f_1)(\boldsymbol x),\quad
\boldsymbol x \in \mathbb R^{n_0}
$$

Each function 

$$
    f_\ell \colon \mathbb R^{n_{\ell - 1}} \to \mathbb R^{n_\ell}
$$

is called a **layer**; it converts representation of $(\ell-1)$-th layer 

$$
    \boldsymbol x_{\ell -1} \in \mathbb R^{n_{\ell - 1}} 
$$

to the representation of $\ell$-th layer 

$$
   \boldsymbol x_{\ell} \in \mathbb R^{n_{\ell}}.
$$

Thus, the **input layer** $\boldsymbol x_0 \in \mathbb R^{n_0}$ is converted to the **output layer** $\boldsymbol x_L \in \mathbb R^{n_L}$. All other layers $\boldsymbol x_\ell$, $1\leqslant \ell < L$, are called **hidden layers**.

```{figure} https://www.researchgate.net/publication/354817375/figure/fig2/AS:1071622807097344@1632506195651/Multi-layer-perceptron-MLP-NN-basic-Architecture.jpg
:align: center
```

```{warning}
The terminology about layers is a bit ambiguous. Both functions $f_\ell$ and their outputs $\boldsymbol x_\ell = f(\boldsymbol x_{\ell - 1})$ are called $\ell$-th layer in different sources.
```

## Parameters of MLP

However, one important element is missing in this description of MLP: parameters! Each layer $f_\ell$ has a vector of parameters $\boldsymbol \theta_\ell\in\mathbb R^{m_\ell}$ (sometimes empty). Hence, a layer should be defined as

$$
    f_\ell \colon \mathbb R^{n_{\ell - 1}} \times \mathbb R^{m_\ell} \to \mathbb R^{n_\ell}.
$$

The representation $\boldsymbol x_\ell$ is calculated from $\boldsymbol x_{\ell -1}$ by the formula 

$$
\boldsymbol x_\ell = f_\ell(\boldsymbol x_{\ell - 1},\boldsymbol \theta_\ell)
$$

with some fixed $\boldsymbol \theta_\ell\in\mathbb R^{m_\ell}$. The whole MLP $F$ depends on parameters of all layers:

$$
    F(\boldsymbol x, \boldsymbol \theta), \quad \boldsymbol \theta = (\boldsymbol \theta_1, \ldots, \boldsymbol \theta_L).
$$

All these parameters are trained simultaneously by the {ref}`backpropagation method <backprop>`.


## Dense layer

Edges between two consequetive layers denote **linear** (or **dense**) layer:

$$
    \boldsymbol x_\ell^{\mathsf T} = f(\boldsymbol x_{\ell - 1}; \boldsymbol W, \boldsymbol b) = \boldsymbol x_{\ell - 1}^{\mathsf T} \boldsymbol W + \boldsymbol b.
$$

The matrix $\boldsymbol W \in \mathbb R^{n_{\ell - 1}\times n_\ell}$ and vector $\boldsymbol b \in \mathbb R^{n_\ell}$ (bias) are parameters of the linear layer which defines the linear transformation from $\boldsymbol x_{\ell - 1}$ to $\boldsymbol x_{\ell}$.

**Q**. How many numeric parameters does such linear layer have?

```{admonition} Exercise
:class: important

Suppose that we apply one more dense layer:

$$
    \boldsymbol x_{\ell + 1} = \boldsymbol {W'x}_{\ell} + \boldsymbol{b'}
$$

Express $\boldsymbol x_{\ell + 1}$ as a function of $\boldsymbol x_{\ell - 1}$.
```

### Linear layer in PyTorch



In [15]:
import torch

x = torch.ones(3)
x

tensor([1., 1., 1.])

Weights:

In [19]:
linear_layer = torch.nn.Linear(3, 4, bias=False)
linear_layer.weight

Parameter containing:
tensor([[-0.4484,  0.5759, -0.3938],
        [-0.5506, -0.1603,  0.3134],
        [-0.2858, -0.0493, -0.0959],
        [-0.0627,  0.3831,  0.4740]], requires_grad=True)

Apply the linear transformation:

In [22]:
linear_layer(x)

tensor([-0.2664, -0.3975, -0.4310,  0.7944], grad_fn=<SqueezeBackward4>)

## Activation layer

In this layer a nonlinear **activation function** $\psi$ is applied element-wise to its input:

$$
    \psi(\boldsymbol x^{\mathsf T}) = \psi\big((x_1, \ldots, x_n)\big) = \big(\psi(x_1), \ldots, \psi(x_n)\big)  = \boldsymbol z^{\mathsf T}
$$

In the origial work by Rosenblatt the activation function was $\psi(t) = [t > 0]$. However, this function is discontinuous, that's why in modern neural networks some other smooth alternatives are used.

Sometimes linear and activation layers are combined into a single layer. Then each MLP layer looks like

$$
    \boldsymbol x_i^{\mathsf T} = \psi_i(\boldsymbol x_{i-1}^{\mathsf T} \boldsymbol W_{i} + \boldsymbol b_{i})
$$

where

* $\boldsymbol W_{i}$ is a matrix of the shape $n_{i-1}\times n_i$
* $\boldsymbol x_i, \boldsymbol b_i \in \mathbb R^{n_i}$ and $\boldsymbol x_{i-1} \in \mathbb R^{n_{i-1}}$
* $\psi_i(t)$ is an activation function which acts element-wise