## Multi-Layer Perceptrons

Multi-Layer Perceptrons (MLPs) a.k.a. Multi-Layer Feed-Forward Neural Networks are obtained by stacking multiple hidden layers. Each hidden layer can contain different number of hidden units. In general, a MLP has $L$ hidden layers with $H_{1}, H_{2}, \ldots, H_{L} $ hidden units, i.e. the $\ell$-th hidden layer contains $H_{\ell}$ hidden units. For convenience, we also assume that the network inputs correspond to a layer $ 0 $ with $ H_{0} = D $ units. Lastly, the $ L$-th layer can directly produce the network outputs. In this case, $ H_{L} = O $. Alternatively, the $ H_{L} $ values at the last layer can be combined _somehow_ to produce the final network estimate $ \hat{\bf y} = f({\bf x}) $. In this case, $ H_{L} $ is not necessarily equal to $ O $.

The figures below illustrate MLPs with different number of hidden layers. {numref}`mlp1_fig1` shows a MLP network with a single hidden layer, {numref}`mlp1_fig2` a MLP with two hidden layers and {numref}`mlp1_fig3` a MLP with tree hidden layers. Note that each layer $\ell$ can use a distinct activation function $\phi_{\ell}$. Note that the values from the last hidden layer units are being combined by means of a single linear unit -- affine transform represented by the symbol $ \sum $ -- to produce the final network estimate $ \hat{y} = f({\bf x}) $ ($ O = 1 $). However, as we will see later, other ways to combine are also possible.

:::{figure} /images/neuralnets/multi_layer_MLP_01.png
---
height: 320px
name: mlp1_fig1
align: left
---
1-layer MLP ($L=1$).
:::

:::{figure} /images/neuralnets/multi_layer_MLP_05.png
---
height: 320px
name: mlp1_fig2
align: left
---
2-layer MLP ($L=2$)
:::

:::{figure} /images/neuralnets/multi_layer_MLP_07.png
---
height: 320px
name: mlp1_fig3
align: left
---
3-layer MLP ($L=3$).
:::

Let $ w^{\ell}_{i,j} $ denote the artificial synaptic weight from the $i$-th unit in previous layer $ \ell-1 $ to the $j$-th unit in the current layer $ \ell$ and let vector $ {\bf w}^{\ell}_{j} = \begin{bmatrix} w^{\ell}_{0,j} & w^{\ell}_{1,j} & \ldots & w^{\ell}_{H_{\ell-1},j} \end{bmatrix}^{T}$ collect all weights of the $j$-th unit at layer $\ell$. Additionally, for convenience, let $ {\bf h}^{0} = \begin{bmatrix} 1 & x_{1} & x_{2} & \ldots & x_{D} \end{bmatrix}^{T} $ with $ H_{0} = D $ collect the observed features and let $ {\bf h}^{\ell} = \begin{bmatrix} 1 & h^{\ell}_{1} & h^{\ell}_{2} & \ldots & h^{\ell}_{H_{\ell}} \end{bmatrix}^{T} $ collect the embedded features for all $ \ell \in \lbrace 1, \ldots, L \rbrace $. Thus, we can write the activation value of the $j$-th unit of the current layer $ \ell $ using vector notation as
\begin{eqnarray}
a^{\ell}_{j} &=& \sum_{i=0}^{H_{\ell-1}} w^{\ell}_{i,j} \, h^{\ell-1}_{i} \nonumber \\
&=& ({\bf w}_{j}^{\ell})^{T} {\bf h}^{\ell-1}. \nonumber
\end{eqnarray}
Finally, the output value of this neuron is computed as $$ h^{\ell}_{j} = \phi_{\ell}\left( a^{\ell}_{j} \right). $$

Now, let the matrix $ {\bf W}^{\ell} = \left[  w^{\ell}_{i,j} \right] $ collect the weights of all units at the $\ell$-th layer such that its $j$-th column stores the weights associated with the $j$-th unit at layer $\ell$. Equivalently, we can write $ {\bf W}^{\ell} = \begin{bmatrix} {\bf w}^{\ell}_{1} & {\bf w}^{\ell}_{2} & \ldots & {\bf w}^{\ell}_{H_{\ell}} \end{bmatrix} $. Furthermore, let the vector $ {\bf a}^{\ell} = \begin{bmatrix} a^{\ell}_{1} & a^{\ell}_{2} & \ldots & a^{\ell}_{H_{\ell}} \end{bmatrix}^{T} $ collect all activations $a^{\ell}_{j}$ of layer $\ell$. Then, we can conveniently write ${\bf a}^{\ell}$ and $ {\bf h}^{\ell} $ as matrix-vector products
\begin{eqnarray}
{\bf a}^{\ell} &=& ({\bf W}^{\ell})^{T} \, {\bf h}^{\ell-1} \nonumber \\
{\bf h}^{\ell} &=& \phi_{\ell}({\bf a}^{\ell}), \nonumber
\end{eqnarray}
in which the non-linearity $ \phi_{\ell} $ is applied element-wise on the vector $ {\bf a}^{\ell} $.

Therefore, the $\ell$-th layer output can be computed from the output of the previous layer $ \ell -1 $ as
:::{math}
:label: mlp_recursion
\begin{eqnarray}
{\bf h}^{\ell} &=& \phi_{\ell} \left( ({\bf W}^{\ell})^{T} \, {\bf h}^{\ell-1} \right)
\end{eqnarray}
:::
and the last layer output can be computed by recursively applying {eq}`mlp_recursion` starting with $ {\bf h}^{0} $ -- vector storing the network inputs in $ {\bf x} $ -- until obtain $ {\bf h}^{L} $ -- vector collecting the values produced by the units at the last layer. Finally, the last layer output can be further _combined_ to produce the network output $ \hat{\bf y} = f({\bf x}) $.

The computation performed by the 3-layer network illustrated in {numref}`mlp1_fig3` can be written using a single line as $$ f({\bf x}) = \underbrace{({\bf W}^{4})^{T}
\phi_3 \Big(
\underbrace{({\bf W}^{3})^{T} \phi_2 \Big(
\underbrace{({\bf W}^{2})^{T} 
\phi_1 \Big(
\underbrace{({\bf W}^{1})^{T} {\bf h}^{0}}_{\mathbf{a}^{1}} 
\Big)}_{\mathbf{a}^{2}} \Big)}_{\mathbf{a}^{3}} \Big)}_{\hat{y}} $$ in which $ \mathbf{W}^{4} $ is $H_L \times O$ matrix and $ {\bf h}^{0} \triangleq \begin{bmatrix} 1 & {\bf x}^{T} \end{bmatrix}^{T} $ with $ {\bf x} = \begin{bmatrix} x_1 & x_2 & \ldots & x_D \end{bmatrix}^{T} $. In this example, the single linear output unit -- node with symbol $ \sum $ in {numref}`mlp1_fig3` -- uses an affine transform ($ \mathbf{W}^{4} $) to combine the $H_L$ values at last hidden layer into a single output $ \hat{y} $, i.e. $O=1$. For multiple outputs, we can design $ \mathbf{W}^{4} $ such that $O > 1$. The MLP is thus a nested sequence of matrix-vector multiplications each of them followed by an element-wise non-linearity.

:::{prf:definition} Expressive Efficiency phenomenon
As stated by Pinkus {cite}`pinkus1999approximation` (see {doc}`Function approximator  <./neuralnets_func_approx>` page), single layer neural networks can approximate arbitrarily well any continuous function $f^{\ast} \colon \mathbb{R}^{D} \rightarrow \mathbb{R}^{O}$ for a wide range of non-linearities $\phi$. Thus, why do we need multiple layers? 

The motivation to use multiple layers is related to a phenomenon called _Expressive Efficiency_:
* Despite being universal approximators, shallow networks with a single hidden layer might require an exponential size -- number of hidden layer units -- on the number of inputs to approximate some function classes.
* On the other hand, the number of neurons required by a deep network with several hidden layers to represent the same functions has polynomial growth on the number of network inputs.

Thus, deep networks can represent non-linear functions more efficiently than shallow ones. A classical example is the _parity function_, which requires exponential size in shallow networks, but polynomial size in deep networks.

The intuition behind the _Expressive Efficiency_ phenomenon is as follows. Typical real-world tasks e.g. _predict apple/pear from an image_ are associated with highly non-linear functions of the type $f^{\ast} \colon \mathbb{R}^{D} \rightarrow \mathbb{R}^{O} $. However, common activation functions $\phi \colon \mathbb{R} \rightarrow \mathbb{R} $ e.g. sigmoid and ReLU employed by the network units show simple non-linearities. A single layer (shallow) network therefore would required a prohibitive number of units to express these real-world non-linearities, i.e. to approximate $f^{\ast} $. On the other hand, by cascading multiple non-linear layers, a deep network improves its ability / flexibility to express non-linearities and, therefore, is able to provide a reasonable approximation to $f^{\ast} $ using fewer neurons.
:::

:::{observation}
The limitations on the expressive power of shallow networks with polynomial size have been known for a long time. However, deep networks have a traditional downside: they are much harder to train than shallow ones. The 90's and early 2000's are known as the _neural networks winter_ due to the lack of progress training neural networks with more than few layers. Some researchers even claimed at that time that neural networks with more than two layers could not be trained at all. Fortunately, deep learning has gained huge attention since 2006 as some _tricks of the trade_ allowed researches to train networks with several hidden layers. Nowadays, neural networks can be scaled to hundreds or even thousands of layers. Moreover, modern deep networks can express highly non-linear functions and perform complex real-world tasks e.g. classify objects on images showing super-human performance, i.e. achieving better results than a trained human performing the same task.
:::
