# Multilayer Perceptrons

Notation: The superscript $[i]$ represents the $i^{th}$ observation, while $(i)$ represents the $i^{th}$ layer.

Assume a fully connected NN, i.e Multilayer Perceptrons.

Training dataset: $\mathcal{D}_{train}$

Size of data: $N$ 

No. of input features: $d$

No. of outputs: $q$

Input matrix = $\mathbf{x} \in \mathbb{R}^{d \times 1}$

No. of hidden units: $h$

Hidden-layer weights: $\mathbf{W}^{(1)} \in \mathbb{R}^{h \times d}$

Hidden-layer biases: $\mathbf{b}^{(1)} \in \mathbb{R}^{h \times 1}$

Output of hidden units: $\mathbf{h} \in \mathbb{R}^{h \times 1}$

Output-layer weights: $\mathbf{W}^{(2)} \in \mathbb{R}^{q \times h}$

Output-layer biases: $\mathbf{b}^{(2)} \in \mathbb{R}^{q \times 1}$

A nonlinear-activation function: $\sigma$

And the output: $\mathbf{o} \in \mathbb{R}^{q \times 1}$

Thus the mathematical representation of our model is


$$\mathbf{H} = \sigma(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)})$$

$$\mathbf{o} = \mathbf{W}^{(2)} \mathbf{h} + \mathbf{b}^{(2)}$$

## Concrete Mathematical Representation


$N = 4898$, $d=11$, $h = 22$

`x` $\leftarrow$ $\mathbf{x} \in \mathbb{R}^{11 \times 1}$

`W1` $\leftarrow$ $\mathbf{W}^{(1)} \in \mathbb{R}^{22 \times 11}$

`b1` $\leftarrow$ $\mathbf{b}^{(1)} \in \mathbb{R}^{22 \times 1}$

`H` $\leftarrow$ $\mathbf{h} \in \mathbb{R}^{22 \times 1}$

`W2` $\leftarrow$ $\mathbf{w}^{(2)T} \in \mathbb{R}^{1 \times 22}$

`b2` $\leftarrow$ $b^{(2)} \in \mathbb{R}^{1}$

ReLU activation: $\sigma(x) = \max(x, 0)$

`y_pred` $\leftarrow$ $o \in \mathbb{R}^1$


### Model's Equations

$$\mathbf{h} = \sigma(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)})$$

$$o = \mathbf{w}^{(2)T} \mathbf{h} + \mathbf{b}^{(2)}$$


### Loss function and Empirical Risk Function

Let loss $L$, be $L = \mathscr{l}(o, y)$

`loss_fn` $\leftarrow$ $l(o, y) = \frac{1}{2} (y - o)^2$

And the empirical risk $J = \frac{1}{n} \sum_{i = 1}^{n} \mathscr{c}(o^{[i]}, y^{[i]})$


### Gradients


#### Empirical Risk

$$\frac{\partial J}{\partial L} = 1$$

#### Loss

$$\frac{\partial L}{\partial o} = -(y - o)$$

#### Hidden to Output Layer

##### Weights $\frac{\partial J}{\partial \mathbf{w}^{(2)}}$

$$
\frac{\partial J}{\partial \mathbf{w}^{(2)}}
=
\frac{\partial J}{\partial L}
\times
\frac{\partial L}{\partial o}
\times
\frac{\partial o}{\partial \mathbf{w}^{(2)}}
$$

$$
\frac{\partial J}{\partial \mathbf{w}^{(2)}} 
= -(y-o)
\times
\frac{\partial (\mathbf{w}^{(2)T} \mathbf{h})}{\partial \mathbf{w}^{(2)}}
$$


$$
\frac{\partial J}{\partial \mathbf{w}^{(2)}} 
= -(y-o) 
\mathbf{h}
$$

##### Biases $\frac{\partial J}{\partial b^{(2)}}$

$$
\frac{\partial J}{\partial \mathbf{b}^{(2)}}
=
\frac{\partial J}{\partial L}
\times
\frac{\partial L}{\partial o}
\times
\frac{\partial o}{\partial \mathbf{b}^{(2)}}
$$

$$\frac{\partial J}{\partial \mathbf{b}^{(2)}} = -(y-o)$$



#### Input to Hidden Layer


##### Weights $\frac{\partial J}{\partial \mathbf{W}^{(1)}}$

Let $\mathbf{z}$ be the intermediate variable before for calculating elementwise activation of $\sigma$, i.e. $\mathbf{z} = \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}$

$$
\frac{\partial J}{\partial \mathbf{W}^{(1)}}
=
\frac{\partial J}{\partial L}
\times
\frac{\partial L}{\partial o}
\times
\frac{\partial o}{\partial \mathbf{h}}
\times
\frac{\partial \mathbf{h}}{\partial \mathbf{z}}
\times
\frac{\partial (\mathbf{W}^{(1)} \mathbf{x})}{\partial \mathbf{W}^{(1)}}
$$

$$
\frac{\partial J}{\partial \mathbf{W}^{(1)}}
=
\left[
\left[
- (y - o)
\mathbf{w}^{(2)}
\right]
\odot
\sigma^{\prime}(z)
\right]
\mathbf{x}^{T}
$$


##### Biases $\frac{\partial J}{\partial \mathbf{b}^{(1)}}$

$$
\frac{\partial J}{\partial \mathbf{b}^{(1)}}
=
\left[
- (y - o)
\mathbf{w}^{(2)}
\right]
\odot
\sigma^{\prime}(z)
$$