# Multilayer Perceptrons

Notation: The superscript $[i]$ represents the $i^{th}$ observation, while $(i)$ represents the $i^{th}$ layer.

Assume a fully connected NN, i.e Multilayer Perceptrons.

Training dataset: $\mathcal{D}_{train}$

Size of data: $N$ 

No. of input features: $d$

No. of outputs: $q$

Minibatch size: $n$  

Input matrix = $\mathbf{X} \in \mathbb{R}^{n \times d}$

No. of hidden units: $h$

Output of hidden units: $\mathbf{H} \in \mathbb{R}^{n \times h}$

Hidden-layer weights: $\mathbf{W}^{(1)} \in \mathbb{R}^{d \times h}$

Hidden-layer biases: $\mathbf{b}^{(1)} \in \mathbb{R}^{1 \times h}$

Output-layer weights: $\mathbf{W}^{(2)} \in \mathbb{R}^{h \times q}$

Output-layer biases: $\mathbf{b}^{(2)} \in \mathbb{R}^{1 \times q}$

A nonlinear-activation function: $\sigma$

And the output: $\mathbf{O} \in \mathbb{R}^{1 \times q}$

Thus the mathematical representation of our model is


$$\mathbf{H} = \sigma(\mathbf{X W}^{(1)} + \mathbf{b}^{(1)})$$

$$\mathbf{O} = \mathbf{H} \mathbf{W}^{(2)} + \mathbf{b}^{(2)}$$

## Concrete Mathematical Representation


$N = 4898$, $n=10$, $d=11$, $h = 22$

`X` $\leftarrow$ $\mathbf{X} \in \mathbb{R}^{10 \times 11}$

`H` $\leftarrow$ $\mathbf{H} \in \mathbb{R}^{10 \times 22}$

`W1` $\leftarrow$ $\mathbf{W}^{(1)} \in \mathbb{R}^{11 \times 22}$

`b1` $\leftarrow$ $\mathbf{b}^{(1)} \in \mathbb{R}^{1 \times 22}$

`W2` $\leftarrow$ $\mathbf{w}^{(2)} \in \mathbb{R}^{22 \times 1}$

`b2` $\leftarrow$ $b^{(2)} \in \mathbb{R}^{1}$

ReLU activation: $\sigma = \max(x, 0)$

`y_pred` $\leftarrow$ $o \in \mathbb{R}^1$


$$\mathbf{H} = \sigma(\mathbf{X W}^{(1)} + \mathbf{b}^{(1)})$$

$$o = \mathbf{H} \mathbf{w}^{(2)} + b^{(2)}$$

### Loss function and Empirical Risk Function

Let loss $L$, be $L^{[i]} = \mathscr{l}(o^{[i]}, y^{[i]})$

`loss_fn` $\leftarrow$ $l(\hat{y}, y) = \frac{1}{2} (y - \hat{y})^2$

And the empirical risk $J = \frac{1}{n} \sum_{i = 1}^{n} \mathscr{c}(o^{[i]}, y^{[i]})$



### Gradients


#### Empirical Risk

$$\frac{\partial J}{\partial L} = 1$$

#### Loss

$$\frac{\partial L}{\partial o} = -(y - o)$$

#### Hidden to Output Layer

##### Weights $\frac{\partial J}{\partial \mathbf{w}^{(2)}}$

$$
\frac{\partial J}{\partial \mathbf{w}^{(2)}}
=
\frac{\partial J}{\partial L}
\times
\frac{\partial L}{\partial o}
\times
\frac{\partial o}{\partial \mathbf{w}^{(2)}}
$$

$$
\frac{\partial J}{\partial \mathbf{w}^{(2)}} 
= -(y-o)
\times
\frac{\partial (\mathbf{H} \mathbf{w}^{(2)} + b^{(2)})}{\partial \mathbf{w}^{(2)}}
$$


$$
\frac{\partial J}{\partial \mathbf{w}^{(2)}} 
= -(y-o)
\times
\mathbf{H}
$$

##### Biases $\frac{\partial J}{\partial b^{(2)}}$

$$
\frac{\partial J}{\partial \mathbf{b}^{(2)}}
=
\frac{\partial J}{\partial L}
\times
\frac{\partial L}{\partial o}
\times
\frac{\partial o}{\partial \mathbf{b}^{(2)}}
$$

$$\frac{\partial J}{\partial \mathbf{b}^{(2)}} = -(y-o)$$



#### Input to Hidden Layer


##### Weights $\frac{\partial J}{\partial \mathbf{W}^{(1)}}$

Let $\mathbf{z}$ be the intermediate variable before for calculating elementwise activation of $\sigma$, i.e. $\mathbf{z} = \mathbf{X W}^{(1)} + \mathbf{b}^{(1)}$

$$
\frac{\partial J}{\partial \mathbf{W}^{(1)}}
=
\frac{\partial J}{\partial L}
\times
\frac{\partial L}{\partial o}
\times
\frac{\partial o}{\partial \mathbf{H}}
\times
\frac{\partial \mathbf{H}}{\partial \mathbf{z}}
\times
\frac{\partial (\mathbf{X W}^{(1)} + \mathbf{b}^{(1)})}{\partial \mathbf{W}^{(1)}}
$$

$$
\frac{\partial J}{\partial \mathbf{W}^{(1)}}
=
1
\times
- (y - o)
\times
\frac{\partial o}{\partial \mathbf{H}}
\odot
\sigma^{\prime}(z)
\times
\frac{\partial (\mathbf{X W}^{(1)})}{\partial \mathbf{W}^{(1)}}
$$


##### Biases $\frac{\partial J}{\partial \mathbf{b}^{(1)}}$

$$
\frac{\partial J}{\partial \mathbf{b}^{(1)}}
=
1
\times
- (y - o)
\times
\frac{\partial o}{\partial \mathbf{H}}
\odot
\sigma^{\prime}(z)
\times
\frac{\partial (\mathbf{b}^{(1)})}{\partial \mathbf{b}^{(1)}}
$$