# Exercise 1: Manual Calculation of MLP Steps

## 1) Forward pass

$$
\mathbf{x}=\begin{bmatrix}0.5\\-0.2\end{bmatrix},\quad
\mathbf{W}^{(1)}=\begin{bmatrix}0.3&-0.1\\0.2&0.4\end{bmatrix},\quad
\mathbf{b}^{(1)}=\begin{bmatrix}0.1\\-0.2\end{bmatrix}
$$

$$
\mathbf{W}^{(2)}=\begin{bmatrix}0.5&-0.3\end{bmatrix},\quad
b^{(2)}=0.2,\quad y=1.0,\quad \hat y=\tanh(u^{(2)})
$$

### Hidden pre-activations

$$
\mathbf{z}^{(1)}=\mathbf{W}^{(1)}\mathbf{x}+\mathbf{b}^{(1)}
$$

Component-wise:

* $z^{(1)}_1 = 0.3(0.5)+(-0.1)(-0.2)+0.1 = 0.1500+0.0200+0.1000 = \boxed{0.270000}$
* $z^{(1)}_2 = 0.2(0.5)+0.4(-0.2)-0.2 = 0.1000-0.0800-0.2000 = \boxed{-0.180000}$

$$
\mathbf{z}^{(1)}=\begin{bmatrix}0.270000\\-0.180000\end{bmatrix}
$$

### Hidden activations (tanh)

$$
\mathbf{h}^{(1)}=\tanh(\mathbf{z}^{(1)})\Rightarrow
h^{(1)}_1=\tanh(0.27)=\boxed{0.2636248355},\;
h^{(1)}_2=\tanh(-0.18)=\boxed{-0.1780808681}
$$

$$
\mathbf{h}^{(1)}=\begin{bmatrix}0.2636248355\\-0.1780808681\end{bmatrix}
$$

### Output pre-activation

$$
u^{(2)}=\mathbf{W}^{(2)}\mathbf{h}^{(1)}+b^{(2)}
=0.5(0.2636248355)+(-0.3)(-0.1780808681)+0.2
$$

$$
=0.1318124178+0.0534242604+0.2000000000
=\boxed{0.3852366782}
$$

### Output

$$
\hat y=\tanh(u^{(2)})=\tanh(0.3852366782)=\boxed{0.3672465626}
$$

---

## 2) Loss (MSE, with $N=1$)

$$
L=\frac{1}{N}(y-\hat y)^2=(1-0.3672465626)^2
=\boxed{0.4003769125}
$$

---

## 3) Backward pass


$$
\frac{d}{du}\tanh(u)=1-\tanh^2(u)
$$

### (a) $\displaystyle \frac{\partial L}{\partial \hat y}$

$$
L=(y-\hat y)^2 \Rightarrow \frac{\partial L}{\partial \hat y}=2(\hat y-y)
=2(0.3672465626-1.0)=\boxed{-1.2655068747}
$$

### (b) $\displaystyle \frac{\partial L}{\partial u^{(2)}}$

$$
\frac{\partial L}{\partial u^{(2)}}=\frac{\partial L}{\partial \hat y}\cdot (1-\hat y^2)
$$

$$
1-\hat y^2=1-(0.3672465626)^2=\boxed{0.8651299622}
$$

$$
\Rightarrow \frac{\partial L}{\partial u^{(2)}}=(-1.2655068747)(0.8651299622)=\boxed{-1.0948279147}
$$

### (c) Output layer grads $\left(\mathbf{W}^{(2)}, b^{(2)}\right)$

$$
\frac{\partial L}{\partial \mathbf{W}^{(2)}}=\frac{\partial L}{\partial u^{(2)}}\;\mathbf{h}^{(1)^\top}
\Rightarrow
\begin{bmatrix}
\frac{\partial L}{\partial W^{(2)}_1} & \frac{\partial L}{\partial W^{(2)}_2}
\end{bmatrix}
=\left(-1.0948279147\right)\begin{bmatrix}0.2636248355 & -0.1780808681\end{bmatrix}
$$

$$
\boxed{\frac{\partial L}{\partial \mathbf{W}^{(2)}}=
\begin{bmatrix}-0.2886238289 & 0.1949679055\end{bmatrix}}
$$

$$
\boxed{\frac{\partial L}{\partial b^{(2)}}=-1.0948279147}
$$

### (d) Backprop to hidden activations

$$
\frac{\partial L}{\partial \mathbf{h}^{(1)}}=\frac{\partial L}{\partial u^{(2)}}\;\mathbf{W}^{(2)^\top}
=(-1.0948279147)\begin{bmatrix}0.5\\-0.3\end{bmatrix}
=\boxed{\begin{bmatrix}-0.5474139574\\0.3284483744\end{bmatrix}}
$$

### (e) Through hidden tanh

Let $\boldsymbol{\sigma}'(\mathbf{z}^{(1)})=1-\tanh^2(\mathbf{z}^{(1)})$.
Compute each:

$$
1-\tanh^2(0.27)=\boxed{0.9305019461},\quad
1-\tanh^2(-0.18)=\boxed{0.9682872044}
$$

Then

$$
\frac{\partial L}{\partial \mathbf{z}^{(1)}}=
\frac{\partial L}{\partial \mathbf{h}^{(1)}}\odot \boldsymbol{\sigma}'(\mathbf{z}^{(1)})
=\begin{bmatrix}-0.5474139574\\0.3284483744\end{bmatrix}\odot
\begin{bmatrix}0.9305019461\\0.9682872044\end{bmatrix}
=\boxed{\begin{bmatrix}-0.5093697527\\0.3180323583\end{bmatrix}}
$$

### (f) Hidden layer grads $\left(\mathbf{W}^{(1)}, \mathbf{b}^{(1)}\right)$

With $\mathbf{x}=[x_1,x_2]^\top=[0.5,-0.2]^\top$:

$$
\frac{\partial L}{\partial \mathbf{W}^{(1)}}=
\frac{\partial L}{\partial \mathbf{z}^{(1)}}\;\mathbf{x}^\top
=
\begin{bmatrix}
-0.5093697527\\
\;\;0.3180323583
\end{bmatrix}
\begin{bmatrix}0.5&-0.2\end{bmatrix}
=
\boxed{
\begin{bmatrix}
-0.2546848763 & 0.1018739505\\
\;\;0.1590161791 & -0.0636064717
\end{bmatrix}}
$$

$$
\boxed{\frac{\partial L}{\partial \mathbf{b}^{(1)}}=
\begin{bmatrix}-0.5093697527\\ \;\;0.3180323583\end{bmatrix}}
$$

---

## 4) Parameter update (gradient descent with $\eta=\mathbf{0.1}$)

$$
\theta \leftarrow \theta - \eta \nabla_\theta L
$$

### Output layer

$$
\mathbf{W}^{(2)}_{\text{new}}=
\begin{bmatrix}0.5&-0.3\end{bmatrix}
-0.1\begin{bmatrix}-0.2886238289&0.1949679055\end{bmatrix}
=
\boxed{\begin{bmatrix}0.5288623829&-0.3194967905\end{bmatrix}}
$$

$$
b^{(2)}_{\text{new}}=0.2-0.1(-1.0948279147)=\boxed{0.3094827915}
$$

### Hidden layer

$$
\mathbf{W}^{(1)}_{\text{new}}=
\begin{bmatrix}0.3&-0.1\\0.2&0.4\end{bmatrix}
-0.1\begin{bmatrix}
-0.2546848763&0.1018739505\\
\;\;0.1590161791&-0.0636064717
\end{bmatrix}
=
\boxed{
\begin{bmatrix}
0.3254684876 & -0.1101873951\\
0.1840983821 & \;\;0.4063606472
\end{bmatrix}}
$$

$$
\mathbf{b}^{(1)}_{\text{new}}=
\begin{bmatrix}0.1\\-0.2\end{bmatrix}
-0.1\begin{bmatrix}-0.5093697527\\ \;\;0.3180323583\end{bmatrix}
=
\boxed{\begin{bmatrix}0.1509369753\\-0.2318032358\end{bmatrix}}
$$