# Machine Learning (Summer 2024)

## Practice Session 12: Introduction

June 22nd, 2024

Ulf Krumnack & Lukas Niehaus

Institute of Cognitive Science,
University of Osnabrück

## Today's Session

* Organization
* Neural Network Libraries

# Announcements

## Final exam

* You have to have completed the practice sheets succesfully to be admitted to the exam.

* Time: Thursday, July 4th, 10:00 to 12:00 (lecture timeslot). Please be there at 10:00 sharp.

* The exam will take 90 minutes.

* Bring your own pen! No additional material (like calculators,
  paper, cell phones, etc.) is allowed.

* The exam will cover lectures ML-01 up to ML-11

### Redoing the exam

* There will be a retry exam for those who fail the written exam. This will most probably be an oral exam, taking place later during the semester break (people who succeed in the written exam are not admitted to the retry exam).

* If you cannot make it to the exam and can provide a medical certificate, you will be admitted to the retry exam.

* If you do not want to participate in the exam this year, you can participate next year without redoing the practice sheets. (However, if you participate in this year's exam and want to repeat the exam next year, you will have to redo the exercises!)

### Registering for the final exam

* If (and only if) you intend to participate in the exam, please register
   in EXA latest by **Monday, July 1st, 2024**.

# Recap / Q&A

* there will be recap sessions on Tuesday and Wednesday
* please post your questions in advance in the forum (there is dedicated area)

## Neural network computation

Consider the following multilayer perceptron (notation from ML-7 slides 46ff), consisting of an input layer (layer $k=0$, with two neurons 1 & 2), a hidden layer ($k=1$ with two neurons 3 & 4) and an output layer ($k=2$ with two neurons 5 & 6).

![mlp-large.png](mlp-large.png)

The connection weights are given by the following connectivity matrix:

to\from|1  |2  |3  |4  |5  |6
-------|---|---|---|---|---|--
1      |-  |-  |-  |-  |-  |-
2      |-  |-  |-  |-  |-  |-
3      |-3 |2  |-  |-  |-  |-
4      |2  |1  |-  |-  |-  |-
5      |-  |-  |4  |-1 |-  |-
6      |-  |-  |-2 |0.5|-  |-

The hidden layer (neurons 3 & 4) applies the [rectifier](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) as activation function.
$$
    \varphi(x)=\max(0,x)
$$

The output layer (neurons 5 & 6) uses the sigmoid ([standard logistic function](https://en.wikipedia.org/wiki/Logistic_function), [Fermi function](https://en.wikipedia.org/wiki/Fermi%E2%80%93Dirac_statistics)) as activation function.
$$
    \varphi(x)={\frac {1}{1+e^{-x}}}
$$

To measure the error, following the lecture (ML-7 slide 48), the (halved) [Mean Squared Error](https://en.wikipedia.org/wiki/Mean_squared_error) is used, that is
$$E[\{w\}](\vec{t},\vec{y}, ) = 
\tfrac{1}{2}\left\|\vec{t}-\vec{y}\right\|_2^2 =
\frac{1}{2}\sum_{i=1}^{d}(t_i-y_i)^2$$
with $\vec{y}$ being the values predicted by the network, $\vec{t}$ the target value ("ground truth"), and $d=2$ the dimensionality of the output space.

### Manual solution (lecture style)

**(a)** Assume the input $\vec{x} = (1.0, 2.0)$ is given to the network (notice that in contrast to the lecture slides, we only consider a single input vector here, instead of a full dataset). Compute the weighted input $s_i(k)$ as well as the output values $o_i(k)$ for all neurons in the network.

Computing the weighted input for layer 1:
\begin{align*}
  s_3(1) &= \sum_{j=1}^{2} w_{3j}(1,0)o_j(0) && = -3\cdot 1 + 2\cdot 2 = -3 + 4 && = 1 \\
  s_4(1) &= \sum_{j=1}^{2} w_{4j}(1,0)o_j(0) && = 2\cdot 1 + 1\cdot 2 = 2+2 && = 4
\end{align*}
The outputs of layer 1 are hence:
\begin{align*}
  o_3(1) &= \varphi_{R}(s_3(1)) &&= \max(0,1) && = 1 \\
  o_4(1) &= \varphi_{R}(s_4(1)) &&= \max(0,4) && = 4 
\end{align*}

For layer 2 we then get the following weighted input:
\begin{align*}
  s_5(2) &= \sum_{j=3}^{4} w_{5j}(2,1)o_j(1) &&= 4\cdot 1 - 1\cdot 4 &&= 0 \\
  s_6(2) &= \sum_{j=3}^{4} w_{6j}(2,1)o_j(1) &&= -2\cdot 1 + 0.5\cdot 4 && = 0 
\end{align*}
The outputs of layer 2 (which also are the network output) ar
\begin{align*}
  o_5(2) &= \varphi_{S}(s_5(2)) = \sigma(0) = 0.5 \\
  o_6(2) &= \varphi_{S}(s_6(2)) = \sigma(0) = 0.5
\end{align*}

**(b)** Compute the loss value for the predicted output, assuming that the target value is $\vec{y}_{\text{true}}=(1.0, 0.0)$.

The loss value is given by the (halved) mean squared error between the network output $\vec{y}=(0.5, 0.5)$ and the target value $\vec{t}=(1.0,0.0)$:
\begin{align*}
  E[{w}](\vec{t},\vec{y}) 
  & = \tfrac12\|\vec{t}-\vec{y}\|^2\\
  & = \tfrac12\sum_{i=1}^{2}(t_i-y_i)^2\\
  & = \tfrac12\left[(1-0.5)^2 + (0-0.5)^2\right]\\
  & = \tfrac12\left[0.25 + 0.25\right]\\
  & = 0.25
\end{align*}

**(c)** Now perform backpropagation: compute the errror signals $\delta_i(k)$ and the partial derivatives $\partial E/\partial w_{ik}$ for the weights in layer $k=2$ and $k=1$ (for layer $k=1$ remember to use the ReLU function, which has a quite simple derivative).

For the output layer we get the following error signal:
\begin{align*}
  \delta_5(2) & = \varphi_{S}'(s_5)\cdot (t_1-y_1(\vec{x})) && = \sigma'(0)\cdot(1.0-0.5) && = \sigma(0)(1-\sigma(0))\cdot 0.5 && = .5 \cdot .5 \cdot .5 && = 0.125 \\
  \delta_6(2) & = \varphi_{S}'(s_6)\cdot (t_2-y_2(\vec{x})) && = \sigma'(0)\cdot(0.0-0.5) && = \sigma(0)(1-\sigma(0))\cdot -0.5 && = .5 \cdot .5 \cdot -.5 && = -0.125
\end{align*}
From this we can obtain the second layer weight gradients:
\begin{align*}
  -\partial E/\partial w_{53}(2,1) & = \delta_5(2)o_3(1) &&= 0.125 \cdot 1 &&= 0.125 \\
  -\partial E/\partial w_{54}(2,1) & = \delta_5(2)o_4(1) &&= 0.125 \cdot 4 &&= 0.5 \\
  -\partial E/\partial w_{63}(2,1) & = \delta_6(2)o_3(1) && = -0.125 \cdot 1 &&= -0.125 \\
  -\partial E/\partial w_{64}(2,1) & = \delta_6(2)o_4(1) && = -0.125 \cdot 4 &&= -0.5 
\end{align*}

For layer 1 the error signal is:
\begin{align*}
  \delta_3(1) &= \varphi_{R}'(s_3(1))\cdot\sum_{j=5}^{6}w_{j3}(2,1)\delta_j(2) && = \varphi_{R}'(1)\cdot\left[4\cdot 0.125 + -2\cdot-0.125\right] &&= 1\cdot [0.5 + 0.25] &&= 0.75\\
  \delta_4(1) &= \varphi_{R}'(s_4(1))\cdot\sum_{j=5}^{6}w_{j4}(2,1)\delta_j(2) && = \varphi_{R}'(4)\cdot\left[-1\cdot 0.125 + 0.5\cdot-0.125\right] &&= 1\cdot [-0.125-0.0625] &&= -0.1875
\end{align*}
yielding the following gradients:
\begin{align*}
  -\partial E/\partial w_{31}(1,0) &= \delta_3(1)o_1(0) &&= 0.75 \cdot 1.0 && = 0.75 \\
  -\partial E/\partial w_{32}(1,0) &= \delta_3(1)o_2(0) &&= 0.75 \cdot 2.0 && = 1.5 \\
  -\partial E/\partial w_{41}(1,0) &= \delta_4(1)o_1(0) &&= -0.1875 \cdot 1.0 &&= -0.1875 \\
  -\partial E/\partial w_{42}(1,0) &= \delta_4(1)o_2(0) &&= -0.1875 \cdot 2.0 &&= -0.375
\end{align*}

**(d)** Finish your training with an update step: apply the adaptation rule with a learning rate $\varepsilon=1$ to obtain the updated network.

The adaptation rule is

$$ w_{ji}(k+1,k) \mapsto w_{ji}(k+1,k) + \Delta w_{ji}(k+1,k)$$

with the update term

$$ \Delta w_{ji}(k+1,k) = - \varepsilon \partial E/\partial w_{ji}(k+1,k) = \varepsilon \delta_j(k+1)o_j(k)$$

For the first layer that is
\begin{align*}
  w_{31}(1,0): &&-3 \mapsto & -3 + 1\cdot 0.75   && = -2.25 \\
  w_{32}(1,0): && 2 \mapsto &  2 + 1\cdot1.5    && = 3.5 \\
  w_{41}(1,0): && 2 \mapsto &  2 + 1\cdot(-0.1875) && = 1.8125 \\
  w_{42}(1,0): && 1 \mapsto &  1 + 1\cdot(-0.375)  && = 0.625
\end{align*}

and for the second layer
\begin{align*}
  w_{53}(2,1): &&  4 \mapsto &    4 + 1\cdot 0.125    && = 4.125 \\
  w_{54}(2,1): && -1 \mapsto &   -1 + 1\cdot 0.5      && = 0.5 \\
  w_{63}(2,1): && -2 \mapsto &   -2 + 1\cdot (-0.125) && = 2.125 \\
  w_{64}(2,1): && 0.5 \mapsto & 0.5 + 1\cdot (-0.5)   && = 0.0
\end{align*}

### Manual Solution (vector calculus)

**(a)** Assume the input $\vec{x} = (1.0, 2.0)$ is given to the network (notice that in contrast to the lecture slides, we only consider a single input vector here, instead of a full dataset). Compute the weighted input $s_i(k)$ as well as the output values $o_i(k)$ for all neurons in the network.

Computing the weighted input for layer 1:
\begin{align*}
  \vec{s}^{(1)} &= \mathbf{W}^{(1)}\cdot \vec{o}^{(0)} = 
  \begin{pmatrix} -3 & 2 \\ 2 & 1 \end{pmatrix}\cdot
  \begin{pmatrix} 1 \\ 2 \end{pmatrix}
  = \begin{pmatrix} 1 \\ 4 \end{pmatrix}
\end{align*}

\begin{align*}
  \vec{o}^{(1)} &= \mathbf{\varphi}_{R}(\vec{s}^{(1)}) =
  \begin{pmatrix} \varphi_R(1) \\ \varphi_R(4) \end{pmatrix} =
  \begin{pmatrix} \max(1,0) \\ \max(4,0) \end{pmatrix} =
  \begin{pmatrix} 1 \\ 4 \end{pmatrix}
\end{align*}

Computing the weighted input for layer 1:
\begin{align*}
  \vec{s}^{(2)} &= \mathbf{W}^{(2)}\cdot \vec{o}^{(1)} = 
  \begin{pmatrix} 4 & -1 \\ -2 & 0.5 \end{pmatrix}\cdot
  \begin{pmatrix} 1 \\ 4 \end{pmatrix}
  = \begin{pmatrix} 0 \\ 0 \end{pmatrix}
\end{align*}

\begin{align*}
  \vec{o}^{(2)} &= \mathbf{\varphi}_{S}(\vec{s}^{(2)}) =
  \begin{pmatrix} \varphi_S(0) \\ \varphi_S(0) \end{pmatrix} =
  \begin{pmatrix} \sigma(0) \\ \sigma(0) \end{pmatrix} =
  \begin{pmatrix} 0.5 \\ 0.5 \end{pmatrix}
\end{align*}

**(b)** Compute the loss value for the predicted output, assuming that the target value is $\vec{t}=(1.0, 0.0)$.

The loss value is given by the (halved) mean squared error between the network output $\vec{y}=(0.5, 0.5)$ and the target value $\vec{t}=(1.0,0.0)$:
\begin{align*}
  E[\{w\}](\vec{t},\vec{y}) 
  & = \tfrac12\|\vec{t}-\vec{y}\|^2\\
  & = \tfrac12\sum_{i=1}^{2}(t_i-y_i)^2\\
  & = \tfrac12\left[(1-0.5)^2 + (0-0.5)^2\right]\\
  & = \tfrac12\left[0.25 + 0.25\right]\\
  & = 0.25
\end{align*}

**(c)** Now perform backpropagation: compute the errror signals $\delta_i(k)$ and the partial derivatives $\partial E/\partial w_{ik}$ for the weights in layer $k=2$ and $k=1$ (for layer $k=1$ remember to use the ReLU function, which has a quite simple derivative).

Start with the error signal for the second (i.e. output) layer:
\begin{align*}
  \vec{\delta}^{(2)} &= 
  \frac{-\partial E}{\partial \vec{s}^{(2)}} =
  \frac{\partial \vec{o}^{(2)}}{\partial \vec{s}^{(2)}} \cdot
  \frac{-\partial E}{\partial \vec{o}^{(2)}}
\end{align*}

The required gradients are:
\begin{align*}
  \frac{-\partial E}{\partial \vec{o}^{(2)}} &= 
  \begin{pmatrix}
    -\frac{\mathrm{d}E}{\mathrm{d}o^{(2)}_1} \\
    -\frac{\mathrm{d}E}{\mathrm{d}o^{(2)}_2}
  \end{pmatrix}
  =
  \begin{pmatrix}
    t_1-o^{(2)}_1 \\ t_2-o^{(2)}_2
  \end{pmatrix}
  =
  \begin{pmatrix}
    0.5 \\ -0.5
  \end{pmatrix}
  \\
  \frac{\partial \vec{o}^{(2)}}{\partial \vec{s}^{(2)}} &= 
  \begin{pmatrix}
    \frac{\mathrm{d}o^{(2)}_1}{\mathrm{d}s^{(2)}_1} &
    \frac{\mathrm{d}o^{(2)}_2}{\mathrm{d}s^{(2)}_1} \\
    \frac{\mathrm{d}o^{(2)}_1}{\mathrm{d}s^{(2)}_2} &
    \frac{\mathrm{d}o^{(2)}_2}{\mathrm{d}s^{(2)}_2}
  \end{pmatrix}
  =
  \begin{pmatrix}
    \sigma'(s^{(2)}_1) & 0 \\
    0 &  \sigma'(s^{(2)}_2)
  \end{pmatrix}
  \\
  &=
  \begin{pmatrix}
    \sigma(s^{(2)}_1)(1-\sigma(s^{(2)}_1)) & 0 \\
    0 &  \sigma(s^{(2)}_1)(1-\sigma(s^{(2)}_1))
  \end{pmatrix}
  =
  \begin{pmatrix}
    0.25 & 0 \\
    0 &  0.25
  \end{pmatrix}
\end{align*}

so the error signals is:
\begin{align*}
  \vec{\delta}^{(2)} &= 
  \frac{\partial \vec{o}^{(2)}}{\partial \vec{s}^{(2)}} \cdot
  \frac{-\partial E}{\partial \vec{o}^{(2)}}
  =
  \begin{pmatrix}
    0.25 & 0 \\
    0 &  0.25
  \end{pmatrix}
  \cdot
  \begin{pmatrix}
    0.5 \\ -0.5
  \end{pmatrix}
  =
  \begin{pmatrix}
    0.125 \\ -0.125
  \end{pmatrix}  
\end{align*}

This allows to compute the gradient with respect to the second layer weights $\mathbf{W}^{(2)}$:
\begin{align*}
  \frac{-\partial E}{\partial \mathbf{W}^{(2)}}
  &=
  \frac{-\partial E}{\partial \vec{s}^{(2)}}
  \frac{\partial \vec{s}^{(2)}}{\partial \mathbf{W}^{(2)}}
  \\
  &=
  \vec{\delta}^{(2)} \cdot (\vec{o}^{(1)})^{T}
  \\
  & =
  \begin{pmatrix}
    0.125 \\ -0.125
  \end{pmatrix}  
  \cdot
  \begin{pmatrix}
    1 & 4
  \end{pmatrix}  
  \\
  &= 
  \begin{pmatrix}
    0.125 & 0.5 \\
    -0.125 & -0.5
  \end{pmatrix}  
\end{align*}

For the first layer the error signal is:
\begin{align*}
  \vec{\delta}^{(1)} &= 
  \frac{-\partial E}{\partial \vec{s}^{(1)}} =
  \frac{\partial \vec{o}^{(1)}}{\partial \vec{s}^{(1)}} \cdot
  \frac{\partial \vec{s}^{(2)}}{\partial \vec{o}^{(1)}} \cdot
  \frac{-\partial E}{\partial \vec{s}^{(2)}}
\end{align*}

The gradients are:
\begin{align*}
  \frac{\partial \vec{s}^{(2)}}{\partial \vec{o}^{(1)}} &= 
  \begin{pmatrix}
    \frac{\mathrm{d}s^{(2)}_1}{\mathrm{d}o^{(1)}_1} &
    \frac{\mathrm{d}s^{(2)}_2}{\mathrm{d}o^{(1)}_1} \\
    \frac{\mathrm{d}s^{(2)}_1}{\mathrm{d}o^{(1)}_2} &
    \frac{\mathrm{d}s^{(2)}_2}{\mathrm{d}o^{(1)}_2} 
  \end{pmatrix}
  =
  \begin{pmatrix}
    w^{(2)}_{11} & w^{(2)}_{21} \\
    w^{(2)}_{12} & w^{(2)}_{12}
  \end{pmatrix}
  = (\mathbf{W}^{(2)})^T =
  \begin{pmatrix}
     4 &  -2\\
     -1 &  0.5
  \end{pmatrix}
  \\
  \frac{\partial \vec{o}^{(1)}}{\partial \vec{s}^{(1)}} &= 
  \begin{pmatrix}
    \frac{\mathrm{d}o^{(1)}_1}{\mathrm{d}s^{(1)}_1} &
    \frac{\mathrm{d}o^{(1)}_2}{\mathrm{d}s^{(1)}_1} \\
    \frac{\mathrm{d}o^{(1)}_1}{\mathrm{d}s^{(1)}_2} &
    \frac{\mathrm{d}o^{(1)}_2}{\mathrm{d}s^{(1)}_2}
  \end{pmatrix}
  =
  \begin{pmatrix}
    \varphi_R'(s^{(1)}_1) & 0 \\
    0 & \varphi_R'(s^{(1)}_2)
  \end{pmatrix}
  =
  \begin{pmatrix}
    1 & 0 \\
    0 & 1
  \end{pmatrix}
\end{align*}

so the error signal amounts to:
\begin{align*}
  \vec{\delta}^{(1)} &= 
  \begin{pmatrix}
    1 & 0 \\
    0 & 1
  \end{pmatrix}
  \cdot
  \begin{pmatrix}
     4 &  -2\\
     -1 &  0.5
  \end{pmatrix}
  \cdot
  \begin{pmatrix}
    0.125 \\ -0.125
  \end{pmatrix}  
  =   
  \begin{pmatrix}
    0.75 \\ -0.1875
  \end{pmatrix}  
\end{align*}

This yields the following gradient for the $\mathbf{W}^{(1)}$, the weights for layer 1:
\begin{align*}
  \frac{-\partial E}{\partial \mathbf{W}^{(1)}}
  &=
  \frac{-\partial E}{\partial \vec{s}^{(1)}}
  \frac{\partial \vec{s}^{(1)}}{\partial \mathbf{W}^{(1)}}
  \\
  &=
  \vec{\delta}^{(1)} \cdot (\vec{o}^{(0)})^{T}
  \\
  & =
  \begin{pmatrix}
    0.75 \\ -0.1875
  \end{pmatrix}  
  \cdot
  \begin{pmatrix}
    1 & 2
  \end{pmatrix}  
  \\
  &= 
  \begin{pmatrix}
    0.75 & 1.5 \\
    -0.1875 & -0.375
  \end{pmatrix}
\end{align*}

**(d)** Finish your training with an update step: apply the adaptation rule with a learning rate $\varepsilon=1$ to obtain the updated network.

\begin{align*}
  \mathbf{W}^{(1)}
  \mapsto & \mathbf{W}^{(1)}
    + \varepsilon \frac{-\partial E}{\partial\mathbf{W}^{(1)}}
  = 
  \begin{pmatrix}
    -3 & 2 \\
    2 & 1
  \end{pmatrix}
  + 1\cdot
  \begin{pmatrix}
    0.75 & 1.5 \\
    -0.1875 & -0.375
  \end{pmatrix}
  \\ & =
  \begin{pmatrix}
    -2.25 & 3.5 \\
    1.8125 & 0.625
  \end{pmatrix}
  \\
  \mathbf{W}^{(2)}
  \mapsto & \mathbf{W}^{(2)}
    + \varepsilon \frac{-\partial E}{\partial\mathbf{W}^{(2)}}
  = 
  \begin{pmatrix}
    4 & -1 \\
    -2 & 0.5
  \end{pmatrix}
  + 1\cdot
  \begin{pmatrix}
    0.125 & 0.5 \\
    -0.125 & -0.5
  \end{pmatrix}
  \\ & =
  \begin{pmatrix}
    4.125 & -0.5 \\
    -2.125 & 0.0
  \end{pmatrix}
\end{align*}

### Numpy solution

In [None]:
import numpy as np

W1 = np.array([[-3., 2.], [ 2., 1.]])
W2 = np.array([[4., -1.], [-2., .5]])

relu = lambda x: np.maximum(1,x)
sigmoid = lambda x: 1/(1+np.exp(-x))

**(a)** Assume the input $\vec{x} = (1.0, 2.0)$ is given to the network (notice that in contrast to the lecture slides, we only consider a single input vector here, instead of a full dataset). Compute the weighted input $s_i(k)$ as well as the output values $o_i(k)$ for all neurons in the network.

In [None]:
x = np.array([[1., 2.]]).T

s1 = W1 @ x
o1 = relu(s1)

print("s1:", s1.T)
print("o1:", o1.T)

In [None]:
s2 = W2 @ o1
o2 = sigmoid(s2)

print("s2:", s2.T)
print("o2:", o2.T)

**(b)** Compute the loss value for the predicted output, assuming that the target value is $\vec{t}=(1.0, 0.0)$.

In [None]:
t = np.array([[1., 0.]]).T
# error_func = lambda x, t: 0.5 * np.linalg.norm(x-t)**2
error_func = lambda x, t: ((x-t)**2).sum() / 2

E = error_func(o2, t)
print("Error:", E)

**(c)** Now perform backpropagation: compute the errror signals $\delta_i(k)$ and the partial derivatives $\partial E/\partial w_{ik}$ for the weights in layer $k=2$ and $k=1$ (for layer $k=1$ remember to use the ReLU function, which has a quite simple derivative).

In [None]:
sigmoid_derivative = lambda x: sigmoid(x) * (1-sigmoid(x))
relu_derivative = lambda x: x > 0

In [None]:
δ2 = sigmoid_derivative(s2) * (t - o2)
print("δ2:", δ2.T)

δ1 = relu_derivative(s1) * W2.T @ δ2
print("δ1:", δ1.T)

**(d)** Finish your training with an update step: apply the adaptation rule with a learning rate $\varepsilon=1$ to obtain the updated network.

In [None]:
ε = 1.0

In [None]:
ΔW1 = ε * δ1 @ x.T
print("ΔW1:\n", ΔW1)

ΔW2 = ε * δ2 @ o1.T
print("\nΔW2:\n", ΔW2)

In [None]:
W1 += ΔW1
print("Updated W1:\n", W1)

W2 += ΔW2
print("\nUpdated W2:\n", W2)

In [None]:
assert np.array_equal(W1, np.array([[-2.2500,  3.5000], [ 1.8125,  0.6250]]))
assert np.array_equal(W2, np.array([[ 4.1250, -0.5000], [-2.1250,  0.0000]]))

### Torch solution

In [None]:
import torch
from torch import nn

In [None]:
layer1 = nn.Linear(2, 2, bias=False)
layer2 = nn.Linear(2, 2, bias=False)

with torch.no_grad():
    layer1.weight.copy_(torch.tensor([[-3., 2.], [ 2., 1.]]))
    layer2.weight.copy_(torch.tensor([[4., -1.], [-2., .5]]))

**(a)** Assume the input $\vec{x} = (1.0, 2.0)$ is given to the network (notice that in contrast to the lecture slides, we only consider a single input vector here, instead of a full dataset). Compute the weighted input $s_i(k)$ as well as the output values $o_i(k)$ for all neurons in the network.

In [None]:
x = torch.tensor([[1.,2.]])

In [None]:
s1 = layer1(x)
o1 = nn.ReLU()(s1)

print("s1 =", s1)
print("o1 =", o1)
assert torch.equal(o1, torch.tensor([[1., 4.]]))

In [None]:
s2 = layer2(o1)
o2 = nn.Sigmoid()(s2)
print("s2 =", s2)
print("o2 =", o2)
assert torch.equal(o2, torch.tensor([[0.5, 0.5]]))

**(b)** Compute the loss value for the predicted output, assuming that the target value is $\vec{t}=(1.0, 0.0)$.

In [None]:
loss_func = nn.MSELoss(reduction="sum")
t = torch.tensor([[1., 0.]])
error = loss_func(o2, t) / 2

print("\ntarget output (ground truth):", t)
print("predicted output:", o2)
print("error:", error)

**(c)** Now perform backpropagation: compute the errror signals $\delta_i(k)$ and the partial derivatives $\partial E/\partial w_{ik}$ for the weights in layer $k=2$ and $k=1$ (for layer $k=1$ remember to use the ReLU function, which has a quite simple derivative).

In [None]:
negative_error = -error
negative_error.backward()

print("\nGradients:")
print("dL/dW2\n", layer2.weight.grad)
print("dL/dW1\n", layer1.weight.grad)

assert torch.equal(layer2.weight.grad, torch.tensor([[0.1250, 0.5000], [ -0.1250,  -0.5000]]))
assert torch.equal(layer1.weight.grad, torch.tensor([[0.7500, 1.5000], [ -0.1875,  -0.3750]]))

**(d)** Finish your training with an update step: apply the adaptation rule with a learning rate $\varepsilon=1$ to obtain the updated network.

In [None]:
ε = 1.0

with torch.no_grad():
    layer1.weight.copy_(layer1.weight + ε * layer1.weight.grad)
    layer2.weight.copy_(layer2.weight + ε * layer2.weight.grad)

print("\nupdated weights:")
print("W1", layer1.weight)
print("W2", layer2.weight)

In [None]:
assert torch.equal(layer1.weight, torch.tensor([[-2.2500,  3.5000], [ 1.8125,  0.6250]]))
assert torch.equal(layer2.weight, torch.tensor([[ 4.1250, -0.5000], [-2.1250,  0.0000]]))

### More "torchish" formulation

In [None]:
model = nn.Sequential(
    nn.Linear(2, 2, bias=False),
    nn.ReLU(),
    nn.Linear(2, 2, bias=False),
    nn.Sigmoid()
)
print(model)

In [None]:
with torch.no_grad():
    model[0].weight.copy_(torch.tensor([[-3., 2.], [ 2., 1.]]))
    model[2].weight.copy_(torch.tensor([[4., -1.], [-2., .5]]))

In [None]:
loss_func = nn.MSELoss(reduction="sum")

x = torch.tensor([[1.,2.]])
y_true = torch.tensor([[1., 0.]])
y_pred = model(x)
error = loss_func(y_pred, y_true) / 2

print("target output (ground truth):", y_true)
print("predicted output:", y_pred)
print("error:", error)

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=1.0)

error.backward()
optimizer.step()

print("\nupdated weights:")
print("W1", layer1.weight)
print("W2", layer2.weight)

In [None]:
assert torch.equal(layer1.weight, torch.tensor([[-2.2500,  3.5000], [ 1.8125,  0.6250]]))
assert torch.equal(layer2.weight, torch.tensor([[ 4.1250, -0.5000], [-2.1250,  0.0000]]))

### JAX (numpy like)

In [None]:
import jax.numpy as jnp

W1 = jnp.array([[-3., 2.], [ 2., 1.]])
W2 = jnp.array([[4., -1.], [-2., .5]])

relu = lambda x: jnp.maximum(0,x)
sigmoid = lambda x: 1/(1+jnp.exp(-x))

**(a)** Assume the input $\vec{x} = (1.0, 2.0)$ is given to the network (notice that in contrast to the lecture slides, we only consider a single input vector here, instead of a full dataset). Compute the weighted input $s_i(k)$ as well as the output values $o_i(k)$ for all neurons in the network.

In [None]:
x = jnp.array([[1., 2.]]).T

s1 = W1 @ x
o1 = relu(s1)

print("s1:", s1.T)
print("o1:", o1.T)

In [None]:
s2 = W2 @ o1
o2 = sigmoid(s2)

print("s2:", s2.T)
print("o2:", o2.T)

**(b)** Compute the loss value for the predicted output, assuming that the target value is $\vec{t}=(1.0, 0.0)$.

In [None]:
# Error computation:
t = jnp.array([[1., 0.]]).T
#error_func = lambda x, t: 0.5 * jnp.linalg.norm(x-t)**2
error_func = lambda x, t: ((x-t)**2).sum() / 2

E = error_func(o2, t)
print("Error:", E)

**(c)** Now perform backpropagation: compute the errror signals $\delta_i(k)$ and the partial derivatives $\partial E/\partial w_{ik}$ for the weights in layer $k=2$ and $k=1$ (for layer $k=1$ remember to use the ReLU function, which has a quite simple derivative).

In [None]:
sigmoid_derivative = lambda x: sigmoid(x) * (1-sigmoid(x))
relu_derivative = lambda x: x > 0

In [None]:
δ2 = sigmoid_derivative(s2) * (t - o2)
print("δ2:", δ2.T)

δ1 = relu_derivative(s1) * W2.T @ δ2
print("δ1:", δ1.T)

**(d)** Finish your training with an update step: apply the adaptation rule with a learning rate $\varepsilon=1$ to obtain the updated network.

In [None]:
ε = 1.0

ΔW1 = ε * δ1 @ x.T
print("ΔW1:\n", ΔW1)

ΔW2 = ε * δ2 @ o1.T
print("\nΔW2:\n", ΔW2)

In [None]:
W1 += ΔW1
print("Updated W1:\n", W1)

W2 += ΔW2
print("\nUpdated W2:\n", W2)

In [None]:
assert jnp.array_equal(W1, jnp.array([[-2.2500,  3.5000], [ 1.8125,  0.6250]]))
assert jnp.array_equal(W2, jnp.array([[ 4.1250, -0.5000], [-2.1250,  0.0000]]))

### The "jaxish" version

In [None]:
import jax
import jax.numpy as jnp

In [None]:
relu = lambda x: jnp.maximum(0,x)
sigmoid = lambda x: 1/(1+jnp.exp(-x))

layer1 = lambda W, x: relu(W @ x)
layer2 = lambda W, x: sigmoid(W @ x)

**(a)** Assume the input $\vec{x} = (1.0, 2.0)$ is given to the network (notice that in contrast to the lecture slides, we only consider a single input vector here, instead of a full dataset). Compute the weighted input $s_i(k)$ as well as the output values $o_i(k)$ for all neurons in the network.

In [None]:
W1 = jnp.array([[-3., 2.], [ 2., 1.]])
W2 = jnp.array([[4., -1.], [-2., .5]])
x = jnp.array([[1., 2.]]).T

print("o1:", layer1(W1, x).T)
print("o2:", layer2(W2, layer1(W1, x)).T)

**(b)** Compute the loss value for the predicted output, assuming that the target value is $\vec{t}=(1.0, 0.0)$.

In [None]:
t = jnp.array([[1., 0.]]).T

model = lambda W, x: layer2(W[1], layer1(W[0], x))

error_func = lambda x, t: ((x-t)**2).sum() / 2
neg_error = lambda W, x, t: -error_func(model(W, x), t)

print("negative error:", neg_error((W1, W2), x, t))

**(c)** Now perform backpropagation: compute the errror signals $\delta_i(k)$ and the partial derivatives $\partial E/\partial w_{ik}$ for the weights in layer $k=2$ and $k=1$ (for layer $k=1$ remember to use the ReLU function, which has a quite simple derivative).

In [None]:
grad_error = jax.grad(neg_error)
W1_grad, W2_grad = grad_error((W1, W2), x, t)

print("-dE/dW2:\n", W2_grad)
print("-dE/dW1:\n", W1_grad)

In [None]:
assert jnp.array_equal(W2_grad, [[0.1250, 0.5000], [ -0.1250,  -0.5000]])
assert jnp.array_equal(W1_grad, [[0.7500, 1.5000], [ -0.1875,  -0.3750]])

**(d)** Finish your training with an update step: apply the adaptation rule with a learning rate $\varepsilon=1$ to obtain the updated network.

In [None]:
ε = 1.0

W1 += ε * W1_grad
print("Updated W1:\n", W1)

W2 += ε * W2_grad
print("\nUpdated W2:\n", W2)

In [None]:
assert jnp.array_equal(W1, jnp.array([[-2.2500,  3.5000], [ 1.8125,  0.6250]]))
assert jnp.array_equal(W2, jnp.array([[ 4.1250, -0.5000], [-2.1250,  0.0000]]))

### A Flax implementation

In [None]:
import jax
import jax.numpy as jnp
from flax import linen as nn

In [None]:
W1 = jnp.array([[-3., 2.], [ 2., 1.]])
W2 = jnp.array([[4., -1.], [-2., .5]])

net = nn.Sequential([
    nn.Dense(features=2, use_bias=False,
             kernel_init=nn.initializers.constant(W1.T)),
    nn.relu,
    nn.Dense(features=2, use_bias=False,
             kernel_init=nn.initializers.constant(W2.T)),
    nn.sigmoid
])

In [None]:
print(net.tabulate(jax.random.PRNGKey(0), jnp.ones((2, ))))

**(a)** Assume the input $\vec{x} = (1.0, 2.0)$ is given to the network (notice that in contrast to the lecture slides, we only consider a single input vector here, instead of a full dataset). Compute the weighted input $s_i(k)$ as well as the output values $o_i(k)$ for all neurons in the network.

In [None]:
x = jnp.array([1., 2.])

params = net.init(jax.random.PRNGKey(0), jnp.ones((2,)))
net.apply(params, x)

**(b)** Compute the loss value for the predicted output, assuming that the target value is $\vec{t}=(1.0, 0.0)$.

In [None]:
t = jnp.array([1., 0.])

error_func = lambda x, t: ((x-t)**2).sum() / 2

def neg_error(params, x, t):
    return -error_func(net.apply(params, x), t)

loss_grad_fn = jax.value_and_grad(neg_error)

In [None]:
loss_val, grads = loss_grad_fn(params, x, t)
print(loss_val)

**(c)** Now perform backpropagation: compute the errror signals $\delta_i(k)$ and the partial derivatives $\partial E/\partial w_{ik}$ for the weights in layer $k=2$ and $k=1$ (for layer $k=1$ remember to use the ReLU function, which has a quite simple derivative).

In [None]:
print(grads)

**(d)** Finish your training with an update step: apply the adaptation rule with a learning rate $\varepsilon=1$ to obtain the updated network.

In [None]:
ε = 1.0
params = jax.tree_util.tree_map(lambda p, g: p +  ε * g, params, grads)
print(params)

In [None]:
assert jnp.array_equal(params['params']['layers_0']['kernel'].T,
                       jnp.array([[-2.2500,  3.5000], [ 1.8125,  0.6250]]))
assert jnp.array_equal(params['params']['layers_2']['kernel'].T,
                       jnp.array([[ 4.1250, -0.5000], [-2.1250,  0.0000]]))