Osnabrück University - Machine Learning (Summer Term 2024) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack, Lukas Niehaus

# Exercise Sheet 08

## Introduction

This week's sheet should be solved and handed in before the end of **Sunday, June 9, 2024**. If you need help (and Google and other resources were not enough), ask in the forum, contact your groups' designated tutor or whomever of us you run into first. Please upload your results to your group's Stud.IP folder.

## Assignment 0: Math recap (Conditional Probability) [0 Points]

This exercise is supposed to be easy and is voluntary. There will be a similar exercise on every sheet. It is intended to revise some basic mathematical notions that are assumed throughout this class and to allow you to check if you are comfortable with them. Usually you should have no problem to answer these questions offhand, but if you feel unsure, this is a good time to look them up again. You are always welcome to discuss questions with the tutors or in the practice session. Also, if you have a (math) topic you would like to recap, please let us know.

**a)** Explain the idea of conditional probability. How is it defined?

Conditional probability is the probability that an event A happens, given that another event B happened.
For example:
The probability of rain is $$P(weather="rain") = 0.3$$ But if you observe, if the street is wet you would get the conditional probability $$P(weather= "rain" |~ street="wet") = 0.95$$
The definition is:
$$ P(A|B) = \frac{P(A,B)}{P(B)} $$

**b)** What is Bayes' theorem? What are its applications?

Bayes Theorem states:
$$ P(B|A) = \frac{P(A|B) \cdot P(B)}{P(A)} $$

The most important application is in reasoning backwards from event to cause (from data to parameters of your distribution):

$$ P(\Theta|Data) = \frac{P(Data|\Theta)P(\Theta)}{P(Data)}$$

**c)** What does the law of total probability state? 

The law of total probability states, that the probabilty of an event occuring is the same as the sum of the probabilities of this event occuring together with all possible states of an other event:
$$P(A) = \sum_b P(A,B=b) = \sum_b P(A|B=b) P(B=b)$$

## Assignment 1: The Logic Perceptron: XOR (3 points)

**a)** Explain in your own words, why the XOR, in contrast to AND, OR, and NAND, can not be implemented by a single perceptron. What other logical operators face the same problem?

The percptron realizes a linear binary classifier, that is, it discriminates between positive and negative points using a hyperplane. Hence it can only implement logical functions where positive and negative outputs are linerly separable.  This is the case for AND, OR, and NAND, but not for XOR.

The equivalence (X1 <=> X2) is another example of a logical operation not implementable by a single perceptron.

**b)** Create two multi-layer perceptrons that encode the XOR function, applying the solutions sketched on (ML-7 slide 37): the first should distort the input space, while the second should add another axis. Explain the operation of your MLP on a geometric level. What is the minimal number of units to be placed in the hidden layer?


```
         +1    >0.5
    x1 ------ h1
       \    /   \
      -1\  /     \+1
         \/       \ out >1.5
         /\       /
      +1/  \     /+1
       /    \   /
    x2 ------ h2
        -1     >-1.5
```
This MLP computes h1 as (x1 OR x2) and h2 as (x1 NAND x2), that is h2 = NOT(x1 AND x2), and out is (h1 AND h2).
Hence out = (x1 OR x2) AND NOT (x1 AND x2) = x1 XOR x2.

Geometrically, the first layer MLP maps the two dimensional input space (x1,x2) into another two-dimensional space (h1,h2) applying a nonlinear transformation $f$. 

  | (x1,x2)  | (h1,h2) | out |
  |----------|---------|-----|
  | (0,0)    | (0,1)   |  1  |
  | (0,1)    | (1,1)   |  0  |
  | (1,0)    | (1,1)   |  0  |
  | (1,1)    | (1,0)   |  1  |

The internal points (0,1) and (1,0) are linearly separable from the point (1,1), e.g. by the line h1+h2=1.5.

```
       +1   >0.5
    x1 --- h1
       \       \+1
      +1\       \
         \    -2 \
           h3 --- out >.5
         /  >1.5 /
      +1/       /+1
       /       /
    x2 --- h2
       +1   >0.5
```

  | (x1,x2)  | (h1,h2,h3) | out |
  |----------|------------|-----|
  | (0,0)    | (0,0,0)    |  1  |
  | (0,1)    | (0,1,0)    |  0  |
  | (1,0)    | (1,0,0)    |  0  |
  | (1,1)    | (1,1,1)    |  1  |

The internal points (0,1,0) and (1,0,0) are linearly separable from the points (0,0,0) and (1,1,1) by the hyperplane h1+h2-2$*$h3=0.5.

## Assignment 2: Perceptron (6 points)

In this exercise you will implement a simple perceptron as described in the lecture [ML-07 Slide 31]. As with  previous exercises it is possible to not use our premade code blocks but write the single Perceptron completely from scratch (an empty cell to do so can be found [below](#Own-Implementation)). 

Use the following output function:
$$y = \begin{cases}1 \quad \text{if} \ s \ge 0\\0 \quad \text{else}\end{cases}$$

The `TODO`'s in the following code segments guide you through what has to be done.

*Hint*: If you have problems with `np.arrays` (which usually have shapes like `(13,)`, thus with one degenerate dimension, either set the shapes manually (`my_np_array.shape = (13, 1)`) or use [np.atleast_2d](https://numpy.org/doc/stable/reference/generated/numpy.atleast_2d.html). Other useful functions might be
* [lambda functions](https://docs.python.org/3/reference/expressions.html#lambda)
* [np.hstack](https://numpy.org/doc/stable/reference/generated/numpy.hstack.html?)
* [np.append](https://numpy.org/doc/stable/reference/generated/numpy.append.html?highlight=append#numpy.append)
* [np.apply_along_axis](https://numpy.org/doc/stable/reference/generated/numpy.apply_along_axis.html)
* [try except](https://docs.python.org/3/tutorial/errors.html?highlight=try%20except#handling-exceptions)


In [None]:
import numpy as np
import numpy.random as rnd

# TODO: Write the input activation (called net_input) and the output function (called out_fun).
### BEGIN SOLUTION
# The net input function (weighted input signals)
net_input = lambda d, w: d @ w.T

# The output function determines the output of the neuron (1 if x > 0 else 0).
out_fun = lambda x: float(x >= 0)
### END SOLUTION


# TODO: Write a function generate_weights that generates N (= number of dimensions) + 1 (w_0) random weights.
### BEGIN SOLUTION
generate_weights = lambda dims: rnd.rand(dims + 1)
### END SOLUTION

In [None]:
####################################################
## Testing the perceptron with a concrete example ##
####################################################

# Dimensions for our test.
dims = 12

# Input is a row vector. (Shape is (1, 13).)
D = np.hstack((1, rnd.rand(dims) - 0.5))

# Weights are stored in a vector.
W = generate_weights(dims)

out = out_fun(net_input(D, W))

assert out == 1 or out == 0, "The output has to be either 1 or 0, but was {}".format(out)

The following `eval_network(t, D, W)` function is used to measure the performance of your perceptron for the upcoming task.

In [None]:
def eval_network(t, D, W):
    """
    This function takes the trained weights of a perceptron
    and the input data (D) as well as the correct target values (t)
    and computes the overall error rate of the perceptron.
    """
    error = 0.0
    size = max(D.shape)
    for i in range(size):
        out = out_fun(net_input(D[i], W))
        error = error + abs(t[i] - out)
    # Normalize the error.
    try:
        return error.item(0) / size
    except AttributeError:
        return error / size

Now we will use the above defined functions to train the perceptron to one of the following logical functions: OR, NAND or NOR. 

In [None]:
# Plotting functions
import matplotlib.pyplot as plt

def function_to_learn(selector, function):
    """
    Functional definitions for the perceptron to learn
    Instantiates plots for visualization of the decision boundary
    :param selector: selects which function to activate
    :return function:
    """
    plot_points = [[0,0],[0,1],[1,0],[1,1]]
    plot_colors = []

    for point in plot_points:
        plot_colors.append(function(point[0], point[1]))
    for color, point in enumerate(plot_points):
        plt.scatter(*point, s=50, c='b' if plot_colors[color] == 1 else 'r')
    print("Perceptron will now learn '{}'...\n\n".format(selector))

In [None]:
import matplotlib.pyplot as plt

# Change this line to choose other operators:
op = 'and'


###################################################
## Now we train our perceptron! [ML-07 Slide 33] ##
###################################################

# TODO: Write the update function (name it 'delta_fun')
#       for the weights dependent on epsilon, the target,
#       the output and the input vector.
### BEGIN SOLUTION
delta_fun = lambda ϵ, t, y, x: ϵ * (t - y) * x
### END SOLUTION

# TODO: Define suitable parameters for your problem.
# Use the following names:
#   ϵ: learning rate
#   dims: dimensions
#   training_size: the number of training samples
### BEGIN SOLUTION
ϵ = 0.1
dims = 2
training_size = 1000
### END SOLUTION

# TODO: Generate the weights (in a variable called W).
### BEGIN SOLUTION
W = generate_weights(dims)
### END SOLUTION

# TODO: Generate a matrix D of truthvalue pairs.
# The shape should be (training_size, dims).
### BEGIN SOLUTION
D = rnd.rand(training_size, dims) > 0.5
### END SOLUTION

# TODO: Pad the input D with ones for the bias. The bias should always be
# w_0, i. e. the first column of the data should be ones.
### BEGIN SOLUTION
D = np.hstack((np.ones((training_size, 1)), D))
### END SOLUTION

# Learn one of the logical functions OR, NAND, NOR
# (the lambda keyword is just a short way to define functions).
log_operators = {
    'and': lambda x1, x2: x1 and x2,
    'or': lambda x1, x2: x1 or x2,
    'nand': lambda x1, x2: not (x1 and x2),
    'nor': lambda x1, x2: not (x1 or x2),
    'xor': lambda x1, x2: (x1 and not x2) or (not x1 and x2)
}

log_operator = log_operators[op]
function_to_learn(op, log_operator)

row_operator = lambda row: log_operator(row[0], row[1])
labels = np.apply_along_axis(row_operator, 1, D[:, 1:])

epochs = 200    # Extra question: What effects do changes in the epochs 
samp_size = 5  #                 and sample sizes have on our training?

for i in range(epochs):
    # Sample random from the training data.
    for idx in rnd.choice(range(training_size), samp_size, replace=False):
        y = out_fun(net_input(D[idx], W))
        W += delta_fun(ϵ, labels[idx], y, D[idx])
    # Plotting code    
    y_point = (0, (-W[0] / W[2]))
    x_point = ((-W[0] / W[1]), 0)
    try:
        slope = (y_point[1] - x_point[1]) / (y_point[0] - x_point[0]) # will not work if x and y intercepts are 0
    except ZeroDivisionError:
        print("X and Y intercepts are both zero.  Due to the way slope is calculated, this causes a division by zero.  Sorry.")
    y_out = lambda points: slope * points
    x = np.linspace(-10, 10, 100)
    plt.plot(x, y_out(x) + y_point[1], 'g--', linewidth=3, alpha=i/epochs +.2 if i/epochs +.2 < 1 else 1)
    
plt.ylim([-.2, 1.2])
plt.xlim([-.2, 1.2])
plt.title("Logic Perceptron (Blue=True)")
plt.xlabel("True(1) or False(0)")
plt.ylabel("True(1) or False(0)")
plt.show()

# Print the overall performance of the Perceptron.
print("Overall error of the Perceptron: {:.2%}".format(eval_network(labels, D, W)))

### Own Implementation

Skip this if you already implemented the perceptron above.

In [None]:
# Space for complete own implementation

### BEGIN SOLUTION
###################################################
## A more object oriented approad.               ##
###################################################
class Perceptron():
    
    def __init__(self, num_inputs, learning_rate=0.1):
        
        self._weights = rnd.normal(size=num_inputs + 1)
        self._learning_rate = learning_rate
    
    def train(self, train_data, train_labels, epochs=1, sample_size=5):
        
        for _ in range(epochs):
            # Sample random from the training data.
            for idx in rnd.choice(range(train_data.shape[0]), sample_size, replace=False):
                self._train_step(train_data[idx], train_labels[idx])
        
    def predict(self, data):
        return self._output(self._activation(data))
    
    def evaluate(self, test_data, test_labels):
        prediction = self.predict(test_data)
        error = (np.abs(prediction - test_labels)).mean()

        return error
        
    def _activation(self, inputs):
        return self._weights @ inputs.T
    
    def _output(self, activation):
        return (activation >= 0).astype('int32')
    
    def _train_step(self, data, label):
        self._weights += self._learning_rate * (label - self.predict(data)) * data
        
        
###################################################
## Create training data.                         ##
###################################################
dims = 2
training_size = 2000

data = rnd.randint(low=0, high=2, dtype='int32', size=(training_size, dims))
# Pad the input D with ones for the bias.
data = np.column_stack((np.ones(training_size, dtype='int32'), data))

logical_operator = lambda x1, x2: x1 and x2
# logical_operator = lambda x, y: not (x and y)
# logical_operator = lambda x, y: not (x or y)

row_operator = lambda row: logical_operator(row[0], row[1])
labels = np.apply_along_axis(row_operator, 1, data[:, 1:])

# Split into test and training sets.
train_data = data[:-5]
train_labels = labels[:-5]
test_data = data[-5:]
test_labels = labels[-5:]

###################################################
## Train and evaluate.                           ##
###################################################
perceptron = Perceptron(num_inputs=2)
perceptron.train(train_data, train_labels, epochs=20, sample_size=50)
perceptron.evaluate(test_data, test_labels)
### END SOLUTION

## Assignment 3: Sigmoid Activation & Backpropagation Delta Functions (5 points)

In this exercise we are first going to take the derivative of a famous activation function - the sigmoid function:

$$\sigma(t)=\frac{1}{1+e^{-t}}$$

This function is commonly used because of its nice analytical properties: Its image is the interval $(0,1)$, it is non-linear, strictly monotonous, continuous, differentiable and the derivative can be expressed in terms of the original function at the given point. This allows us to avoid redundant calculations. The sigmoid function is a special case of the more general *Logistic function* which can be found in many different fields: Biology, chemistry, economics, demography and recently most prominently: artificial neural networks.

**(a)** Computing the derivative of the sigmoid activation function:

Proof that

$$
\frac{\partial \sigma}{\partial t} = \frac{1}{\left({1 + e^{-t}}\right)} \cdot \frac{e^{-t}}{\left({1 + e^{-t}}\right)}
$$

and that it can be rewritten as an expression in terms of $\sigma(t)$ resulting in:

$$
\frac{\partial \sigma}{\partial t} = \sigma(t) \left(1 - \sigma(t) \right)
$$

$$\newcommand{\e}{\mathrm{e}}
\begin{align}
\frac{\partial \sigma}{\partial t} &= \frac{\partial}{\partial t} \frac{1}{1 + \e^{-t}}\\
&= \frac{\partial}{\partial t} \left({1 + \e^{-t}}\right)^{-1}\\
&= \left.- \left({1 + \e^{-t}}\right)^{-2} \cdot (- \e^{-t}) ~\right\vert~ \text{by chain rule}\\
&= \frac{\e^{-t}}{\left({1 + \e^{-t}}\right)^{2}}\\
&= \frac{1}{\left({1 + \e^{-t}}\right)} \cdot \frac{\e^{-t}}{\left({1 + \e^{-t}}\right)}\\
&= \sigma(t) \cdot \frac{\e^{-t}}{\left({1 + \e^{-t}}\right)}\\
&= \left.\sigma(t) \cdot \frac{(1+\e^{-t}) - 1}{\left({1 + \e^{-t}}\right)}~\right\vert~ 1-1=0\\
&= \sigma(t) \cdot \left( \frac{(1+\e^{-t})}{\left({1 + \e^{-t}}\right)} - \frac{1}{\left({1 + \e^{-t}}\right)} \right)\\
&= \sigma(t) \left(1 - \sigma(t) \right)\\
\end{align}$$

**(b)** Multilayer perceptrons (MLPs) can be regarded as a simple concatenation (and parallelization) of several perceptrons, each having a specified activation function $\sigma$ and a set of weights $\mathbf{w}_{ij}$. The idea that this can be done was discovered early after the invention of the perceptron, but people didn't really use it in practice because nobody really knew how to figure out the appropriate $\mathbf{w}_{ij}$. The solution to this problem was the discovery of the backpropagation algorithm which consists of two steps: first propagating the input forward through the layers of the MLP and storing the intermediate results and then propagating the error backwards and adjusting the weights of the units accordingly.

An updating rule for the output layer can be derived straightforward. The rules for the intermediate layers can be derived very similarly and only require a slight shift in perspective - the mathematics for that are however not in the standard toolkit so we are going to omit the calculations and refer you to the lecture slides.

We take the (halvedleast) least-squares approach to derive the updating rule, i.e. we want to minimize the Loss function
$$L = \frac{1}{2}(y-t)^2$$
where t is the given (true) label from the dataset and y is the (single) output produced by the MLP. To find the weights that minimize this expression we want to take the derivative of $L$ w.r.t. $\mathbf{w}_{i}$ where we are now going to assume that the $\mathbf{w}_{i}$ are the ones directly before the output layer:
$$y = \sigma\left(\sum_{k=1}^n \mathbf{w}_{k}o_k\right)$$
Calculate $\frac{\partial L}{\partial \mathbf{w}_{i}}$.

*Hint*: Start here if you don't know what to do: $\frac{\partial L}{\partial \mathbf{w}_{i}} = \frac{\partial L}{\partial y}\frac{\partial y}{\partial \mathbf{w}_{i}}$

*Hint*: Remember that the derivative of the sigmoid activation function in its general form (from part **a**) is defined as: $\frac{\partial \sigma}{\partial t} = \sigma(t) \left(1 - \sigma(t) \right)$ 

$$\begin{align}
\frac{\partial L}{\partial \mathbf{w}_{i}} &= \frac{\partial L}{\partial y}\frac{\partial y}{\partial \mathbf{w}_{i}}\\
&=\frac{\partial}{\partial y} \frac{1}{2} (y - t)^2 \cdot \frac{\partial}{\partial \mathbf{w}_{i}} \sigma\left(\sum_{k=1}^n \mathbf{w}_{k}o_k\right)\\
&=\left.\left((y-t) \cdot 1\right) \cdot \left(\sigma\left(\sum_{k=1}^n \mathbf{w}_{k}o_k\right)\left(1-\sigma\left(\sum_{k=1}^n \mathbf{w}_{k}o_k\right)\right) \cdot o_i \right) ~\right\vert~ \text{by chain rule}\\
&=\left.(y-t) \cdot y \cdot (1 -y) \cdot o_i ~\right\vert~ y = \sigma\left(\sum_{k=1}^n \mathbf{w}_{k}o_k\right)\\
\end{align}$$

## Assignment 4: Training a MLP by hand (6 points)

Consider the following multilayer perceptron (notation from ML-7 slides 46ff), consisting of an input layer (layer $k=0$, with two neurons 1 & 2), a hidden layer ($k=1$ with two neurons 3 & 4) and an output layer ($k=2$ with two neurons 5 & 6).  The connection weights are given by the following image and connectivity matrix:

![mlp-large.png](mlp-large.png)

to\from|1  |2  |3  |4  |5  |66
-------|---|---|---|---|---|--
1      |-  |-  |-  |-  |-  |-
2      |-  |-  |-  |-  |-  |-
3      |-3 |2  |-  |-  |-  |-
4      |2  |1  |-  |-  |-  |-
5      |-  |-  |4  |-1 |-  |-
6      |-  |-  |-2 |0.5|-  |-

The hidden layer (neurons 3 & 4) applies the [rectifier](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) as activation function.
$$
    \varphi_{R}(x)=\max(0,x)
$$

The output layer (neurons 5 & 6) uses the sigmoid ([standard logistic function](https://en.wikipedia.org/wiki/Logistic_function), [Fermi function](https://en.wikipedia.org/wiki/Fermi%E2%80%93Dirac_statistics)) as activation function.
$$
    \varphi_{S}(x)={\frac {1}{1+e^{-x}}}
$$

To measure the error, following the lecture (ML-7 slide 48), the (halved) [Mean Squared Error](https://en.wikipedia.org/wiki/Mean_squared_error) is used, that is
$$E[\{w\}](\vec{t},\vec{y}, ) = 
\tfrac{1}{2}\left\|\vec{t}-\vec{y}\right\|_2^2 =
\frac{1}{2}\sum_{i=1}^{d}(t_i-y_i)^2$$
with $\vec{y}$ being the values predicted by the network, $\vec{t}$ the target value ("ground truth"), and $d=2$ the dimensionality of the output space.

**(a)** Assume the input $\vec{x} = (1.0, 2.0)$ is given to the network (notice that in contrast to the lecture slides, we only consider a single input vector here, instead of a full dataset). Compute the weighted input $s_i(k)$ as well as the output values $o_i(k)$ for all neurons in the network.

Computing the weighted input for layer 1:
\begin{align*}
  s_3(1) &= \sum_{j=1}^{2} w_{3j}(1,0)o_j(0) && = -3\cdot 1 + 2\cdot 2 = -3 + 4 && = 1 \\
  s_4(1) &= \sum_{j=1}^{2} w_{4j}(1,0)o_j(0) && = 2\cdot 1 + 1\cdot 2 = 2+2 && = 4
\end{align*}
The outputs of layer 1 are hence:
\begin{align*}
  o_3(1) &= \varphi_{R}(s_3(1)) &&= \max(0,1) && = 1 \\
  o_4(1) &= \varphi_{R}(s_4(1)) &&= \max(0,4) && = 4 
\end{align*}

For layer 2 we then get the following weighted input:
\begin{align*}
  s_5(2) &= \sum_{j=3}^{4} w_{5j}(2,1)o_j(1) &&= 4\cdot 1 - 1\cdot 4 &&= 0 \\
  s_6(2) &= \sum_{j=3}^{4} w_{6j}(2,1)o_j(1) &&= -2\cdot 1 + 0.5\cdot 4 && = 0 
\end{align*}
The outputs of layer 2 (which also are the network output) ar
\begin{align*}
  o_5(2) &= \varphi_{S}(s_5(2)) = \sigma(0) = 0.5 \\
  o_6(2) &= \varphi_{S}(s_6(2)) = \sigma(0) = 0.5
\end{align*}

**(b)** Compute the loss value for the predicted output, assuming that the target value is $\vec{t}=(1.0, 0.0)$.

The loss value is given by the (halved) mean squared error between the network output $\vec{y}=(0.5, 0.5)$ and the target value $\vec{t}=(1.0,0.0)$:
\begin{align*}
  E[{w}](\vec{t},\vec{y}) 
  & = \tfrac12\|\vec{t}-\vec{y}\|^2\\
  & = \tfrac12\sum_{i=1}^{2}(t_i-y_i)^2\\
  & = \tfrac12\left[(1-0.5)^2 + (0-0.5)^2\right]\\
  & = \tfrac12\left[0.25 + 0.25\right]\\
  & = 0.25
\end{align*}

**(c)** Now perform backpropagation: compute the errror signals $\delta_i(k)$ and the partial derivatives $\partial E/\partial w_{ik}$ for the weights in layer $k=2$ and $k=1$ (for layer $k=1$ remember to use the ReLU function, which has a quite simple derivative).

For the output layer we get the following error signal:
\begin{align*}
  \delta_5(2) & = \varphi_{S}'(s_5)\cdot (t_1-y_1(\vec{x})) && = \sigma'(0)\cdot(1.0-0.5) && = \sigma(0)(1-\sigma(0))\cdot 0.5 && = .5 \cdot .5 \cdot .5 && = 0.125 \\
  \delta_6(2) & = \varphi_{S}'(s_6)\cdot (t_2-y_2(\vec{x})) && = \sigma'(0)\cdot(0.0-0.5) && = \sigma(0)(1-\sigma(0))\cdot -0.5 && = .5 \cdot .5 \cdot -.5 && = -0.125
\end{align*}
From this we can obtain the second layer weight gradients:
\begin{align*}
  -\partial E/\partial w_{53}(2,1) & = \delta_5(2)o_3(1) &&= 0.125 \cdot 1 &&= 0.125 \\
  -\partial E/\partial w_{54}(2,1) & = \delta_5(2)o_4(1) &&= 0.125 \cdot 4 &&= 0.5 \\
  -\partial E/\partial w_{63}(2,1) & = \delta_6(2)o_3(1) && = -0.125 \cdot 1 &&= -0.125 \\
  -\partial E/\partial w_{64}(2,1) & = \delta_6(2)o_4(1) && = -0.125 \cdot 4 &&= -0.5 
\end{align*}

For layer 1 the error signal is:
\begin{align*}
  \delta_3(1) &= \varphi_{R}'(s_3(1))\cdot\sum_{j=5}^{6}w_{j3}(2,1)\delta_j(2) && = \varphi_{R}'(1)\cdot\left[4\cdot 0.125 + -2\cdot-0.125\right] &&= 1\cdot [0.5 + 0.25] &&= 0.75\\
  \delta_4(1) &= \varphi_{R}'(s_4(1))\cdot\sum_{j=5}^{6}w_{j4}(2,1)\delta_j(2) && = \varphi_{R}'(4)\cdot\left[-1\cdot 0.125 + 0.5\cdot-0.125\right] &&= 1\cdot [-0.125-0.0625] &&= -0.1875
\end{align*}
yielding the following gradients:
\begin{align*}
  -\partial E/\partial w_{31}(1,0) &= \delta_3(1)o_1(0) &&= 0.75 \cdot 1.0 && = 0.75 \\
  -\partial E/\partial w_{32}(1,0) &= \delta_3(1)o_2(0) &&= 0.75 \cdot 2.0 && = 1.5 \\
  -\partial E/\partial w_{41}(1,0) &= \delta_4(1)o_1(0) &&= -0.1875 \cdot 1.0 &&= -0.1875 \\
  -\partial E/\partial w_{42}(1,0) &= \delta_4(1)o_2(0) &&= -0.1875 \cdot 2.0 &&= -0.375
\end{align*}

**(d)** Finish your training with an update step: apply the adaptation rule with a learning rate $\varepsilon=1$ to obtain the updated network.

The adaptation rule is

$$ w_{ji}(k+1,k) \mapsto w_{ji}(k+1,k) + \Delta w_{ji}(k+1,k)$$

with the update term

$$ \Delta w_{ji}(k+1,k) = - \varepsilon \partial E/\partial w_{ji}(k+1,k) = \varepsilon \delta_j(k+1)o_j(k)$$

For the first layer that is
\begin{align*}
  w_{31}(1,0): &&-3 \mapsto & -3 + 1\cdot 0.75   && = -2.25 \\
  w_{32}(1,0): && 2 \mapsto &  2 + 1\cdot1.5    && = 3.5 \\
  w_{41}(1,0): && 2 \mapsto &  2 + 1\cdot(-0.1875) && = 1.8125 \\
  w_{42}(1,0): && 1 \mapsto &  1 + 1\cdot(-0.375)  && = 0.625
\end{align*}
and for the second layer
\begin{align*}
  w_{53}(2,1): &&  4 \mapsto &    4 + 1\cdot 0.125    && = 4.125 \\
  w_{54}(2,1): && -1 \mapsto &   -1 + 1\cdot 0.5      && = 0.5 \\
  w_{63}(2,1): && -2 \mapsto &   -2 + 1\cdot (-0.125) && = 2.125 \\
  w_{64}(2,1): && 0.5 \mapsto & 0.5 + 1\cdot (-0.5)   && = 0.0
\end{align*}