Osnabrück University - Machine Learning (Summer Term 2016) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack

# Exercise Sheet 07

## Introduction

This week's sheet should be solved and handed in before the end of **Sunday, June 05, 2016**. If you need help (and Google and other resources were not enough), feel free to contact your groups designated tutor or whomever of us you run into first. Please upload your results to your group's studip folder.

## Assignment 1: Perceptron Theory (x Points)

## Assignment 2: Perceptron (8 Points)

In this exercise you will implement a multilayer perceptron as described in the lecture. We start with the basic building block: the perceptron [ML-07 Slide 31]. As with a previous exercise it is possible to not use our premade code blocks but write the single Perceptron completly from scratch (the section for this can be found on the bottom of the exercise). You are free to use lambdas for this exercise. Those are handy constructs that allow to create small anonymous functions. For example if you want to write a little lambda function for addition it would look like this:

In [None]:
addition = lambda x, y : x + y
print(addition(1,1))

If you feel uncomfortable with this use regular functions, but lambdas can shorten things up - so at least give it a try. The $TODO$'s in the following code segments guide you through what has to be done.

In [None]:
import numpy as np
import numpy.random as rnd

# TODO: Write the activation function (called actFun) and the output function (called outFun).

# Activation function.
actFun = lambda d, w: d @ np.transpose(w)

# The output function determines the output of the neuron (1 if >0, else -1).
outFun = lambda x: x > 0

# TODO: Write a function that generates weights.
def generateWeights(dims):
    '''
    Generates weights with the given bias for a number
    of dimensions.
    '''
    W = rnd.rand(1, dims + 1)
    W.shape = (1, dims + 1)
    return W

In [None]:
####################################################
## Testing the perceptron with a concrete example ##
####################################################

# Dimensions for our test.
dims = 12 

# Input is a row vector.
D = np.append(1, rnd.rand(1, dims))
D.shape = (1, dims + 1)

#weights are stored in a vector
W = generateWeights(dims)

out = outFun(actFun(D, W))
assert out == 1 or out == -1, "The output has to be either 1 or -1, but was %d" %out
assert actFun(D, W).shape == (1, 1), "The activation functions output should be one value"

The following $eval\_network(t, D, W)$ function is used to measure the performance of your perceptron for the upcoming task.

In [None]:
def eval_network(t, D, W):
    '''
    This function takes the trained weights of a perceptron
    and the input data (D) as well as the correct target values (t)
    and computes the overall error rate of the perceptron.
    '''
    error = 0.0
    size = np.max(D.shape)
    for i in range(0, size):
        out = outFun(actFun(D[i], W))
        error = error + np.abs(t[i] - out)
    # Normalize the error
    return error/size

Now we will use the above defined functions to train the perceptron to one of the following logical functions: AND, OR, NAND or NOR. 

In [None]:
#############################################
# Now we train our perceptron! [ML-07, sl.33]
#############################################

# TODO write the update function for the weights dependent
#      on epsilon, the target, the output and the input vector
delta_fun = lambda eps,t,y,x: eps * (t-y) * x


# TODO define suitable parameters for your problem
eps = 0.1
dims = 2
training_size = 1250

# TODO generate the weights
W = generateWeights(dims)

# Input
# generate a list of truthvalue pairs
D_i = rnd.rand(training_size, dims) > 0.5

# pad the input with ones for the threshold/bias/w_0
D = np.ones((training_size, dims+1))
D[:,1:] = D_i 
D = np.matrix(D)

# Example learn one of the logical functions: AND, OR, NAND, NOR
op = lambda x1, x2: x1 and x2 #TODO change for other functions

log_op = lambda row: op(row[0], row[1])
labels = np.apply_along_axis(log_op, 1, D[:,1:])

epochs = 20
samp_size = 5

for i in range(0,epochs):
    #sample random from the training data
    for idx in rnd.choice(range(0, training_size), samp_size):
        y = outFun(actFun(D[idx], W))
        W = W + delta_fun(eps, labels[idx], y, D[idx])


#Print the overall performance of the Perceptron
print("Overall error of the Perceptron: {:.2%}".format(eval_network(labels, D, W)[0,0]))

In [None]:
# Space for complete own implementation


## Assignment : Sigmoid activation function & backpropagation delta funtion [ Points]

In this exercise we are first going to take the derivative of a famous activation function - the sigmoid function
$$\sigma(t)=\frac{1}{1+e^{-t}}$$.
This function is commonly used because of its nice analytical properties: it's $\in[0,1]$, non-linear, strictly monotonous, continuous, differentiable and the derivative can be expressed in terms of the original function at the given point. This allows us to avoid redundant calculations. The sigmoid function is a special case of the more general *Logistic function}* which can be found in many different fields: Biology, chemistry, economics, demography and recently most prominently: Artificial Neural Networks.

Take the derivative $\frac{d\sigma}{dt}$ and (if possible) write the resulting expression in terms of $\sigma(t)$:

$$\begin{align}
\frac{d\sigma}{dt}&=-e^{-t}*(-1)*\frac{1}{(1+e^{-t})^2}\\
&= \frac{1}{1+e^{-t}} * \frac{- 1 + 1 + e^{-t}}{1+e^{-t}} \\
&= \frac{1}{1+e^{-t}} * \left(1 - \frac{1}{1+e^{-t}}\right) \\
&= \sigma(t)(1-\sigma(t))
\end{align}$$

MLPs can be regarded as a simple concatenation (and paralellization) of several perceptrons which each have a specified activation function $\sigma$ and a set of weights $\mathbf{w}_{ij}$. The idea that this can be done was discovered rather early on after the invention of the perceptron but people didn't really use it in practice that much because nobody really knew how to figure out the appropriate $\mathbf{w}_{ij}$. The solution to this problem was the discovery of the backpropagation algorithm that consists of two steps: first propagating the input forward through the layers of the MLP and storing the intermediate results and then propagating the error backwards and adjusting the weights of the units accordingly.

An updating rule for the output layer can be derived rather straightforward so we're going to let you do that. The rules for the intermediate layers can be derived very similarly and only require a slight shift in perspective - the mathematics for that are however not in the standard toolkit so we are going to omit the calculations and refer you to the lecture slides.

We take the least-squares approach to derive the updating rule, i.e. we want to minimize the Loss function
$$L = \frac{1}{2}(y-t)^2$$
where t is the given (true) label from the dataset and y is the (single) output produced by the MLP. To find the weights that minimize this expression we want to take the derivative of $L$ w.r.t. $\mathbf{w}_{i}$ where we are now going to assume that the $\mathbf{w}_{i}$ are the ones directly before the output layer:
$$y = \sigma\left(\sum_{k=1}^n \mathbf{w}_{k}o_k\right)$$
Calculate $\frac{dL}{d\mathbf{w}_{i}}$.

*Hint (This might only be helpful for some): $\frac{dL}{d\mathbf{w}_{i}}=\frac{dL}{dy}\frac{dy}{d\mathbf{w}_{i}}$*

$$\begin{align}
\frac{dL}{d\mathbf{w}_{i}}&=\frac{dL}{dy}\frac{dy}{d\mathbf{w}_{i}}\\
&=(y-t)o_i\sigma\left(\sum_{k=1}^n \mathbf{w}_{k}o_k\right)\left(1-\sigma\left(\sum_{k=1}^n \mathbf{w}_{k}o_k\right)\right)\\
&= (y-t)o_iy\left(1-y\right)\\
\end{align}$$