Osnabrück University - Machine Learning (Summer Term 2016) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack

# Exercise Sheet 07

## Introduction

This week's sheet should be solved and handed in before the end of **Sunday, June 05, 2016**. If you need help (and Google and other resources were not enough), feel free to contact your groups designated tutor or whomever of us you run into first. Please upload your results to your group's studip folder.

## Assignment 1: The Perceptron [6 Points]

### a) The Logic Perceptron

For the following two logical functions sketch a perceptron's weights after it was trained. To do so, figure out when the perceptron should fire. Then come up with ideas of how you can achieve this. Remember that $w_0$, the bias, is used as a threshold and that there is a constant $x_0 = 1$.

#### 1) $(A \wedge B) \vee (\neg A \wedge B)$

#### 2) $(A \wedge B) \vee (\neg A \wedge B) \vee (A \wedge \neg B)$

### b) The Tensorflow Perceptron

With the online tool [TensorFlow playground](https://playground.tensorflow.org/) it is possible to visual simple neural networks and to share configurations and settings. 

Follow [this link](https://playground.tensorflow.org/#activation=sigmoid&batchSize=1&dataset=gauss&regDataset=reg-plane&learningRate=0.1&regularizationRate=0&noise=0&networkShape=1&seed=0.56339&showTestData=true&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&learningRate_hide=false&regularizationRate_hide=true&percTrainData_hide=true&batchSize_hide=true&dataset_hide=false&regularization_hide=true&discretize_hide=true&stepButton_hide=false&showTestData_hide=false&problem_hide=true&noise_hide=true&activation_hide=true) to the TensorFlow playground. If you click it, many features are disabled and set to useful defaults, since they were either not discussed yet in the lecture or are not important for this exercise.

You will see a simple configuration: Two activated inputs ($x_1$ and $x_2$), one hidden layer with one neuron (which can be understood as a simple perceptron) and the output shown as a nice picture. It initially shows a training loss of 0.527. Try and run it to see how the perceptron can learn to separate the two clusters. Note that for the rest of the exercise we assume at most about 1000 learning steps (usually many fewer will do it), so don't wait too long in front of your browser.

The dataset gets fully classified after very few iterations. Next try the XOR dataset, either by clicking on it on the left (the top right data pattern) or by following [this link](https://playground.tensorflow.org/#activation=sigmoid&batchSize=1&dataset=xor&regDataset=reg-plane&learningRate=0.1&regularizationRate=0&noise=0&networkShape=1&seed=0.56339&showTestData=true&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&learningRate_hide=false&regularizationRate_hide=true&percTrainData_hide=true&batchSize_hide=true&dataset_hide=false&regularization_hide=true&discretize_hide=true&stepButton_hide=false&showTestData_hide=false&problem_hide=true&noise_hide=true&activation_hide=true). You will notice that you won't achieve much better results than a loss of 0.4, which is just above chance. Try to improve the result by adding neurons and or layers (but don't change the inputs!) until you get a classification with a loss smaller than 0.01. You may also change the learning rate. Then copy the link from your browser address bar and paste it below:

How many neurons in hidden layers are already sufficient to get at least 99% classification (i.e. loss < 0.01) if they are a) in one hidden layer or b) in two hidden layers? You may consider configurations which just need above 1000 iterations to get there as well, but we don't expect you to run any configuration longer than 1000 iterations.

## Assignment 2: Perceptron [8 Points]

In this exercise you will implement a simple perceptron as described in the lecture [ML-07 Slide 31]. As with a previous exercise it is possible to not use our premade code blocks but write the single Perceptron completely from scratch (the section for this can be found [below](#Own-Implementation)). 

You are free to use lambdas for this assignment. Those are handy constructs that allow to create small anonymous functions. The general syntax is: `name = lambda inputs : outputs`. For example if you want to write a little lambda function for addition it would look like this:

In [1]:
addition = lambda x, y : x + y
print(addition(1, 1))

2


If you feel uncomfortable with this use regular functions, but lambdas can shorten things up - so at least give it a try. The `TODO`'s in the following code segments guide you through what has to be done.

*Hint*: If you have problems with `np.arrays` (which usually have shapes like `(13,)`, thus with one degenerate dimenstion, either set the shapes manually (`my_np_array.shape = (13, 1)`) or try using `np.matrix` objects (`my_np_matrix = np.matrix(my_np_array)`). Other useful functions might be `np.append` or `np.hstack`.

In [2]:
import numpy as np
import numpy.random as rnd

# TODO: Write the activation function (called act_fun) and the output function (called out_fun).
def out_fun(X):
    return int(X>0)

def act_fun(X,W):
    return np.dot(X,W)

# TODO: Write a function generate_weights that generates N (= number of dimensions) + 1 (w_0) random weights.
def generate_weights(N):
    return np.random.random([N+1])*2-1


In [3]:
####################################################
## Testing the perceptron with a concrete example ##
####################################################

# Dimensions for our test.
dims = 12

# Input is a row vector. (Shape is (1, 13).)
D = np.matrix(np.hstack((1, rnd.rand(dims) - 0.5)))

# Weights are stored in a vector.
W = generate_weights(dims)

out = out_fun(act_fun(D, W))

assert out == 1 or out == 0, "The output has to be either 1 or 0, but was {}".format(out)
assert act_fun(D, W).shape == (1, 1), "The activation functions output should be one value"

The following `eval_network(t, D, W)` function is used to measure the performance of your perceptron for the upcoming task.

In [4]:
def eval_network(t, D, W):
    """
    This function takes the trained weights of a perceptron
    and the input data (D) as well as the correct target values (t)
    and computes the overall error rate of the perceptron.
    """
    error = 0.0
    size = np.max(D.shape)
    for i in range(0, size):
        out = out_fun(act_fun(D[i], W))
        error = error + np.abs(t[i] - out)
    # Normalize the error
    return error / size

Now we will use the above defined functions to train the perceptron to one of the following logical functions: OR, NAND or NOR. 

In [5]:
###################################################
## Now we train our perceptron! [ML-07 Slide 33] ##
###################################################

# TODO: Write the update function delta_fun for the weights dependent
#       on epsilon, the target, the output and the input vector.

def delta_fun(eps, T, Y, D):
    return eps*(T - Y)*D
    
# TODO: Define suitable parameters for your problem.
eps = 0.1
dims = 2
training_size = 4

# TODO: Generate the weights.
W = generate_weights(dims) 

# TODO: Generate a matrix D of truthvalue pairs.
# The shape should be (training_size, dims).
D = [[0,0],[0,1],[1,0],[1,1]]

# TODO: Pad the input D with ones for the threshold/bias/w_0
for X in D:
    X[0:0] = [1]

print(D)
# TODO: Learn one of the logical functions OR, NAND, NOR
# Change the lambda log_operator to achieve this.
log_operator = lambda x1, x2 : int(x1 or x2)

row_operator = lambda row: log_operator(row[0], row[1])
labels = np.apply_along_axis(row_operator, 1, D[:,1:])

epochs = 20    # Extra question: What effects do changes in the epochs 
samp_size = 5  #                 and sample sizes have on our training?

for i in range(0, epochs):
    print(i)
    # Sample random from the training data.
    for idx in rnd.choice(range(training_size), samp_size, replace=False):
        y = out_fun(act_fun(D[idx], W))
        W = W + delta_fun(eps, labels[idx], y, D[idx])

# Print the overall performance of the Perceptron.
print("Overall error of the Perceptron: {:.2%}".format(eval_network(labels, D, W)[0,0]))

[[1, 0, 0], [1, 0, 1], [1, 1, 0], [1, 1, 1]]


TypeError: list indices must be integers or slices, not tuple

### Own Implementation

Skip this if you already implemented the perceptron above.

In [2]:
import numpy as np

OR   = [([0,0],0),([0,1],1),([1,0],1),([1,1],1)]
NAND = [([0,0],1),([0,1],1),([1,0],1),([1,1],0)]
XOR  = [([0,0],0),([0,1],1),([1,0],1),([1,1],0)]

#define InputDataset
INPUT = XOR

def delta_fun(eps, T, Y, D):
    return np.multiply(eps*(T-Y),(D))

eps  = 0.25
dim  = 2
size = 4

#define weights
W = np.random.random([3])*2+1

#define inputs and add bias input
D = [X[0] for X in INPUT]
[X.insert(0,1) for X in D]

#define targets 
T = [X[1] for X in INPUT]

cycles  = 100
samples = 5

for c in range(cycles):
    indecies = np.random.randint(0,size,samples) #get random indecies for training
    error = 0
    for i in indecies:
        Y = np.tanh(np.sum(W*D[i])) #get output of perceptron
        W = np.add(W,delta_fun(eps,T[i],Y,D[i])) #update weight vector
        error += abs(Y-T[i])**2
    if(c%10==0):
        print("ERROR: ",np.sqrt(error)/samples) #print mean square error 

#plot trained perceptron
print("\nPLOT OF PERCEPTRON OUTPUT:\n")

for i in range(size):
    Y = np.tanh(np.sum(W*D[i]))
    print(D[i][1:]," :: ",Y)


ERROR:  0.259145601289
ERROR:  0.155216745788
ERROR:  0.253921138992
ERROR:  0.270010539957
ERROR:  0.297869466152
ERROR:  0.260394273293
ERROR:  0.347174683097
ERROR:  0.319149587122
ERROR:  0.278963249374
ERROR:  0.180799853146

PLOT OF PERCEPTRON OUTPUT:

[0, 0]  ::  0.495359479002
[0, 1]  ::  0.63311476803
[1, 0]  ::  0.558485249211
[1, 1]  ::  0.682665478922


## Assignment 3: Sigmoid Activation & Backpropagation Delta Functions [6 Points]

In this exercise we are first going to take the derivative of a famous activation function - the sigmoid function
$$\sigma(t)=\frac{1}{1+e^{-t}}$$.
This function is commonly used because of its nice analytical properties: it domain is $\in[0,1]$, it is non-linear, strictly monotonous, continuous, differentiable and the derivative can be expressed in terms of the original function at the given point. This allows us to avoid redundant calculations. The sigmoid function is a special case of the more general *Logistic function* which can be found in many different fields: Biology, chemistry, economics, demography and recently most prominently: artificial neural networks.

Take the derivative $\frac{\partial \sigma}{\partial t}$ and (if possible) write the resulting expression in terms of $\sigma(t)$:

$$\begin{align}
\frac{\partial \sigma}{\partial t} &=\, ?
\end{align}$$

Multilayer perceptrons (MLPs) can be regarded as a simple concatenation (and parallelization) of several perceptrons, each having a specified activation function $\sigma$ and a set of weights $\mathbf{w}_{ij}$. The idea that this can be done was discovered early after the invention of the perceptron, but people didn't really use it in practice because nobody really knew how to figure out the appropriate $\mathbf{w}_{ij}$. The solution to this problem was the discovery of the backpropagation algorithm which consists of two steps: first propagating the input forward through the layers of the MLP and storing the intermediate results and then propagating the error backwards and adjusting the weights of the units accordingly.

An updating rule for the output layer can be derived straightforward. The rules for the intermediate layers can be derived very similarly and only require a slight shift in perspective - the mathematics for that are however not in the standard toolkit so we are going to omit the calculations and refer you to the lecture slides.

We take the least-squares approach to derive the updating rule, i.e. we want to minimize the Loss function
$$L = \frac{1}{2}(y-t)^2$$
where t is the given (true) label from the dataset and y is the (single) output produced by the MLP. To find the weights that minimize this expression we want to take the derivative of $L$ w.r.t. $\mathbf{w}_{i}$ where we are now going to assume that the $\mathbf{w}_{i}$ are the ones directly before the output layer:
$$y = \sigma\left(\sum_{k=1}^n \mathbf{w}_{k}o_k\right)$$
Calculate $\frac{\partial L}{\partial \mathbf{w}_{i}}$.

*Hint*: Start here if you don't know what to do: $\frac{\partial L}{\partial \mathbf{w}_{i}} = \frac{\partial L}{\partial y}\frac{\partial y}{\partial \mathbf{w}_{i}}$

$$\begin{align}
\frac{\partial L}{\partial \mathbf{w}_{i}} &= \, ?
\end{align}$$