Osnabrück University - Machine Learning (Summer Term 2016) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack

# Exercise Sheet 08

## Introduction

This week's sheet should be solved and handed in before the end of **Sunday, June 12, 2016**. If you need help (and Google and other resources were not enough), feel free to contact your groups' designated tutor or whomever of us you run into first. Please upload your results to your group's Stud.IP folder.

## Assignment 1: Multilayer Perceptron (MLP) [10 Points]

Last week you implemented a simple perceptron. We discussed that one can use multiple perceptrons to build a network. This week you will build your own MLP. Again the following code cells are just a guideline. If you feel like it, just follow the algorithm steps below in the empty cell - otherwise feel free to use our guided approach again.

### Algorithm: Multilayer Perceptron with Backpropagation

We try to follow the definitions from the lecture (ML-07, Slide 46) closely:

* Layers are numbered from $0$ (input layer) to $L_H + 1$ (output layer), such that $1 \dots L_H$ are hidden layers.
* Each layer has $N(i)$ neurons, numbered from $1 \dots N(i)$.
* $o_i(k)$ is the output of neuron $i$ in layer $k$.
* $w_{ik}(m,n)$ is the weight between neuron $i$ in layer $m$ and neuron $k$ in layer $n$ (where for our case $m = n + 1$ holds).
* The input to the MLP is $x \in \mathbb{R}^{d_{in}} = o(0)$, the output is $y \in \mathbb{R}^{d_{out}} = o(L_H + 1)$.
* $\epsilon$ is the learning rate.

The algorithm you have to implement is now as follows:

1. **Initialize your MLP.** Use as many input neurons as there are dimensions in the data. Input neurons always expect 1D input. Then create neurons for each hidden and the output layer. Each neuron in the hidden and output layers expects as many inputs as there are neurons in the layer before them.
1. **Initialize the neurons' weights.** For each neuron in layers $1 \dots L_H + 1$ initialize the weights to small random values (values between $0$ and $1$ are fine, but you are allowed to tweak the numbers around).
1. **Implement the activation (feed-forward) step.**
    1. Decompose the input into its components and pass them to the correct input neuron.
    1. Each input neuron passes its unprocessed input to the next layer. That means each neuron in layer $1$ receives all outputs from each input layer as its own input.
    $$o_i(0) = x_i$$
    1. Calculate the weighted sums of their inputs and apply their activation function $\sigma$ for each neuron in the layers $1 \dots L_H + 1$. This is best done iteratively layer by layer, as each layer's input is the output of its preceding layer (Note: $w_{j0}(k,k)$ denotes the bias for neuron $j$ in layer $k$):
    $$\begin{align*}
      o_j(k) = \sigma\left(w_{j0}(k,k)+\sum\limits_{i=1}^{N(k-1)} 
              o_i(k-1) w_{ji}(k,k-1)\right)
    \end{align*}$$
    with 
    $$\sigma(x) = \frac{1}{1 + \exp{(-x)}}$$
    1. The resulting $o_i(L_H+1)$ are the outputs $y_i$ for each output neuron $i$.
1. **Implement the adaption (backpropagation) step.**
    1. Compute the error between the target and output components to calculate the error signals $\delta_i(L_H+1)$:
    $$\begin{align*}
      \delta_i(L_H + 1) &= o_i(L_H+1)\ (1 - o_i(L_H+1))\ (t_i - o_i(L_H + 1))
    \end{align*}$$
    1. Calculate the error signals $\delta_i(k)$ for each hidden layer $k$, starting with $k=L_H$ and going down to $k=1$.
    $$\begin{align*}
    \delta_i(k) &= o_i(k)\ (1 - o_i(k))\ \sum\limits_{j=1}^{N(k+1)} w_{ji}(k+1,k)\delta_j(k+1)
    \end{align*}$$
    1. Adapt the weights for each neuron in the hidden and output layers.
    $$\Delta w_{ji}(k+1, k) = \epsilon\, \delta_j(k+1)\, o_i(k)$$

### Implementation
In the following you will be guided through implementing the above algorithm step by step. Instead of sticking to this guide, you are free to take a complete custom approach instead if you wish.

We will basically take a bottom-up approach: Starting from an individual **perceptron** (or neruon), over a **layer of perceptrons** to a **multilayer perceptron** (or nerual network). Each of the 3 will be implemented as its own python *class*. Such a class basically defines a type of element, as a perceptron, which can be instantiated multiple times. You can think of such an instance as an individuum of a population (defined by a specific class). Each instance has specific methods which can be used to modify it -- again, taking the population reference, each puppy of the *puppy population* has the method *make_puppy_face()*.

To guide you along all required classes and functions are outlined in valid python code with extensive comments. Each comment describes the arguments the specific method accepts (`Args`) and the values it is expected to return (`Returns`).

#### Dependencies
Besides `numpy` for matrix multiplications we only require on the `expit` function from the `scipy` package. `expit` is just a normal sigmoid function, but `scipy` handles `NaN` problems for us, which otherwise could occur when dividing by zero.

In [None]:
import numpy as np

# sigmoid = lambda x: 1 / (1 + np.exp(-x))
from scipy.special import expit as sigmoid

#### Perceptron, PerceptronLayer and MultilayerPerceptron

In [None]:
class Perceptron:
    """Single neuron handling its own weights and bias."""

    def __init__(self, dim_in, act_func=sigmoid):
        """Initialize a new neuron with its weights and bias.

        Args:
            dim_in: Dimensionality of the data coming into
                this perceptron. In a network of perceptrons
                this basically represents the number of neurons
                in the layer before this neuron's layer.
            act_fun: Function to apply on activation.
        """
        self.act_func = act_func
        self.weights = np.random.normal(size=dim_in + 1)

    def activate(self, x):
        """Activate this neuron with a specific input.

        Calculate the weighted sum of inputs and apply the
        activation function.

        Args:
            x: Vector of input values.

        Returns:
            A real number representing the perceptron's
            activation after calculating the weighted sum
            of inputs and applying the perceptron's
            activation function.
        """
        return self.act_func(self.weights @ np.append(1, x))

    def adapt(self, x, delta, rate=0.03):
        """Adapt this neuron's weights by a specific delta.

        Args:
            x: Vector of input values.
            delta: Weight adaptation delta value.
            rate: Learning rate.
        """
        self.weights += rate * delta * np.append(1, x)


# TODO (ahoereth): Add asserts testing the perceptron.

In [None]:
class PerceptronLayer:
    """Layer of multiple neurons."""

    def __init__(self, dim_in, dim_out, act_func=sigmoid):
        """Initialize the layer as a list of individual neurons.

        A layer contains as many neurons as it has outputs, each
        neuron has as many input weights (+ bias) as the layer has inputs.
        
        Args:
            dim_in: Dimensionality of the expected input values,
                also the size of the previous layer of a neural network.
            dim_out: Dimensionality of the output, also the requested 
                amount of in this layer and the input dimension of the
                next layer.
            act_func: Activation function to use in each perceptron of
                this layer.
        """
        self.perceptrons = [Perceptron(dim_in, act_func)
                            for _ in range(dim_out)]

    def activate(self, x):
        """Activate this layer by activating each individual neuron.
        
        Args:
            x: Vector of input values.
            
        Retuns:
            Vector of output values which can be used as input to
            another PerceptronLayer instance.
        """
        return np.asarray([p.activate(x) for p in self.perceptrons])

    def adapt(self, x, deltas, rate=0.03):
        """Adapt this layer by adapting each individual neuron.
        
        Args:
            x: Vector of input values.
            deltas: Vector of delta values.
            rate: Learning rate.
        """
        for perceptron, delta in zip(self.perceptrons, deltas):
            perceptron.adapt(x, delta, rate)

    def get_weights_matrix(self):
        """Helper function for getting this layer's weight matrix.
        
        Returns:
            Numpy array of all the weights for this perceptron layer.
        """
        return np.asarray([p.weights for p in self.perceptrons]).T


# TODO (ahoereth): Add asserts testing the perceptron layer.

In [None]:
class MultilayerPerceptron:
    """Network of perceptrons, also a set of multiple perceptron layers."""

    def __init__(self, *layers):
        """Initialize a new network, madeup of individual PerceptronLayers.

        Args:
            *layers: Arbritrarily many PerceptronLayer instances.
        """
        self.layers = layers

    def activate(self, x):
        """Activate network and return the last layer's output.

        Args:
            x: Vector of input values.

        Returns:
            Vector of output values from the last layer of the network
            after propagating forward through the network.
        """
        for layer in self.layers:
            x = layer.activate(x)
        return x

    def adapt(self, x, t, rate=0.03):
        """Adapt the whole network given an input and expected output.

        Args:
            x: Vector of input values.
            t: Vector of target values (expected outputs).
            rate: Learning rate.
        """
        # Activate each layer and collect intermediate outputs.
        outputs = [x]
        for layer in self.layers:
            outputs.append(layer.activate(outputs[-1]))

        # Calculate total error between t and network output.
        e = t - outputs[-1]

        # Backpropagate error through the network and adapt each layer.
        # TODO (ahoereth): Get rid of zip here because its 
        #     difficult to grasp not only for python beginners.
        layers = list(zip(self.layers, outputs[:-1], outputs[1:]))
        for layer, x, y in reversed(layers):
            delta = (y * (1 - y)) * e
            weights = layer.get_weights_matrix()
            # TODO (ahoereth): The following line for many is hard and
            #     therefore the need to remove the bias should be
            #     somehow explained.
            e = (weights @ delta)[1:]
            layer.adapt(x, delta, rate)


# TODO (ahoereth): Add asserts testing the MLP.

### Classification

#### Problem Definition
Before we start, we need a problem to solve. In the following cell we first generate some three dimensional data (= $\text{input_dim}$) between 0 and 1 and label all data according to a binary classification: If the sum of a data point's components is bigger than $0.7 \cdot \text{input_dim}$ the data point belongs to the first, otherwise to the second class.

In the cell below we visualize the data set.

**TODO (ahoereth)**: 
- Add padding between data for better results?
- Describe what exactly we aim to do.

In [None]:
n = 1000
input_dim = 3

inputs = np.random.rand(n, input_dim)
targets = (np.sum(inputs, 1) > 0.7 * input_dim) * 1

In [None]:
%matplotlib notebook
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure('Labeled Data')
ax = fig.add_subplot(111, projection='3d')
for label, color in [(0, 'cyan'), (1, 'orange')]:
    x, y, z = tuple(zip(*inputs[targets == label]))
    ax.scatter(x, y, z, c=color)

#### Model Design

In [None]:
MLP = MultilayerPerceptron(
    PerceptronLayer(input_dim, 2),
    PerceptronLayer(2, 2),
    PerceptronLayer(2, 1),
)

#### Training

In [None]:
EPOCHS = 200000

for epoch in range(0, EPOCHS):
    s = np.random.randint(0, len(targets))
    MLP.adapt(inputs[s], targets[s])

    if epoch % 2500 == 0 or (epoch + 1) == EPOCHS:
        outputs = np.squeeze([MLP.activate(x) for x in inputs])
        predictions = np.round(outputs)
        accuracy = np.sum(predictions == targets) / len(targets)
        print('Accuracy at epoch {}: {}'.format(epoch, accuracy))

#### Evaluation

In [None]:
error = 0
for d, t in zip(inputs, targets):
    error += np.abs(t - net.activate(d)) / len(targets)
print(error)

## Assignment 2: MLP and RBFN [10 Points]

This exercise is aimed at deepening the understanding of Radial Basis Function Networks and how they relate to Multilayer Perceptrons. Not all of the answers can be found directly in the slides - so when answering the (more algorithmic) questions, first take a minute and think about how you would go about solving them and if nothing comes to mind search the internet for a little bit. If you are interested in a real life application of both algorithms and how they compare take a look at this paper: [Comparison between Multi-Layer Perceptron and Radial Basis Function Networks for Sediment Load Estimation in a Tropical Watershed](http://file.scirp.org/pdf/JWARP20121000014_80441700.pdf)

![Schematic of a RBFN](RBFN.png)

We have prepared a little example that shows how radial basis function approximation works in Python. This is not an example implementation of a RBFN but illustrates the work of the hidden neurons.

In [None]:
%matplotlib notebook

import numpy as np
from numpy.random import uniform

from scipy.interpolate import Rbf

import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm


def func(x,y):
    '''
    This is the example function that should be fitted.
    Its shape could be described as two peaks close to
    each other - one going up, the other going down
    '''
    return (x + y) * np.exp(-4.0 * (x**2 + y**2))

# number of training points (you may try different values here)
training_size = 50

# sample 'training_size' data points from the input space [-1,1]x[-1,1] ...
x = uniform(-1.0, 1.0, size=training_size)
y = uniform(-1.0, 1.0, size=training_size)

# ... and compute function values for them.
fvals = func(x, y)

# get the aprroximation via RBF
new_func = Rbf(x, y, fvals)


# Plot both functions: 
# create a 100x100 grid of input values
x_grid, y_grid = np.mgrid[-1:1:100j, -1:1:100j]

plt.figure("Original Function")
# This plot represents the original function
f_orig = func(x_grid, y_grid)
plt.imshow(f_orig, extent=[-1,1,-1,1], cmap=plt.cm.jet)

plt.figure("RBF Result")
# This plots the approximation of the original function by the RBF
# if the plot looks strange try to run it again, the sampling
# in the beginning is random
f_new = new_func(x_grid, y_grid)
plt.imshow(f_new, extent=[-1,1,-1,1], cmap=plt.cm.jet)
plt.xlim(-1,1)
plt.ylim(-1,1)
# scatter the datapoints that have been used by the RBF
plt.scatter(x, y)

### Radial Basis Function Networks

#### What are radial basis functions?

Radial basis functions are all functions that fullfill the following criteria:

The value of the function for a certain point depends only on the distance of that point to the origin or some other fixed center point. In mathematical formulation that spells out to: 
$\phi (\mathbf {x} )=\phi (\|\mathbf {x} \|)$  or  $\phi (\mathbf {x} ,\mathbf {c} )=\phi (\|\mathbf {x} -\mathbf {c} \|)$. Notice that it is not necessary (but most common) to use the norm as the measure of distance.

#### What is the structure of a RBFN? You may also use the notion from the above included picture.

RBFN's are networks that contain only one hidden layer. The input is connected to all the hidden units. Each of the hidden units has a different radial basis function that is *sensitive* to ranges in the input domain. The output is then a linear combination of the outpus ot those functions.

#### How is a RBFN trained?

Note: all input data has to be normalized.

Training a RBFN is a two-step process. First the functions in the hidden layer are initialized. This can be either done by sampling from the input data or by first performing a k-means clustering, where k is the number of nodes that have to be initialzed.

The second step fits a linear model with coefficients $w_{i}$ to the hidden layer's outputs with respect to some objective function. The objective function depends on the task: it can be the least squares function, or the weights can be adapted by gradient descent.

### Comparison to the Multilayer Perceptron

#### What do both models have in common? Where do they differ?

|RBFN                 |MLP                  | 
|---------------------|---------------------|
| non-linear layered feedforward network|non-linear layered feedforward network| 
| hidden neurons use radial basis functions, output neurons use linear function| input, hidden and output-layer all use the same activation function| 
| universal approximator |   universal approximator |
| learning usually affects only one or some RBF | learning affects many weights throught the network|

#### How can classification in both networks be visualized?

![Classification](Solution_Classification.png)

#### When would you use a RBFN instead of a Multilayer Perceptron?

RBFNs are more robust to noise and should therefore be used when the data contains false-positives.