Osnabrück University - Machine Learning (Summer Term 2016) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack

# Exercise Sheet 03

## Introduction

This week's sheet should be solved and handed in before the end of **Sunday, May 1, 2016**. If you need help (and Google and other resources were not enough), feel free to contact your groups designated tutor or whomever of us you run into first. Please upload your results to your group's studip folder.

This week's sheet is also a lot of implementations, but don't worry: To be able to implement most of the code, you have to understand the theory.

## Assignment 1: Rosner test [5 Points]

The Rosner test is an iterative procedure to remove outliers of a data set via a z-test. In this exercise you will implement it and apply it to a sample data set.

### a)

First of all, think about why we use procedures like this and answer the following questions: 

What are causes for outliers? And what are our options to deal with them? 

Solution: 

There are different types of outliers which can have different causes.
They could arise through measurement or technical errors when collecting data. This may be connected to having a sharp cut-off in regard to the range of measurements, which could lead to a high concentration of values at the artificial boundaries of an experiment. However they may also show us a true underlying effect in our data that we didn't expect or account for, this might be the case when we are treating the measurements as one distribution, when in reality there are two underlying distributions. Lastly, our distribution might actually naturally have a high variance, which makes outliers or extreme values a natural part of the distribution.

To deal with them, first of all outliers have to be detected. To detect which data points we want to declare as an outlier, we have to find a definition of a regular data point to make the distinction. What we do most of the time is to assume a normal distribution underlying the data (or a multivariate distribution where each cluster is normally distributed). 
One option is to calculate the z-value for each data point (a measure of the distance from the mean in terms of the standard deviation)- data points with a high z-value would be regarded outliers. This can be improved by using the median and applying a different threshold. 
The Rosner test takes it one step further and iteratively calculates z-values and removes found outliers, until none can be found anymore. This can be done one outlier at a time or k outliers at a time for more efficiency.
A different idea would be to not remove the outliers completely, but weight them according to the z-values. 
And lastly an alternative to complete removal would be to fill up the emerging gaps with values that 'fit' better in the distribution. (And there are different possiblities to define those.)

### b)

In the following you find a stub for the implementation. The dataset is already generated. Now it is your turn to write the Rosner test and detect the

In [4]:
import numpy as np
import matplotlib.pyplot as plt

#generate dataset
data = np.random.normal(50, 20, 100)
xtr_points = np.random.normal(-50, 10, 6)

data = np.concatenate((xtr_points,data))
outliers = []

plt.title('The Dataset')
plt.plot(data, 'x') #just to check if everything is pretty

#now find the outliers!
z = float('inf')

while(z > 3):
    stdev = np.std(data)
    m = np.mean(data)
    zs = [abs(value - m)/stdev for value in data]

    z = max(zs)
    z_index = zs.index(z)
    
    if z > 3: #check if we have to remove
        outliers.append([z_index, data[z_index]])
        data = np.delete(data, z_index)


#plot results        
plt.figure(2)
plt.title('Rosner Result')
plt.plot(data,'bx', label='cleared data')
plt.scatter([x[0] for x in outliers],[y[1] for y in outliers], c='red',marker='x', label='outliers')

plt.legend(loc='lower right');
plt.show()

## Assignment 2: p-norm [5 Points]

A very well known norm is the euclidean distance. However, it is not the only norm: It is in fact just one of many p-norms where $p = 2$. In this assignment you will take a look at other p-norms and see how they behave.

## Assignment 3: Expectation Maximization [10 Points]

In this exercise you will implement the Expectation Maximization (EM) algorithm in a slightly simplified version than it was presented in the lecture.

### c)

Describe in your own words: How does the EM-algorithm deal with the missing value problem?

In the EM-Algorithm, all known values are considered via their probability depending on the distribution. In the same way, hidden (i.e. missing) values are considered as depending on the probability distribution and additionally on the known values. 
So the complete distribution can be seen as the product of two probability distributions (known and missing values).
In the EM-Algorithm, the parameters that maximize the log-likelihood are searched. As they depend on the missing values, those are averaged out. In an iterative procedure, the estimated parameter is improved (M-step), followed by averaging over the missing values using the obtained parameter (E-step). 
This will lead the estimation of the parameter to converge to a local maximum which is hopefully close to the real parameter value. 
The principle in handling missing values here is to not try to regain them somehow, but to invent values from a model we have through the probability distribution. In the best case, this wouldn't destroy information, but it normally does. However, at least this makes the existing values technically usable. 