Osnabrück University - Machine Learning (Summer Term 2016) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack

# Exercise Sheet 05

## Introduction

This week's sheet is supposed to last two weeks and thus should be solved and handed in before the end of **Sunday, May 22, 2016**. If you need help (and Google and other resources were not enough), feel free to contact your groups designated tutor or whomever of us you run into first. Please upload your results to your group's studip folder.

### SciPy and scikit-learn and playsound

From now on you will sometimes need the python package [scipy](https://pypi.python.org/pypi/scipy). To check if you already have a running version installed, run the following cell. If the output is `scipy not found` follow the instructions below to install it. Otherwise just skip the following paragraphs and continue with the assignments.

In [None]:
import importlib
assert importlib.util.find_spec('scipy') is not None, 'scipy not found'

On Unix systems you can easily install it with `pip3 install scipy` from any terminal window. If it fails, try to figure out how to install a Fortran compiler for your OS or ask one of your fellow tutors for help.

On Windows it is a little bit more difficult to get a Fortran compiler (although [MinGW](http://www.mingw.org/) offers one it is still very difficult to get everything to run), so we recommend you to take the [precompiled binaries](http://www.lfd.uci.edu/~gohlke/pythonlibs/#scipy) of Christoph Gohlke. If you previously installed a 32bit version of Python download `scipy-0.17.0-cp35-none-win32.whl`, if you have a 64bit version please resort to `scipy-0.17.0-cp35-none-win_amd64.whl`. If you are unsure which version you run, run the following cell to figure it out:

In [None]:
import platform
print('You are running a {} ({}) version.'.format(*platform.architecture()))

To install the binaries open your command line, navigate to your folder where you downloaded the `*.whl` file to (`cd FOLDER`) and run `pip install scipy-0.17.0-cp35-none-win32.whl` (or `pip install scipy-0.17.0-cp35-none-win_amd64.whl` if you downloaded the 64 bit version). If you run into troubles, get in touch with us!

## Assignment 1: Curse of Dimensionality [X Points]

**For the following exercise, be detailed in your answers, give examples and incorporate the following concepts/keywords:
random vectors in high dimensional space, Bertillonage, manifold**

What are the curse of dimensionality and its implication for pattern classification? 

*Curse of dimensionality describes the phenomenon that in high dimensional vector spaces, two randomly drawn vectors will almost always be close to orthogonal to each other. This is a real problem in data mining problems, where for a higher number of features, the number of possible combinations and therefore the volume of the resulting feature space exponentionally increases. 
In such a high dimensional space, data vectors from real data sets lie far away from each other (which means dense sampling becomes impossible, as there aren't enough samples close to each other). This also leads to the problem that pairs of data vectors have a high probability of having similar distances and to be close to orthogonal to each other. The result is that clustering becomes really difficult, as the vectors more or less stand on their own and distance measures cannot be applied easily.*


Explain how the phenomenom above could be use to one's advantage.

*This is actually an advantage if you want to discriminate between a high number of individuals (see Bertillonage, where using only 11 features results in a feature space big enough to discriminate humans), but if you want to get usable information out of data, such a 'singling out' of samples is a great disadvantage.*


What are types of dimensionality that can be used to describe data?

*Intrinsic dimensionality exists in contrast to the descriptive dimensionality of data, which is defined by the numbers of parameters used to produce or represent the raw data (i.e. the number of pixels in an unprocessed image).*

*Additionally to this representive dimensionality, there is also a (most of the time smaller) number of independent parameters which is necessary to describe the data, always in regard to a specific problem we want to use the data on. 
For example: a data set might consist of a number of portraits, all with size 1920x1080 pixels, which constitutes their descriptive dimensionality. To do some facial recognition on these portraits however, we do not need the complete descriptive dimension space (which would be waaaay too big anyway), but only a few independent parameters (which we can get by doing PCA and looking at the eigenfaces). 
This is possible because the data never fill out the entire high dimensional vector space but instead concentrate along a manifold of a much lower dimensionality.*

## Assignment 2: Implement and apply PCA [X Points]

In this assignment you will implement PCA from the ground up and apply it to the `cars` dataset. This dataset consists of measurements taken on 97 different cars. The eleven features measured are: Suggested retail price (USD), Price to dealer (USD), Engine size (liters), Number of engine cylinders, Engine horsepower, City gas mileage, Highway gas mileage, Weight (pounds), Wheelbase (inches), Length (inches) and Width (inches). 

We would like to visualize these high dimensional features to get a feeling for how the cars relate to each other so we need to find a subspace of dimension two or three into which we can project the data.

In [None]:
%matplotlib notebook

import numpy as np
import matplotlib.pyplot as plt

# TODO: Load the cars dataset in cars.csv .
cars = np.loadtxt('cars.csv', delimiter=',')

assert cars.shape == (97, 11), "Shape is not (97, 11), was {}".format(cars.shape)

As a first step we need to normalize the data (why is that so?). Use the standard score for this:
$$\frac{X - \mu}{\sigma}$$

In [None]:
# TODO: Normalize the data and store it in cars_norm.
cars_norm = (cars - np.mean(cars, axis=0)) / np.std(cars, axis=0)

assert cars_norm.shape == (97, 11), "Shape is not (97, 11), was {}".format(cars.shape)
assert np.abs(np.sum(cars_norm)) < 1e-10, "Absolute sum was {} but should be close to 0".format(np.abs(np.sum(cars_norm)))
assert abs(np.sum(cars_norm ** 2) / cars_norm.size - 1) < 1e-10, "The data is not normalized, sum/N was {} not 1".format(np.sum(cars_norm ** 2) / cars_norm.size)

PCA finds a subspace that maximizes the variance by determining the eigenvectors of the covariance matrix. So we need to calculate the autocovariance matrix and afterwards the eigenvalues. When the data is normalized the autocovariance is calculated as
$$C = X\cdot X^T$$
with $X$ being an $n \times m$ matrix with $n$ features and $m$ samples.
The entry $c_{i,j}$ in $C$ tells you how much feature $i$ correlates with feature $j$. 

In [None]:
# TODO: Compute the autocovariance matrix and store it into autocovar
autocovar = cars_norm.T @ cars_norm

assert autocovar.shape == (11, 11)

# TODO: Compute the eigenvalues und eigenvectors and store them into eigenval and eigenvec
#       (Figure out a function to do this for you)
eigenval, eigenvec = np.linalg.eig(autocovar)

assert eigenval.shape == (11,)
assert eigenvec.shape == (11, 11)

Now you should have a matrix full of eigenvectors. We can now do two things: project the data down into the two dimensional subspace to visualize it and we can also plot the two first principle component vectors as eleven two dimensional plots to get a feeling for how the features are projected into the subspace. Execute the cells below and describe what you see. Is PCA a good method for this problem? Was it justifiable that we only considered the first two principle components? What kinds of cars are in the four quadrants of the first plot?

In [None]:
# Project the data down into the two dimensional subspace
proj = cars_norm @ eigenvec[:,0:2]

# Plot projected data
fig = plt.figure('Data projected onto first two Principal Components')
fig.gca().set_xlim(-8, 8)
fig.gca().set_ylim(-4, 7)
plt.scatter(proj[:,0], proj[:,1])
# Divide plot into quadrants
plt.axhline(0, color='green')
plt.axvline(0, color='green')


# Plot eigenvectors
plt.figure('Eigenvector plot')
plt.scatter(eigenvec[:,0], eigenvec[:,1])

# add labels
labels = ['Suggested retail price (USD)', 'Price to dealer (USD)', 
          'Engine size (liters)', 'Number of engine cylinders', 
          'Engine horsepower', 'City gas mileage' , 
          'Highway gas mileage', 'Weight (pounds)', 
          'Wheelbase (inches)', 'Length (inches)', 'Width (inches)']
for label, x, y in zip(labels, eigenvec[:,0], eigenvec[:,1]):
    plt.annotate(
        label, xy = (x, y), xytext = (-20, 20),
        textcoords = 'offset points', ha = 'left', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'blue', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

## Assignment 3: Whitening and Reconstruction of Data [X Points]

## Assignment 4: Eigenfaces [X Points]

## Assignment 5: Theory [X Points]