Osnabrück University - Machine Learning (Summer Term 2016) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack

# Exercise Sheet 04: Clustering

## Introduction

This week's sheet should be solved and handed in before the end of **Sunday, May 8, 2016**. If you need help (and Google and other resources were not enough), feel free to contact your groups designated tutor or whomever of us you run into first. Please upload your results to your group's studip folder.

### SciPy

From now on you will sometimes need the python package [scipy](https://pypi.python.org/pypi/scipy). To check if you already have a running version installed, run the following cell. If the output is `scipy not found` follow the instructions below to install it. Otherwise just skip the following paragraphs and continue with the assignments.

In [None]:
import importlib
assert importlib.util.find_spec('scipy') is not None, 'scipy not found'

On Unix systems you can easily install it with `pip3 install scipy` from any terminal window. If it fails, try to figure out how to install a Fortran compiler for your OS or ask one of your fellow tutors for help.

On Windows it is a little bit more difficult to get a Fortran compiler (although [MinGW](http://www.mingw.org/) offers one it is still very difficult to get everything to run), so we recommend you to take the [precompiled binaries](http://www.lfd.uci.edu/~gohlke/pythonlibs/#scipy) of Christoph Gohlke. If you previously installed a 32bit version of Python download `scipy-0.17.0-cp35-none-win32.whl`, if you have a 64bit version please resort to `scipy-0.17.0-cp35-none-win_amd64.whl`. If you are unsure which version you run, run the following cell to figure it out:

In [None]:
import platform
print('You are running a {} ({}) version.'.format(*platform.architecture()))

To install the binaries open your command line, navigate to your folder where you downloaded the `*.whl` file to (`cd FOLDER`) and run `pip install scipy-0.17.0-cp35-none-win32.whl` (or `pip install scipy-0.17.0-cp35-none-win_amd64.whl` if you downloaded the 64 bit version). If you run into troubles, get in touch with us!

## Assignment 0.5: Distance Measures

Implement the four different kinds of distance measurements for clusters. Each function takes two clusters (each a $n \times 2$ numpy array) and should return a single scalar. In the following use always the euclidean distance!

In [None]:
import numpy as np

def d_min(X, Y):
    #find minimum distance between all points
    min_dist = float('inf')
    for x in np.array(X):
        for y in np.array(Y):
            #calculate distance from y to x        
            dist_yx = np.sqrt(sum((x-y)**2))
            if dist_yx < min_dist:
                min_dist = dist_yx
    
    #even smaller solution with scipy
    #min_dist = np.min(distance.cdist(X,Y))
    return min_dist

In [None]:
X = np.array([[1,2,3],[4,5,6],[6,7,8]])
Y = np.array([[9,10,11],[12,13,14],[15,16,17]])

assert d_min(X,Y) == d_min(Y,X)
assert round(d_min(X,Y)) == 5.0

In [None]:
def d_max(X, Y):
    #find maximum distance between all points
    max_dist = 0
    for x in np.array(X):
        for y in np.array(Y):
            #calculate distance from y to x        
            dist_yx = np.sqrt(sum((x-y)**2))
            if dist_yx > max_dist:
                max_dist = dist_yx
    
    #even smaller solution with scipy
    #max_dist = np.max(distance.cdist(X,Y))
    return max_dist

In [None]:
X = np.array([[1,2,3],[4,5,6],[6,7,8]])
Y = np.array([[9,10,11],[12,13,14],[15,16,17]])

assert d_max(X,Y) == d_max(Y,X)
assert round(d_max(X,[0]*len(X))) == 12.0
assert round(d_max(X,Y)) == 24.0

In [None]:
def d_mean(X, Y):
    #find mean distance between all points
    mean_dist = 0
    for x in np.array(X):
        for y in np.array(Y):
            #calculate distance from y to x        
            dist_yx = np.sqrt(sum((x-y)**2))
            mean_dist = mean_dist + dist_yx
            
    return mean_dist/(len(X)*len(Y))

In [None]:
X = np.array([[1,2,3],[4,5,6],[6,7,8]])
Y = np.array([[9,10,11],[12,13,14],[15,16,17]])

assert d_mean(X,Y) == d_centroid(Y,X)
assert round(d_mean(X,Y)) == 14.0

In [None]:
def d_centroid(X, Y):
    #find the centroids and compute distance
    cent_X = sum(X)/len(X)
    cent_Y = sum(Y)/len(Y)
    
    dist_yx = np.sqrt(sum((cent_X-cent_Y)**2))
    return dist_yx

In [None]:
X = np.array([[1,2,3],[4,5,6],[6,7,8]])
Y = np.array([[9,10,11],[12,13,14],[15,16,17]])

assert d_centroid(X,Y) == d_centroid(Y,X)
assert round(d_centroid(X,Y)) == 14.0

## Assignment 1: Hierarchical Clustering

Implement single and complete linkage agglomarative clustering. Stop clustering when 5 clusters are found. Plot the results in a colorful scatter plot for each method.

In [None]:
def single_linkage(data):
    return

In [None]:
def complete_linkage(data):
    return

Play around with using different distance measures. Describe the different results.

## Assignment 2: k-means Clustering
Implement kmeans clustering. Plot the results for $k = 5$ and $k = 3$ in colorful scatter plots.

In [None]:
def kmeans(data, k):
    """
    Applies kmeans clustering to the data (numpy array of size n*2) using k initial clusters.
    """

How could one handle situations when one or more clusters end up containing 0 elements?

Apply kmeans clustering to the following image and describe (and plot) the results when using different values for $k$.