Osnabrück University - Machine Learning (Summer Term 2016) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack

# Exercise Sheet 04: Clustering

## Introduction

This week's sheet should be solved and handed in before the end of **Sunday, May 8, 2016**. If you need help (and Google and other resources were not enough), feel free to contact your groups designated tutor or whomever of us you run into first. Please upload your results to your group's studip folder.

## Assignment 0.5: Distance Measures

Implement the four different kinds of distance measurements for clusters. Each function takes two clusters (each a $n \times 2$ numpy array) and should return a single scalar. In the following use always the euclidean distance!

In [None]:
import numpy as np

def d_min(X, Y):
    """
    Minimal distance between points of two clusters.
    X and Y are expected to be numpy arrays.
    """
    min_dist = float('inf')
    for x in np.array(X):
        for y in np.array(Y):
            dist_yx = np.sqrt(sum((x-y)**2))
            if dist_yx < min_dist:
                min_dist = dist_yx
    
    # Shorter solution using scipy (for future reference):
    # min_dist = np.min(scipy.spatial.distance.cdist(X,Y))
    return min_dist


X = np.array([[1,2,3],[4,5,6],[6,7,8]])
Y = np.array([[9,10,11],[12,13,14],[15,16,17]])
assert d_min(X,Y) == d_min(Y,X)
assert round(d_min(X,Y)) == 5.0
del X, Y # don't pollute global namespace

In [None]:
def d_max(X, Y):
    """
    Maximal distance between points of two clusters.
    X and Y are expected to be numpy arrays.
    """
    max_dist = 0
    for x in np.array(X):
        for y in np.array(Y):      
            dist_yx = np.sqrt(sum((x-y)**2))
            if dist_yx > max_dist:
                max_dist = dist_yx
    
    # Shorter solution using scipy (for future reference):
    # max_dist = np.max(scipy.spatial.distance.cdist(X,Y))
    return max_dist


X = np.array([[1,2,3],[4,5,6],[6,7,8]])
Y = np.array([[9,10,11],[12,13,14],[15,16,17]])
assert d_max(X,Y) == d_max(Y,X)
assert round(d_max(X,[0]*len(X))) == 12.0
assert round(d_max(X,Y)) == 24.0
del X, Y

In [None]:
def d_mean(X, Y):
    """
    Mean distance between points of two clusters.
    X and Y are expected to be numpy arrays.
    """
    mean_dist = 0
    for x in np.array(X):
        for y in np.array(Y):
            #calculate distance from y to x        
            dist_yx = np.sqrt(sum((x-y)**2))
            mean_dist = mean_dist + dist_yx
            
    return mean_dist/(len(X)*len(Y))


X = np.array([[1,2,3],[4,5,6],[6,7,8]])
Y = np.array([[9,10,11],[12,13,14],[15,16,17]])
assert d_mean(X,Y) == d_mean(Y,X)
assert round(d_mean(X,Y)) == 14.0
del X, Y

In [None]:
def d_centroid(X, Y):
    """
    Distance between the centroids of two clusters.
    X and Y are expected to be numpy arrays.
    """
    cent_X = sum(X)/len(X)
    cent_Y = sum(Y)/len(Y)
    return np.sqrt(sum((cent_X-cent_Y)**2))


X = np.array([[1,2,3],[4,5,6],[6,7,8]])
Y = np.array([[9,10,11],[12,13,14],[15,16,17]])
assert d_centroid(X,Y) == d_centroid(Y,X)
assert round(d_centroid(X,Y)) == 14.0
assert d_mean(X,Y) == d_centroid(Y,X)
del X, Y

## Assignment 1: Hierarchical Clustering

Implement single and complete linkage agglomarative clustering. Stop clustering when 5 clusters are found. Plot the results in a colorful scatter plot for each method.

In [None]:
def single_linkage(data):
    return

In [None]:
def complete_linkage(data):
    return

Play around with using different distance measures. Describe the different results.

## Assignment 2: k-means Clustering
Implement kmeans clustering. Plot the results for $k = 5$ and $k = 3$ in colorful scatter plots.

In [None]:
def kmeans(data, k):
    """
    Applies kmeans clustering to the data (numpy array of size n*2) using k initial clusters.
    """

How could one handle situations when one or more clusters end up containing 0 elements?

Apply kmeans clustering to the following image and describe (and plot) the results when using different values for $k$.