Osnabrück University - Machine Learning (Summer Term 2016) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack

# Exercise Sheet 04: Clustering

## Introduction

This week's sheet should be solved and handed in before the end of **Sunday, May 8, 2016**. If you need help (and Google and other resources were not enough), feel free to contact your groups designated tutor or whomever of us you run into first. Please upload your results to your group's studip folder.

In the following tasks we will be relying on numpy. Using the following import we expect it to be in global scope as `np`. Therefore we can, after executing the following cell, use stuff like `np.array` and `np.sqrt`. Check out the [NumPy Reference](http://docs.scipy.org/doc/numpy/reference/index.html) and especially search it using e.g. [Google Site Search](https://www.google.de/search?q=array+site%3Adocs.scipy.org%2Fdoc%2Fnumpy)!

In [None]:
import numpy as np

## Assignment 0.5: Distance Measures [4 Points]

Implement the four different kinds of distance measurements for clusters. Each function takes two clusters (each a $n \times 2$ numpy array) and should return a single scalar. In the following use always the euclidean distance!

In [None]:
def pdist2(X, Y):
    """
    Pairwise distance between all points of two datasets.
    X and Y are expected to be numpy arrays of size n*o and m*o, respectivley. 
    n and m being the amount of observations in the first dimension of each set.
    """
    dist = np.ndarray((X.shape[0], Y.shape[0]))
    for xi, xv in enumerate(X):
        for yi, yv in enumerate(Y):
            dist[xi][yi] = np.sqrt(sum((xv-yv)**2))
    return dist

In [None]:
def d_mean(X, Y):
    """
    Mean distance between points of two clusters.
    X and Y are expected to be numpy arrays.
    """
    mean_dist = 0
    for x in np.array(X):
        for y in np.array(Y):
            #calculate distance from y to x        
            dist_yx = np.sqrt(sum((x-y)**2))
            mean_dist = mean_dist + dist_yx
            
    return mean_dist/(len(X)*len(Y))


X = np.array([[1,2,3],[4,5,6],[6,7,8]])
Y = np.array([[9,10,11],[12,13,14],[15,16,17]])
assert d_mean(X,Y) == d_mean(Y,X)
assert round(d_mean(X,Y)) == 14.0
del X, Y

In [None]:
def d_centroid(X, Y):
    """
    Distance between the centroids of two clusters.
    X and Y are expected to be numpy arrays.
    """
    cent_X = sum(X)/len(X)
    cent_Y = sum(Y)/len(Y)
    return np.sqrt(sum((cent_X-cent_Y)**2))


X = np.array([[1,2,3],[4,5,6],[6,7,8]])
Y = np.array([[9,10,11],[12,13,14],[15,16,17]])
assert d_centroid(X,Y) == d_centroid(Y,X)
assert round(d_centroid(X,Y)) == 14.0
assert d_mean(X,Y) == d_centroid(Y,X)
del X, Y

## Assignment 1: Hierarchical Clustering [6 Points]

In the following you find implementations for single- and complete-linkage clustering. Take a look at the code (this might also help if you get stuck on k-means) and answer the question posted below. You may of course change parameters and try it out on different datasets.

In [None]:
def linkage(data, k=5, complete=False):
    # Initially all points are their own cluster.
    labels = np.arange(len(data))

    # Calculate distance between all points.
    # Also removing half of the matrix because 
    # its symmetrical along the diagonal.
    dst = np.tril(pdist2(data, data))

    while len(set(labels)) > k:
        # Get the lowest distance of two points which
        # do not have the same label.
        r,c = np.where(dst==np.min(dst[dst>0]))
        
        # Ignore the case when there are multiple with
        # equally smallest distance.
        r = r[0]
        c = c[0]

        # The two points are now in the same cluster,
        # so they have a distance of 0 now.
        dst[r,c] = 0

        # Make the two clusters have the same label.
        labels[labels==labels[r]] = labels[c]

        # Check if we want to do complete linkage clustering.
        if complete:
            # Update the distances of the points which are not in the same cluster.
            for i in np.nonzero(dst[r,:]>0)[0]:
                dst[r,i] = np.max(pdist2(data[i,:], data[labels==labels[r],:]))

            # The distances to c are now the same as to r, so we can just
            # set them to zero - would be duplicates otherwise.
            dst[c,c+1:] = 0

    return labels

In [None]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt

# Read the data.
data = np.loadtxt('points.txt')

# Show unprocessed data set.
fig_cluster = plt.figure('Unprocessed Cluster Data')
plt.scatter(data[:,0], data[:,1])
fig_cluster.canvas.draw()

# Apply Single Linkage Clustering
labels = linkage(data, k=5, complete=False)
fig_single = plt.figure('Single-linkage Clustering with k=5')
plt.scatter(data[:,0], data[:,1], c=labels)
fig_single.canvas.draw()

# Apply Complete Linkage Clusteringc
labels = linkage(data, k=5, complete=True)
fig_complete = plt.figure('Complete-linkage Clustering with k=5')
plt.scatter(data[:,0], data[:,1], c=labels)
fig_complete.canvas.draw()

What is the difference between single- and complete-linkage clustering and which is the better solution given the dataset?

## Assignment 2: k-means Clustering [10 Points]
Implement kmeans clustering. Plot the results for $k = 5$ and $k = 3$ in colorful scatter plots.

In [None]:
def kmeans(data, k = 3):
    """
    Applies kmeans clustering to the data using k initial clusters.
    data is expected to be a numpy array of size n*2, 
    n being the amount of observations in the data.
    """
    # Initial centroids are k random samples from the data.
    centroids = data[np.random.randint(0, data.shape[0], k)]
    old_centroids = np.zeros(centroids.shape)
    
    # Initial labels are all.. something.
    labels = np.ndarray(data.shape[0])
    
    # Lets keep count of our iterations to avoid infinite loops.
    iterations = 0
    
    while np.any(np.abs(centroids - old_centroids) > np.finfo(float).eps) and iterations < 1000:
        # Keep count of iterations and remember current centroids for change calculation.
        iterations = iterations + 1
        old_centroids = np.copy(centroids)
        
        # Calculate new labels. Labels are the index of their minimal distance to any centroid.
        labels = np.argmin(pdist2(centroids, data), axis=0)
        
        # Update centroids using the new cluster labels.
        for label in range(k): 
            # Check for empty clusters.
            if (len(labels == label) > 0):
                # Cluster is not empty, move its centroid to new mean.
                centroids[label, :] = np.mean(data[labels == label], axis=0)
            else:
                # Cluster is empty, set its centroid to the furthest outlier.
                blacksheep = np.argmax(pdist2(centroids, data), axis=0)
                centroids[label, :] = data[blacksheep, :]

    # Return labels and centroids for pretty plotting.
    return (labels, centroids)

In [None]:
data = np.loadtxt('clusterData.txt')
labels, centroids = kmeans(data, 5)

kemans_fig = plt.figure('k-means with k=7')
plt.scatter(data[:,0], data[:,1], c=labels)
plt.scatter(centroids[:,0], centroids[:,1], 
            c=list(set(labels)), alpha=.1, marker='o',
            s=np.array([len(labels[labels==label]) for label in set(labels)])*100)
kemans_fig.canvas.draw()

How could one handle situations when one or more clusters end up containing 0 elements?

Apply kmeans clustering to the following image and describe (and plot) the results when using different values for $k$.

## Assignment  : Soft Clustering with Gaussian Mixture [? Points]

In this assignment you will calculate the update rules for a Gaussian Mixture model required for the M-step of the EM algorithm. The Gaussian Mixture model can be used for soft clustering since it allows us to express varying degrees of certainty about the membership of individual samples. It is one of the most widely used models since Gaussian distributions generally have the property of fitting all different kinds of data reasonably well.

A mixture model with $\mathbf{K}$ components is in general of the form:

$$ p(\mathbf{x}_i|\mathbf{\theta}) = \sum_{k=1}^K\pi_kp_k(\mathbf{x}_i|\mathbf{\theta})$$
where $\sum_{k=1}^K\pi_k = 1$.

This means that the probability of observing a data point $\mathbf{x}_i$ given the parameter vector $\mathbf{\theta}$ can be expressed as the sum of $\mathbf{K}$ individual distributions $p_k$ which are weighted by respective class probabilities $\pi_k$. We can now choose distributions for $p_k$ and $\pi_k$ and we get a whole collection full of different possible models, each of which has its own advantages and disadvantages (you can check <a href='https://en.wikipedia.org/wiki/Mixture_model'>Wikipedia</a> if you want an overview). The easiest case is where our mixing distributions are normally distributed, $p_k \sim \mathcal{N}(\mu_k,\sigma_k)$, and our latent class probabilities have a discrete distribution where we only have $\pi_k \in [0,1]$ and the constraint $\sum_k\pi_k=1$.

If we were to randomly pick values for the coefficient vector $\theta$ then we would now have a generative model that can produce naturally clustered data for us, we would just have to sample $x \sim p(\mathbf{x}_i|\mathbf{\theta})$. We want to go into the oppisite direction however and figure out what the distribution of the labels given the data is. This can be calculated easily by Bayes Theorem however:

$$p(z_i=k|\mathbf{x}_i,\mathbf{\theta})=\frac{p(z_i=k|\mathbf{\theta})p(\mathbf{x}_i|z_i=k,\mathbf{\theta})}{\sum_{k'=1}^Kp(z_i=k'|\mathbf{\theta})p(\mathbf{x}_i|z_i=k',\mathbf{\theta})} = \frac{\pi_kp_k(\mathbf{x}_i|\mathbf{\theta})}{\sum_{k'=1}^K\pi_kp_k(\mathbf{x}_i|\mathbf{\theta})}$$

That sounds good enough, but where do we actually start now? We have a mathematical framework pinned down, but it contains many variable and it is not *a priori* obvious how we can figure out the best values for them. We *have* some data which we want to use to determine the parameters so the usual approach would be to simply calculate a Maximum Likelihood Estimator (MLE) or an Maximum A Posteriori Estimator (MAP) by maximizing the above formulas over the possible parameters with a method like Gradient Descent. It turns out however that this is very very hard to do (optimal MLE for a GMM is NP-hard (Aloise et al. 2009; Drineas et al. 2004)) since the $\pi_k$ and the $\theta$ are strongly interdependent and neither is known. It *can* still be done with some work-arounds, but there is also an alternative path that we can go down.

*(The following exhibition is only for those who are interested in the mathematical background of the EM-algorithm, those who only want to solve the exercise can skip ahead to the function that you have to maximize.)*

We want to maximize the log likelihood given as
$$\mathcal{l}(\mathbf{\theta})=\sum_{i=1}^N\log p(\mathbf{x}_i|\mathbf{\theta}) = \sum_{i=1}^N\log\big[\sum_{k=1}^Kp_k(\mathbf{x}_i|\mathbf{\theta})\big].$$
All the problems occur because we have a sum inside the logarithm and so we can't pull the logarithm further in towards the densitiy and that is what makes the problem so hard. If we just *ignore* the inner sum we get an expression
$$\mathcal{l}_c(\mathbf{\theta}) = \sum_{i=1}^N\log p_k(\mathbf{x}_i|\mathbf{\theta})$$
which would be much nicer to compute. But now we have a free floating $k$ in the subscript of our density! Which one of the mixing distributions are we talking about here? Kind of all of them at once. But we need one quantity to represent all the distributions. So to get rid of the $k$ we take the expected value with respect to the latent variable $k$ and receive a function that only depends on $\mathbf{\theta}$:
$$Q(\mathbf{\theta},\mathbf{\theta}^{t-1}) = \mathbb{E}[\mathcal{l}_c(\mathbf{\theta})|\mathcal{\theta}^{t-1}]$$

Calculating this Q function can be difficult - but at least we only have to do it once instead of solving an NP-hard optimization problem every time we have a new dataset. I will only provide you the final formular, you will have to trust me on this one:

$$\begin{align}
Q(\mathbf{\theta},\mathbf{\theta}^{t-1}) &= \sum_i\sum_k p(z_i=k|\mathbf{x}_i,\mathbf{\theta}^{t-1})\log\pi_k + \sum_i\sum_k p(z_i=k|\mathbf{x}_i,\mathbf{\theta}^{t-1})\log p_k(\mathbf{x}_i|\mathbf{\theta})
\end{align}$$

This still looks nasty but it really isn't that bad! Since $\theta^{t-1}$ is known at time $t$ we can calculate $p(z_i=k|\mathbf{x}_i,\mathbf{\theta}^{t-1})$ with Bayes Theorem as stated above and replace these expressions with constants $r_{i,k}.$

**This is where your work begins:**

In the lecture you saw a proof that if we choose
$$\mathbf{\theta}^t = \arg\max_{\mathbf{\theta}} Q(\mathbf{\theta},\mathbf{\theta}^{t-1})$$
that the likelihood of the parameter is non-decreasing then. So we want to maximize $Q(\mathbf{\theta},\mathbf{\theta}^{t-1})$ for the parameters $(\pi_1\dots,\pi_K)$ and $\theta = (\mu_1,\dots,\mu_K,\sigma_1,\dots,\sigma_K)$. So your job is to take the derivative of 
$$\begin{align}
Q(\mathbf{\theta},\mathbf{\theta}^{t-1}) &= \sum_i\sum_k r_{i,k}\log\pi_k + \sum_i\sum_k r_{i,k}\log p_k(\mathbf{x}_i|\mathbf{\theta})
\end{align}$$
with respect to these variables, to set it equal to 0 and to solve for the value that you are currently maximizing for. You only have to do this for the one dimensional case, i.e. 
$$p_k(\mathbf{x}_i|\mathbf{\theta}) = \frac{1}{\sqrt{2\pi\sigma_k^2}}\exp\big({-\frac{(x_i-\mu_k)^2}{2\sigma_k^2}}\big)$$

**a) Calculate the maximizer for the $\pi_k$ (You need a Lagrangian Multiplier for this):**


$$\begin{align} 
\frac{\delta}{\delta \pi_k}Q(\mathbf{\theta},\mathbf{\theta}^{t-1}) - \lambda (\sum_k \pi_k - 1) &= \sum_i \frac{r_{i,k}}{\pi_k} - \lambda = 0 \leftrightarrow \sum_i \frac{r_{i,k}}{\lambda} = \pi_k \\
\frac{\delta}{\delta \lambda}Q(\mathbf{\theta},\mathbf{\theta}^{t-1}) + \lambda (\sum_k \pi_k - 1) &= (\sum_k \pi_k - 1) = 0 \leftrightarrow \sum_k \pi_k = 1 \\
\rightarrow \frac{1}{\lambda}\sum_k\sum_i r_{i,k} &= 1 \rightarrow \pi_k = \frac{1}{N}\sum_i r_{i,k}
\end{align}$$

**b) Calculate the maximizer for the $\mu_k$:**


$$\begin{align} 
\frac{\delta}{\delta \mu_k}Q(\mathbf{\theta},\mathbf{\theta}^{t-1}) &= \frac{\delta}{\delta \mu_k}\sum_i\sum_k r_{i,k}\log p_k(\mathbf{x}_i|\mathbf{\theta}) \\
&= \sum_i r_{i,k} \frac{\delta}{\delta \mu_k} \big(-\frac{1}{2}\log(\sigma_k^2) - \frac{1}{2\sigma_k^2}(x_i-\mu_k)^2\big)
\\
&= \frac{1}{\sigma^2}\sum_i r_{i,k}(x_i-\mu_k) = 0 \\
\leftrightarrow \sum_i r_{i,k}x_i &= \mu_k\sum_ir_{i,k} \leftrightarrow \mu_k = \frac{\sum_i r_{i,k}x_i}{\sum_ir_{i,k}} 
\end{align} $$ 

**c) Calculate the maximizer for the $\sigma_k^2$:**


$$\begin{align}
\frac{\delta}{\delta \sigma_k^2}Q(\mathbf{\theta},\mathbf{\theta}^{t-1}) &= \sum_i r_{i,k} \frac{\delta}{\delta \sigma_k^2} \big(-\frac{1}{2}\log(\sigma_k^2) - \frac{1}{2\sigma_k^2}(x_i-\mu_k)^2\big) \\
&= \sum_i r_{i,k}(\frac{1}{\sigma_k^4}(x_i-\mu_k)^2 -\frac{1}{\sigma_k^2}) = 0\\
\leftrightarrow & \sum_i r_{i,k}(x_i-\mu_k)^2 = \sigma_k^2\sum_ir_{i,k} \\
\leftrightarrow & \sigma_k^2 = \frac{\sum_i r_{i,k}(x_i-\mu_k)^2}{\sum_ir_{i,k}}
\end{align}$$