Osnabrück University - Machine Learning (Summer Term 2024) - Prof. Dr.-Ing. G. Heidemann, Ulf Krumnack, Lukas Niehaus

# Exercise Sheet 04: Clustering

## Introduction

This week's sheet should be solved and handed in before the end of **Sunday, May 12, 2024**. If you need help (and Google and other resources were not enough), use the StudIP forum, contact your groups designated tutor or whomever of us you run into first. Please upload your results to your group's Stud.IP folder.

# Assignment 1: Kmeans Clustering (5 points)

**a)** Perform K-means clustering to divide the following datset into 2 clusters by hand.

| x1 | x2 |x3 |
|----|----|---|
|  1 | 1  | 1 |
|  2 | 1  | 1 |
|  3 | 1  | 2 |
|  2 | 3  | 4 |
|  5 | 5  | 5 |
|  3 | 2  | 1 |

Start with the following centroids for the two clusters: $(1,1,1)$ and $(5,5,5)$.

Step 1: Initialize the centroids for the two clusters randomly. Let's say we choose the following centroids:

Cluster 1 centroid = (1, 1, 1)
Cluster 2 centroid = (5, 5, 5)

Step 2: Assign each data point to the nearest centroid. We can calculate the distance between each data point and the centroids using the Euclidean distance formula.

The distances between each data point and the centroids are as follows:


| Data Point | Distance to Cluster 1 centroid | Distance to Cluster 2 centroid | Assigned Cluster |
|:----------:|:------------------------------:|:------------------------------:|:----------------:|
| (1, 1, 1)  | 0                              | 5.196                          | Cluster 1        |
| (2, 1, 1)  | 1                              | 4.899                          | Cluster 1        |
| (3, 1, 2)  | 1.414                          | 4.242                          | Cluster 1        |
| (2, 3, 4)  | 5.744                          | 1.732                          | Cluster 2        |
| (5, 5, 5)  | 7.071                          | 0                              | Cluster 2        |
| (3, 2, 1)  | 1.732                          | 4.899                          | Cluster 1        |

Based on the distances, we assign each data point to the nearest centroid. Data points (1, 1, 1), (2, 1, 1), (3, 1, 2), and (3, 2, 1) are assigned to Cluster 1, and data points (2, 3, 4) and (5, 5, 5) are assigned to Cluster 2.

Step 3: Recalculate the centroids for each cluster by taking the mean of all the data points in the cluster.

The new centroids for the two clusters are as follows:

Cluster 1 centroid = ((1+2+3+3)/4, (1+1+1+2)/4, (1+1+2+1)/4) = (2.25, 1.25, 1.25)
Cluster 2 centroid = ((2+5)/2, (3+5)/2, (4+5)/2) = (3.5, 4, 4.5)

Step 4: Repeat steps 2 and 3 until convergence is reached.

We can see that after one iteration, the assigned clusters have changed. Data points (1, 1, 1), (2, 1, 1), and (3, 1, 2) are now assigned to Cluster 1, and data points (2, 3, 4), (5, 5, 5), and (3, 2, 1) are assigned to Cluster 2.

We can now recalculate the centroids for each cluster:

Cluster 1 centroid = ((1+2+3)/3, (1+1+1)/3, (1+1+2)/3) = (2, 1, 1.33)
Cluster 2 centroid = ((2+5+3)/3, (3+5+2)/3, (4+5+1)/3) = (3.33, 3.33, 3.33)


After the second iteration, the assigned clusters do not change, so we have

**b)** $k$-mean clustering of a given dataset can result in different outcomes. Explain, at which point the algorithms is indeterministic and explain how to compare the quallity of different outcomes.

The algorithm is indeterministic because of the initial choice of cluster centers.  All subsequent steps are fully deterministic. 

Different outcomes can be assessed by the cumulated distance of points to their respective cluster centers.  The smaller that distance, the better the clustering.

**c)** The pseudocode for k-means algorithm on (ML-05, slide 27) uses as the termination condition ($\exists k \in[1,K]: \|\vec{w}_k(t)-\vec{w}_k(t-1)\|>\varepsilon$). Discuss this condition from a theoretical and practical perspective and name alternatives.

The conditions states that the algorithm should terminate when the cluster centers do not more than a given threshold.

It is not even clear, that the steps of the algorithm do decrease (is it possible to create a simple example demonstrating that idea?)

# Assignment 2: k-means Clustering (7 points)

## a) Implement k-means clustering. Plot the results for $k = 7$ and $k = 3$ in colorful scatter plots.

How could one handle situations when one or more clusters end up containing 0 elements? Handle these situtation in your code.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from scipy.spatial.distance import cdist

def kmeans(data, k=3):
    """
    Applies k-means clustering to given data.

    Args:
        data (ndarray): a numpy array of shape (n, 2), providing n
                        2-dimensional datapoints
        k (int): Number of clusters

    Returns:
        labels (ndarray): Numpy array containing numbers (=labels) with the same order as the data points
                          e.g. (1,1,3,5,5,5,...)
        centroids(ndarray): vector representation of cluster centers of shape (k, 2)
    """
    ### BEGIN SOLUTION
    # Initial centroids are k random samples from the data.
    centroids = data[np.random.randint(0, data.shape[0], k)]
    old_centroids = np.zeros(centroids.shape)
    
    # Initial labels are all.. something.
    labels = np.ndarray(data.shape[0])
    
    # Lets keep count of our iterations to avoid infinite loops.
    iterations = 0
    
    while np.any(np.abs(centroids - old_centroids) > np.finfo(float).eps) and iterations < 1000:
        # Keep count of iterations and remember current centroids for change calculation.
        iterations += 1
        # Copy the centroids and keep them for break condition check.
        old_centroids = np.copy(centroids)
        
        # Calculate new labels. Labels are the index of their minimal distance to any centroid.
        labels = np.argmin(cdist(centroids, data), axis=0)
        
        # Update centroids using the new cluster labels.
        for label in range(k): 
            # Check for empty clusters.
            if np.any(labels == label):
                # Cluster is not empty, move its centroid to new mean.
                centroids[label, :] = np.mean(data[labels == label], axis=0)
            else:
                # Cluster is empty, set its centroid to the furthest outlier.
                blacksheep = np.argmax(cdist(centroids, data), axis=0)
                centroids[label, :] = data[blacksheep, :]
    
    ### END SOLUTION

    return labels, centroids

In [None]:
%matplotlib ipympl

data = np.loadtxt('clusterData.txt')

# Test experiments with a different number of clusters and different 
# number of runs here.
# You can define the number of clusters and how often k-means is called
# allowing to investigate the inlfuenece of the number of clusters and 
# of the different random cluster initializations per run.


# Experiment 1: One run with k=2
experiment = ((2,1),)
# Experiment 2: One run with k=3, one run with k=7
#experiment = ((3,1),(7,1))
# Experiment 3: Five run with k=7
#experiment = ((7,5),)


for params in  experiment:
    k = params[0]
    for i in range(1, params[1]+1):
        labels, centroids = kmeans(data, k)
        
        assert isinstance(labels, np.ndarray), "'labels' should be a numpy array!"       
        assert isinstance(centroids, np.ndarray), "'centroids' should be a numpy array!"    
        assert labels.shape==(data.shape[0],), "Each data point needs a label!"
        assert centroids.shape==(k,data.shape[1]), ("k centroids with the same dimensionality "
            "as the data are needed!")
        
        kmeans_fig = plt.figure('k-means with k={}, i={}'.format(k,i))
        plt.scatter(data[:,0], data[:,1], c=labels)
        plt.scatter(centroids[:,0], centroids[:,1], 
                    c=list(set(labels)), alpha=.1, marker='o',
                    s=np.array([len(labels[labels==label]) for label in set(labels)])*100)
        plt.title('k = {}, i = {}'.format(k, i))
        kmeans_fig.canvas.draw()
        plt.show()   

## b) Why might the clustering for k=7 not look optimal? 
What happens if you run the algorithm several times?

K-Means works best for datasets in which clusters have the same circular shape and the same amount of datapoints per cluster. Opposed to a Mixture of Gaussians, in which the different distributions/clusters are weighted and the standard deviations may differ between distributions and also between different dimensions, in K-Means the variance is fixed and the same metric is used for all clusters. 

In this example inter- and intra cluster variance varies together with the number of datapoints per cluster.

Moreover, the outcome of K-Means is a local minima, depending on the random initialization of the cluster centers. This local minimal may not be the global optimum.

# Assignment 3: Scipy Clustering (3 points)

The documentation of [Scikit-Learn](https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html) compares the characteristics of different cluster algorithms by showing their performance on various toy datasets. [Hierarchical](https://scikit-learn.org/stable/auto_examples/cluster/plot_linkage_comparison.html#sphx-glr-auto-examples-cluster-plot-linkage-comparison-py) is clustering is shown in detail in a seperate notebook.
The lecture focuses on the clustering methods for agglomerative with single, complete, average and ward (ML-5 Slide: 8-12), kmeans (ML-5 Slide: 26) and gaussian mixture (ML-5 Slide: 36), which are implemented in the code below.
You can modify the random state for the dataset generation, clustering approach and number of samples to get a deeper understanding for the different clustering algorithms.
Below the code are questions about the datasets and clustering algorithms which you should answer.

In [None]:
import time
import warnings
import numpy as np
from itertools import cycle, islice
import matplotlib.pyplot as plt
from sklearn import cluster, datasets, mixture
from sklearn.preprocessing import StandardScaler

### MODIFY VALUES ###

random_state_dataset = 170
random_state_clustering = 170
n_samples = 1500

### CREATE DATASETS ###

noisy_circles = datasets.make_circles(
    n_samples=n_samples, factor=0.5, noise=0.05, random_state=random_state_dataset
)
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=0.05, random_state=random_state_dataset)
blobs = datasets.make_blobs(n_samples=n_samples, random_state=random_state_dataset)
rng = np.random.RandomState(random_state_dataset)
no_structure = rng.rand(n_samples, 2), None

# Anisotropicly distributed data
X, y = datasets.make_blobs(n_samples=n_samples, random_state=random_state_dataset)
transformation = [[0.6, -0.6], [-0.4, 0.8]]
X_aniso = np.dot(X, transformation)
aniso = (X_aniso, y)

# blobs with varied variances
varied = datasets.make_blobs(
    n_samples=n_samples, cluster_std=[1.0, 2.5, 0.5], random_state=random_state_dataset
)

### Start CLUSTERING ###

# Set up cluster parameters
plt.figure(figsize=(9 * 1.3 + 2, 14.5))
plt.subplots_adjust(
    left=0.02, right=0.98, bottom=0.001, top=0.96, wspace=0.05, hspace=0.01
)

plot_num = 1

default_base = {"n_neighbors": 10, "n_clusters": 3, "random_state": random_state_clustering}

datasets = [
    (noisy_circles, {"n_clusters": 2}),
    (noisy_moons, {"n_clusters": 2}),
    (varied, {"n_neighbors": 2}),
    (aniso, {"n_neighbors": 2}),
    (blobs, {}),
    #(no_structure, {}),
]

for i_dataset, (dataset, algo_params) in enumerate(datasets):
    # update parameters with dataset-specific values
    params = default_base.copy()
    params.update(algo_params)

    X, y = dataset

    # normalize dataset for easier parameter selection
    X = StandardScaler().fit_transform(X)

    # ============
    # Create cluster objects
    # ============
    single = cluster.AgglomerativeClustering(
        n_clusters=params["n_clusters"], 
        linkage="single"
    )
    average = cluster.AgglomerativeClustering(
        n_clusters=params["n_clusters"], 
        linkage="average"
    )
    complete = cluster.AgglomerativeClustering(
        n_clusters=params["n_clusters"], 
        linkage="complete"
    )
    ward = cluster.AgglomerativeClustering(
        n_clusters=params["n_clusters"], 
        linkage="ward"
    )
    gmm = mixture.GaussianMixture(
        n_components=params["n_clusters"],
        covariance_type="full",
        random_state=params["random_state"],
    )
    #kmeans = cluster.MiniBatchKMeans(
    kmeans = cluster.KMeans(
        n_clusters=params["n_clusters"],
        random_state=params["random_state"],
        n_init="auto"
    )
    
    clustering_algorithms = (
        ("Single Linkage", single),
        ("Average Linkage", average),
        ("Complete Linkage", complete),
        ("Ward Linkage", ward),
        ("Gaussian Mixture", gmm),
        ("KMeans", kmeans)
    )

    for name, algorithm in clustering_algorithms:
        t0 = time.time()

        # catch warnings related to kneighbors_graph
        with warnings.catch_warnings():
            warnings.filterwarnings(
                "ignore",
                message="the number of connected components of the "
                + "connectivity matrix is [0-9]{1,2}"
                + " > 1. Completing it to avoid stopping the tree early.",
                category=UserWarning,
            )
            algorithm.fit(X)

        t1 = time.time()
        if hasattr(algorithm, "labels_"):
            y_pred = algorithm.labels_.astype(int)
        else:
            y_pred = algorithm.predict(X)

        plt.subplot(len(datasets), len(clustering_algorithms), plot_num)
        if i_dataset == 0:
            plt.title(name, size=18)

        colors = np.array(
            list(
                islice(
                    cycle(
                        [
                            "#377eb8",
                            "#ff7f00",
                            "#4daf4a",
                            "#f781bf",
                            "#a65628",
                            "#984ea3",
                            "#999999",
                            "#e41a1c",
                            "#dede00",
                        ]
                    ),
                    int(max(y_pred) + 1),
                )
            )
        )
        
        plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[y_pred])

        plt.xlim(-3.5, 3.5)
        plt.ylim(-3.5, 3.5)
        plt.xticks(())
        plt.yticks(())
        plt.text(
            0.22,
            #0.50,
            0.01,
            ("%.2fs" % (t1 - t0)).lstrip("0"),
            transform=plt.gca().transAxes,
            size=15,
            horizontalalignment="right",
            #verticalalignment="top",
        )
        plot_num += 1

plt.show()

## a) Which of the clusters obtained by the different algorithms would be seen as good for the different toy datasets and why?

| Dataset | Single linkage | Average Linkage | Complete Linkage | Ward Linkage | Gaussian Mixture | KMeans |
|---------|----------------|-----------------|------------------|--------------|------------------|--------|
| 1       | good           | bad             | bad              | bad          | bad              | bad    |
| 2       | good           | bad             | bad              | bad          | bad              | bad    |
| 3       | bad            | bad             | bad              | good         | good             | bad    |
| 4       | bad            | bad             | bad              | bad          | good             | bad    |
| 5       | bad            | good            | good             | good         | good             | good   |

A good cluster can be described as by it having distinctly different characteristics. These can be described by density or distance to other clusters. A well chosen cluster algorithm is able to use these characteristics by modelling the underlying conditions that created the data. For example: While Single linkage works well on datasets, that show a large minimum distance between clusters (like 1 and 2), gaussian mixture clustering works well on clusters that stem from a gaussian distribution (like 3, 4 and 5) 

## b) Explain why single linkage clustering creates good results for dataset one and two, but bad results for dataset 3, 4 and 5.

single linkage tends to create clusters by chaining, by which points close to each other are combined into a cluster. Dataset 1 and 2 shows clusters which have a large minimum distance between the clusters, while the density inside a cluster is high, which results in good clustering results.

The results for datasets 3, 4, 5 show bad results since the minimum distance between clusters are low, while the density of a cluster can vary. For these datasets, single linkage tends to build clusters out of individual points or small clusters, which are further apart from the rest of the points.

## c) Why does ward work good on dataset 3 but bad on dataset 4?

ward tries to merge the clusters in a way by which the total variance is minimized.
In dataset 3 the the total variance is smaller if one cluster contains the wide spread points in blue while the other clusters contain the more conentrated clusters in orange and green

## d) explain the difference between gaussian mixture and kmeans clustering for dataset 3 and 4

gaussian mixture models try to find optimal matching gaussian (bivariate) distributions to the data.
Dataset 3 and 4 stem from bivariate distributions.
Gaussian mixture models allow to modify the underlying mean and covariance matrix, by which the distributions can be spread further out or be streched in one direction (shaped by an affine transformation).
This leads to the effect, that the gaussian mixture model creates good clusters for datasets 3 and 4.

In contrast kmeans only creates spherical clusters (creates a voronoi tesselation of the space).
Kmeans is not able to create different underlying distributions for each cluster but only attends to the distance between clusters.

## e) Why do the methods for gaussian mixture and kmeans need a random state and the single, average complete and ward linkage do not require these?

single, average, complete and ward linkage do not sample from random distributions. 
They only work on the given data and do not rely on an initial state.

Gaussuan mixture and kmeans rely on an initial state and the given data. They operate by iteratively updating this state until reaching a local optimum. Their optimum may depend on the initial state and this different local optima can be reached.