# Machine Learning (Summer 2024)

## Practice Session 4

May 7th, 2023

Lukas Niehaus & Ulf Krumnack

Institute of Cognitive Science,
University of Osnabrück

## Today's Session

* New Sheet04
* kMeans Introduction
* kMeans RGB
* kMeans Handwritten digits

## K-Means

$$ \text{Minimize E}(D, \mathbf{w}_i) = \frac{1}{|D|} \sum_{i=1...|D|}||\mathbf{x}_i - \mathbf{w}_{m(\mathbf{x}_i})||^2  $$
with

Dataset: $D = \{\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_{|D|}\}$

Clusters: $C_1, C_2, ..., C_K$

Cluster centers: $\mathbf{w}_1, \mathbf{w}_2, ..., \mathbf{w}_K$ with $\mathbf{w}_k = \frac{1}{|C_i|}\sum_{\mathbf{x}_i \in C_k} \mathbf{x_i}$

Best matching cluster center for a $\mathbf{x}_i$: $\mathbf{w}_m$ with $\mathbf{w}_m(\mathbf{x}_i) = argmin_j||\mathbf{x}_i - \mathbf{w}_j||$

![kmeans.png](kmeans.png)

## K-means for color clustering in images

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial import distance
import imageio.v3 as iio
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
%matplotlib ipympl
# from matplotlib.colors import rgb_to_hsv, hsv_to_rgb

img = iio.imread('peppers.png', pilmode = 'RGB')

def kmeans_rgb(img, k, threshold=0, do_display=None):
    """
    k-means clustering in RGB space.

    Args:
        img (numpy.ndarray): an RGB image
        k (int): the number of clusters
        threshold (float): Maximal change for convergence criterion.
        do_display (bool): Whether or not to plot, intermediate steps.
        
    Results:
        cluster (numpy.ndarray): an array of the same size as `img`,
            containing for each pixel the cluster it belongs to
        centers (numpy.ndarray): 'number of clusters' x 3 array. 
            RGB color for each cluster center.
    """

    # Transform image into n_pixels 3-dimensional vectors.
    vec = img.reshape((-1, img.shape[2]))
    n_pixels = vec.shape[0]

    # Initialize random center vectors from data set.
    random_indices = np.random.choice(n_pixels, size=k, replace=False)
    centers = vec[random_indices]
    print(centers.shape)

    change = float('Inf')
    while change > threshold:
        # Remember previous centers.
        old_centers = centers.copy()
            
        # Calculate distance and best matching center vector.
        cluster = distance.cdist(vec, centers).argmin(axis=1)

        # Recalculate cluster centers.
        for i in range(k):
            idx = cluster == i
            if idx.any():
                centers[i] = vec[idx].mean(axis=0)
            else:
                # No vector is a match for this center vector.
                # Re-initialize center vector.
                centers[i] = vec[np.random.randint(n_pixels)]

        change = np.sum(np.linalg.norm(centers - old_centers))
        
        if do_display:
            plt.imshow(centers[cluster].reshape(img.shape))
            plt.title('change: {:.2f}'.format(change))
            display.clear_output(wait=True)
            display.display(plt.gcf())
            time.sleep(0.1)
        elif do_display is not None:
            print(change)
        
    cluster = cluster.reshape(img.shape[:2])
   
    return cluster, centers

theta = 0.01
def cb(k):
    cluster, centers_rgb = kmeans_rgb(img, k, theta)
    

    centers_random = np.random.rand(centers_rgb.shape[0], centers_rgb.shape[1])

    plt.subplot(312)
    plt.axis('off'); 
    plt.imshow(centers_rgb[cluster])
    plt.title('clustered')
    plt.subplot(313)
    plt.axis('off'); 
    plt.imshow(centers_random[cluster])
    plt.title('pseudo')



In [None]:
fig = plt.figure(figsize=(9, 15))
plt.subplot(311); plt.axis('off'); plt.imshow(img); plt.title('original')
plt.imshow(img)

plt.show()
interact(cb, k=widgets.IntSlider(min=1,max=32,step=1,value=7));

# Demo 1: Clustering hadwritten digits

Based on [A demo of K-Means clustering on the handwritten digits data](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html) from the scikit learn website.

In this example we compare the various initialization strategies for K-means in terms of runtime and quality of the results.

As the ground truth is known here, we also apply different cluster quality metrics to judge the goodness of fit of the cluster labels to the ground truth.

## Load the dataset

We will start by loading the `digits` dataset. This dataset contains handwritten digits from 0 to 9. In the context of clustering, one would like to group images such that the handwritten digits on the image are the same.

In [None]:
import numpy as np
from sklearn.datasets import load_digits

data, labels = load_digits(return_X_y=True)
(n_samples, n_features), n_digits = data.shape, np.unique(labels).size

## Inspect the data

It is usually a good idea to first inspect the data a bit before doing further processing. This way you can confirm that data is loaded as expected, and it may give hints what problems to expect and which methods to apply.

In [None]:
print(f"# digits: {n_digits}; # samples: {n_samples}; # features {n_features}")
print(f"Shape and dtype the data: {data.shape}, dtype={data.dtype}, ({data.min()}-{data.max()})")
print(f"Shape and dtype of the labels: {labels.shape}, dtype={labels.dtype}, ({labels.min()}-{labels.max()})")

In [None]:
import matplotlib.pyplot as plt
rows, columns = 2,4

plt.figure(figsize=(12,8))
plt.gray()
for row, column in np.ndindex(rows, columns):
    plt.subplot(rows, columns, row*columns + column + 1)
    index = np.random.choice(n_samples)
    plt.title(f"labels[{index}] = {labels[index]}")
    plt.imshow(data[index].reshape(8,8))
plt.show()

## Define our evaluation benchmark

We will first our evaluation benchmark. During this benchmark, we intend to compare different initialization methods for KMeans. Our benchmark will:
 * create a pipeline which will scale the data using a [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler);
 * train and time the pipeline fitting;
 * measure the performance of the clustering obtained via different metrics.

In [None]:
from time import time
from sklearn import metrics
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

def bench_k_means(kmeans, name, data, labels):
    """Benchmark to evaluate the KMeans initialization methods.

    Parameters
    ----------
    kmeans : KMeans instance
        A :class:`~sklearn.cluster.KMeans` instance with the initialization
        already set.
    name : str
        Name given to the strategy. It will be used to show the results in a
        table.
    data : ndarray of shape (n_samples, n_features)
        The data to cluster.
    labels : ndarray of shape (n_samples,)
        The labels used to compute the clustering metrics which requires some
        supervision.
    """
    t0 = time()
    estimator = make_pipeline(StandardScaler(), kmeans).fit(data)
    fit_time = time() - t0
    results = [name, fit_time, estimator[-1].inertia_]

    # Define the metrics which require only the true labels and estimator
    # labels
    clustering_metrics = [
        metrics.homogeneity_score,
        metrics.completeness_score,
        metrics.v_measure_score,
        metrics.adjusted_rand_score,
        metrics.adjusted_mutual_info_score,
    ]
    results += [m(labels, estimator[-1].labels_) for m in clustering_metrics]

    # The silhouette score requires the full dataset
    results += [
        metrics.silhouette_score(data, estimator[-1].labels_,
                                 metric="euclidean", sample_size=300)
    ]

    # Show the results
    formatter_result = ("{:9s}\t{:.3f}s\t{:.0f}\t{:.3f}\t{:.3f}"
                        "\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}")
    print(formatter_result.format(*results))

Cluster quality metrics evaluated (see [Clustering performance evaluation](https://scikit-learn.org/stable/modules/clustering.html#clustering-evaluation) for definitions and discussions of the metrics):
* homogeneity score (homo): how homegenous are the clusters (do they contain points with only one label - conditional entropy of classes given cluster asignments)? Best is 1.0, worst is 0.0
* completeness score (compl): are all members of a class assigned to one cluster (conditional entropy of cluster given class)? Best is 1.0, worst is 0.0
* V measure (v-meas): the harmonic mean of homogeneity and completeness. Best is 1.0.
* adjusted Rand index (ARI): compares cluster assignments to ground truth class labels (by counting how all pairs of points are labeled). 1.0 is best and 0.0 is worst.
* adjusted mutual information (AMI): agreement of the cluster assignment and the ground truth assignment. Best is 1.0.
* silhouette coefficient (silhouette): relates similarity of a sample and all other points in its cluster to the similarity of that sample to points in the next neighboring cluster. 1.0 means highly dense and well separated clustering, -1.0 means incorrect clustering.

## Run the benchmark

We will compare three approaches:
* an initialization using `kmeans++`. This method is stochastic and we will run the initialization `n_init` times;
* a random initialization. This method is stochastic as well and we will run the initialization `n_init` times;
* an initialization based on a PCA projection. Indeed, we will use the components of the PCA to initialize KMeans. This method is deterministic and a single initialization suffice.

In [None]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

n_init=4

print(82 * '_')
print('init\t\ttime\tinertia\thomo\tcompl\tv-meas\tARI\tAMI\tsilhouette')

kmeans = KMeans(init="k-means++", n_clusters=n_digits, n_init=n_init,
                random_state=0)
bench_k_means(kmeans=kmeans, name="k-means++", data=data, labels=labels)

kmeans = KMeans(init="random", n_clusters=n_digits, n_init=n_init, random_state=0)
bench_k_means(kmeans=kmeans, name="random", data=data, labels=labels)

pca = PCA(n_components=n_digits).fit(data)
kmeans = KMeans(init=pca.components_, n_clusters=n_digits, n_init=1)
bench_k_means(kmeans=kmeans, name="PCA-based", data=data, labels=labels)

print(82 * '_')

## Visualize the results on PCA-reduced data

PCA allows to project the data from the original 64-dimensional space into a lower dimensional space. Subsequently, we can use PCA to project into a 2-dimensional space and plot the data and the clusters in this new space.

In [None]:
import matplotlib.pyplot as plt

reduced_data = PCA(n_components=2).fit_transform(data)
kmeans = KMeans(init="k-means++", n_clusters=n_digits, n_init=4)
kmeans.fit(reduced_data)

# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02     # point in the mesh [x_min, x_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation="nearest",
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired, aspect="auto", origin="lower")

print(reduced_data.shape, labels.shape)
plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1], marker="x", s=169, linewidths=3,
            color="w", zorder=10)
plt.title("K-means clustering on the digits dataset (PCA-reduced data)\n"
          "Centroids are marked with white cross")
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

## Clustering after dimension reduction

Dimension reduction may improve the results of many machine learning algorithm, as it reduces redundancy and removes small variations which often are due to noise.  The following code compares different degrees of dimension reduction with PCA:

In [None]:
print(82 * '_')
print('init\t\ttime\tinertia\thomo\tcompl\tv-meas\tARI\tAMI\tsilhouette')

kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=4)
bench_k_means(kmeans=kmeans, name=f"Original ({data.shape[1]})", data=data, labels=labels)

# compute principal components
for n_components in (32,16,8,4,2,1):
    pca = PCA(n_components=n_components).fit(data.T)
    kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=4)
    bench_k_means(kmeans=kmeans, name=f"{n_components} PCs", data=pca.components_.T, labels=labels)

print(82 * '_')