https://github.com/datascience-course/2023-datascience-lectures/blob/main/22-Clustering1/22-Clustering1.ipynb

[K-means clustering](https://www.wikiwand.com/en/K-means_clustering) is one of the simpler clustering algorithms. The idea is that we want to represent each cluster by a vector who's coordinates are obtained by averaging the coordinates of observations that belong to it. An observations belongs to the cluster that is closest to. The "closest to" part depends on a metric provided by the user. This is a bit of a [chicken or the egg](https://www.wikiwand.com/en/Chicken_or_the_egg) kind of problem. On the one hand we want to know to which cluster each observation belongs to. On the other hand the position of each cluster depends on the observations that are assigned to it. We'll get back to this.

First let's prepare the classic iris dataset.

In [1]:
import pandas as pd
from sklearn import datasets

X, y = datasets.load_iris(return_X_y=True)
X = pd.DataFrame(data=X, columns=['Sepal length', 'Sepal width', 'Petal length', 'Petal width'])

X.head()


Unnamed: 0,Sepal length,Sepal width,Petal length,Petal width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [2]:
X.plot.scatter(x='Sepal length', y='Sepal width');


## Explanation

It turns that finding the solution to the K-means method is [NP-hard](https://www.wikiwand.com/en/NP-hardness) (see this [ELI5](https://www.reddit.com/r/explainlikeimfive/comments/1glcly/eli5_np_nphard_npcomplete/) for some intuition). This doesn't mean we can't find an approximate (and good enough) solution. The most popular approximate method is [Lloyd's algorithm](https://www.wikiwand.com/en/Lloyd%27s_algorithm), and it isn't just used for K-means.

Lloyd's algorithm goes as follows:

1. Generate random centroids (a fancy name for the center of each cluster) 
2. Assign each observation to the closest centroid.
3. Move each centroid to the average of the observations that are assigned to it.
4. Repeat from step 2 until a termination criterion is reached.

Usually the termination criterion is simply a number of iterations provided by the user. Fancy implementations stop the algorithm once some convergence has been detected.

In [None]:
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import numpy as np
from sklearn import cluster

columns = ['Sepal length', 'Petal width']

k = 3
kmeans = cluster.KMeans(n_clusters=k, random_state=42)
kmeans.fit(X[columns])

centroids = kmeans.cluster_centers_
labels = kmeans.labels_

fig, ax = plt.subplots()
colors = iter(cm.viridis(np.linspace(0, 1, k)))

for i, centroid in enumerate(centroids):
    color = next(colors)

    # Plot the points
    x = [xi[0] for xi in X[columns][labels == i].values]
    y = [xi[1] for xi in X[columns][labels == i].values]
    ax.scatter(x, y, color=color, alpha=0.7)

    # Plot the centroids
    ax.scatter(centroid[0], centroid[1], color='white', marker='*', s=300,
               edgecolor=color, linewidth='2')


## With scikit-learn

You can find the documentation for scikit-learn's implementation of K-means [here](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). Try and play around with the parameters. You can also use different variables of the iris dataset.

In [3]:
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import numpy as np
from sklearn import cluster


columns = ['Sepal length', 'Petal width']

k = 3
kmeans = cluster.KMeans(n_clusters=k, random_state=42)
kmeans.fit(X[columns])

centroids = kmeans.cluster_centers_
labels = kmeans.labels_

fig, ax = plt.subplots()
colors = iter(cm.viridis(np.linspace(0, 1, k)))

for i, centroid in enumerate(centroids):
    color = next(colors)

    # Plot the points
    x = [xi[0] for xi in X[columns][labels == i].values]
    y = [xi[1] for xi in X[columns][labels == i].values]
    ax.scatter(x, y, color=color, alpha=0.7)

    # Plot the centroids
    ax.scatter(centroid[0], centroid[1], color='white', marker='*', s=300,
               edgecolor=color, linewidth='2')






Error in callback <function _draw_all_if_interactive at 0x16a3b2200> (for post_execute):


TypeError: must be real number, not str

TypeError: must be real number, not str

<Figure size 640x480 with 1 Axes>

The big question is how do we choose $k$? We can usually take a good guess if we can visualize the data, but that is almost never the case in practice. What we need is a metric. The idea is that we want to loop over a range of $k$ values and measure the quality of our clusters. Obviously "quality" is subject to interpretation. A commonly used metric is called the [silhouette](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score) where the best possible score is 1 and the worst is -1.

In [4]:
from sklearn import metrics

for k in [2, 3, 4, 5, 6]:

    k_means = cluster.KMeans(n_clusters=k, random_state=42)

    labels = k_means.fit_predict(X)

    silhouette = metrics.silhouette_score(X, labels)

    print('k = {} gives a silhouette score of {}'.format(k, silhouette))


k = 2 gives a silhouette score of 0.6810461692117462
k = 3 gives a silhouette score of 0.5528190123564095
k = 4 gives a silhouette score of 0.49805050499728726
k = 5 gives a silhouette score of 0.4887488870931055
k = 6 gives a silhouette score of 0.3648340039670025














### Let's code it ourselves

Let's write a simple version of Lloyd's algorithm. For the fun of it we'll plot the progress of the algorithm.

In [5]:
import time

import numpy as np


def l1_distance(x, y):
    return np.abs(x - y).sum()


def k_means(X, k=3, distance=l1_distance, n_iterations=10, ax=None):

    # Make sure we're working with a numpy.ndarray and not a pandas.DataFrame
    if isinstance(X, pd.DataFrame):
        X = X.values

    # Create the initial centroids at random
    row_means = X.mean(axis=0)
    centroids = [np.random.uniform(0.5, 1.5, size=len(row_means)) * row_means for _ in range(k)]

    for i in range(n_iterations):

        # We'll store the clusters inside a dictionary
        clusters = {i: [] for i in range(k)}

        # Iterate over each data point
        for j, x in enumerate(X):

            # Compute the distance with each centroid
            distances = [distance(x, centroid) for centroid in centroids]

            # Find the closest centroid
            closest = np.argmin(distances)
            clusters[closest].append(j)

        # Update each centroid
        for j, points in clusters.items():

            # No update is needed if there are no points assigned
            if not points:
                continue

            # The centroid becomes the average of the points in the clusters it forms
            centroid = np.mean([X[k] for k in points], axis=0)
            centroids[j] = centroid

        # Plot the current disposition
        if ax:
            colors = iter(cm.viridis(np.linspace(0, 1, k)))
            ax.clear()
            ax.set_title('Iteration {}'.format(i+1))

            for i, centroid in enumerate(centroids):

                # Use the same color for the point and the centroids
                color = next(colors)

                # Plot the points belonging to the centroid
                cluster = clusters[i]
                x = [X[j][0] for j in cluster]
                y = [X[j][1] for j in cluster]
                ax.scatter(x, y, color=color, alpha=0.7)

                # Plot the centroid
                ax.scatter(centroid[0], centroid[1], color='white', marker='*', s=300,
                           edgecolor=color, linewidth='2')

            fig.canvas.draw()
            time.sleep(0.3)

    return centroids, clusters


fig, ax = plt.subplots()
plt.ion()

centroids, clusters = k_means(X[['Sepal length', 'Sepal width']], k=3, ax=ax)


TypeError: must be real number, not str

TypeError: must be real number, not str

<Figure size 640x480 with 1 Axes>