# $k$-Means Clustering
*Curtis Miller*

Here I demonstrate clustering using the $k$-means algorithm.

## Clustering the Iris Dataset

The first example will demonstrate using $k$-means clustering for the iris dataset. I first load in that dataset.

In [None]:
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
iris_obj = load_iris()
iris_data = iris_obj.data
species = iris_obj.target
iris_data[:5,:]

In [None]:
plt.scatter(iris_data[:, 0], iris_data[:, 1], c=species, cmap=plt.cm.brg)
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.show()

Next I import the `KMeans` object to perform $k$-means clustering, and then apply the method.

In [None]:
from sklearn.cluster import KMeans

In [None]:
irisclust = KMeans(n_clusters=3, init='random')    # Three clusters with cluster centers chosen as random dataset points
irisclust.fit(iris_data)
irisclust.cluster_centers_    # The coordinates of cluster centers

In [None]:
# Visualizing the clustering
plt.scatter(iris_data[:, 0], iris_data[:, 1], c=irisclust.predict(iris_data), cmap=plt.cm.brg)
plt.scatter(irisclust.cluster_centers_[:, 0], irisclust.cluster_centers_[:, 1],
            c=irisclust.predict(irisclust.cluster_centers_), cmap=plt.cm.brg, marker='^', s=200,
            edgecolors='k')
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.show()

## Image Compression with $k$-Means

$k$-means can also be used for image compression. An image is first clustered using the $k$-means algorithm, each pixel being assigned to a cluster (often pixels are represented using RGB values). The number of clusters is the number of unique colors that need to be stored. Additionally, we would need to store the dimensions of the image and which color each pixel of the image is.

I demonstrate this approach with an image of a poison dart frog, which we will compress with $k$-means into ten unique colors.

In [None]:
from sklearn.datasets import load_sample_image
from PIL import Image
import numpy as np

In [None]:
frog = np.array(Image.open("frog.png").convert("RGB")) / 255    # The last division to force numbers to be in [0,1]

In [None]:
frog.shape

In [None]:
frog[:5, :5, 0]

In [None]:
frog[:5, :5, 1]

In [None]:
frog[:5, :5, 2]

In [None]:
plt.imshow(frog)

In [None]:
def kmeans_compression(img, n_clusters):
    """Recolors an image when colors are clustered using the k-means algorithm"""
    h, w, d = img.shape
    assert d == 3
    img_data = img.reshape(h * w, d)    # The new array should have a row per pixel
    img_clust = KMeans(n_clusters=n_clusters, init='random').fit(img_data)    # The actual k-means clustering step
    centroids = img_clust.cluster_centers_    # The RGB (normalized) values for the new pixels
    clust_pixels = img_clust.predict(img_data)    # Which pixel gets which new value
    new_img_data = centroids[clust_pixels]
    return new_img_data.reshape(h, w, d)

In [None]:
newfrog = kmeans_compression(frog, 10)

In [None]:
plt.imshow(newfrog)

While the original image would need more memory to store each pixel's unique color, the latter has less information and thus would need less memory to store, although the quality of the image is not the same.