# Introduction to the Clustering of Images

Similar to texts you can also cluster images. For complex images it is usually necessary to use a convolutional neural network (CNN), like for example the model VGG16, as a feature extractor only, meaning that we will remove the final (prediction) layer so that we can obtain feature vectors of the images which then can be clustered.

In our case we will use a simpler example. We will cluster the digits dataset.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from scipy.stats import mode

from sklearn.datasets import load_digits
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.cluster import DBSCAN, MiniBatchKMeans, KMeans, AgglomerativeClustering

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE


from time import time

In [None]:
# Loading the digits dataset
digits = load_digits()
digits.data.shape

Let's have a look at how the digits look like.

In [None]:
# PLotting of the digit dataset
def plot_digits(data):
    fig, ax = plt.subplots(10, 10, figsize=(8, 8),
                           subplot_kw=dict(xticks=[], yticks=[]))
    fig.subplots_adjust(hspace=0.05, wspace=0.05)
    for i, axi in enumerate(ax.flat):
        im = axi.imshow(data[i].reshape(8, 8), cmap='binary')
        im.set_clim(0, 16)
plot_digits(digits.data)

We can cluster the digits with a simple `KMeans` clustering model.

In [None]:
# Creating the KMeans model and predict digits
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape

The result are 10 clusters in 64 dimensions. Notice that the cluster centers themselves are 64-dimensional points, and can themselves be interpreted as the "typical" digit within the cluster. Let's see what these cluster centers look like:

In [None]:
# Plot the cluster centers
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

We see that even without the labels, KMeans is able to find clusters whose centers are recognizable digits, with perhaps the exception of 1 and 8.

Because Kmeans is an unsupervised algorithm it doesn't know the correct labels but assigns them randomly. Before we can properly evaluate the performance of the algorithm we need to fix this and by matching each learned cluster label with the true labels.

In [None]:
# Create a labels array to match the learned cluster lables with the true labels
labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]

Now we can check how accurate our unsupervised clustering was in finding similar digits within the data:

In [None]:
# Plotting the confusing matrix
mat = confusion_matrix(digits.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=digits.target_names,
            yticklabels=digits.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');

print('Accuracy: ', accuracy_score(digits.target, labels))
print('==============================================================')
print(confusion_matrix(digits.target,labels))
print('==============================================================')
print(classification_report(digits.target,labels))

With a simple `KMeans` algorithm we were able to correctly cluster **~79%** of the input digits. As we might expect from the visualization of the cluster centers, the main point of confusion was between the eights and ones. Nevertheless, it is still an impressive example of how we can build a digit classifier without a reference to any known label.

Let's try to take this even further. We can use the **t-distributed stochastic neighbor embedding** (t-SNE) algorithm to preprocess the data before performing Kmeans. t-SNE is a nonlinear embedding algorithm that is particularly adept at preserving points within clusters.

In [None]:
# Project the data: this step will take several seconds
tsne = TSNE(n_components=2, init='random', random_state=0)
digits_proj = tsne.fit_transform(digits.data)

# Compute the clusters
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits_proj)

# Permute the labels
labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]

# Plot the confusion matrix
mat = confusion_matrix(digits.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=digits.target_names,
            yticklabels=digits.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');

# Compute the accuracy
print('Accuracy: ', accuracy_score(digits.target, labels))
print('==============================================================')
print(confusion_matrix(digits.target, labels))
print('==============================================================')
print(classification_report(digits.target, labels))

Now we achieve **93% accuracy** without using the labels. This is the power of unsupervised learning when used carefully: it can extract information from the dataset that might be difficult to extract by hand or eye.

Let's also try **DBSCAN**:

In [None]:
# Project the data: this step will take several seconds
tsne = TSNE(n_components=2, init='random', random_state=0)
digits_proj = tsne.fit_transform(digits.data)

# Compute the clusters
clustering = DBSCAN(eps=5, min_samples=15)
clusters = clustering.fit_predict(digits_proj)

# Permute the labels
labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]

# Plotting the confusing matrix
mat = confusion_matrix(digits.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=digits.target_names,
            yticklabels=digits.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');

# Compute the accuracy
print('Accuracy: ', accuracy_score(digits.target, labels))
print('==============================================================')
print(confusion_matrix(digits.target, labels))
print('==============================================================')
print(classification_report(digits.target, labels))

Getting an **accuracy** of **79%** we can see that DBSCAN does not work as well as KMeans. Apparently the algorithm also struggled to correctly identify ones and nines.

## Bonus

You can also cluster elements in one picture by grouping or aggregating the pixel values of an image into a certain number of natural classes (groups) based on statistical similarity. That can for example be used for color compression. Imagine you have an image with millions of colors. In most images, a large number of the colors will be unused, and many of the pixels in the image will have similar or even identical colors.

In our example here we will use clustering to highlight particular landforms on satellite imagery.

In [None]:
import rasterio as rio
from rasterio.plot import show
from sklearn import cluster
import matplotlib.pyplot as plt
import matplotlib.colors as mc
import numpy as np

We are going to use a satellite image of the Seville ([Source](http://www.esa.int/ESA_Multimedia/Missions/Sentinel-2/(sortBy)/published/(result_type)/images)). We will load the image with Rasterio, an open source python library that reads and writes raster datasets such as satellite imagery and terrain models in different formats like GEOTIFF and JP2.

In [None]:
seville_raster = rio.open("images/Seville_Spain.jpg") 

To visualize this image nicely will need to adjust its contrast first by stretching it out.

In [None]:
seville_arr = seville_raster.read()
vmin, vmax = np.nanpercentile(seville_arr, (5,95))  # 5-95% contrast stretch

fig, ax = plt.subplots(figsize=[20,20], ncols=1,nrows=1)
show(seville_raster, cmap='gray', vmin=vmin, vmax=vmax, ax=ax)
ax.set_axis_off()
fig.savefig("images/seville_new.jpg", bbox_inches='tight')
plt.show()

If we print the shape of this image we will see that it has only the information for the height and width (height, width). Before we go on we need to change it to match the required shape of (height, width, channels). Channels in this case represent the color channels RGB (red, green, blue). 
We can reshape the image by creating an empty array using our image size, counts and data type from the meta data. With a for loop we can slice each channel/band and reform it in our empty array. At the end of the loop we will get a new array with the required shape that has the same size and number of channels.

In [None]:
# print the shape of the original image
seville_raster.shape

In [None]:
# create an empty array with same dimension and data type
imgxyb = np.empty((seville_raster.height, seville_raster.width, seville_raster.count), seville_raster.meta['dtype'])

In [None]:
# loop through the raster's bands to fill the empty array
for band in range(imgxyb.shape[2]):
    imgxyb[:,:,band] = seville_raster.read(band+1)

In [None]:
print(imgxyb.shape)

Now we are almost ready to train our model, but first, we need to combine our X (width) and Y (height) dimensions to 1 dimension, so that we have a 2d array instead of 3d. This array can be fed into a `KMeans` cluster model. 

In [None]:
# convert to 2d array
img2d = imgxyb[:,:,:3].reshape((imgxyb.shape[0]*imgxyb.shape[1],imgxyb.shape[2]))

Training: The most important parameter to set is the `n_clusters` which represents the number of clusters that we want to group our pixels into, we choose 4 classes for clarity, but you can choose as many classes as you can see in the image.

In [None]:
# create an object of the classifier and train it
cl = KMeans(n_clusters=4)
param = cl.fit(img2d)

In [None]:
cl.labels_

In [None]:
# get the labels of the classes and reshape it x-y-bands shape order (one band only)
img_cl = cl.labels_
img_cl = img_cl.reshape(imgxyb[:,:,0].shape)

In [None]:
img_cl.shape

To show the resulting image, we use a custom color map where you can control the color of each class. 

In [None]:
# Create a custom color map to represent our 4 different classes
cmap = mc.LinearSegmentedColormap.from_list("", ["#B90E0A","navy","green","#E1AD01"])

In [None]:
# Show the resulting array and save it as jpg image
plt.figure(figsize=[20,20])
plt.imshow(img_cl, cmap=cmap)
plt.axis('off')
plt.savefig("images/seville_clustered.jpg", bbox_inches='tight')
plt.show()

You can also try the code with other pictures. (Maybe try the code with a selfie with different colors and number of clusters... ;))