# notMNIST letters clustering with k-means

In this notebook, we'll apply the k-means clustering algorithm to analyze notMNIST letters using a GPU and the [RAPIDS](https://rapids.ai/) libraries (cudf, cuml).

**Note that a GPU is required with this notebook.**

This version of the notebook has been tested with RAPIDS version 0.15.

First, the needed imports. 

In [None]:
%matplotlib inline

from pml_utils import show_clusters

import cudf
import numpy as np
import pandas as pd

import os
import urllib.request

from cuml import KMeans
from cuml import __version__ as cuml_version

from sklearn.cluster import KMeans as sklearn_KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn import __version__ as sklearn_version

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

print('Using cudf version:', cudf.__version__)
print('Using cuml version:', cuml_version)
print('Using sklearn version:', sklearn_version)

Then we load the notMNIST data. First time we need to download the data, which can take a while. The data is stored as Numpy arrays in host (CPU) memory.

In [None]:
def load_not_mnist(directory, filename):
    filepath = os.path.join(directory, filename)
    if os.path.isfile(filepath):
        print('Not downloading, file already exists:', filepath)
    else:
        if not os.path.isdir(directory):
            os.mkdir(directory)
        url_base = 'https://a3s.fi/mldata/'
        url = url_base + filename
        print('Downloading {} to {}'.format(url, filepath))
        urllib.request.urlretrieve(url, filepath)
    return np.load(filepath)

In [None]:
DATA_DIR = os.path.expanduser('~/data/notMNIST/')
if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR)
    
X = load_not_mnist(DATA_DIR, 'notMNIST_large_images.npy').reshape(-1, 28*28)
X = X.astype(np.float32)
y = load_not_mnist(DATA_DIR, 'notMNIST_large_labels.npy')

print()
print('notMNIST data loaded:',len(X))
print('X:', type(X), 'shape:', X.shape)
print('y:', type(y), 'shape:', y.shape)

Let's convert our data to a cuDF DataFrame in device (GPU) memory. 

In [None]:
%%time

cu_X = cudf.DataFrame.from_pandas(pd.DataFrame(X))

print('cu_X:', type(cu_X), 'shape:', cu_X.shape)

## k-means

[K-means](https://docs.rapids.ai/api/cuml/stable/api.html#k-means-clustering) clusters data by trying to separate samples in *k* groups of equal variance using an iterative two-step algorithm. It requires the number of clusters as a parameter.

In [None]:
%%time

n_clusters_kmeans = 10

kmeans = KMeans(n_clusters=n_clusters_kmeans)
kmeans.fit(cu_X)

kmeans_labels = kmeans.labels_.to_array()
kmeans_cluster_centers = kmeans.cluster_centers_.as_matrix()

As a comparison, we can run K-means clustering using scikit-learn.

The sizes of the clusters:

In [None]:
plt.hist(kmeans_labels, bins=range(kmeans.n_clusters+1),
         rwidth=0.5)
plt.xticks(0.5+np.arange(kmeans.n_clusters),
           np.arange(kmeans.n_clusters))
plt.title('Cluster sizes');

The k-means centroids are vectors in the same space as the original data, so we can take a look at them:

In [None]:
plt.figure(figsize=(kmeans.n_clusters, 1))

for i in range(kmeans.n_clusters):
    plt.subplot(1, kmeans.n_clusters, i+1)
    plt.axis('off')
    plt.imshow(kmeans_cluster_centers[i,:].reshape(28,28), cmap="gray")
    plt.title(str(i))

Let's also draw some letters from each cluster:

In [None]:
show_clusters(kmeans_labels, kmeans.n_clusters, X)

### Evaluation

Since we know the correct labels for the notMNIST letters, we can evaluate the quality of the clustering. We'll use the [adjusted Rand index](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html) which considers all pairs of samples and counts pairs that are assigned in the same or different clusters in the predicted and true clusterings. The index is between 0.0 and 1.0 with higher values denoting better clusterings.

In [None]:
print("Adjusted Rand index: %.3f"
      % adjusted_rand_score(y, kmeans_labels))