# Lab: Clustering MNIST

Use the MNIST data set to compare K-Means and Spectral clustering algorithms.

In [1]:
# get the data

import numpy as np
from sklearn.datasets import load_digits

data, labels = load_digits(return_X_y=True)
(n_samples, n_features), n_digits = data.shape, np.unique(labels).size

print(f"# digits: {n_digits}; # samples: {n_samples}; # features {n_features}")

# digits: 10; # samples: 1797; # features 64


## Task 1
* create train and test sets
* cluster the train set using [**K-MEANS**](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) and [**Spectral Clustering**](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering) using *k=10*
* evaluate the clustering results on test using the 
    * [silouette score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score)
    * [Fowlkes-Mallows score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fowlkes_mallows_score.html#sklearn.metrics.fowlkes_mallows_score)

In [67]:
from sklearn.cluster import KMeans, SpectralClustering
from sklearn.metrics import silhouette_score, fowlkes_mallows_score

KMeans

In [71]:
X = data.reshape(len(data), -1)
X = X.astype(float) / 255.

In [72]:
kmeans = KMeans(n_clusters=n_digits, random_state=0).fit(X)

In [73]:
kmeans.labels_.shape

(1797,)

In [75]:
data.shape

(1797, 64)

In [78]:
silhouette_score(X, kmeans.labels_)

0.1825191642460056

In [76]:
fowlkes_mallows_score(labels, kmeans.labels_)

0.702734687012768

Spectral Clustering

In [65]:
clustering = SpectralClustering(n_clusters=n_digits, gamma=1.0/10**2.0, assign_labels='discretize', random_state=0).fit(data)

In [66]:
silhouette_score(data, clustering.labels_)

0.15213942866788666

In [79]:
fowlkes_mallows_score(labels, clustering.labels_)

0.6570851074305768

## Task 2
* find the best *k* (in terms of the scoring) for *K-Means* and *Spectral Clustering*


In [88]:
import numpy as np

res = []
for k in range(10, 20):
    kmeans = KMeans(n_clusters=k, random_state=0).fit(data)
    res.append(silhouette_score(data, kmeans.labels_))

In [89]:
np.argmax(res)

5

## Task 3
* visualize the best clusters from task 2 with the [embedding projector](http://projector.tensorflow.org/)