# MNIST data

We cluster the MNIST data set with Persistable. Using the interactive mode, it's easy to identify the basic cluster structure in the data, and find parameters that lead to a clustering that match the labels well.

Importantly, we pre-process the data using the UMAP dimensionality reduction algorithm, following [Leland McInnes' notebook Clustering evaluation on high dimensional data](https://gist.github.com/lmcinnes).

In [1]:
import persistable
import umap
import numpy as np
import json
from sklearn.datasets import fetch_openml
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics import confusion_matrix

  from .autonotebook import tqdm as notebook_tqdm


## Load data

In [2]:
mnist = fetch_openml("MNIST_784")
raw_mnist = mnist.data.astype(np.float32)

## Apply UMAP to reduce the dimensionality of the data

This may take a few minutes

In [3]:
umap_mnist = umap.UMAP(n_neighbors=10, n_components=4, min_dist=1e-8, 
                       random_state=42, n_epochs=500).fit_transform(raw_mnist)

## Launch an instance of the Persistable interactive mode with pre-chosen settings

We played around with the parameters of the Component Counting Function and Prominence Vineyard to find settings that do a nice job of identifying cluster structure in MNIST. Feel free to adjust these settings to see the data from different points of view.

### Basic usage of the Persistable interactive mode

- Run the cell below to open the graphical user interface.
- To see the Component Counting Function, click "Compute".
- To see the Prominence Vineyard, in the box "Interactive inputs selection", choose "Family of lines". Now, one sees the two chosen lines that determine the Prominence Vineyard. Next, click "Compute" under "Prominence Vineyard".
- To get a clustering, in the box "Parameter selection", choose "On".
- To get a clustering that matches the MNIST labels reasonably well, select Gap number 10. Then click "Choose parameter".
- To get the labels for this clustering, run the cell below the graphical user interface.

In [4]:
# load a state dictionary to pass to start_ui
with open('MNIST-state.json', 'r') as fp:
    state = json.load(fp)

# create Persistable object
p = persistable.Persistable(umap_mnist, subsample=10000, n_neighbors=500)

# start UI
pi = persistable.PersistableInteractive(p)
port = pi.start_ui(ui_state=state, jupyter_mode='inline')

In [5]:
# get clustering with parameters chosen via the interactive mode
cluster_labels = pi.cluster()

## Comparing with MNIST labels

In [6]:
# get indices of points clustered by Persistable
clustered_points = (cluster_labels >= 0)

# print adjusted rand index, and percentage of data points clustered
ari = adjusted_rand_score(mnist.target[clustered_points], cluster_labels[clustered_points])
pct_clustered = (np.sum(clustered_points) / mnist.target.shape[0])
print('adjusted rand index: ' + str(ari))
print('percentage of data points clustered: ' + str(pct_clustered))

# print confusion matrix
print('Confusion matrix:')
print(confusion_matrix(mnist.target.astype(np.int16)[clustered_points], cluster_labels[clustered_points]))

adjusted rand index: 0.9390434332763671
percentage of data points clustered: 0.9397285714285715
Confusion matrix:
[[   0    2    3    0    2    5    6    7   25 6846]
 [   0    2    0    1    2   14 7797   31    4    1]
 [   3    5    3   22   14  123   47 6714   11   41]
 [   1   23   57 6550   43   44    9   41    4    2]
 [5143   73    0    1    2   12   45    4   25    5]
 [   5   19 5192   54   15    6    1   16   71   17]
 [   6    0   28    0    1    0   16    3 6795   24]
 [   8   56    0    0    1 7095   75   22    0    2]
 [  11   48   64   79 6253   29   77   15   31   14]
 [  30 5539    8   85   13  103   14    6    4   15]]
