In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

import warnings;
warnings.filterwarnings('ignore');

## K-means clustering of the MNIST dataset

This project applies the k-means clustering algorithm to cluster written digits.

The [MNIST](http://yann.lecun.com/exdb/mnist/) dataset is a large database of handwritten digits. We will analyse a subset of this database with digit images reduced to 8x8 grayscaled valued pixels.  

It is a very well known dataset in the machine learing community and can be loaded directly from Scikit-learn:

In [None]:
from sklearn.datasets import load_digits

digits = load_digits()

print(digits.DESCR)

In `digits`, `data` contains the pixel feature vectors and `target` contains the labels.

We assign the feature vectors to `X` and the target to `y`: 

In [None]:
X = digits.data
y = digits.target

Print the number of rows and columns in `X` and `y`:

In [None]:
#Start code here

#End code here

The following code shows a random datapoint:

In [None]:
import random

plt.grid(b=None)
idx = random.randint(0,X.shape[0]-1)
plt.imshow(X[idx].reshape(8,8),cmap=plt.cm.gray_r)
plt.title("label = %i"%y[idx])
plt.show()

The `KMeans` function in Scikit-learn has the following parameters:

In [None]:
from sklearn.cluster import KMeans

help(KMeans)

The most important are `n_clusters` (the number of cluster centroids (K) that K-means should find) and `init` (the algorithm used to initialize the cluster centers). These parameters are called hyperparameters as they require optimization by the user. This in contrast to modelparameters that are optimized by the learning algorithm.

Also notice hyperparameter `n_init` that sets the number of time the K-means algorithm will be run starting from different centroid seeds, with the final best result selected based on the inertia metric.

Cluster the data into 10 groups with just one random cluster center initialization. Set `random_state` equal to zero:

In [None]:
#Start code here

#initialize K-means here
cls_kmns = 

#store clusters here
kmeans_result = 

#End code here

print(kmeans_result)

On the help page of the Scikit-learn KMeans implementation there is a section "Attirbutes" that lists additional results computed during K-means clustering. For instance, `cluster_centers_` contains the 10 cluster centers computed by the K-means algorithm. 

What is the inertia for the obtained clusters?

In [None]:
#Start code here

#End code here

The following code plots the 10 cluster centers:

In [None]:
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
for axi, center in zip(ax.flat, cls_kmns.cluster_centers_):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center.reshape(8, 8), cmap=plt.cm.binary)

Create a Pandas DataFrame `label_compare` with two columns: `label_cluster` that contains the labels assigned by the K-means clustering and `label_true` that contains the true (observed) label for each datapoint in `X`: 

In [None]:
#Start code here

label_compare = 

#End code here

print(label_compare)

Create a Pandas DataFrame `tmp` that contains all the rows in `label_compare` that were assigned to cluster center 0:

In [None]:
#Start code here

tmp = 

#End code here

Print the first 20 rows in `tmp`:

In [None]:
#Start code here

print(tmp.head(20))

#End code here

It should be clear that the labels assigned by K-means do not correspond to the true labels of the majority of the images in a cluster (i.e. the mode of the true labels in each cluster).

A Pandas Series has the function `mode()` to compute the mode of the values in a Series. Print the mode of the column `label_true` in `tmp`: 

In [None]:
#Start code here

#End code here

Compute the mode for each cluster in `label_compare` and add these modes to the Python list `label_mapper`:

In [None]:
label_mapper = []

for label_cluster in range(0,10):
    #Start code here
    
    #End code here
    
for label_cluster in range(0,10):
    print("Mode for cluster labeled {} = {}".format(label_cluster,label_mapper[label_cluster]))

Use the `map()` function to add a column `label_mode` to `label_compare` that contains the mode for each cluster label in `label_cluster`:

In [None]:
#Start code here

#End code here

print(label_compare)

Finally, we can compare the mode of the labels in each cluster with the true labels of the datapoints.

Print the accuracy of the K-means label modes (that can be seen as the class predictions computed by K-means):

In [None]:
from sklearn.metrics import accuracy_score

#Start code here

#End code here

The following code computes and plots a confusion matrix for the K-means predictions:

In [None]:
from sklearn.metrics import confusion_matrix

mat = confusion_matrix(label_compare["label_true"],label_compare["label_mode"])
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=digits.target_names,
            yticklabels=digits.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label')

For what digit do the K-means clusters make the most mistakes?

Apply K-means clustering again, but first normalize the feature vectors using `StandardScaler()` (write the normalized feature vectors to `X_norm`). What is the accuracy of the K-means predictions now?

In [None]:
from sklearn.preprocessing import StandardScaler

#Start code here

X_norm = 

kmeans_result = 

label_compare = 

label_mapper = []
for label_cluster in range(0,10):


#End code here

print(accuracy_score(label_compare["label_true"],label_compare["label_mode"]))


Use the TNSE module in Scikit-learn to project the 64 dimensional feature vectors in `X` to a 2-dimensional space. Set the `perplexity` hyperparemter to 30.

In [None]:
from sklearn.manifold import TSNE

#Start code here

prj_tsne = 
X_embedded =

#End code here

print(X_embedded)

Create a Pandas DataFrame `tsne_result` that contains the two columns in `X_embedded` with column names `t-SNE_1` and `t-SNE_2`:

In [None]:
#Start code here

tsne_result = 

#End code here

print(tsne_result)

Add a column `label` to `tnse_result` that contains the true label `y`:

In [None]:
#Start code here

#End code here

print(tsne_result)

To plot the t-SNE result we first convert the `label` column to a string (to understand why we do this, just run the notebook while skipping the following line of code):

In [None]:
tsne_result["label"] = tsne_result["label"].astype(str) 

We can use the Python [Seaborn](https://seaborn.pydata.org/) library to plot the t-SNE result:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(x="t-SNE_1",y="t-SNE_2",hue="label",data=tsne_result)

Apply K-means clustering to `X_embedded` and again create a Pandas DataFrame `label_compare` with columns `label_true` that contains the true labels and `label_mode` that contains the K-means predicted labels:

In [None]:
#Start code here

kmeans_result = 

label_compare = 

label_mapper = []
for label_cluster in range(0,10):

label_compare["label_mode"] =
 
#End code here

print(label_compare)

What is the accuracy of the K-means label predictions now? 

In [None]:
#Start code here

#End code here