# Why this Dataset?
Let's be frank up-front. MNIST is the most classical example of a machine learning datatset. One which has been explored quite a number of times by many people and We do not have any novelty to add to the solution. **Unless we try to use an etirely different approach to solve the problem.** Much of the existing work done on this dataset has been using some form of *Supervised Learning*, may it be Linear models or Decision Trees or GBTs or CNNs.  
What we are going to do through this notebook is to find an unique approach to solve this problem through an ***Unsupervised Algorithm***. Sounds novel now? 😎

# Problem Statement
Our goal is to correctly identify digits from a dataset of tens of thousands of handwritten images.  
The scoring metric for this competition is also a classical one for multi-class classification: **Accuracy**.  
How often do we encouter this nowaday! 😆

## Data Description:-
1. The data files train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine.
2. Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

## Expected Outcome:-
* Correctly identify the digit form the pixel values of an grayscale image.

## Problem Category:-
* For the data and objective its is evident that this is a **multi-class classification problem** in the **Computer Vision** domain.

# About this Notebook
* This Notebook proposes a novel approach towards attempting the classical MNIST problem as an unsupervised learning problem.
* Since MNIST is a beginner dataset, this notebook will also be a **beginner friendly** notebook.
* This notebook assumes the reader has some basic idea regarding unsupervised learning methods and types.
    * If you are not confident, I would highly recommend you to brush up on topics from [this Notebook](https://www.kaggle.com/manabendrarout/unsupervised-learning-for-beginners), I wrote some times back. It is very extensive and should cover all the basics.
* This notebook also attempts to project other general use cases and inferences from Unsupervised learning models.

# Imports
Keeping these above points in mind, let's start by importing some basic libraries that we require though our journey of this notebook.

In [None]:
# Asthetics
import warnings
import sklearn.exceptions
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)

# General
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import os
import time
import random
from scipy.stats import mode

# Visialisation
import matplotlib.pyplot as plt
from matplotlib import offsetbox
%matplotlib inline
import seaborn as sns
sns.set(style="whitegrid")
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

# Machine Learning
# Dimensionallity Reduction
from sklearn.manifold import Isomap, TSNE
from sklearn.decomposition import PCA
# Clustering
from sklearn.mixture import GaussianMixture
#Metrics
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report

Now many models/libraries have a random initialization state which might differ from one run to another. Which might lead to difference in performance purely due to randomness and not due to any changes in code or algorithm. To account for such difference, let's fix the randomness by seeding the values to a fixed integer so that we have a much more predictable performance measure.

In [None]:
RANDOM_SEED = 42

In [None]:
def seed_everything(seed=RANDOM_SEED):
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    random.seed(seed)

In [None]:
seed_everything()

Now we will get predictable performance everytime we run this code. Moving on, let's read the train and testset files...

In [None]:
data_path = '../input/digit-recognizer'

train_file_path = os.path.join(data_path, 'train.csv')
sample_sub_path = os.path.join(data_path, 'sample_submission.csv')
test_file_path = os.path.join(data_path, 'test.csv')

print(f'Training File path: {train_file_path}')
print(f'Sample Submission File path: {sample_sub_path}')
print(f'Test Files path: {test_file_path}')

In [None]:
train_df = pd.read_csv(train_file_path)
sample_sub_df = pd.read_csv(sample_sub_path)
test_df = pd.read_csv(test_file_path)

# EDA
Let's look at some example images...

In [None]:
train_data = train_df.drop(['label'], axis=1).values.reshape(-1,28,28,1)
train_data = test_df.values.reshape(-1,28,28,1)

In [None]:
num_examples = 10
plt.figure(figsize=(20,20))
for i in range(num_examples):
    plt.subplot(1, num_examples, i+1)
    plt.imshow(train_data[i], cmap='Greys')
    plt.axis('off')
plt.show()

Great! Let's see what is the class balance in this particular ptoblem set...

In [None]:
ax = plt.subplots(figsize=(18, 6))
sns.set_style("whitegrid")
sns.countplot(x='label', data=train_df);
plt.ylabel("No. of Observations", size=20);
plt.xlabel("Class Name", size=20);

Okay, the class looks pretty balanced. Now let's convert the images into 2D and 3D embeddings and see how closely they resemble one another...

In [None]:
iso = Isomap(n_components=2)

# Using only 1/10 of the data as full data takes a lot of time to run
iso.fit(train_df.drop(['label'], axis=1)[::10])
data_2d = iso.transform(train_df.drop(['label'], axis=1)[::10])

iso_df = pd.DataFrame(data_2d)
iso_df['label'] = train_df.label.values[::10]
iso_df.columns = ['x', 'y', 'label']
# Converting label to string to get discrete colors in plot
iso_df['label'] = iso_df['label'].astype(str)
iso_df.head()

In [None]:
fig = px.scatter(iso_df, x='x', y='y', color='label',
                 hover_data=['label'])
fig.update_layout(title = 'MNIST ISOmap 2D')
fig.show()

In [None]:
iso = Isomap(n_components=3)

# Using only 1/10 of the data as full data takes a lot of time to run
iso.fit(train_df.drop(['label'], axis=1)[::10])
data_3d = iso.transform(train_df.drop(['label'], axis=1)[::10])

iso_df = pd.DataFrame(data_3d)
iso_df['label'] = train_df.label.values[::10]
iso_df.columns = ['x', 'y', 'z', 'label']
# Converting label to string to get discrete colors in plot
iso_df['label'] = iso_df['label'].astype(str)
iso_df.head()

In [None]:
fig = px.scatter_3d(iso_df, x='x', y='y', z='z', color='label',
                    hover_data=['label'])
fig.update_layout(title = 'MNIST ISOmap 3D')
fig.show()

**NOTE:- The above plot is interactive. Feel free to rotate/zoom to clearly observe the clusters.**  

Another dimensionality reduction technique is PCA. Let's see how those clusters look...

In [None]:
pca = PCA(n_components=2)

# Using only 1/5 of the data as full data takes a lot of time to run
pca.fit(train_df.drop(['label'], axis=1)[::5])
data_2d = pca.transform(train_df.drop(['label'], axis=1)[::5])

pca_df = pd.DataFrame(data_2d)
pca_df['label'] = train_df.label.values[::5]
pca_df.columns = ['x', 'y', 'label']
# Converting label to string to get discrete colors in plot
pca_df['label'] = pca_df['label'].astype(str)
pca_df.head()

In [None]:
fig = px.scatter(pca_df, x='x', y='y', color='label',
                 hover_data=['label'])
fig.update_layout(title = 'MNIST PCA 2D')
fig.show()

In [None]:
pca = PCA(n_components=3)

# Using only 1/5 of the data as full data takes a lot of time to run
pca.fit(train_df.drop(['label'], axis=1)[::5])
data_3d = pca.transform(train_df.drop(['label'], axis=1)[::5])

pca_df = pd.DataFrame(data_3d)
pca_df['label'] = train_df.label.values[::5]
pca_df.columns = ['x', 'y', 'z', 'label']
# Converting label to string to get discrete colors in plot
pca_df['label'] = pca_df['label'].astype(str)
pca_df.head()

In [None]:
fig = px.scatter_3d(pca_df, x='x', y='y', z='z', color='label',
                    hover_data=['label'])
fig.update_layout(title = 'MNIST PCA 3D')
fig.show()

**NOTE:- The above plot is interactive. Feel free to rotate/zoom to clearly observe the clusters.**  

Another interesting and powerful dimensionality reduction technique is tSNE. Let's see how we do on that algorithm...

In [None]:
tsne = TSNE(n_components=2, random_state=RANDOM_SEED)

# Using only 1/5 of the data as full data takes a lot of time to run
data_2d = tsne.fit_transform(train_df.drop(['label'], axis=1)[::5])

tsne_df = pd.DataFrame(data_2d)
tsne_df['label'] = train_df.label.values[::5]
tsne_df.columns = ['x', 'y', 'label']
# Converting label to string to get discrete colors in plot
tsne_df['label'] = tsne_df['label'].astype(str)
tsne_df.head()

In [None]:
fig = px.scatter(tsne_df, x='x', y='y', color='label',
                 hover_data=['label'])
fig.update_layout(title = 'MNIST tSNE 2D')
fig.show()

In [None]:
tsne = TSNE(n_components=3, random_state=RANDOM_SEED)

# Using only 1/5 of the data as full data takes a lot of time to run
data_3d = tsne.fit_transform(train_df.drop(['label'], axis=1)[::5])

tsne_df = pd.DataFrame(data_3d)
tsne_df['label'] = train_df.label.values[::5]
tsne_df.columns = ['x', 'y', 'z', 'label']
# Converting label to string to get discrete colors in plot
tsne_df['label'] = tsne_df['label'].astype(str)
tsne_df.head()

In [None]:
fig = px.scatter_3d(tsne_df, x='x', y='y', z='z', color='label',
                    hover_data=['label'])
fig.update_layout(title = 'MNIST tSNE 3D')
fig.show()

**NOTE:- The above plot is interactive. Feel free to rotate/zoom to clearly observe the clusters.**  

Now that we have seen various embeddings of MNIST clustered on 2 and 3 axes. You can observe some interesting things here:-
* Digits like 6 and 0 which look very close to one another have a lot of overlap in the cluster space.
* Digits like 4 and 0 which look very different are far apart in the projected plot.
* One other interesting thing is the overlap between 1 and 7 cluster spaces. I am guessing that might be due to some people writing ones with hats on top which makes them look closer to 7.
* Interestingly it also shows that some 4 and 7 are also closely related. I guess these are because some people write 7 with a horizontal line in between which makes it seem closer to 4.

As you can see we can derive similar interesting insights from the Data using unsupervised learning, dimensionality reduction and plotting.  

Another interesting aspect to see is to track what kind of variations are there within a single class... To visualize this better let's create an helpfer function hat will output image thumbnails at the locations of the projections.

In [None]:
def plot_components(data, model, images=None, ax=None, thumb_frac=0.05, cmap='gray'):
    ax = ax or plt.gca()
    proj = model.fit_transform(data)
    ax.plot(proj[:, 0], proj[:, 1], '.k')
    if images is not None:
        shown_images = np.array([2 * proj.max(0)])
        for i in range(data.shape[0]):
            shown_images = np.vstack([shown_images, proj[i]])
            imagebox = offsetbox.AnnotationBbox(
                offsetbox.OffsetImage(images[i], cmap=cmap),
                proj[i])
            ax.add_artist(imagebox)

Let's plot all of the ones and see how varied are they from one another...

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
model = Isomap(n_neighbors=5, n_components=2, eigen_solver='dense')
data = train_df[train_df.label == 1][::5]
data = data.drop(['label'], axis=1).values
plot_components(data, model, images=data.reshape((-1, 28, 28)), ax=ax, thumb_frac=0.05, cmap='gray_r')

Fascinating!  
As you can see this gives un an idea of the variety of Ones we have in out datatset.
Some observations will be:-
1. Towards the left there are ones which are tilting left.
2. Towards the middle there are straight ones.
3. Towards the right there are ones which tilt right.
4. Towards bottom there are ones which are thinner.
5. Towards the top there are ones which are thicker.

Let's try the same on Zeros and let's see what comes up...

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
model = Isomap(n_neighbors=5, n_components=2, eigen_solver='dense')
data = train_df[train_df.label == 0][::5]
data = data.drop(['label'], axis=1).values
plot_components(data, model, images=data.reshape((-1, 28, 28)), ax=ax, thumb_frac=0.05, cmap='gray_r')

Here we can observe that:-
1. On the left side of the plot there are perfectly circular(ish) zeros.
2. Towards the right hand side the zeros are slightly tiled towsrds the right.
3. On bottom the zeros are marked with a thinner pen/pencil.
4. Towards the top the Zeros are a lot thicker.

Similarly we can find out a lot of interesting and beautiful insights from data using Unsupervised learning. It is often seen that Unsupervised Learning is usually overlooked and people are more interested in supeervised learning. But as you can see, Unsupervised learning can be really powerful, insightful and cool. And the best thing is we do not need to do much and just have to infer charecteristics from the already segmented data.  

Now moving onto the coolest bit, for which we all are here... Can we solve the MNIST problem through unsupervised learning?  
Let's find out... *(SPOILER ALERT:- `YES`)*  

# Model Creation

Before clustering, let's reduce the dimentionality of the digits using TSNE.

In [None]:
tsne = TSNE(n_components=3,
            n_jobs=-1,
            random_state=RANDOM_SEED)
digits_proj = tsne.fit_transform(train_df.drop(['label'], axis=1))

Now predicting the clusters using Gaussian Mixture Model.

In [None]:
gmm = GaussianMixture(
    n_components=train_df.label.nunique(),
    random_state=RANDOM_SEED)
clusters = gmm.fit_predict(digits_proj)

Usually clustering also involves optimizing for number of clusters. But in this scenario we already have an idea regarding how manu clusters weneed the data to be split into, we will not be guessing/optimizing for the optimal number of clusters.  

Now that we have out clusters, let's assign the labels to them. What we are going to do is we will take the mode value of the cluster and assign to all the members of that cluster.

In [None]:
labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(train_df.label[mask])[0]

Let's find the cluster number to actual number mapping so that we can apply this to test set...

In [None]:
mapping_dict = {}
for i in range(10):
    mask = (clusters == i)
    mapping_dict[i] = mode(train_df.label[mask])[0]

In [None]:
print(mapping_dict)

Now let's check the classification metrics...

In [None]:
acc = metrics.accuracy_score(train_df.label.values, labels)
print(f'Accuracy Score: {acc}')

In [None]:
sns.heatmap(confusion_matrix(train_df.label.values, labels), annot=True, fmt='g', cmap="YlGnBu");

In [None]:
print(classification_report(train_df.label.values, labels))

I know this score might not look impressive especially considering what other "Supervised" algorithms can do with this dataset.  

But let me put this in perpective...  

**We are getting >85% accuracy WITHOUT EVEN LOOKING AT THE LABELS.**  

How does that sound? Not bad... Right?  

**If you liked this notebook and use parts of it in you code, please upvote this kernel. It keeps me inspired to come-up with unique approaches like this one and share it with the community.**  

Thanks and happy kaggling!