## Visualising hand-written digits with t-SNE

In this exercise, we are going to visualise a subset of the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). This is a very famous dataset comprised of 70,000 images of handwritten digits.

In its original form, this dataset has labels and while much research has gone into developing supervised classification models which can outperform humans, it provides an interesting test case for t-SNE if we discard the labels, apply t-SNE, and then see how our low dimensional representation compares to the actual labels.

### Loading the data

As usual, let's start by loading in some essential libraries: 


In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

### Loading the data


* Load the `handwritten_digits.csv` file from the `data` directory, without the dataset header.


In [None]:
# Your code here...
digits = pd.read_csv("./data/handwritten_digits.csv", header=None)


### Loading the data

* Inspect the data (such as the `shape`, `head` method, etc.)

In [None]:
# Your code here...
print(digits.shape)
digits.head()


### Plot some digits

After briefly inspecting the size of the dataset, let's plot some digits so you can see what the data really looks like. However, before we do so, we need to tinker a little with the raw data.

If you called the `head()` method on the loaded data, you might have noticed that the entire observable output consisted of zeros. This is because, at the moment, each row of our DataFrame has 784 columns, where each column represents a pixel in our images. As the background to the handwritten digits is blank, these pixels have a value of zero ([black](https://homepages.inf.ed.ac.uk/rbf/HIPR2/value.htm)). Furthermore, as our images are square, this means that we need to turn each row into an array and reshape it to (28x28).

### Plot some digits

* Use the `.values` property to convert the `digits` DataFrame to arrays
* Then use a list comprehension or a for loop to `reshape` each row of the array into a $28x28$ array called `images`.

In [None]:
# Your code here...
# Reshape the rows from the digits dataframe and plot n
images = [image.reshape(28,28) for image in digits.values]


In [None]:
def plot_examples(images, n):
    for i in range(0, n):
        plt.subplot(1, n, i + 1)
        plt.axis('off')
        plt.imshow(images[i], cmap=plt.cm.gray)
        plt.show()

### Plot some digits

* Pass `images` to the `plot_examples` function
* Choose how many plots you wish to view and pass that as the `n` argument (10 is a good default)

In [None]:
# Your code here...
plot_examples(images, 10)


### Initial Dimensional Reduction

Since our images are $28x28$ we can think of our data as having 784 dimensions, where each dimension represents the value of a specific pixel.

This is a reasonable amount but could cause the fitting of the t-SNE to run slowly.

What we can do then, is to reduce the number of dimensions while trying to retain as much of the variation in the data. PCA is a great way of going about doing this because it uses the correlation between dimensions to provide a minimum number of variables, that retain the maximum amount of variation, about how the original data is distributed.

This is achieved by calculating the eigenvectors of the covariance matrix, which conveniently have the property of pointing along the major directions of variation in the data.

Since two and three dimensional plots are much easier to make sense of, let's calculate the first two principal components and see how much of the variation in the actual dataset they account for (as ever, `sklearn` has our back!)

### Initial Dimensional Reduction

* Import `PCA` from `sklearn.demcomposition` and `fit_transform` our digit data
* In doing so, pass $2$ as the `n_components` argument
* Save the pca transformed data as a new variable, `pca_digits`

In [None]:
# Your code here...
# Import and run PCA on the digits data
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca_digits = pca.fit_transform(digits.values)


### Initial Dimensional Reduction

* Extract the `explained_variance_ratio_` and `print` the output

In [None]:
# Your code here...
print("Explained variation per principal component: {}".format(pca.explained_variance_ratio_))


### Plot PCA

Given that the first two principal components account for over 17% of the variance, let's see if that is enough to visually tell the digits apart:


* Load the labels for the digits (`handwritten_digits_labels.csv`) from the `data` directory, call this DataFrame `labels`.
* Define a function that takes a labels DataFrame and creates a scatter plot of the `PC1` and `PC2` columns

In [None]:
# Your code here...
def plot_pca(labels):
    fig = plt.figure(figsize=(10, 10))
    plt.scatter(
        labels.PC1.values, 
        labels.PC2.values, 
        marker='.', 
        linewidths=0.5, 
        c=labels[0],
    )
    plt.show()
# Load labels and plot the PCA results
labels = pd.read_csv("./data/handwritten_digits_labels.csv", header=None)


### Plot PCA

* Assign the output of the PCA to two new columns in the `labels` DataFrame (`PC1` and `PC2`)

In [None]:
# Your code here...
labels["PC1"] = pca_digits[:,0]
labels["PC2"] = pca_digits[:,1]


### Plot PCA

* Pass the `labels` DataFrame to the `plot_pca` function

In [None]:
# Your code here...
plot_pca(labels)



The graphs shows that the first two principal components definitely hold some important informaton. While there is a lot of overlap, and some of the labels are more dispersed than others, the labels are clearly clustered.

As mentioned before, t-SNE is recommended for use on data with dimensionality in the range of 50. Obviously, 784 is substantially higher than 50, and while t-SNE will still run with our raw data, the steps involved in the algorithm are computationally intense and so the code will run slowly.

To get around this, let's run PCA again but this time retain 50 principal components and then use this dimensionally reduced data to further reduce the data into two dimensions with t-SNE:



* Re-run `PCA` with `n_components` set to $50$
* Store the output of the PCA as a variable called `digits_50`

In [None]:
# Your code here...
# Run PCA with 50 components
pca = PCA(n_components=50)
digits_50 = pca.fit_transform(digits.values)



* Extract the `explained_variance_ratio_` and `print` the output


In [None]:
# Your code here...
print("Cumulative variance explained by first 50 Principal Components: {}".format(np.sum(pca.explained_variance_ratio_)))


### Run t-SNE
Impressively, the first 50 components account for over 83% of the variance in the digits data set.

Now let's feed this data into the t-SNE algorithm and plot the results:




* Import `TSNE` from `sklearn.manifold`


In [None]:
# Your code here...
from sklearn.manifold import TSNE



* To improve the speed of t-SNE, filter `digits_50` to the first $5000$ data points


In [None]:
# Your code here...
digits_50 = digits_50[:5000]


* Instantiate the `TSNE` object, setting the `n_components` to $2$, the `perplexity` to $40$, and chain the `fit_transform` method with the `digits_50`
* Save the t-sne transformed data as a new variable, `digits_tsne`

### Warning: running the cell below can take a while! ~5min

In [None]:
# Your code here...
# Run t-SNE
digits_tsne = TSNE(
    n_components=2, 
    perplexity=40, 
    verbose=2).fit_transform(digits_50)


### t-SNE output analysis

Now that we have our fitted t-SNE model, let's once again plot the results but, before we colour each point with it's respective label (or **ground truth**), let's see what reasonable inferences we can make about the t-SNE output presented as it is:


* Reload the labels for the digits (`handwritten_digits_labels.csv`) from the `data` directory as the variable `labels`.

In [None]:
# Your code here...
# Load labels and plot the PCA results
labels = pd.read_csv(
    "./data/handwritten_digits_labels.csv", 
    header=None
)



* Because we only fitted the t-SNE on the first $5000$ data points, filter the `labels` accordingly


In [None]:
# Your code here...
labels = labels[:5000]


* Assign the output of the t-SNE to two new columns in the `labels` dataframe (`TSNE1` and `TSNE2`)

In [None]:
# Your code here...
labels["TSNE1"] = digits_tsne[:,0]
labels["TSNE2"] = digits_tsne[:,1]


* Pass this DataFrame to the `plot_tsne` function and use it to visualise the t-SNE embedding

In [None]:
def plot_tsne(labels):
    fig = plt.figure(figsize=(10, 10))
    plt.scatter(
        labels.TSNE1.values, 
        labels.TSNE2.values, 
        marker='.', 
        linewidths=0.5, 
    )
    plt.show()
    
# Your code here...
plot_tsne(labels)


Since there are 10 digits, we have 10 classes. But even if we did not know this about our data, it is relatively easy to discern between 8 or 9 clusters. The t-SNE algorithm has separated our data into clearly distinct and tightly grouped clusters.

Given this output, it's quite easy to see why t-SNE has been used as a dimensionality reduction method to generate inputs for a variety of clustering methods.

To get a sense of how well the algorithm has performed in visualising our high dimensional data, let's generate the same plot, but colour each point by its respective label.


* Modify the `plot_tsne` function so it takes the $5000$ `labels` DataFrame and plots it (note that the colour labels are the `0th` column of the DataFrame)

In [None]:
# Your code here...
# Plot the results with points coloured by digit
def plot_coloured_tsne(labels):
    fig = plt.figure(figsize=(10, 10))
    plt.scatter(
        labels.TSNE1.values, 
        labels.TSNE2.values, 
        marker='.', 
        linewidths=0.5,
        c=labels[0],
    )
    plt.show()

plot_coloured_tsne(labels)


## Limitations

While t-SNE is a powerful tool for visualising high dimensional data, it comes with a number of shortcomings.

The first is that t-SNE works best at visualising data when the dimensionality is limited (albeit far larger than what humans can muster). When the dimensionality is very high, it is highly recommended to use another dimensionality reduction method first (e.g. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g. 50).

Perhaps more seriously, since t-SNE scales quadratically with regards to our sample size, its applicability is limited to data sets with only a few thousand input objects; beyond that, learning becomes too slow to be practical (and the memory requirements become too large). That said, the authors have updated the method with a [tree-based algorithm](https://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf) which scales much better to larger data sets.