# 03: Projections

It's hard to visualize high-dimensional data. There's a limited number of channels that we can use at once to encode the dimensions. One approach is to project the high-dimensional data into fewer dimensions, often 2. There are many techniques for doing this, such as t-SNE, UMAP, PCA, and locally linear embeddings. Let's see how projections can be useful for exploring machine learning datasets and model predictions.

## Imports

In [None]:
# If you're running this on colab, then you can uncomment the below command to
# install the pmlb library.
# !pip install pmlb

In [None]:
import altair as alt
import numpy as np
import pandas as pd
import pmlb
from itertools import product

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits


from io import BytesIO
import base64
from PIL import Image

In [None]:
# If you're running this code locally, then you can uncomment this to automatically
# save the chart data in files, rather than including the data in the spec. 

# !mkdir -p data
# alt.data_transformers.enable('json', prefix='data/altair-data')

## Data Preparation and t-SNE

Load the dataset and take a random sample of it.

In [None]:
mnist = pmlb.fetch_data('mnist')

In [None]:
mnist_small = mnist.sample(n=5000)

Separate the feature values from the target labels. Split the dataset into train and test sets.

In [None]:
X = mnist_small.drop(columns=['target'])
y = mnist_small['target'].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

Create 2D projections of the dataset.

In [None]:
tsne = TSNE(n_components=2, learning_rate='auto', init='random', perplexity=3)
X_train_embedded = tsne.fit_transform(X_train)
X_test_embedded = tsne.fit_transform(X_test)

Create a [data URL](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs) for each image \[[1](https://stackoverflow.com/a/70751215/5016634)\].

In [None]:
def encode_images(matrix, prefix):
    paths = []
    
    for i, row in enumerate(matrix):
        d = int(np.sqrt(len(row)))
        matrix = row.reshape((d, d)).astype(np.uint8)
        
        img = Image.fromarray(matrix)
        
        with BytesIO() as buffer:
            img.save(buffer, 'png')
            data = base64.encodebytes(buffer.getvalue()).decode('utf-8')
            
        
        paths.append(f'data:image/png;base64,{data}')
        
    return paths

In [None]:
!mkdir -p digits

images_train = encode_images(X_train.values, 'train')
images_test = encode_images(X_test.values, 'test')

Prepare the data frames for visualizaiton.

In [None]:
df_train = pd.DataFrame(X_train_embedded, columns=['x-tsne', 'y-tsne'])
df_train['target'] = y_train
df_train['image'] = images_train

df_train.head()

In [None]:
df_test = pd.DataFrame(X_test_embedded, columns=['x-tsne', 'y-tsne'])
df_test['target'] = y_test
df_test['image'] = images_test

df_test.head()

## Visualizing Training Data

**Exercise 1:** Create a scatterplot of the projection in `df_train`. What would make this scatterplot more useful?

**Exercise 2:** Replace the circles with digits. Is this effective?

**Exercise 3:** Color the circles by their target label. Make the scatterplot support panning and zooming. Do you notice anything interesting? What would make this visualization more useful?

**Exercise 4:** Add a tooltip to the visualization that shows the image and the target label. Does this lead to any interesting findings?

**Exercise 5:** Follow this [example](https://altair-viz.github.io/user_guide/marks.html#image-mark) of the `mark_image` to create a scatterplot that shows the images rather than circles. You'll likely only want to show a subset of the data. Do you find this visualization more or less useful than the previous one?

Let's add another projection to compare to t-SNE.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)

df_train['x-pca'] = X_train_pca[:,0]
df_train['y-pca'] = X_train_pca[:,1]

**Exercise 6:** Create side-by-side scatterplots for the t-SNE projection and the PCA projection.

**Exercise 7:** Add a brush selection to the scatterplots so that selecting points in one of the scatterplots highlights them in the other.

## Modeling

**Exercise 8:** Train a model on this dataset. You can use [scikit-learn](https://scikit-learn.org/stable/) or any library of your choosing. Save the predictions on the test dataset in `df_test`.

**Exercise 9:** Create a confusion matrix. What are the most common mistakes that the model makes?

**Exercise 10:** Create a scatterplot that shows the t-SNE projection of the test data. How can the scatteplot be made useful for exploring incorrectly classified points? Can you show which points were correctly classified vs. incorrectly classified? Can you provide a way for the user to see both the target and predicted labels?