In [None]:
# Import some basic libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_context('paper')

# Hands-on Activity 13.1: Dimensionality Reduction

## Objectives
+ Understand the dimensionality reduction problem
+ Use principal component analysis to solve the dimensionality reduction problem

Through out this lecture we will be using the [MNIST dataset](https://en.wikipedia.org/wiki/MNIST_database).
The MNIST dataset consists of thousands of images of handwritten digits from $0$ to $1$.
The dataset is a standard benchmark in machine learning.
Here is how to get the dataset from the tensorflow library:

In [None]:
# Import tensorflow
import tensorflow as tf
# Download the data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

The dataset comes with inputs (that are images of digits) and labels (which is the label of the digit).
We are not going to use the labels in this lecture as we will be doing unsupervised learning.
Let's look at the dimensions of the training dataset:

In [None]:
x_train.shape

The training dataset is a 3D array.
The first dimension is 60,0000. This is the number of different images that we have.
Then each image consists of 28x28 pixels.
Here is the first image in terms of numbers:

In [None]:
x_train[0]

Each number corresponds to the pixel value.
Say, zero is a white pixel and 255 is a black pixel.
Values between 0 and 255 correspond to some shade of gray.
Here is how to visualize the first image:

In [None]:
plt.imshow(x_train[0], cmap=plt.cm.gray_r, interpolation='nearest')

In this handout, I want to work with just images of threes.
So, let me just keep all the threes and throw away all other data:

In [None]:
threes = x_train[y_train == 3]
threes.shape

We have 6,131 threes. That's enough.
Now, each image is a 28x28 matrix.
We do not like that.
We would like to have vectors instead of matrices.
So, we need to *vectorize* the matrices.
That's easy to do. We just have to reshape them.

In [None]:
vectorized_threes = threes.reshape((threes.shape[0], threes.shape[1] * threes.shape[2]))
vectorized_threes.shape

Okay. You see that we now have 6,131 vectors each with 784 dimensions.
That is our dataset.
Let's apply PCA to it to reduce its dimensionality.
We are going to use the [PCA class of scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
Here is how to import the class:

In [None]:
from sklearn.decomposition import PCA

And here is how to initialize the model and fit it to the data:

In [None]:
pca = PCA(n_components=0.98, whiten=True).fit(vectorized_threes)

For the complete definition of the inputs to the ``PCA`` class, see its documentation.
The particular parameters that I define above have the following effect:
- ``n_components``: If you set this to an integer, the PCA will have this many components. If you set it to a number between $0$ and $1$, say 0.98, then PCA will keep as many components as it needs in order to capture 98% of the variance of the data. I use the second type of input.
- ``whiten``: This ensures that the projections have unit variance. If you don't specify this then their variance will be the corresponding eigenvalue. Setting ``whiten=True`` is consistent with the theory developed in the video.

Okay, so now that the model is trained let's investigate it.
First, we asked PCA to keep enough components so that it can describe 98% of the variance.
How many did it actually keep?
Here is how to check this:

In [None]:
pca.n_components_

It kept 227 compents. This doesn't look very impressive but we will take it for now.

Now, let's focus on the eigenvalues of the covariance matrix.
Here is how to get them:

In [None]:
fig, ax = plt.subplots(dpi=150)
ax.plot(pca.explained_variance_)
ax.set_xlabel('$i$')
ax.set_ylabel(r'$\lambda_i$');

Remember that the sum of first $k$ eigenvalues, $\sum_{i=1}^k\lambda_i$ tells you how much variance is explained with a model that keeps the first $k$ PCA components.

Okay.
As we discussed in the lecture videos, each of the observations expanded as follows:
$$
\mathbf{x}_j = \mathbf{m} + \sum_{i=1}^kz_{ji}\sqrt{\lambda}_i\mathbf{v}_i.
$$
Let's visualize first the mean $\mathbf{m}$.
It is this vector:

In [None]:
pca.mean_.shape

so let's reshape it and plot it as an image:

In [None]:
fig, ax = plt.subplots(dpi=64)
ax.imshow(pca.mean_.reshape((28, 28)), cmap=plt.cm.gray_r, interpolation='nearest')
ax.set_xticks([])
ax.set_xticklabels([])
ax.set_yticks([])
ax.set_yticklabels([]);

Now let's go for the eigenvectors $\mathbf{v}_i$.
Here is where the are:

In [None]:
pca.components_.shape

and here is how to visualize them as images:

In [None]:
for i in range(5):
    fig, ax = plt.subplots(dpi=64)
    ax.imshow(pca.components_[i, :].reshape((28, 28)), cmap=plt.cm.gray_r, interpolation='nearest')
    ax.set_xticks([])
    ax.set_xticklabels([])
    ax.set_yticks([])
    ax.set_yticklabels([])

Now, let's visulize the first two principal components $\mathbf{z}_j$ of each observation $\mathbf{x}_j$.
This will essentially project the dataset from 784 dimensions to two dimensions.
Here is how to find the principal components:

In [None]:
Z = pca.transform(vectorized_threes)
Z.shape

Visualize the first two:

In [None]:
fig, ax = plt.subplots(dpi=150)
ax.scatter(Z[:, 0], Z[:, 1])
ax.set_xlabel('$z_1$')
ax.set_ylabel('$z_2$');

Alright! Each dot in this plot corresponds to an image of a 3.
This is nice, but not the best thing we can do in terms of visualization.
It would be nice if we could plot the actual image instead of a dot.
Here is how to do this:

In [None]:
# The following code is a modification of the code found here:
# https://stackoverflow.com/questions/35651932/plotting-img-with-matplotlib
from matplotlib.offsetbox import OffsetImage, AnnotationBbox
from matplotlib.cbook import get_sample_data
def imscatter(x, y, images, cmap=plt.cm.gray_r, ax=None, zoom=1):
    x, y = np.atleast_1d(x, y)
    artists = []
    for x0, y0, image in zip(x, y, images):
        im = OffsetImage(image, zoom=zoom, cmap=cmap, interpolation='nearest')
        ab = AnnotationBbox(im, (x0, y0), xycoords='data', frameon=False)
        artists.append(ax.add_artist(ab))
    ax.update_datalim(np.column_stack([x, y]))
    ax.autoscale()
    return artists

Here it is:

In [None]:
fig, ax = plt.subplots(dpi=150)
imscatter(Z[:, 0], Z[:, 1], threes, ax=ax, zoom=0.2)
ax.set_xlabel('$z_1$')
ax.set_ylabel('$z_2$');

In this plot you can clearly see the interpretation of the principal components.
The first principal component seems to rotate the three about an axis coming out of the screen.
The second principal component seems to change the thickness of the bottom of the three.
We can study these effects in more detail using an interactive plot.
Here you go:

In [None]:
from ipywidgets import interactive
def visualize_pca_component(i=0, z=0.0):
    fig, ax = plt.subplots(dpi=64)
    x = pca.mean_ + z * np.sqrt(pca.explained_variance_[i]) * pca.components_[i]
    ax.imshow(x.reshape((28, 28)), cmap=plt.cm.gray_r, interpolation='nearest')
    ax.set_xticks([])
    ax.set_xticklabels([])
    ax.set_yticks([])
    ax.set_yticklabels([])
interactive(visualize_pca_component, i=[0, 1, 2, 3], z=np.linspace(-2.0, 2.0, 100))

### Questions

+ Keeping the index $i=0$ fixed, play with the corresponding $z$ and observe that it rotates the three.
+ Change $i$ to $1,2$ and $3$ and study the effect of the corresponding principal component.

Now we are going to study the reconstruction error for the validation dataset.
First, throw everything that is not a three:

In [None]:
valid_threes = x_test[y_test==3]
valid_threes.shape

We have 1,010 images for validation.
We still need to vectorize.
Let's do it:

In [None]:
vectorized_valid_threes = valid_threes.reshape(valid_threes.shape[0], valid_threes.shape[1] * valid_threes.shape[2])

Now, what I am going to do is project all the validation points:

In [None]:
Z_valid = pca.transform(vectorized_valid_threes)

And then reconstruct them and compare them.

In [None]:
# Which validation example would you like to look at:
idx = 1
# How many componentd would you like to use for the reconstruction
n_components = 1
# Here is how to reconstruct:
x = pca.inverse_transform(
    np.hstack([Z_valid[idx][:n_components], 
               np.zeros((Z_valid.shape[1] - n_components,))]))

# Visualize the original image
fig, ax = plt.subplots(dpi=64)
ax.imshow(valid_threes[idx], cmap=plt.cm.gray_r, interpolation='nearest')
ax.set_xticks([])
ax.set_xticklabels([])
ax.set_yticks([])
ax.set_yticklabels([])

# Visualize the reconstructed image
fig, ax = plt.subplots(dpi=64)
ax.imshow(x.reshape((28, 28)), cmap=plt.cm.gray_r, interpolation='nearest')
ax.set_xticks([])
ax.set_xticklabels([])
ax.set_yticks([])
ax.set_yticklabels([]);

### Questions
+ Play with the code block above increasing ``n_components`` to 2, 4, 8, and so on up to 227. Observe how the reconstruction becomes better (but not perfect).
+ Repeat the above question, but change also the ``idx`` variable so that you see some more examples of three.
+ Go back a few code blocks, and change your validation set to include only fives.
You must change this:
```
valid_threes = x_test[y_test==3]
valid_threes.shape
```
to this:
```
valid_threes = x_test[y_test==5]
valid_threes.shape
```
Don't bother renaming ``valid_threes``.
Can the PCA model constructed with threes describe 5s? Why yes or why not?
+ Repeat the previous question with a couple of other digits.