# Machine Learning Exercise 7 - K-Means Clustering & PCA

In this exercise we'll implement K-means clustering and use it to compress an image.  We'll start with a simple 2D data set to see how K-means works, then we'll apply it to image compression.  

## K-means clustering

To start out we're going to implement and apply K-means to a simple 2-dimensional data set to gain some intuition about how it works.  K-means is an iterative, unsupervised clustering algorithm that groups similar instances together into clusters.  The algorithm starts by guessing the initial centroids for each cluster, and then repeatedly assigns instances to the nearest cluster and re-computes the centroid of that cluster.  The first piece that we're going to implement is a function that finds the closest centroid for each instance in the data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.io import loadmat
%matplotlib inline

In [None]:
def find_closest_centroids(X, centroids):
    m = X.shape[0]
    k = centroids.shape[0]
    idx = np.zeros(m)
    
    #CODE A FUNCTION THAT RETURNS FOR EVERY POINT THE CLOSEST CENTROID
    
    return idx

Let's test the function to make sure it's working as expected.  We'll use the test case provided in the exercise.

In [None]:
data = loadmat('data/ex7data2.mat')
X = data['X']
initial_centroids = initial_centroids = np.array([[3, 3], [6, 2], [8, 5]])

idx = find_closest_centroids(X, initial_centroids)
idx[0:3]

The output matches the expected values in the text (remember our arrays are zero-indexed instead of one-indexed so the values are one lower than in the exercise).  Next we need a function to compute the centroid of a cluster.  The centroid is simply the mean of all of the examples currently assigned to the cluster.

In [None]:
def compute_centroids(X, idx, k):
    m, n = X.shape
    centroids = np.zeros((k, n))
    #CODE A FUNCTION THAT COMPUTES THE NEW CENTROIDS
    
    return centroids

In [None]:
compute_centroids(X, idx, 3)

This output also matches the expected values from the exercise.  So far so good.  The next part involves actually running the algorithm for some number of iterations and visualizing the result.  This step was implmented for us in the exercise, but since it's not that complicated I'll build it here from scratch.  In order to run the algorithm we just need to alternate between assigning examples to the nearest cluster and re-computing the cluster centroids.

In [None]:
def run_k_means(X, initial_centroids, max_iters):
    m, n = X.shape
    k = initial_centroids.shape[0]
    idx = np.zeros(m)
    centroids = initial_centroids
    #CODE A FUNCTION THAT IMPLEMENTS THE K-MEANS ALGORITHM
    return idx, centroids

In [None]:
idx, centroids = run_k_means(X, initial_centroids, 10)

In [None]:
cluster1 = X[np.where(idx == 0)[0],:]
cluster2 = X[np.where(idx == 1)[0],:]
cluster3 = X[np.where(idx == 2)[0],:]

fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(cluster1[:,0], cluster1[:,1], s=30, color='r', label='Cluster 1')
ax.scatter(cluster2[:,0], cluster2[:,1], s=30, color='g', label='Cluster 2')
ax.scatter(cluster3[:,0], cluster3[:,1], s=30, color='b', label='Cluster 3')
ax.legend()

One step we skipped over is a process for initializing the centroids.  This can affect the convergence of the algorithm.  We're tasked with creating a function that selects random examples and uses them as the initial centroids.

In [None]:
def init_centroids(X, k):
    m, n = X.shape
    centroids = np.zeros((k, n))
    #CODE A FUNCTION THAT INITIALISES THE CENTROIDS RANDOMLY
    return centroids

In [None]:
init_centroids(X, 3)

Our next task is to apply K-means to image compression.  The intuition here is that we can use clustering to find a small number of colors that are most representative of the image, and map the original 24-bit colors to a lower-dimensional color space using the cluster assignments.  Here's the image we're going to compress.

In [None]:
from IPython.display import Image
Image(filename='data/bird_small.png')

The raw pixel data has been pre-loaded for us so let's pull it in.

In [None]:
image_data = loadmat('data/bird_small.mat')
image_data

In [None]:
A = image_data['A']
A.shape

Now we need to apply some pre-processing to the data and feed it into the K-means algorithm.

In [None]:
# normalize value ranges
A = A / 255.

# reshape the array
X = np.reshape(A, (A.shape[0] * A.shape[1], A.shape[2]))
X.shape

In [None]:
# randomly initialize the centroids
initial_centroids = init_centroids(X, 16)

# run the algorithm
idx, centroids = run_k_means(X, initial_centroids, 10)

# get the closest centroids one last time
idx = find_closest_centroids(X, centroids)

# map each pixel to the centroid value
X_recovered = centroids[idx.astype(int),:]
X_recovered.shape

In [None]:
# reshape to the original dimensions
X_recovered = np.reshape(X_recovered, (A.shape[0], A.shape[1], A.shape[2]))
X_recovered.shape

In [None]:
plt.imshow(X_recovered)

Cool!  You can see that we created some artifacts in the compression but the main features of the image are still there.  That's it for K-means.  We'll now move on to principal component analysis.

## Principal component analysis

PCA is a linear transformation that finds the "principal components", or directions of greatest variance, in a data set.  It can be used for dimension reduction among other things.  In this exercise we're first tasked with implementing PCA and applying it to a simple 2-dimensional data set to see how it works.  Let's start off by loading and visualizing the data set.

In [None]:
data = loadmat('data/ex7data1.mat')
data

In [None]:
X = data['X']

fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(X[:, 0], X[:, 1])

The algorithm for PCA is fairly simple.  After ensuring that the data is normalized, the output is simply the singular value decomposition of the covariance matrix of the original data.

In [None]:
def pca(X):
    # normalize the features
    X = (X - X.mean()) / X.std()
    
    # compute the covariance matrix
    X = np.matrix(X)
    cov = (X.T * X) / X.shape[0]
    
    # perform SVD
    U, S, V = np.linalg.svd(cov)
    
    return U, S, V

In [None]:
U, S, V = pca(X)
U, S, V

Now that we have the principal components (matrix U), we can use these to project the original data into a lower-dimensional space.  For this task we'll implement a function that computes the projection and selects only the top K components, effectively reducing the number of dimensions.

In [None]:
def project_data(X, U, k):
    U_reduced = U[:,:k]
    return np.dot(X, U_reduced)

In [None]:
Z = project_data(X, U, 1)
Z

We can also attempt to recover the original data by reversing the steps we took to project it.

In [None]:
def recover_data(Z, U, k):
    U_reduced = U[:,:k]
    return np.dot(Z, U_reduced.T)

In [None]:
X_recovered = recover_data(Z, U, 1)
X_recovered

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(X_recovered[:, 0], X_recovered[:, 1])

Notice that the projection axis for the first principal component was basically a diagonal line through the data set.  When we reduced the data to one dimension, we lost the variations around that diagonal line, so in our reproduction everything falls along that diagonal.

Our last task in this exercise is to apply PCA to images of faces.  By using the same dimension reduction techniques we can capture the "essence" of the images using much less data than the original images.

In [None]:
faces = loadmat('data/ex7faces.mat')
X = faces['X']
X.shape

The exercise code includes a function that will render the first 100 faces in the data set in a grid.  Rather than try to re-produce that here, you can look in the exercise text for an example of what they look like.  We can at least render one image fairly easily though.

In [None]:
face = np.reshape(X[3,:], (32, 32))

In [None]:
plt.imshow(face)

Yikes, that looks awful.  These are only 32 x 32 grayscale images though (it's also rendering sideways, but we can ignore that for now).  Anyway's let's proceed.  Our next step is to run PCA on the faces data set and take the top 100 principal components.

In [None]:
U, S, V = pca(X)
Z = project_data(X, U, 100)

Now we can attempt to recover the original structure and render it again.

In [None]:
X_recovered = recover_data(Z, U, 100)
face = np.reshape(X_recovered[3,:], (32, 32))
plt.imshow(face)

Observe that we lost some detail, though not as much as you might expect for a 10x reduction in the number of dimensions.



### PCA and MNIST


In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape


Now, we can try to reduce the dimensionality of the dataset by using PCA and visualise it into a 2D plane. To do so, look into the scikit-learn for a function that performs PCA and apply it to perform PCA on the MNIST dataset.

In [None]:
#ADD YOUR CODE HERE
#The results of the PCA will be stored in the array called projected


plt.scatter(projected[:, 0], projected[:, 1],
            c=digits.target, edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('Spectral', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();




### Choose the number of components
To find the right number of components to keep in your PCA, you can plot the cumulative sum of explained variance to see how much information is kept in your reduced data. Plot the cumulative explained variance as a function of the number of dimension kept and assess the number of dimensions you can keep.

In [None]:
### INSERT YOUR CODE HERE

### PCA as a tool for compression
Earlier, we used K-means to compress a picture, here, we can use PCA to compress data. Indeed, with the plot you drew in the previous picture, we see that by only keeping a few components, we still have most of the information. Write a piece of code to visualise how the resulting picture evolves with the number of components kept

In [None]:
fig, axes = plt.subplots(8, 8, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)

for i, ax in enumerate(axes.flat):
    ###fill this code using only one picture ####

    ax.imshow(im.reshape((8, 8)), cmap='binary')
    ax.text(0.95, 0.05, 'n = {0}'.format(i + 1), ha='right',
            transform=ax.transAxes, color='green')
    ax.set_xticks([])
    ax.set_yticks([])

### PCA as a noise filter
PCA can also be used as a tool to remove noise from data. Indeed, components that carry the most important part of the variance won't be impacted by the variance due to the noise. Therefore, if we keep only these dimensions, we should be able to rule out the noise.

In [None]:
def plot_digits(data):
    fig, axes = plt.subplots(4, 10, figsize=(10, 4),
                             subplot_kw={'xticks':[], 'yticks':[]},
                             gridspec_kw=dict(hspace=0.1, wspace=0.1))
    for i, ax in enumerate(axes.flat):
        ax.imshow(data[i].reshape(8, 8),
                  cmap='binary', interpolation='nearest',
                  clim=(0, 16))
plot_digits(digits.data)



**EXERCISE :** Now add some random normal noise to this data

In [None]:
### INSERT YOUR CODE HERE

Now, perform PCA to keep only 50% of the variance on the noisy date. Once you have done this, perform the inverse transform of the PCA data to obtain the filtered version.

In [None]:
### INSERT YOUR CODE HERE

Draw conclusions :