# Part 1 - Principal Component Analysis

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import util
import datasets
from typing import Tuple

## Part 1.1 - Implement PCA [15%]

In [None]:
def pca(X: np.ndarray, K: int) -> Tuple[np.ndarray, np.ndarray,  np.ndarray]:
    """
    X is an N*D matrix of data (N points in D dimensions)
    K is the desired maximum target dimensionality (K <= min{N,D})

    should return a tuple (P, Z, evals)
    
    where P is the projected data (N*K) where
    the first dimension is the higest variance,
    the second dimension is the second higest variance, etc.

    Z is the projection matrix (D*K) that projects the data into
    the low dimensional space (i.e., P = X * Z).

    and evals, a K dimensional array of eigenvalues (sorted)
    """
    
    N, D = X.shape

    # make sure we don't look for too many eigs!
    if K > N:
        K = N
    if K > D:
        K = D

    ### TODO: YOUR CODE HERE
    raise NotImplementedError

    return (P, Z, evals)

Our first test of PCA will be on Gaussian data with a known covariance matrix. First, let's generate some data and see how it looks.

In [None]:
M = np.array([[3,2],[2,4]])
(U,S,VT) = np.linalg.svd(M)
D = np.diag(np.sqrt(S))

Si = U @ D @ VT
x = np.random.randn(1000,2) @ Si
plt.plot(x[:,0], x[:,1], 'b.');

We can also see what the sample covariance is!

In [None]:
np.cov(x.T)

Note that the sample covariance of the data is almost exactly the true covariance of the data. If you run this with 100,000 data points (instead of 1,000), you should get something even closer to 
$\begin{bmatrix} 3 & 2 \\ 2 & 4 \end{bmatrix}$.

Now, let's run PCA on this data. We basically know what should happen, but let's make sure it happens anyway (still, given the random nature, the numbers won't be exactly the same). We can project the data onto the first eigenvalue and plot it in red, and the second eigenvalue in green. 

In [None]:
(P, Z, evals) = pca(x, 2)

x0 = np.dot(np.dot(x, Z[:,0]).reshape(1000,1), Z[:,0].reshape(1,2))
x1 = np.dot(np.dot(x, Z[:,1]).reshape(1000,1), Z[:,1].reshape(1,2))

plt.plot(x[:,0], x[:,1], 'b.', x0[:,0], x0[:,1], 'r.', x1[:,0], x1[:,1], 'g.');

## Part 1.2 - Visualization of MNIST [5%]

Lets work with some [handwritten digits](https://en.wikipedia.org/wiki/MNIST_database). Before we try PCA on them, let's visualize the digits. Specifically, implement the function `draw_digits`.

In [None]:
def draw_digits(X: np.ndarray, Y: np.ndarray):
    ### TODO: YOUR CODE HERE
    raise NotImplementedError

In [None]:
(X, Y) = datasets.load_digits()
draw_digits(X, Y)

Now, let's look at some "eigendigits."

In [None]:
(P, Z, evals) = pca(X, 784)
evals

Eventually, the eigenvalues drop to zero (some may be negative due to floating point errors).

## Part 1.3 - Normalized Eigenvalues [10%]

Plot the normalized eigenvalues for the MNIST digits. How many eigenvectors do you have to include before you've accounted for 90% of the variance? 95%?

**ANSWER**:

## Part 1.4 - Visualization of Dimensionality Reduction [5%]

Now, let's plot the top 50 eigenvectors:

In [None]:
draw_digits(Z.T[:50,:], np.arange(50))

Do these look like digits? Should they? Why or why not?

**ANSWER:**