# Introduction to PCA

In this notebook we are going go through the PCA algorithm step by step on a simple data set. We'll also generate the data and visualize the results. Let's get started!

## Step 1: Import some modules

In the next cell we are going to import the following modules:
* `numpy` - for doing math
* `matplotlib.pyplot` - for visualization
* `seaborn` - for making our visualizations look good

In [None]:
# This cell imports the modules we will use throughout the notebook

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

## Step 2: Generate the data

Now we are going to generate some data. The data will be two-dimensional and drawn from a multivariate normal distribution with the following mean and covariance:

\begin{equation*}
\mu = (-1,2) \qquad \Sigma = \left(
\begin{array}{cc}
4 & 2 \\
2 & 2
\end{array}
\right)
\end{equation*}

We will generate $N = 1000$ points $X = \{ x_1, \ldots, x_N \}$ from this distribution, and store them in the $1000 \times 2$ numpy array `X`.

In [None]:
# This cell generates the data

N = 1000
mean = (-1, 2)
cov = [[4, 2], [2, 2]]
X = np.random.multivariate_normal(mean, cov, N)

## Step 3: Plot the data

Now it's your turn. Let's plot the data using `plt.scatter`. Using the `alpha` input parameter is a nice trick here to a get feel for where the points are more densely sampled. You should also make sure the axis lengths are drawn to proportional lengths, so you can visualize the data properly! If you're not sure about something, Google it! There is tons of python documentation online.

In [None]:
# Plot the data in this cell


## Step 4: PCA

Now we are going to implement the PCA algorithm. We will break it down into sub-steps and across multiple cells.

### Step 4a: Compute the sample mean and center the data

The first step of PCA is to compute the sample mean of the data and use it to center the data. Recall the sample mean is

\begin{equation*}
\mu_N = \frac{1}{N} \sum_{i=1}^N x_i
\end{equation*}

and the mean-centered data $\bar{X} = \{ \bar{x}_1, \ldots, \bar{x}_N \}$ takes the form

\begin{equation*}
\bar{x}_i = x_i - \mu_N
\end{equation*}

When you are done with these steps, print out $\mu_N$ to verify it is close to $\mu$ and plot your mean centered data to verify it is centered at the origin!

In [None]:
# In this cell compute the sample mean, center the data, and plot the centered data


### Step 4b: Compute the sample covariance

Now we are going to use the mean centered data to compute the sample covariance of the data. Recall it is given by:

\begin{equation*}
\Sigma_N = \frac{1}{N-1} \sum_{i=1}^N \bar{x}_i^T \bar{x}_i = \frac{1}{N-1} \sum_{i=1}^N (x_i - \mu_N)^T (x_i - \mu_N)
\end{equation*}

where the data points $x_i \in \mathbb{R}^p$ (here in this example $p = 2$) are column vectors and $x^T$ is the transpose of $x$. In the next cell, compute the sample covariance matrix. Print it out and compare it to the one used to generate the data!

In [None]:
# Compute the sample covariance matrix in this cell, and print it out



### Step 4c: Diagonalize the sample covariance matrix to obtain the principal components

Now we are ready to solve for the principal components! To do so we diagonalize the sample covariance matrix $\Sigma_N$. We can use the function `np.linalg.eig` to do so. It will return the eigenvalues and eigenvectors of $\Sigma_N$. Once you have these, carry out the following tasks:
* Compute the percentage of the total variance captured by the first principal component
* Plot the mean centered data and lines along the first and second principal components
* Project the mean centered data onto the first and second principal components, and plot the projected data. What do you observe?
* Approximate the data as
\begin{equation*}
x_i \approx \tilde{x}_i := \mu_N + \langle x_i, v_0 \rangle v_0
\end{equation*}
where $v_0$ is the first principal component. What do you observe?

In [None]:
# Diagonalize the sample covariance matrix in this cell, and complete the tasks described in the cell above.
# Add additional cells as needed!


## Step 5: Writing your own PCA function

Now in the cell below write your own PCA function. Have the input be the data and have the output be the principal components and their associated eigenvalues, sorted in descending order. Can you think of a way to make it more efficient than the algorithm outlined above?

In [None]:
# Put your PCA function in this cell


## Step 6: Test your PCA function on other data

Go the the `load_data` notebook and load up some data sets. Visualize the data and try out your PCA function!