# Principal Component Analysis

Principal Component Analysis (PCA) is a fast and flexible unsupervised method for dimensionality reduction in data. 

In this section, PCA will be applied to perform dimensionality reduction.

Lets start with a small dataset to get intuition on how PCA works.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Generate and plot a 2d dataset.
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 100)).T
plt.scatter(X[:, 0], X[:, 1])
plt.show()

Can you see the linear relationship between the x and y?

In unsupervised learning instead of predicting y from x, we attempt to learn about the relationship between them. In PCA, this relationship is quantified by finding a list of the principal axes in the data. Lets do this using Sklearn.

First, import the PCA and call PCA with `n_components=2`. Fir the PCA to the `X` data.

There is two objects we're interested in that `pca` object : 

- `pca.components_`: it's a set of eigenvectors which point to the directions where the variance is maximally explained. In other words, the directions of maximum variance.

- `pca.explained_variance_`: it's the corresponding squared length of the proportion of explained variance

What we can do is visualize those two components by plotting those components, with size squareroot of the corresponding explained variance. 

To do that you can call the `draw_vector` function below.

They respectively the 2 directions of maximum variance 
and (squared) proportion of explained variance in that direction.

In [None]:
def draw_vector(v0, v1, ax=None):
    ax = ax or plt.gca()
    arrowprops=dict(arrowstyle='->', color='red', linewidth=2)
    ax.annotate('', v1, v0, arrowprops=arrowprops)

The length of the vector is a measure of the variance of the data when projected onto that axis!

We can then use those directions to "explain" most of our observations behaviour - most of the distinction between observations happens along thoses axis. 

### Now what happens when we use those components?

We can use those components (`pca.components_`, the red arrows on the plot) to transform every sample of our dataset and see it in a new space where its variance is more clear and hence more easy to visualize.

To do that, first transform the dataset into the new space by using the function `transform` of your `pca` object, then you can plot this transformation.

We have stretched the whole dataset into it's nicer form, where we can now study the behviour __between__ the observations. It's not crushed anymore into a line that was too packed.

__Try to understand how we passed from the original dataset, then took the two red arrows which represent the directions of most variance, to transform the observations to this new plot.__

## Dimensionality Reduction with PCA

Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance.

We can then use those new representations as features to feed any model we want. It can be very useful since you often have lots of features, and you want to transform and keep a packed number of features that are the most representative of what you want to model.

__Lets load a face image dataset and apply PCA.__

In [None]:
# Load data
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

You have access to
- `faces.images` which are the matrix of __50 x 37 pixels__ you can plot 
- the flattened version in `faces.data` of size __1850 x 1__ (because 50 x 37 = 1850)
- `faces.targets` which is annotation of every face to the corresponding person (as a number index)

In [None]:
# Let’s visualize some faces.
fig = plt.figure(figsize=(7,10))
for i in range(15):
    plt.subplot(5, 5, i + 1)
    plt.title(faces.target_names[faces.target[i]], size=12)
    plt.imshow(faces.images[i], cmap=plt.cm.gray)
    plt.xticks(())
    plt.yticks(())


In this scikit dataset we have 50 × 37 pixel images (1850 dimensions!). Often, so many dimensions is a lot to train algorithms we studied in the previous exercises (for example SVM). 

That's why we use PCA to reduce these features to a more manageable size, while maintaining most of the information of the dataset.

__👉Apply PCA to the dataset (both fit and transform), to reduce dimensions to 150, by setting `n_components=150`__

__👉Put your transformation into a variable named `projected`__

The face dataset was projected onto only the first 150 principal components! Again, what we call components are directions of most variance of the dataset.

It means that now, we don't need 1850 pixels anymore to describe each images but just 150 values. 

__How is that possible?__

The pca has found to be the most representative directions of what distinguishes faces between each other with just 150 values for every image. 

They are the directions of most variance.

You can access them in `pca.components_`. Look at the first component of this array of components, and its shape

As you can see, it's a vector of 1850 values. **We have now 150 components of 4096 values each.**

One face is now described as a combination (sum) of those components.

We're gonna reconstruct one image from its reduced representation to see how it works.

__👉study the code below__

In [None]:
%matplotlib inline

num_dimensions = 150

# We do our reconstruction over the 10th image
image_original = faces.images[12];
image_compressed = projected[12];

# we start the reconstruction from the mean over all images (computed by pca)
image_reconstructed = pca.mean_; 

# Reconstruct the image by weighing every entry of its compressed representation to the corresponding component
for i in range(num_dimensions):
    image_reconstructed += pca.components_[i] * image_compressed[i]
    
# Plot the original and the compressed image.
fig, ax = plt.subplots(1, 2, figsize = (5,5))
ax[0].imshow(image_original, cmap=plt.cm.gray)
ax[0].set_title('Original Image')
ax[1].imshow(image_reconstructed.reshape(faces.images[0].shape), cmap=plt.cm.gray)
ax[1].set_title('Compressed reconstructed Image')
for ax in fig.axes:
    ax.axis('off')
plt.tight_layout()

## How to Choose the Number of Components?

In practice, it is very important to find how many components are needed to describe the data without losing too much information. This can be determined visually by plotting the cumulative sum of `explained_variance_ratio_` as a function of the number of components.

This curve quantifies how much of the total variance is contained within the first components. For example:
- The first 20 components contain more than 75% of the variance,
- while we need about only 70 components to describe 90% of the variance!

## Now as a machine learning specialist, what is your use of this transformation?

You have this dataset of faces and you want to build a face recognition engine. 

You can now use this low-dimensional new transformation you just created, that is still representative of the faces to train your supervised algorithm!

### 1. Train test split the face dataset

Use the train test split function from scikit to separate __the original faces dataset__ into training and testing set, `Xtrain`, `ytrain`, `Xtest`, `ytest`

### 2. Transform your training set to reduce the number of dimensions / features

Fit a PCA __over the training data only__ and transform your training data into the reduced dimension (150 features for example).

Using this same PCA (only trained on the training set) transform your testing set as well

### 3. Cross validate your choice of best hyperparameters

Call a cross validated grid search for an SVM where you fine tune the hyperparameters C between 1000 and 10000 and gamma between 0.0001 and 0.1.

Train that cross validation grid search over our newly transformed training set.

Print the [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report) of your best model over the testing set 

Try to improve this score with the best choice of PCA components