# Principal Component Analysis

Let's discuss PCA! Since this isn't exactly a full machine learning algorithm, but instead an unsupervised learning algorithm, we will just have a lecture on this topic, but no full machine learning project (although we will walk through the cancer set with PCA).

## PCA Review

Make sure to watch the video lecture and theory presentation for a full overview of PCA! 
Remember that PCA is just a transformation of your data and attempts to find out what features explain the most variance in your data. For example:

- This is just an extension of Python and SkLearn to illustrate concept of PCA.
- We won't be having a portfolio project, although we will be using a Cancer dataset.
- PCA is just transformation of data to find out which features explain most of the variance in the data.
<img src='PCA.png' />

## Explanation of Image Above :
- If we have 2 components as in image on top left, we try to get rid of components which do not explain a lot of variance in the data.
- We can transform the data with either first or second component dropped and then see for variance.

## Libraries

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline

## The Data

Let's work with the cancer data set again since it had so many features.

In [None]:
from sklearn.datasets import load_breast_cancer

# Getting dataset directly from SkLearn, Sklearn has these built-in datasets as references to use case of SkLearn in documentation.

In [None]:
cancer = load_breast_cancer()
# Setting a variable, and loading the dataset

In [None]:
type(cancer) # This dataset acts like dictionary

In [None]:
cancer.keys()

In [None]:
print(cancer['DESCR']) # For description, notice how Attributes number is really high for less number of instances.
# There are target and their names here and we would have done prediction whether tumour is benign or malignant but
# that would have been case for supervised learning, here we find which components are most important explaining
# the most of the variance of the dataset.

In [None]:
df = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])
#(['DESCR', 'data', 'feature_names', 'target_names', 'target'])


In [None]:
df.head()

In [None]:
cancer['target_names']

- We won't be applying any ML algorithm for prediction/classification, instead we will do PCA, the reason behind that is if we were given this dataset and we were planning to apply a classification algorithm on it then we would have done PCA first, to get an idea of what is important to see if tumour belongs to class 0 or 1.

## PCA Visualization

As we've noticed before it is difficult to visualize high dimensional data, we can use PCA to find the first two principal components, and visualize the data in this new, two-dimensional space, with a single scatter-plot. Before we do this though, we'll need to scale our data so that each feature has a single unit variance.

In [None]:
from sklearn.preprocessing import StandardScaler

We scale our data so that each feature has a single unit variance, before we actually use PCA on the cancer dataset.

In [None]:
# We have done this before, we call it like we would do for any other estimator in SkLearn.
scaler = StandardScaler()

In [None]:
scaler.fit(df) # Fitting the scaler to features

In [None]:
# Now we transform
scaled_data = scaler.transform(df)

PCA with Scikit Learn uses a very similar process to other preprocessing functions that come with SciKit Learn. We instantiate a PCA object, find the principal components using the fit method, then apply the rotation and dimensionality reduction by calling transform().

We can also specify how many components we want to keep when creating the PCA object.

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(n_components=2) # We try to visualise the entire 3 dimensional dataset just by using two principal components.

In [None]:
pca.fit(scaled_data)

In [None]:
# Transform the data to its first principal components.
x_pca = pca.transform(scaled_data)

In [None]:
scaled_data.shape

In [None]:
x_pca.shape # Transformed and reduced to first two principal component.

In [None]:
# After reduction of 3 dimensions to 2, we will plot these dimensions using matplotlib
plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1],c=cancer['target'],cmap='plasma')
# c = cancer['target'] colours the points according to benign or malignant.
# Grab all the rows from column 0 and plot these against all the rows from column 1
plt.xlabel("First Principal Component")
plt.ylabel("Second Principal Component")
plt.grid(True)

- The plot above shows the power of PCA, as based off of only first and second component we can see we have a separation which is very clearly depicting what does benign and malignant tumours look like.

- Clearly by using these two components we can easily separate these two classes.

## Interpreting the components 

Unfortunately, with this great power of dimensionality reduction, comes the cost of being able to easily understand what these components represent.

The components correspond to combinations of the original features, the components themselves are stored as an attribute of the fitted PCA object:

In [None]:
pca.components_

In this numpy matrix array, each row represents a principal component, and each column relates back to the original features. we can visualize this relationship with a heatmap:

In [None]:
df_comp = pd.DataFrame(pca.components_,columns=cancer['feature_names'])
# Has relationship for each of the 30 features for Prinicpal Component 0 and 1.

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(df_comp,cmap='plasma')

This heatmap and the color bar basically represent the correlation between the various feature and the principal component itself.
Each principal component is shown here as a row, higher the number or hotter a colour looks like i.e. towards yellow, it is more correlated to a specific feature in column.


- **Do some extra reading on PCA from ISLR.**

## Conclusion

Hopefully this information is useful to you when dealing with high dimensional data!
After having principal components we can go ahead and feed in the reduced version x_pca into a classification algo. Say a logistic regression on x_pca instead of doing that on entire dataframe of features. As we see in plot the 2 categories are almost separable by a straight line, we should use SVM with data of this nature preferrably.