#Principle Component Analysis

Principal Component Analysis (PCA) is a linear dimensionality reduction technique that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space. 

1. It tries to preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variation.

2. One important thing to note about PCA is that it is an Unsupervised dimensionality reduction technique, you can cluster the similar data points based on the feature correlation between them without any supervision (or labels)

3. PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components.

Note: Features, Dimensions, and Variables are all referring to the same thing. 

4. Used in Data Visualization for dimensionality reduction and in ML for speeding up the algorithm.

5. When the data is projected into a lower dimension (assume three dimensions) from a higher space, the three dimensions are nothing but the three Principal Components that captures (or holds) most of the variance (information) of your data.

Principal components have both direction and magnitude. 

* The direction represents across which principal axes the data is mostly spread out or has most variance.
* The magnitude signifies the amount of variance that Principal Component captures of the data when projected onto that axis.

Here we are going to use Breast Cancer dataset. It contains two classes of data whether the patient has a Breast cancer or not. The two classes are **Malignant** and **Benign**



In [None]:
import numpy as np
import pandas as pd
import matplotlib as plt
from sklearn.datasets import load_breast_cancer

**load_breast_cancer** will give you both labels and the data. To fetch the data, you will call .data and for fetching the labels .target.

In [None]:
breast = load_breast_cancer()

In [None]:
breast

In [None]:
breast_data = breast.data
print(breast_data.shape)

In [None]:
breast_lables=breast.target
print(breast_lables.shape)

In [None]:
labels=np.reshape(breast_lables,(569,1))
labels.shape

Reshape the dataset by adding lable to it. 

Concatenate the dataset with the lable

Create a dataframe

In [None]:
breast_dataset = pd.DataFrame(breast_data)
print(breast_dataset.head())

print the features that are there in the breast cancer dataset!

In [None]:
features = breast.feature_names
print(features)
print(len(features))

In [None]:
breast_dataset.columns = features
breast_dataset.head()

In [None]:
breast_dataset['label']=labels
breast_dataset.head()

Since the original labels are in 0,1 format, you will change the labels to benign and malignant using **.replace** function. You will use **inplace=True** which will modify the dataframe breast_dataset.

In [None]:
breast_dataset['label'].replace(0, 'Benign',inplace=True)
breast_dataset['label'].replace(1, 'Malignant',inplace=True)
breast_dataset.tail()

#PCA
You start by Standardizing the data since PCA's output is influenced based on the scale of the features of the data.

In [None]:
from sklearn.preprocessing import StandardScaler
x = breast_dataset.loc[:, features].values
x = StandardScaler().fit_transform(x) # normalizing the features
print(x.shape)

Check the mean and std deviation is 0 and 1

In [None]:
np.mean(x),np.std(x)

Let's convert the normalized features into a tabular format with the help of DataFrame.

In [None]:
feat_cols = ['feature'+str(i) for i in range(x.shape[1])]

In [None]:
normalised_breast = pd.DataFrame(x,columns=feat_cols)
print(normalised_breast)

In [None]:
normalised_breast.tail()

Projecting the thirty-dimensional Breast Cancer data to two-dimensional principal components.

In [None]:
from sklearn.decomposition import PCA
pca_breast = PCA(n_components=2)
principalComponents_breast = pca_breast.fit_transform(x)

In [None]:
principalComponents_breast.shape

In [None]:
principal_breast_Df = pd.DataFrame(data = principalComponents_breast
             , columns = ['principal component 1', 'principal component 2'])
principal_breast_Df.head()

In [None]:
breast_dataset

you can find the explained_variance_ratio. It will provide you with the amount of information or variance each principal component holds after projecting the data to a lower dimensional subspace.

In [None]:
print('Explained variation per principal component: {}'.format(pca_breast.explained_variance_ratio_))

Plot the PCA

In [None]:
import matplotlib.pyplot as plt
plt.figure()
plt.figure(figsize=(10,10))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Principal Component - 1',fontsize=20)
plt.ylabel('Principal Component - 2',fontsize=20)
plt.title("Principal Component Analysis of Breast Cancer Dataset",fontsize=20)
targets = ['Benign', 'Malignant']
colors = ['r', 'g']
for target, color in zip(targets,colors):
    indicesToKeep = breast_dataset['label'] == target
    plt.scatter(principal_breast_Df.loc[indicesToKeep, 'principal component 1']
               , principal_breast_Df.loc[indicesToKeep, 'principal component 2'], c = color, s = 25)

plt.legend(targets,prop={'size': 15})

#### PCA for IRIS dataset

In [None]:
from sklearn.datasets import load_iris

In [None]:
iris=load_iris()
iris

In [None]:
iris_data=iris.data
iris_target=iris.target

In [None]:
iris_target.shape

In [None]:
#convert to column matrix
iris_target_col=np.reshape(iris_target, (150,1))
iris_target_col

In [None]:
#create iris dataframe
iris_dataset = pd.DataFrame(iris_data, columns=iris.feature_names)
print(iris_dataset.head())

In [None]:
iris_dataset['label']=iris_target_col
iris_dataset.head()

In [None]:
iris_dataset['label'].replace(0, 'setosa',inplace=True)
iris_dataset['label'].replace(1, 'versicolor',inplace=True)
iris_dataset['label'].replace(2, 'virginica',inplace=True)
iris_dataset.tail()

In [None]:
from sklearn.decomposition import PCA
pca_iris = PCA(n_components=2)
principalComponents_iris = pca_iris.fit_transform(iris_data)

In [None]:
principal_iris_Df = pd.DataFrame(data = principalComponents_iris
             , columns = ['PC1', 'PC2'])
principal_iris_Df.head()

In [None]:
print('Explained variation per principal component: {}'.format(pca_breast.explained_variance_ratio_))

In [None]:
principal_iris_Df['label']=iris_target_col
principal_iris_Df.head()

In [None]:
plt.figure()
plt.figure(figsize=(10,10))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Principal Component - 1',fontsize=20)
plt.ylabel('Principal Component - 2',fontsize=20)
plt.title("Principal Component Analysis of Breast Cancer Dataset",fontsize=20)
targets = ['setosa', 'versicolor', 'virginica']
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
    indicesToKeep = iris_dataset['label'] == target
    plt.scatter(principal_iris_Df.loc[indicesToKeep, 'PC1']
               , principal_iris_Df.loc[indicesToKeep, 'PC2'], c = color, s = 25)

plt.legend(targets,prop={'size': 15})

#### Try PCA for your own dataset and visualize it in 3D