# Iris dataset Data Visualization using Principal Component Analysis(PCA)

## Introduction
In PCA, we are trying to reduce the dimensions of a dataset when the features become to vast for comprehension and cannot decide which feature to choose.
>**Rule of thumb to choose PCA**: <br>
>1.Do you want to reduce the number of variables, but aren’t able to identify variables to completely remove from consideration?<br>
>2.Do you want to ensure your variables are independent of one another?<br>
>3.Are you comfortable making your independent variables less interpretable?<br>
>If yes to all, PCA is the right method. If no to question 3, PCA might not be your ideal solution.


In [None]:
import pandas as pd 
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn import datasets
from sklearn import model_selection
import matplotlib.pyplot as plt
%matplotlib inline

### Load Data
Here we will load the IRIS dataset from **scikit-learn**. We will be utilizing `iris.data` and `iris.target` as usual for our features and values.

In [None]:
iris = datasets.load_iris()

As usual `dir(iris)` shows the attributes of the iris datasets.<br> `iris.data.shape` shows the shape of the data.<br>
`iris.target_names` shows the classes that we want to classify.<br>
`iris.feature_names` shows the name of features that we are training.

In [None]:
dir(iris)

In [None]:
iris.target_names

In [None]:
iris.feature_names

In [None]:
iris.data.shape

In [None]:
np.unique(iris.target)

In [None]:
data = iris.data.astype(np.float32)
target = iris.target.astype(np.float32)

In [None]:
pd.DataFrame(data=data, columns=iris.feature_names).head()

Use StandardScaler to scale the data before applying PCA.

In [None]:
scaled_data = Standard().fit_transform(data)

In [None]:
pd.DataFrame(data=scaled_data, columns=iris.feature_names).head()

Specify the target number of **principal components** to 2.

In [None]:
# TODO: Set Principal Components = 2
pca = PCA(n_components)

In [None]:
principalComponents = pca.fit_transform(scaled_data)

In [None]:
principaldf = pd.DataFrame(data=principalComponents,
                           columns=['Principal component 1', 'Principal component 2'])

In [None]:
principaldf.head()

In [None]:
targetdf = pd.DataFrame(data = iris.target,
                        columns = ["Iris Class"])

In [None]:
finaldf = pd.concat([principaldf, targetdf], axis=1)
finaldf.head()

Specify the **targets** to be the **labels** of the iris dataset.

In [None]:
# TODO: Set targets to be labels of the plot
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('Principal Component 1', fontsize = 13)
ax.set_ylabel('Principal Component 2', fontsize = 13)
ax.set_title('2D Data Visualization after PCA', fontsize = 15)

targets=np.unique(iris.)

colors = ['b', 'g', 'r']
for target, color in zip(targets,colors):
    indicesToKeep = finaldf["Iris Class"] == target
    ax.scatter(finaldf.loc[indicesToKeep, 'Principal component 1'],
               finaldf.loc[indicesToKeep, 'Principal component 2'],
               c = color,
               s = 50)
ax.legend(iris.target_names)
ax.grid()

**explained_variance_ratio_** : array, shape (n_components,)
Percentage of variance explained by each of the selected components.

If n_components is not set then all components are stored and the sum of the ratios is equal to 1.0.

Print out the variance ratio.

In [None]:
# TODO: complete the code to print variance ratio
pca._

After PCA, Dimension of the dataset was reduced from four to two.

0.7296+0.2285=0.9581.

**95.81%** of the information was retained.

Let us try with another dataset, this time we will use the famous Breast Cancer dataset.
We can load it directly from scikit-learn.

# Breast Cancer Data Visualization using PCA

In [None]:
bcancer=datasets.load_breast_cancer()

In [None]:
dir(bcancer)

Let's check how many class do we have in this dataset.

In [None]:
bcancer.target_names

There are **two** target classes in the breast cancer dataset, Malignant and Benign.

**Malignant** means **"Harmful"** whereas **Benign** means **"Not Harmful"**.

In [None]:
bcancer.feature_names

In [None]:
bcancer.data.shape

In the breast cancer dataset, there are 30 features or columns of data.

There are 569 rows of sample data or entries.

In [None]:
data = bcancer.data.astype(np.float32)
target = bcancer.target.astype(np.float32)

In [None]:
np.unique(bcancer.target)

In [None]:
pd.DataFrame(data = data, columns = bcancer.feature_names).head()

**Scale** the data before applying PCA.

In [None]:
# TODO: 
scaled_data = StandardScaler().(data)

In [None]:
pd.DataFrame(data = scaled_data, columns = bcancer.feature_names).head()

In [None]:
pca = (n_components=2)
principalComponents = pca.fit_transform(scaled_data)
principaldf = pd.DataFrame(data=principalComponents, 
                           columns=['Principal component 1', 'Principal component 2'])
principaldf.head()

In [None]:
targetdf = pd.DataFrame(data=bcancer.target,
                        columns=["Breast Cancer Class"])

In [None]:
finaldf = pd.concat([principaldf, targetdf], axis = 1)
finaldf.head()

Specify the "targets" to be the labels of the breast cancer dataset.

In [None]:
# TODO: Complete the code to specify targets
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('Principal Component 1', fontsize = 13)
ax.set_ylabel('Principal Component 2', fontsize = 13)
ax.set_title('2D Data Visualization after PCA', fontsize = 15)

=np.unique(bcancer.target)

colors = ['r', 'b']
for target, color in zip(targets,colors):
    indicesToKeep = finaldf["Breast Cancer Class"] == target
    ax.scatter(finaldf.loc[indicesToKeep, 'Principal component 1']
               , finaldf.loc[indicesToKeep, 'Principal component 2']
               , c = color
               , s = 50)
ax.legend(bcancer.target_names)
ax.grid()

Print out the variance ratio.

In [None]:
# TODO: Complete the command to print out variance ratio
.explained_variance_ratio_

After PCA, Dimension of the dataset was reduced from **thirty to two**.

0.4427+0.1897=0.6324.

**63.24%** of the information was retained.

Despite the information lost, we can still see that the two classes of Breast Cancer is **clearly separated** by using the provided dataset.