# PCA Implementation using Python and Scikit-Learn

Today, I am learning about PCA and how it is used to reduce the amount of features in a data set and make my machine learning models run faster. 

I will be using the following libraries:

 - `pandas`
 - `sklearn`
 - `plotly.express`
 - `Numpy`


### What is PCA

Principal Component Analysis (PCA) is an unsupervised learning algorithm that compresses data from data sets with high dimensionality into a data set that has a lower dimensionality. This process is called dimensionality reduction and the compressed data set is called the principal component. 

### How can Dimensionality Reduction Be Used?

Dimensionality reduction is able to speed up the process of training a machine learning model. For example, you want to train a neural network that you made in tensorflow and you have a data set that comprises of 1000 features and an almost endless number of rows containing data. The PCA algorithm can compress that down to 5 features or however many you would like it to. The initial PC will contain the core information while the last PC will contain the noise of the data set.

In [1]:
import pandas as pd
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

In [2]:
# Loading the Iris Dataset
iris = load_iris()
iris_features = iris.feature_names
iris_label = iris.target_names
print("Features:", iris_features)
print("Labels:", iris_label)

Features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Labels: ['setosa' 'versicolor' 'virginica']


In [3]:
# Assigning the features and labels to X and y
X = iris.data
y = iris.target

In [4]:
# Exploring the Dimentionality of the Data
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

Shape of X: (150, 4)
Shape of y: (150,)


In [5]:
# Performing the PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=3)
pca.fit(X_scaled)
Scores = pca.transform(X_scaled)

In [6]:
# Making a DataFrame of the scores of the PCA
df = pd.DataFrame(Scores, columns = ['PC1', 'PC2', 'PC3'])
df.head()

Unnamed: 0,PC1,PC2,PC3
0,-2.264703,0.480027,0.127706
1,-2.080961,-0.674134,0.234609
2,-2.364229,-0.341908,-0.044201
3,-2.299384,-0.597395,-0.09129
4,-2.389842,0.646835,-0.015738


In [7]:
# Adding the Target Column to the DataFrame
y_label = []
for i in y:
    if i == 0:
        y_label.append('Setosa')
    elif i == 1:
        y_label.append('Versicolor')
    else:
        y_label.append('Virginica')
species = pd.DataFrame(y_label, columns=['Species'])
df = pd.concat([df, species], axis=1)
df.head()

Unnamed: 0,PC1,PC2,PC3,Species
0,-2.264703,0.480027,0.127706,Setosa
1,-2.080961,-0.674134,0.234609,Setosa
2,-2.364229,-0.341908,-0.044201,Setosa
3,-2.299384,-0.597395,-0.09129,Setosa
4,-2.389842,0.646835,-0.015738,Setosa


### PCA Complete!!!