Implementing PCA in Python with Scikit-Learn
------------------------------------------------------------------

With the availability of high performance CPUs and GPUs, it is pretty much possible to solve every regression, classification, clustering and other related problems using machine learning and deep learning models. However, there are still various factors that cause performance bottlenecks while developing such models. 

Large number of features in the dataset is one of the factors that affect both the training time as well as accuracy of machine learning models. You have different options to deal with huge number of features in a dataset.

1. Try to train the models on original number of features, which take days or weeks if the number of features is too high.

2. Reduce the number of variables by merging correlated variables.

3. __Extract the most important features from the dataset that are responsible for maximum variance in the output.__ Different statistical techniques are used for this purpose e.g. linear discriminant analysis, factor analysis, and principal component analysis (PCA).

About Principal Component Analysis
-----------------------------------------------------
Principal component analysis, or PCA, is a statistical technique to convert high dimensional data to low dimensional data by selecting the most important features that capture maximum information about the dataset. The features are selected on the basis of variance that they cause in the output. The feature that causes highest variance is the first principal component. 

The feature that is responsible for second highest variance is considered the second principal component, and so on. It is important to mention that principal components do not have any correlation with each other.

Advantages of PCA
----------------------------
There are two main advantages of dimensionality reduction with PCA.

1> The training time of the algorithms reduces significantly with less number of features.

2> It is not always possible to analyze data in high dimensions. For instance if there are 100 features in a dataset. Total number of scatter plots required to visualize the data would be (100(100-1))/2 = 4950. Practically it is not possible to analyze data this way.

Normalization of Features
-------------------------------------

__It is imperative to mention that a feature set must be normalized before applying PCA__. For instance if a feature set has data expressed in units of Kilograms, Light years, or Millions, the variance scale is huge in the training set. If PCA is applied on such a feature set, the resultant loadings for features with high variance will also be large. Hence, principal components will be biased towards features with high variance, leading to false results.

Finally, the last point to remember before we start coding is that __PCA is a statistical technique and can only be applied to **numeric data. Therefore, categorical features are required to be converted into numerical features before PCA can be applied.__

Important Note
>>IN PCA, if we have 10 features, it works on these 10 features and makes 10 new features which are weighted features of each of the above 10 features. Note, each feature is the weighted average of all 10 old features. Also note that the weighted coefficients for all new features is different.

PCA Main ideas:
--

https://www.youtube.com/watch?v=HMOI_lkzW08


Want more detailed Understanding:
https://www.youtube.com/watch?v=FgakZw6K1QQ&t=759s

Implementing PCA with Scikit-Learn
---------------------------------------------------

In [1]:
# Importing Libraries
import seaborn as sns
import numpy as np
import pandas as pd  
import matplotlib.pyplot as plt
from sklearn import svm, datasets

# type your code h
irisdata = sns.load_dataset('iris')
irisdata.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [4]:
irisdata.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


In [2]:
irisdata.sample(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
71,6.1,2.8,4.0,1.3,versicolor
101,5.8,2.7,5.1,1.9,virginica
70,5.9,3.2,4.8,1.8,versicolor
137,6.4,3.1,5.5,1.8,virginica
62,6.0,2.2,4.0,1.0,versicolor


In [5]:
irisdata.shape

(150, 5)

In [6]:
# type your code here7
X = irisdata.drop('species', axis=1)
y = irisdata['species']

# Train Test Split -> use train_test_split()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.20)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
print(X_train)

[[ 0.94136185 -1.25182181  1.07945685  0.70079785]
 [-0.16733041 -0.78312098  0.67484116  0.83386073]
 [ 0.07904565 -1.95487306  0.03901651 -0.36370521]
 [-0.16733041 -0.78312098  0.09681875 -0.36370521]
 [-1.02964661  1.09168236 -1.46384175 -1.42820827]
 [-1.89196281 -0.31442014 -1.46384175 -1.42820827]
 [ 1.06454988 -0.08006973  0.90605013  1.0999865 ]
 [-1.52239872  0.38863111 -1.521644   -1.42820827]
 [-0.29051843 -1.25182181  0.61703892  0.96692361]
 [-0.90645858  2.49778486 -1.40603951 -1.56127116]
 [-1.27602266  1.32603278 -1.46384175 -1.56127116]
 [ 1.557302   -0.08006973  1.07945685  0.43467208]
 [ 1.31092594  0.38863111  0.4436322   0.16854632]
 [ 0.6949858  -0.54877056  0.38582996  0.3016092 ]
 [ 1.06454988 -0.54877056  0.50143444  0.16854632]
 [ 2.4196182   1.79473361  1.4262703   0.96692361]
 [-0.66008252 -0.08006973  0.32802772  0.3016092 ]
 [ 1.80367805 -0.54877056  1.25286357  0.83386073]
 [-1.64558675  0.38863111 -1.46384175 -1.42820827]
 [ 0.44860974 -1.25182181  0.55

In [7]:
from sklearn.decomposition import PCA
pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

In the code above, we create a PCA object named pca. We did not specify the number of components in the constructor. Hence, all four of the features in the feature set will be returned for both the training and test sets.

The PCA class contains explained_variance_ratio_ which returns the variance caused by each of the principal components.

In [9]:
explained_variance = pca.explained_variance_ratio_
explained_variance

array([0.72717031, 0.22940862, 0.03774842, 0.00567264])

~this analysis could defer for each participant depending on the data :
It can be seen that first principal component is responsible for 74.30% variance. Similarly, the second principal component causes 21.74% variance in the dataset. Collectively we can say that (72.22 + 23.9) 96.21% percent of the classification information contained in the feature set is captured by the first two principal components.

In [10]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

In [11]:
explained_variance = pca.explained_variance_ratio_
explained_variance

array([0.72717031, 0.22940862])

In [12]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train,y_train)

y_pred = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test,y_pred)
ac = accuracy_score(y_test,y_pred)

In [13]:
cm

array([[15,  0,  0],
       [ 0,  7,  1],
       [ 0,  4,  3]], dtype=int64)

In [14]:
ac

0.8333333333333334

In [15]:
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

In [16]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train,y_train)

y_pred = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test,y_pred)
ac = accuracy_score(y_test,y_pred)

In [17]:
cm

array([[15,  0,  0],
       [ 0,  8,  0],
       [ 0,  3,  4]], dtype=int64)

In [18]:
ac

0.9

In [20]:
from sklearn.decomposition import PCA
pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

In [21]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train,y_train)

y_pred = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test,y_pred)
ac = accuracy_score(y_test,y_pred)

In [22]:
cm

array([[15,  0,  0],
       [ 0,  8,  0],
       [ 0,  3,  4]], dtype=int64)

In [23]:
ac

0.9

It can be seen from the output that with only one feature, the random forest algorithm is able to correctly predict 28 out of 30 instances, resulting in 93.33% accuracy. **(This accuracy value will change at train-test dataset changes)

In [None]:
# Results with 2 and 3 Principal Components
#------------------------------------------
# -- self try -- from start of this notebook.