 ## <div align="center">  Principal Component Analysis in python </div>
 <div align="center">**quite practical and far from any theoretical concepts**</div>
<div style="text-align:center">last update: <b>10/16/2018</b></div>



---------------------------------------------------------------------
Fork and Run this kernel on GitHub:
> ###### [ GitHub](https://github.com/mjbahmani/10-steps-to-become-a-data-scientist)


-------------------------------------------------------------------------------------------------------------
 **I hope you find this kernel helpful and some UPVOTES would be very much appreciated**
 
 -----------


 <a id="0"></a> <br>
**Notebook Content**
1. [Introduction](#1)
1. [What is PCA Approach](#2)
1. [How to Principal Component Analysis?](#3)
1. [Reusable Principal Component Analysis](#4)
1. [References](#5)


 <a id="1"></a> <br>
## 1- Introduction

The sheer size of data in the modern age is not only a challenge for computer hardware but also a main bottleneck for the performance of many machine learning algorithms. The main goal of a PCA analysis is to identify patterns in data; PCA aims to detect the correlation between variables. If a strong correlation between variables exists, the attempt to reduce the dimensionality only makes sense. In a nutshell, this is what PCA is all about: Finding the directions of maximum variance in high-dimensional data and project it onto a smaller dimensional subspace while retaining most of the information.[3]


 <a id="2"></a> <br>
## 2- What is PCA Approach?

1. Standardize the data.
1. Obtain the Eigenvectors and Eigenvalues from the covariance matrix or correlation matrix, or perform Singular Vector Decomposition.
1. Sort eigenvalues in descending order and choose the k eigenvectors that correspond to the k largest eigenvalues where k is the number of dimensions of the new feature subspace (k≤d)/.
1. Construct the projection matrix W from the selected k eigenvectors.
1. Transform the original dataset X via W to obtain a k-dimensional feature subspace Y.

 <a id="3"></a> <br>
## 3- How to Principal Component Analysis?
There is no pca() function in NumPy, but we can easily calculate the Principal Component Analysis step-by-step using NumPy functions.

The example below defines a small 3×2 matrix, centers the data in the matrix, calculates the covariance matrix of the centered data, and then the eigendecomposition of the covariance matrix. The eigenvectors and eigenvalues are taken as the principal components and singular values and used to project the original data.[2]

In [None]:
from numpy import array
from numpy import mean
from numpy import cov
from numpy.linalg import eig
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# calculate the mean of each column
M = mean(A.T, axis=1)
print(M)
# center columns by subtracting column means
C = A - M
print(C)
# calculate covariance matrix of centered matrix
V = cov(C.T)
print(V)
# eigendecomposition of covariance matrix
values, vectors = eig(V)
print(vectors)
print(values)
# project data
P = vectors.T.dot(C.T)
print(P.T)

Running the example first prints the original matrix, then the eigenvectors and eigenvalues of the centered covariance matrix, followed finally by the projection of the original matrix.

Interestingly, we can see that only the first eigenvector is required, suggesting that we could project our 3×2 matrix onto a 3×1 matrix with little loss.

 <a id="4"></a> <br>
## 4- Reusable Principal Component Analysis
We can calculate a Principal Component Analysis on a dataset using the PCA() class in the scikit-learn library. The benefit of this approach is that once the projection is calculated, it can be applied to new data again and again quite easily.

When creating the class, the number of components can be specified as a parameter.

The class is first fit on a dataset by calling the fit() function, and then the original dataset or other data can be projected into a subspace with the chosen number of dimensions by calling the transform() function.

Once fit, the eigenvalues and principal components can be accessed on the PCA class via the explained_variance_ and components_ attributes.

The example below demonstrates using this class by first creating an instance, fitting it on a 3×2 matrix, accessing the values and vectors of the projection, and transforming the original data.

In [None]:
# Principal Component Analysis
from numpy import array
from sklearn.decomposition import PCA
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# create the PCA instance
pca = PCA(2)
# fit on data
pca.fit(A)
# access values and vectors
print(pca.components_)
print(pca.explained_variance_)
# transform data
B = pca.transform(A)
print(B)

 <a id="5"></a> <br>
## 5- References
1. [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)
2. [calculate-principal-component-analysis-scratch-python](https://machinelearningmastery.com/calculate-principal-component-analysis-scratch-python/)
3. [plot.ly](https://plot.ly/ipython-notebooks/principal-component-analysis/)

---------------------------------------------------------------------
Fork and Run this kernel on GitHub:
> ###### [ GitHub](https://github.com/mjbahmani/10-steps-to-become-a-data-scientist)

 

-------------------------------------------------------------------------------------------------------------
 **I hope you find this kernel helpful and some UPVOTES would be very much appreciated**
 
 -----------

**Not completed yet!!!**

**Update every two days**