### PCA (Principal Component Analysis)

1) It is an unsupervised learning algorithm used for dimensionality reduction<br>
2) It reduces the number of features(columns). <br>
3) It can serve as a data cleaning or data preprocessing technique used before applying other ML algorithm.<br>
3) So after applying PCA, we can apply Regression, Classification or even Clustering<br> 
4) Principal components so generated remove noise by reducing a large number of features to just a couple of principal components. Principal components are orthogonal projections of data onto lower-dimensional space.<br>
5) In theory, PCA produces the same number of principal components as there are features in the training dataset. In practice, though, we do not keep all of the principal components. Each successive principal component explains the variance that is left after its preceding component, so picking just a few of the first components sufficiently approximates the original dataset without the need for additional features.<br>

### PCA Steps

#### 1) Standardization (StandardScaler)
Standardized value (z) = (Actual Value – mean)/ Standard Deviation 

#### 2) Covariance Matrix Computation
a) For the standardized dataset, PCA computes the Covariance matrix in the second step.<br>
b) Covariance describes the direction of linear relationship between 2 variables<br>
c) Covaraince matrix is a square matrix, the shape depends upon the number of varaibles<br>

<img src="pca1.png" align="left" width="350">
<img src="pca2.png" align="middle" width="400">

<pre>
cov(x,y) for 2 variables x and y 
cov(x,y) = [cov(x,x)  cov(x,y)]
           [cov(y,x)  cov(y,y)]
           </pre>

#### Eigen Values and Eigen Vectors

<b>Eigen Vector</b><br>
An eigenvector of for a square matrix A, is a nonzero vector v in
such that Av = λv, holds True for some scalar λ.

<b>Eigen Value</b><br>
An eigenvalue of for a square matrix A, is a scaler λ such that Av = λv has non-trivial solution


#### 3) Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components

a)	Eigenvectors and eigenvalues are the linear algebra concepts that PCA computes from the covariance matrix in order to determine the principal components of the data.<br>
b)	Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components. So, the idea is 20-dimensional data gives you 20 principal components, but PCA tries to put maximum possible information in the first component, then maximum remaining information in the second and so on.<br>
c)	Organizing information in principal components this way, will allow you to reduce dimensionality without losing much information, and this by discarding the components with low information and considering the remaining components as your new variables

#### How PCA constructs Pricncipal Components
a)	There are as many principal components as there are variables in the data, principal components are constructed in such a manner that the first principal component accounts for the largest possible variance in the data set.<br>
b)	Every eigenvalue has an eigenvector. And their number is equal to the number of dimensions of the data. For example, for a 3-dimensional data set, there are 3 variables, therefore there are 3 eigenvalues with 3 corresponding eigenvectors.<br>
c)	<b>The eigenvectors of the Covariance matrix are actually the directions of the axes where there is the most variance(most information) and that we call Principal Components. And eigenvalues are simply the coefficients attached to eigenvectors, which give the amount of variance carried in each Principal Component.</b><br>

d)	<b>By ranking your eigenvectors in order of their eigenvalues, highest to lowest(descending order), you get the principal components in order of significance.</b><br>


#### 4) Feature Vector
a) Computing the eigenvectors and ordering them by their eigenvalues in descending order, allow us to find the principal components in order of significance. In this step, what we do is, to choose whether to keep all these components or discard those of lesser significance (of low eigenvalues), and with the remaining ones form a matrix of vectors that we call Feature vector.<br>
b) So, the feature vector is simply a matrix that has as columns the eigenvectors of the components that we decide to keep. This makes it the first step towards dimensionality reduction.


#### 5) Recast the Data Along the Principal Components Axes

a) In the last step, the aim is to use the feature vector formed using the eigenvectors of the covariance matrix, to reorient the data from the original axes to the ones represented by the principal components (hence the name Principal Components Analysis). <br>
<b>b) FinalDataSet = StandardizedOriginalDataSet * FeatureVector^T</b><br>
#### OR
<b>b) FinalDataSet = np.dot(StandardizedOriginalDataSet,FeatureVector^T)</b>
<br>
where <br>
T = transpose<br>
Finaldataset = Resultant Principal Components after Dimensionalty Reduction

#### Summary of PCA Steps
1) Standardize the datset<br>
2) Compute Covarince matrix for the standardized data<br>
3) Compute eigen value and eigen vector for the covaraince matrix. Arrange the eigen vectors(direction of axes where we have maximum variance) in decreasing order of eigen values(variance).<br>
4) Select a threshold for the sum of varaince required(usually set to 75% or more).
Based on this threshold, select the feature Vectors(the number of eigen vectors chosen based on sum of eigen value(sum of variance, which is more than or equal to the threshold(75%)). <br>
5) Reorient the eigen vectors into the Principal components<br>
Principal components  = np.dot(Standardized dataset, Feature_vector.T)<br>
where T = Transpose

#### Computing Eigen Values and Eigen Vector

In [5]:
import numpy as np
# numpy is aliased as np

In [8]:
a = np.array([[2,0,0],[0,4,5],[0,4,3]])
print(a)
print(a.shape) # rows=3,cols=3
print(a.ndim)  # 2

[[2 0 0]
 [0 4 5]
 [0 4 3]]
(3, 3)
2


In [10]:
eig_val,eig_vec = np.linalg.eig(a)
print('Eigen values\n',eig_val)
print('Eigen vectors\n',eig_vec)

Eigen values
 [-1.  8.  2.]
Eigen vectors
 [[ 0.          0.          1.        ]
 [ 0.70710678 -0.78086881  0.        ]
 [-0.70710678 -0.62469505  0.        ]]
