# Principal Component Analysis

- Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

- Take the whole dataset consisting of d-dimensional samples ignoring the class labels
- Compute the d-dimensional mean vector (i.e., the means for every dimension of the whole dataset)
- Compute the scatter matrix (alternatively, the covariance matrix) of the whole data set
- Compute eigenvectors (ee1,ee2,...,eed) and corresponding eigenvalues (λλ1,λλ2,...,λλd)
- Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest eigenvalues to form a d×k        dimensional matrix WW(where every column represents an eigenvector)
- Use this d×k eigenvector matrix to transform the samples onto the new subspace. This can be summarized by the mathematical - - equation: yy=WWT×xx (where xx is a d×1-dimensional vector representing one sample, and yy is the transformed k×1-dimensional - sample in the new subspace.)

In [57]:
# Imports

import numpy as np
import pandas as pd
from sklearn.utils import shuffle
import matplotlib.pyplot as plt

In [50]:
def pca(X,y):
    
    # Calculate Covariance matrix of X
    covar_X = np.cov(X.T)  
    # Calculate Eigen Values and Eigen Vectors of the covariance matrix X
    eigval,eigvec = np.linalg.eigh(covar_X)
    
    # We sort the eigen values and corresponding eigen vectors in descending order
    
    idx = np.argsort(-eigval)  
    eigval = eigval[idx]
    eigval = np.maximum(eigval,0)
    eigvec = eigvec[:,idx]
    
    # Compute the linear transformation Matrix of X with the Sorted Information and decorrelated data
    Z = X.dot(eigvec)
    
    # PLot the first two highest information components
    plt.scatter(Z[:,0], Z[:,1], s = 100,c = y, alpha = 0.3)
    plt.xlabel("First Highest Information Component")
    plt.ylabel("Second Highest Information Content")
    plt.show()
    
    # Eigen Values of the transformed Matrix
    plt.plot(eigval)
    plt.title("Variance of each component")
    plt.show()
    
    # Cumulated sum of Eigen Values
    plt.plot(np.cumsum(eigval))
    plt.title("Cumulative variance")
    plt.show()  