# Homework: PCA

Principal Component Analysis
---

In this problem you'll be implementing Dimensionality reduction using Principal Component Analysis technique. 

The gist of PCA Algorithm to compute principal components is follows:
- Calculate the covariance matrix X of data points.
- Calculate eigenvectors and corresponding eigenvalues.
- Sort the eigenvectors according to their eigenvalues in decreasing order.
- Choose first k eigenvectors which satisfies target explained variance.
- Transform the original data of shape m observations times n features into m observations times k selected features.


The skeleton for the *PCA* class is below. Scroll down to find more information about your tasks.

In [16]:
import math
import pickle
import gzip
import numpy as np
import pandas
import matplotlib.pylab as plt
%matplotlib inline

In [17]:
from sklearn.preprocessing import StandardScaler
    
class PCA:
    def __init__(self, target_explained_variance=None):
        self.target_explained_variance = target_explained_variance
        self.feature_size = -1

    ''' standardize - Standardize the data
            Input:
                X - Data
            Algorithm:
                * Use StandardScaler to standardize the data
                * Fit the transform on the data
            Output:
                Returns the standardized data
    '''
    def standardize(self, X):
        scaler = StandardScaler()
        X_std = scaler.fit_transform(X)
        return X_std

    ''' compute_mean_vector - Compute the mean vector
            Input:
                X_std - Standardized data
            Algorithm:
                * Compute the mean of the data
            Output:
                Returns the mean vector
    '''
    def compute_mean_vector(self, X_std):
        mean_vec = np.mean(X_std, axis=0)
        return mean_vec

    ''' compute_cov - Compute the covariance matrix
            Input:
                X_std - Standardized data
                mean_vec - Mean vector
            Algorithm:
                * Get the shape of the data
                * Center the data
                * Compute the covariance matrix
            Output:
                Returns the covariance matrix
    '''
    def compute_cov(self, X_std, mean_vec):
        m = X_std.shape[0]
        X_centered = X_std - mean_vec
        cov_mat = np.dot(X_centered.T, X_centered) / (m)
        return cov_mat

    ''' compute_eigen_vector - Compute the eigen vector
            Input:
                cov_mat - Covariance matrix
            Algorithm:
                * Use numpy's linalg.eig to compute the eigen vector
            Output:
                Returns the eigen values and eigen vectors
    '''
    def compute_eigen_vector(self, cov_mat):
        eig_vals, eig_vecs = np.linalg.eig(cov_mat)
        return eig_vals, eig_vecs

    ''' compute_explained_variance - Compute the explained variance
            Input:
                eigen_vals - Eigen values
            Algorithm:
                * Sum up the eigen values
                * Sort the eigen values
                * Compute the explained variance
            Output:
                Returns the explained variance
    '''
    def compute_explained_variance(self, eigen_vals):
        total = np.sum(eigen_vals)
        sorted = np.sort(eigen_vals)[::-1]
        var_exp = sorted / total
        return var_exp

    ''' cumulative_sum - Compute the cumulative sum
            Input:
                var_exp - Explained variance
            Algorithm:
                * Use numpy's cumsum to compute the cumulative sum
            Output:
                Returns the cumulative sum
    '''
    def cumulative_sum(self, var_exp):
        return np.cumsum(var_exp)

    ''' compute_weight_matrix - Compute the weight matrix
            Input:
                eig_pairs - Eigen pairs
            Algorithm:
                * Iterate over the eigen pairs
                * If the cumulative variance is less than the target explained variance
                    * Append the eigen vector to the matrix
                * Else break
                * Stack the matrix
            Output:
                Returns the weight matrix
    '''
    def compute_weight_matrix(self, eig_pairs, cum_var_exp):
        matrix_w = []
        for i, (eig_val, eig_vec) in enumerate(eig_pairs):
            if cum_var_exp[i] <= self.target_explained_variance:
                matrix_w.append(eig_vec.reshape(-1, 1))
            else:
                break
        matrix_w = np.hstack(matrix_w)
        return matrix_w

    ''' transform_data - Transform the data
            Input:
                X_std - Standardized data
                matrix_w - Weight matrix
            Algorithm:
                * Dot product of the standardized data and the weight matrix
            Output:
                Returns the transformed data
    '''
    def transform_data(self, X_std, matrix_w):
        return X_std.dot(matrix_w)

    ''' fit - Fit the data
            Input:
                X - Data
            Algorithm:
                * Standardize the data
                * Compute the mean vector
                * Compute the covariance matrix
                * Compute the eigen vector
                * Compute the explained variance
                * Compute the cumulative sum
                * Compute the weight matrix
                * Transform the data
            Output:
                Returns the transformed
    '''
    def fit(self, X):
        self.feature_size = X.shape[1]
        X_std = self.standardize(X)
        mean_vec = self.compute_mean_vector(X_std)
        cov_mat = self.compute_cov(X_std, mean_vec)
        eig_vals, eig_vecs = self.compute_eigen_vector(cov_mat)
        eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
        var_exp = self.compute_explained_variance(eig_vals)
        cum_var_exp = self.cumulative_sum(var_exp)
        # This step calculates the matrix_w
        matrix_w = self.compute_weight_matrix(eig_pairs=eig_pairs,cum_var_exp=cum_var_exp) 
        print(len(matrix_w),len(matrix_w[0]))
        return self.transform_data(X_std=X_std, matrix_w=matrix_w)


To Do: Complete helper functions above.

complete `fit()` to using all helper functions to find reduced dimension data.

Run PCA on *fashion mnist dataset* to reduce the dimension of the data.

fashion mnist data consists of samples with *784 dimensions*.

Report the reduced dimension $k$ for target explained variance of **0.99**

In [18]:
X_train = pickle.load(open('./data/fashionmnist/train_images.pkl','rb'))
y_train = pickle.load(open('./data/fashionmnist/train_image_labels.pkl','rb'))

X_train = X_train[:1500]
y_train = y_train[:1500]

In [19]:
pca_handler = PCA(target_explained_variance=0.99)
X_train_updated = pca_handler.fit(X_train)

784 409
