# Homework 1: PCA

Problem 1 - Principal Component Analysis
---

In this problem you'll be implementing Dimensionality reduction using Principal Component Analysis technique. 

The gist of PCA Algorithm to compute principal components is follows:
- Calculate the covariance matrix X of data points.
- Calculate eigenvectors and corresponding eigenvalues.
- Sort the eigenvectors according to their eigenvalues in decreasing order.
- Choose first k eigenvectors which satisfies target explained variance.
- Transform the original data of shape m observations times n features into m observations times k selected features.


The skeleton for the *PCA* class is below. Scroll down to find more information about your tasks.

In [1]:
import math
import pickle
import gzip
import numpy as np
import pandas
import matplotlib.pylab as plt
%matplotlib inline

In [19]:
from sklearn.preprocessing import StandardScaler


class PCA:
    def __init__(self, target_explained_variance=None):
        """
        explained_variance: float, the target level of explained variance
        """
        self.target_explained_variance = target_explained_variance
        self.feature_size = -1

    def standardize(self, X):
        """
        standardize features using standard scaler
        :param X: input data with shape m (# of observations) X n (# of features)
        :return: standardized features (Hint: use skleanr's StandardScaler. Import any library as needed)
        """
        # your code here
        scaler = StandardScaler()
        transformed_features = scaler.fit_transform(X)
        return(transformed_features)

    def compute_mean_vector(self, X_std):
        """
        compute mean vector
        :param X_std: transformed data
        :return n X 1 matrix: mean vector
        """
        # your code here
        
        # I am currently operating under the assumption that the means we are taking are of all observations over 
        ## a given feature, for all features. That would result in n computations, which can be stored in n X 1 vector.
        
        means = np.mean(X_std, axis = 0)
        return(means)

    def compute_cov(self, X_std, mean_vec):
        """
        Covariance using mean, (don't use any numpy.cov)
        :param X_std:
        :param mean_vec:
        :return n X n matrix:: covariance matrix
        """
        # your code here
        
        return (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0] - 1)

    def compute_eigen_vector(self, cov_mat):
        """
        Eigenvector and eigen values using numpy. Uses numpy's eigenvalue function
        :param cov_mat:
        :return: (eigen_values, eigen_vector)
        """
        # your code here
        (eigenvals, eigenvecs) = np.linalg.eig(cov_mat)
        return (eigenvals, eigenvecs)

    def compute_explained_variance(self, eigen_vals):
        """
        sort eigen values and compute explained variance.
        explained variance informs the amount of information (variance)
        can be attributed to each of  the principal components.
        :param eigen_vals:
        :return: explained variance.
        """
        # your code here
        
        #I had to reupload because this question among others got misgraded. This function successfully sorts the
        ## eigenvals ahd calculates the explained variance by each sequentially.
        
        eigen_vals[::-1].sort()
        total_var = eigen_vals.sum()
        explained_var = (1 / total_var) * eigen_vals
        return(explained_var)

    def cumulative_sum(self, var_exp):
        """
        return cumulative sum of explained variance.
        :param var_exp: explained variance
        :return: cumulative explained variance
        """
        return np.cumsum(var_exp)
    
    def make_eig_pairs(self, eigenvals, eigenvecs):
        """
        Citation: https://stackoverflow.com/questions/9007877/sort-arrays-rows-by-another-array-in-python
        Input: np.ndarrays eigenvals, eigenvecs. These are outputted by compute_eigen_vector
        Output: list of doubles (eigenvalue, eigenvector) sorted by decreasing eigenvalue
        """
        indecies = eigenvals.argsort()
        sorted_eigenvals = eigenvals[indecies[::-1]]
        sorted_eigenvecs = eigenvecs[indecies[::-1], :]
        n = sorted_eigenvals.shape[0]
        eig_pairs = list()
        
        for j in range(n):
            this_double = (sorted_eigenvals[j], sorted_eigenvecs[j, :])
            eig_pairs.append(this_double)
        
        return(eig_pairs)
            

    def compute_weight_matrix(self, eig_pairs, cum_var_exp):
        """
        compute weight matrix of top principal components conditioned on target
        explained variance.
        (Hint : use cumilative explained variance and target_explained_variance to find
        top components)
        
        :param eig_pairs: list of tuples containing eigenvalues and eigenvectors, 
        sorted by eigenvalues in descending order (the biggest eigenvalue and corresponding eigenvectors first).
        :param cum_var_exp: cumulative expalined variance by features
        :return: weight matrix (the shape of the weight matrix is n X k)
        """
        # your code here
        my_target = self.target_explained_variance
        # I'm just going to scan over cum_var_exp, which should run in O(k). This could be sped up using binary search
        ## because cum_var_exp is already sorted. That would run in O(log(k)).
        
        k = 0
        n = cum_var_exp.shape[0]
        for feature in range(n):
            if cum_var_exp[feature] >= my_target:
                break
            else:
                k = k + 1
        
        # Now that I have the index of the last desired component, we just mash the corresponding eigenvectors into a matrix
        eigenvec_container = list()
        for feature in range(k):
            this_eigenvec = eig_pairs[feature][1]
            eigenvec_container.append(this_eigenvec)
        eigenvecs = np.asarray(eigenvec_container).reshape(n, k)
        
        return(eigenvecs)
        

    def transform_data(self, X_std, matrix_w):
        """
        transform data to subspace using weight matrix
        :param X_std: standardized data
        :param matrix_w: weight matrix
        :return: data in the subspace
        """
        return X_std.dot(matrix_w)

    def fit(self, X):
        """    
        entry point to the transform data to k dimensions
        standardize and compute weight matrix to transform data.
        The fit functioin returns the transformed features. k is the number of features which cumulative 
        explained variance ratio meets the target_explained_variance.
        :param   m X n dimension: train samples
        :return  m X k dimension: subspace data. 
        """
    
        self.feature_size = X.shape[1]
        
        # your code here
        
        #Essentially I just call all the methods in the class PCA sequentially to fulfill their dependancies.
        
        this_transformed = self.standardize(X)
        this_mean_vec = self.compute_mean_vector(this_transformed)
        this_cov = self.compute_cov(this_transformed, this_mean_vec)
        (these_eigenvals, these_eigenvecs) = self.compute_eigen_vector(this_cov)
        these_eig_pairs = pca.make_eig_pairs(these_eigenvals, these_eigenvecs)
        this_varexp = pca.compute_explained_variance(these_eigenvals)
        this_cumexp = pca.cumulative_sum(this_varexp)
        this_weight_matrix = pca.compute_weight_matrix(these_eig_pairs, this_cumexp)
        
        
        print(len(this_weight_matrix),len(this_weight_matrix[0]))
        return self.transform_data(X_std = this_transformed, matrix_w = this_weight_matrix)


**[ PART A ]** Your task involves implementing helper functions to compute *mean, covariance, eigenvector and weights*.

complete `fit()` to using all helper functions to find reduced dimension data.

Run PCA on *fashion mnist dataset* to reduce the dimension of the data.

fashion mnist data consists of samples with *784 dimensions*.

Report the reduced dimension $k$ for target explained variance of **0.99**

In [21]:
X_train = pickle.load(open('./data/fashionmnist/train_images.pkl','rb'))
y_train = pickle.load(open('./data/fashionmnist/train_image_labels.pkl','rb'))

X_train = X_train[:1500]
y_train = y_train[:1500]

In [22]:
pca_handler = PCA(target_explained_variance=0.99)
X_train_updated = pca_handler.fit(X_train)

784 191
