# Implementing a Principal Component Analysis (PCA)
Apr 13, 2014<br>
by Sebastian Raschka

## Sections

-  [Introduction](#introduction)
   -  [What is a "good" subspace?]()
   -  [Summarizing the PCA approach]()
-  [Generating some 3-dimensional sample data]()
   -  [Why are we choosing a 3-dimensional sample?]()
-  [1. Taking the whole dataset ignoring the class labels]()
-  [2. Computing the d-dimensional mean vector]()
-  [3. a) Computing the Scatter Matrix]()
-  [3. b) Computing the Covariance Matrix (alternatively to the scatter matrix)]()
-  [4. Computing eigenvectors and corresponding eigenvalues]()
   -  [Checking the eigenvector-eigenvalue calculation]()
   -  [Visualizing the eigenvectors]()
-  [5.1. Sorting the eigenvectors by decreasing eigenvalues]()
-  [5.2. Choosing _k_ eigenvectors with the largest eigenvalues]()
-  [6. Transforming the samples onto the new subspace]()
-  [Using the PCA() class from the matplotlib.mlab library]()
   -  [Class attributes of PCA()]()
-  [Differences between the step by step approach and matplotlib.mlab.PCA()]()
-  [Using the PCA() class from the sklearn.decomposition library to confirm our results]()

## Introduction

The main purposes of a principal component analysis are the analysis of data to identify patterns and finding patterns to reduce the dimensions of the dataset with minimal loss of information.

Here, our desired outcome of the principal component analysis is to project a feature space (our dataset consisting of $n$ $d$-dimensional samples) onto a smaller subspace that represents our data "well". A possible application would be a pattern classification task, where we want to reduce the computational costs and the error of the parameter estimation by reducing the number of dimensions of our feature space by extracting a subspace that describes our data "best".

### What is a "good" subspace?

Let's assume that our goal is to reduce the dimensions of a $d$-dimensional dataset by projecting it onto a $k$-dimensional subspace (where $k < d$). So, how do we know what size we would choose for $k$, and how do we know if we have a feature space that represents our data "well"? Later, we will compute eigenvectors (the components) from our data set and collect them in a so-called scatter-matrix (or alternatively calculate them from the covariance matrix). Each of those eigenvectors is associated with an eigenvalue, which tells us about the "length" or "magnitude" of the eigenvectors. If we observe that all the eigenvalues are of very similar magnitude, this is a good indicator that our data is already in a "good" subspace. Or if some of the eigenvalues are much higher than others, we might be interested in keeping only those eigenvectors with the much larger eigenvalues, since they contain more information about our data distribution. Vice versa, eigenvalues that are close to 0 are less informative and we might consider in dropping those when we construct the new feature subspace.

### Summarizing the PCA approach

Listed below are the 6 general steps for performing a principal component analysis, which we will investigate in the following sections.

1. [Take the whole dataset consisting of $d$-dimensional samples ignoring the class labels]()
2. [Compute the $d$-dimensional mean vector]() (i.e. the means for every dimension of the whole dataset)
3. [Compute the scatter matrix (alternatively, the covariance matrix) of the whole dataset]()
4. [Compute eigenvectors ($e_1,e_2,...,e_d$) and corresponding eigenvalues ($\lambda_1,\lambda_2,...,\lambda_d$)]()
5. [Sort the eigenvectors by decreasing eigenvalues and choose $k$ eigenvectors with the largest eigenvalues to form a $d \times k$ dimensional matrix $W$]() (where every column represents an eigenvector)
6. [Use this $d \times k$ eigenvector matrix to transform the samples onto the new subspace.]() This can be summarized by the mathematical equation: $y=W^T \times x$ (where $x$ is a $d \times 1$-dimensional vector representing one sample, and $y$ is the transformed $k \times 1$-dimensional sample in the new subspace.)