# Lec 5. Principle Component Analysis

1. Classifier works well on the training set. $\Leftarrow$ ERM (Empirical Risk Minimization);

2. Hypothesis Space is not so long;

3. Filter out noise and __extract__ the __relevant information__ from the given data set. __Decrease__ both __noise__ and __redundancy(correlated data)__ in the data set;

FACE Recognization

|1|Insupervised|PCA (Principle Component Analysis)
|:-|:-|:-
|2|Supervised|LDA (Linear Discriminant Analysis)
|3|Kernal Methods|Kernal Methods

## 0. Principle Component Analysis

### 0.1 What is PCA?

- __P__rincipal __C__omponent __A__nalysis (PCA) is a __statistical procedure__ that uses an __orthogonal transformation__ to convert a set of __observations of possibly correlated variables__ into a set of values of __linearly uncorrelated variables__ called __principal components__. 


- __Purpose__: The goal of principal component analysis is to identify the most __meaningful basis__ to __re-express__ a data set. The hope is that this __new basis__ will __filter out__ the __noise__ and __reveal hidden structure__.



- The __number__ of __p__rincipal __c__omponents __<=__ to the __number__ of __original variables__. 


### 0.2 Assumption of PCA

1. __Linearity__
    - Assumes the data set to be linear combinations of the variables;
    
2. The importance of __mean__ and __covariance__
    - There is no guarantee that the directions of maximum variance will contain good features for discrimination;
    
3. That __large variances__ have __important__ dynamics
    - Assumes that components with larger variance correspond to interesting dynamics and lower ones correspond to noise.
    
__Note__:

Where regression determines a line of best fit to a data set, factor analysis determines several orthogonal lines of best fit to the data set.

The orthogonality of  principal components implies that PCA finds the __most uncorrelated__ components to explain as __much variation__ in the data as possible.

### 0.3 How PCA Transformation Defined?
- This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. 


- The resulting vectors are an uncorrelated orthogonal basis set. 


- PCA is sensitive to the relative scaling of the original variables.

### 0.4 Where to use?

1) Image Compression;

2) Eigen Face;

3) Finding Patterns in data of high dimension;

### 0.5 Mathmatics

1) Standard deviation;

2) Covariance;

3) Eigen Vectors;

4) Eigen Values;

5) Orthogonal matrix;

6) __Spectral Therom__: A __matrix__ is __symmetric__ if and only if it is __orthogonally diagonalizable__;

- A __symmetric matrix__ __A__ can be written as $A = EDE^T$, where __D__ is a __diagnol matrix__, __E__ is a matrix of A's __eigen vectors__;

- [Gram–Schmidt process](https://en.wikipedia.org/wiki/Gram%E2%80%93Schmidt_process)

- A symmetric matrix is diagonalized by a matrix of its orthonormal eigenvectors.

7) __singular value decomposition__ (SVD): __PCA__ is intimately related to the mathematical technique of __singular value decomposition__ (SVD);

## 1. Definition of Principle Component

__Input__: $X_i\in R^d,X=[X_1,X_2,...,X_N]\in R^{\space d\times N}\space,\space i\space =\space 1,...,N$;

__Output__: Projection Matrix, $w\in R^{\space d\times d'} \Rightarrow w^Tx\in R^{\space d'\times N}\space\Rightarrow ww^Tx\in R^{\space d\times N}$

### 1.1 Find:
- The variance of d random variables and
- Structure of the covariance or correclation between d variable

If d is large, we want to focus on only some part of the d' variables while still preserve most of the information.

### 1.2 Questions: 
    
1. Which samples can we ignore?

    1) dependent variables -> __high covariance__ 
    
    2) constant
    
    3) __noise__ -> __low variance__
    
    4) constant + noise
    
2. Which samples do we want to keep?

    1) __low covariance__ $\approx$ uncorrelated, which means this sample doesn't depend on others too much.
    
    2) changes a lot ->  large variances have important dynamics
        
        a) high variance, can be treated as 'data'
        
        b) low variance can be treated as 'noise' 
## $\text{signal - noise ration}: SNR=\frac{\sigma_{data}^2}{\sigma_{noise}^2}$

    So that, maximizes SNR = maximizes the variance between axis

3. How we describe ‘most important’ features using math?

    - Variance
    
4. How do we represent our data, so that the most important feature can be extracted easily?

    - Changes of basis;

5. What we are looking for?

    Re-express sample $x$ as $w^Tx$, which decrease both noise and redundancy;
    
    Since we want $w^Tx$ has __minimized covariance__ and __maximized__ variance, the ideal covariance matrix of $w^Tx$ is __diagonalized__ matrix (with zeros in the off-diagonal terms).


### 1.3 Definition of Priciple Component

#### Step 1:

For linear function,

$\alpha_{1}^{T}X\space=\space \alpha_{11}x_{1}+...+\alpha_{1d'}x_{d'}=\sum\limits_{j=1}^{d'}\alpha_{1j}x_{j}$

where, $\alpha_{1}^TX$ having maximum variance.

#### Step 2:

For linear function $\alpha_{2}^TX$, which is uncorrelated with $\alpha_{1}^TX$

$\alpha_{2}^TX = \alpha_{21}x_{1}+...+\alpha_{2d'}x_{d'}=\sum\limits_{j=1}^{d'}\alpha_{2j}x_{j}$

where, $\alpha_{2}^{T}X$ having maximum variance.

#### Step k:

$\alpha_{k}^{T}X$ is the kth PC.

Where $w = [\alpha_1,\alpha_2,...,\alpha_k]$, so  
    
## \begin{equation}
w^T=\begin{bmatrix}
\alpha_1^T \\
\alpha_2^T \\
\vdots \\
\alpha_k^T
\end{bmatrix}
\end{equation}

and __w__ is __Orthogonal matrix__,i.e

## $w^Tw=I$

## 1.2 How to Find

Covariance Matrix of X: $\Sigma$.(If $\Sigma$ is unkown, replace it by a sample covariance matrix S), quantifies the correlation between all possible pairs of measurement.

For kth PC: 

### $Y_k = \alpha_k^TX$

where, 

- ### $\alpha_{k}$: is an eigenvector of $\Sigma$ corresponding to its kth largest eigenvalue $\lambda_{k}$;

- ### Fortheremore, if $\alpha_{k}^{T}\alpha_{k}=1$(with vector) $\Rightarrow$ $var[Y_{k}]=\lambda_{k}$;

## 1.3 Derive -> How to get $\alpha$ ?

__Input__: $X_i\in R^d,X=[X_1,X_2,...,X_N]\in R^{\space d\times N}\space,\space i\space =\space 1,...,N$;

__Output__: Projection Matrix, $w\in R^{\space d\times d'} \Rightarrow w^Tx\in R^{\space d'\times N}\space\Rightarrow ww^Tx\in R^{\space d\times N}$

### 1.3.1  Find 1st Principle Component

__Proof__:

###    \begin{equation}
    max\space var[\alpha_1^TX]=\alpha_1^T\Sigma\alpha_1 \\
    st. \alpha_1^T\alpha_1 = 1, \sum_{i=1}^{d'}\alpha_{1i}^2=1 \\
    \overset{Lagrange}{\Rightarrow} max \space \alpha_1^T\Sigma\alpha_1 - \lambda(\alpha_1^T\alpha_1-1) \\
    \frac{d}{d\alpha_1} \alpha_1^T\Sigma\alpha_1-\lambda(\alpha_1^T\alpha_1-1)=0 \\
    \Rightarrow 2\Sigma\alpha_1-2\lambda\alpha_1=0 \\
    \Rightarrow \Sigma\alpha_1-\lambda\alpha_1 = 0 \\
    \Rightarrow (\Sigma - \lambda I_{d'})\alpha_1 = 0 \\
    \alpha_1 \in R^{d'}, \Sigma \in R^{d'\times d'}, \lambda \in R
    \end{equation}

__Note__:

### \begin{equation}
Var(\alpha_1^TX) = E\{[\alpha_1^TX-E(\alpha_1^TX)][\alpha_1^TX-E(\alpha_1^TX)]^T\} \\
= E[\alpha_1^TXX^T\alpha_1-\alpha_1^TXE(\alpha_1^TX)-X^T\alpha_1E(\alpha_1^TX)+E(\alpha_1^TX)^2]\\
= \alpha_1^TE(XX^T)\alpha_1 - E(\alpha_1^TX)^2 - E(X^T\alpha_1)E(\alpha_1^TX) + E(\alpha_1^TX)^2\\
= \alpha_1^TE(XX^T)\alpha_1 - E(X^T\alpha_1)E(\alpha_1^TX)\\
= \alpha_1^T\Sigma\alpha_1 \text{(if the mean of X is zero)}
\end{equation}

### \begin{equation}
(\Sigma - \lambda I_p)\alpha_1 = 0 \\
1^{st}: \lambda\space \text{is an eigenvalue of}\space \Sigma, and\\
2^{nd}: \alpha_1 \text{is the corresponding eigenvector} \\
\Rightarrow max\space Var(\alpha_1^T X) = max\space \alpha_1^T\Sigma\alpha_1 \\
= max\space \alpha_1^T\lambda\alpha_1, since\space \alpha_1^T\alpha_1=1 \\
3^{rd}: = max\space \lambda
\end{equation}

### Summary

1) The Lagrangian $\lambda$ is the eigenvalue of covariance matirx of x, $\Sigma$;

2) $\alpha_1$ is the corresponding eigenvector;

4) $\Sigma$ is a symmetric matrix;

3) Finding PC $\iff$ Find the maximum eigenvalue of covariance matirx of x, $\Sigma$;

$\lambda$ must be as large as possible and $\alpha_1$ is the eigenvector corresponding to the largest eigenvalue of $\Sigma$;

### \begin{equation}
\Rightarrow \lambda_1 = \alpha_1^{*T}\Sigma\alpha_1^{*} = Var(\alpha_1^{*T}X)
\end{equation}

### 1.3.2  Find 2nd Principle Component

### \begin{equation}
max\space \alpha_2^T\Sigma\alpha_2 \\
st. \alpha_2^T\alpha_2 = 1, \underbrace{cov[\alpha_1^TX,\alpha_2^TX]=0}_\text{make PCs unrelated} \\
\end{equation}

__Proof__:
### \begin{equation}
cov[\alpha_1^TX,\alpha_2^TX] \\
= E[\alpha_1^TX-E(\alpha_1^TX)]E[\alpha_2^TX-E(\alpha_2^TX)]^T \\
= \alpha_1^TE[X-\mu_X]E[X-\mu_X]^T\alpha_2 \\
= \alpha_1^T\Sigma\alpha_2 \\
= \alpha_2^T\Sigma\alpha_1 \\
= \alpha_2^T\lambda_1\alpha_1 \\
= \lambda_1\alpha_2^T\alpha_1 = 0 \\
\Rightarrow
\left \{
  \begin{aligned}
    &\alpha_1^T\Sigma\alpha_2 = 0 && \\
    &\alpha_2^T\Sigma\alpha_1 = 0 && \\
    &\alpha_1^T\alpha_1 = 0 && \\
    &\alpha_2^T\alpha_1 = 0 && \\
  \end{aligned} \right. \\
\overset{Lagrange}{\Rightarrow} \alpha_2^T\Sigma\alpha_2 - \lambda(\alpha_2^T\alpha_2-1)-\phi(\alpha_2^T\alpha_1)\\
\frac{d}{d\alpha_2}[\alpha_2^T\Sigma\alpha_2 - \lambda(\alpha_2^T\alpha_2-1)-\phi(\alpha_2^T\alpha_1)] = 0 \\
\Rightarrow 2\Sigma\alpha_2 - 2\lambda\alpha_2-\phi\alpha_1 = 0 \\
\Rightarrow \Sigma\alpha_2 - \lambda\alpha_2-\phi\alpha_1 = 0 \\
\Rightarrow \underset{=0}{\alpha_1^T\Sigma\alpha_2} - \underset{=0}{\alpha_1^T\alpha_2} - \underset{=1}{\phi\alpha_1^T\alpha_1} = 0 \\
\Rightarrow \phi = 0 \\
\Rightarrow (\Sigma - \lambda I_{d'})\alpha_2 = 0
\end{equation}

### \begin{equation}
(\Sigma - \lambda I_{d'})\alpha_2 = 0 \\
\Rightarrow\lambda\space \text{is an eigenvalue of}\space \Sigma, and\\
\alpha_2 \text{is the corresponding eigenvector} \\
\Rightarrow max\space Var(\alpha_1^T X) = max\space \alpha_1^T\Sigma\alpha_1 \\
= max\space \alpha_1^T\lambda\alpha_1, since\space \alpha_1^T\alpha_1=1 \\
= max\space \lambda
\end{equation}

### Summary

1) The Lagrangian $\lambda$ is the eigenvalue of covariance matirx of x, $\Sigma$;

2) $\alpha_2$ is the corresponding eigenvector;

3) Finding PC $\iff$ Find the maximum eigenvalue of covariance matirx of x, $\Sigma$;

$\lambda$ must be as large as possible and $\alpha_1$ is the eigenvector corresponding to the largest eigenvalue of $\Sigma$;

### \begin{equation}
\Rightarrow \lambda_2 = \alpha_2^{*T}\Sigma\alpha_2^{*} = Var(\alpha_2^{*T}X)
\end{equation}

### 1.3.3 General Case

### \begin{equation}
\lambda_k = \alpha_k^{*T}\Sigma\alpha_k^{*} = Var(\alpha_k^{*T}X)
\end{equation}

Now, we proved the statements in __1.2__, which is:

### $Y_k = \alpha_k^TX$

where, 

- ### $\alpha_{k}$: is an eigenvector of $\Sigma$ corresponding to its kth largest eigenvalue $\lambda_{k}$;

- ### Fortheremore, if $\alpha_{k}^{T}\alpha_{k}=1$(with vector) $\Rightarrow$ $var[Y_{k}]=\lambda_{k}$;

Which means, to get w, we need to calcualte the eigen value of the covariance matrix of samples. Then, choose the highest d' and order them.

## 2. Solving PCA Using Eigenvector Decomposition

Now, we have

__Input__: $X_i\in R^d,X=[X_1,X_2,...,X_N]\in R^{\space d\times N}\space,\space i\space =\space 1,...,N$;

__Output__: Projection Matrix $Y = w^TX$, $w\in R^{\space d\times d'} \Rightarrow w^TX\in R^{\space d'\times N}\space\Rightarrow ww^TX\in R^{\space d\times N} (reconstruct)$

## $Y = w^TX$

Since we want __Y__ to be a __diagonalized matrix__,the next step is to check __Y__. 

### 2.1 Covariance Matrix of Y

## \begin{equation}
cov[Y] = cov[w^TX] = E[w^TX-E(w^TX)]E[w^TX-E(w^TX)]^T \\
\overset{\text{since mean of X is zero}}{=} E[w^TX]E[w^TX]^T =w^TE[X]E[X]^Tw = w^T\Sigma w
\end{equation}

### Note:


__Spectral Theorem__: A __matrix__ is __symmetric__ if and only if it is __orthogonally diagonalizable__.

1) If A is __orthogonally diagonalizable__, then A is a __symmetric matrix__.

Proof: 
## \begin{equation}
A = ED E^T \\
A ^T = (ED E^T)^T = ED E^T \\
\Rightarrow A\space \text{is symmetric}
\end{equation}
Q.E.D

2) If A is a __symmetric matrix__, then A is __orthogonally diagonalizable__.

Proof:

By induction, for every $1\times1$ matrix A: if A = [a], then A = [1][a][1] = UAU^T;

Now, assume that

(**) every (n-1)$\times$(n-1) symmetric matrix is orthogonally diagonalizable.

Based on Spectral Theorem, since $\Sigma$ is symmetric matrix, we can get:

## \begin{equation}
\Sigma = w \lambda w^T \\
\Rightarrow cov[Y] = w^T \Sigma w = w^T w\lambda w^T w = (ww)^T\lambda(ww) = \lambda
\end{equation}

- Which means __cov[Y]__ is diagonalized by w;

- For covariance matrix of Y, all __off diagonal__ terms are __zero__, and all __diagona__l terms are the __eigen value__ of __cov[x]__, __x__ has __zero mean__.

### 2.2 Process of Computing PCA

1) Subtracting off the mean of each measurement X;

2) computing the eigenvectors of cov[x];

3) Pick the first d' largest eigenvalue as w;

4) Compute $w^T$;

Now insteading of working on X, we can focus on $w^TX$.

## 3. PCA Process

### 3.1 Step1 -> Subtract the mean from each of the dimensions for X

Substract the mean $\rightarrow$ now all the data's mean = 0, in this way, the __covariance matrix__ of __X__ can be calculated as:

## $cov[x] = E[X]E[X]^T$

## \begin{equation}
\bar{X}=\frac{1}{N}\sum\limits_{i=1}^{N}X_i\space,\bar{X_1}=\frac{1}{N}X1
\end{equation}

### $for\space i=1,...,N;\space \widetilde{X_i} = X_i-\bar{X}.$

## Reference

1. [A tutorial on Principal Components Analysis 1](http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf)

2. [A tutorial on Principal Components Analysis 2](https://arxiv.org/pdf/1404.1100.pdf)

3. [A Tutorial on Principal Component Analysis PPT](http://www.cvip.louisville.edu/wwwcvip/education/ECE523/PCA%20Tutorial%20-%20Sept%2009.pdf)

4. [Principal Component Analysis](http://webspace.ship.edu/pgmarr/Geo441/Lectures/Lec%2017%20-%20Principal%20Component%20Analysis.pdf)

5. [Are there implicit Gaussian assumptions in the use of PCA (principal components analysis)?](https://www.quora.com/Are-there-implicit-Gaussian-assumptions-in-the-use-of-PCA-principal-components-analysis)

6. [Singular Value Decomposition](https://www.cs.cmu.edu/~venkatg/teaching/CStheory-infoage/book-chapter-4.pdf)

7. [Eigenvalue Problems](http://dept.stat.lsa.umich.edu/~kshedden/Courses/Stat606/Notes/eigen.pdf)

8. [Orthogonally Diagonalizable Matrices](http://www.math.wustl.edu/~freiwald/309orthogdiag.pdf)