# Principal Component Analysis

Here comes the next fitting technique. Wooo!!!!! So, principal component analysis (PCA) seems like it will be the most promising fitting method, as evidenced in the paper (de Oliveira-Costa et al. 2008). It is an efficient way to compress data as we are able to fit the data with as few parameters as possible while maintaining accuracy. When there are too many parameters, it leads to the risk of overfitting. 

Summary of steps for PCA:
<ol>
<li>Standardize the data (to ensure the data is at the same scale)</li>
<li>Find the covariance matrix </li>
<li>Compute eigenvalues and eigenvectors </li>
<li>Rank eigenvectors </li>
</ol>

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time
%matplotlib notebook 

In [2]:
data = np.load('500pixels.npz')
data_matrix = data['arr_0'] #matrix of intensity 500x30
x_freq = [0.01, 0.022, 0.045, 0.408, 1.42,2.326]

In [3]:
overlap_data = np.load('overlapping_pixels.npz')
overlap_matrix = overlap_data['arr_0']
overlap_x = np.arange(overlap_matrix.shape[1])
overlap_dict = dict(zip(x_freq,
                        (overlap_matrix[:,i] for i in range(overlap_matrix.shape[0]))
                       ))

In [4]:
# %%time
# plt.figure()
# plt.plot(list(overlap_dict.keys()),list(overlap_dict.values()))
# plt.yscale('log')
# plt.savefig('graph')

In [5]:
# %%time
# plt.figure()
# for i in range(overlap_matrix.shape[0]):
#     plt.plot(x_freq,overlap_matrix[i])
#     plt.yscale('log')

## Standardization and Finding Covariance

To begin, we need to calculate the covariance matrix. The covariance matrix should be the same dimensions as the dimensions for data which, in this case, is 2 dimensions. There are different ways to calculate the covariance matrix but essentially it requires normalizing the data set by subtracting off the mean. The steps used here to calculate the covariance matrix is as follows:
<br>
###  Standardize data set
To standardize it, I just subtracted the mean value from the data set to make sure the data is the same scale. 
    Let $X$ be the matrix of the data of $nxp$ dimensions such that $X = \begin{bmatrix}
            x_{11} & \ldots & x_{1p} \\
            \vdots &  & \vdots \\
            x_{n1} & \ldots & x_{np}\\
            \end{bmatrix} $
    <ol>
      <li>Found the mean of each column of the matrix then put it in a vector. Let j = 1,..,p. 
                 $$u_{j} = \frac{1}{n} \sum_{i=1}^{n}{X_{ij}} \quad where \quad \bar{u} = \begin{bmatrix}
                                                                                u_{1} \\
                                                                                \vdots \\
                                                                                u_{p} \\
                                                                                \end{bmatrix}$$</li>
            <li>Multiplied a vector of ones (h vector of $nx1$ size) and the transpose mean vector ($1xp$) to create a $nxp$ matrix of the mean values.</li>
                 $$\bar{h}\bar{u}^{T}$$
            <li>Subtracted mean matrix ($nxp$) from the $nxp$ matrix of data set to form B matrix ($nxp$).</li>
                  $$B = X - \bar{h}\bar{u}^{T}$$
</ol>

### Find the covariance matrix
 With the standardized matrix found (B matrix), the covariance matrix of $pxp$ size can be calculated with
    $$C = \frac{1}{n-1} B^{T}B$$

Credit: I based this off the wiki page for PCA (https://en.wikipedia.org/wiki/Principal_component_analysis#Derivation_of_PCA_using_the_covariance_method)

In [6]:
#args matrix of data
#returns covariance of data and normalized data set
def cov_matrix(x_matrix):
    column_vec = data_matrix[:,np.arange(x_matrix.shape[1])] #taking each column vector of matrix
    mean_vector = np.c_[np.mean(column_vec,axis=0)] #calculating mean for each column and adding to vector
    ones_vector = np.ones([x_matrix.shape[0],1]) #one vector 
#     print('Mean value ', np.dot(ones_vector,mean_vector.T), 'and mean vector \n', mean_vector)
    b_matrix = x_matrix - np.dot(ones_vector,mean_vector.T) #subtracting mean value 
    cov = (np.dot(b_matrix.T,b_matrix))/(x_matrix.shape[0]-1) #covariance matrix formula
    return cov,b_matrix 

## Correlation Matrix

The correlation matrix is defined by:
$$R_{jk} = \frac{C_{jk}}{\sigma_{j} \sigma_{k}}$$
so that $-1 \leq R_{jk} \leq 1$ and $R_{jj} = 1$.

In [41]:
#arg matrix of data
#returns correlation matrix
def corr_matrix(x_matrix):
    c, stand = cov_matrix(x_matrix) #finding covariance of data matrix
    sigma = np.sqrt(np.diag(c)) #finding sigma vector from covariance 
    c /= sigma[:,None] #divide columns of c_matrix by sigma vector
    c /= sigma[None,:] #divide rows of c_matrix by sigma vector
    return c

## Eigenvalues and Eigenvectors

With the covariance matrix, we can find eigenvectors $\bar{v}$ such that $C\bar{v}=\lambda \bar{v}$ for eigenvalue $\lambda$. For a $pxp$ covariance matrix there will be $p$ eigenvectors with a corresponding set of eigenvalues. To determine how much information or variance is attributed to each principal component, you can calculate the explained variance. You determine the sum of all the eigenvalues and divide each eigenvalue by that sum. The result is a percentage of the total variance that is explained by each eigenvalue.

With the explained variance, we can rank the eigenvectors by the eigenvalue with their corresponding eigenvectors from highest to lowest to determine an order of significane. 

In [42]:
#args matrix
#returns eigenvalues, eigenvectors, and tuple of eigvec and eigvec
def eig_values(c):
    eigval,eigvec = np.linalg.eig(c) #finding eigenvalues and eigenvectors
    eig_pairs = [(eigval[i],eigvec[:,i]) for i in range(eigvec.shape[1])] #creating a tuple of eigval and eigvec
    eig_pairs.sort() #sorting from least to greatest
    eig_pairs.reverse() #reversing order to greatest to least
    return eigval, eigvec, eig_pairs

In [43]:
#args eigenvalue
#returns dictionary of ranked eigenvalues (keys number of rank and values explained variance of each eigenvalue)
def ordered_eigval(e):
    total_eig = np.sum(e)
    var_exp = np.sort([(e[i]/total_eig) for i in np.arange(e.size)]) #calculating and sorting explained variance
    var_exp = var_exp[::-1] #reverse array to descending order
    e_dict = dict(zip(np.arange(1,var_exp.size+1),var_exp)) #adding ordered eigvalues to dictionary with rank as key
    return e_dict

In [29]:
cov,stand_matrix = cov_matrix(data_matrix) #covariance and standardized data matrix
corr = corr_matrix(data_matrix) #correlation matrix

#eigvenvalues and eigenvectors for covariance
eigval,eigvec,eigpairs = eig_values(cov)
eigval_dict = ordered_eigval(eigval)

#eigvenvalues and eigenvectors for correlation
eigval2,eigvec2,eigpairs2 = eig_values(corr)
eigval_dict2 = ordered_eigval(eigval2)

In [45]:
plt.figure()
plt.scatter(eigval_dict.keys(),eigval_dict.values())
plt.plot(list(eigval_dict.keys()),list(eigval_dict.values()))
plt.yscale('log')
plt.xlabel('Principal component number')
plt.ylabel('Fraction of variance explained')
plt.title('Rank of Eigenvalues for Covariance Matrix')
plt.show()

<IPython.core.display.Javascript object>

In [51]:
plt.figure()
plt.scatter(eigval_dict2.keys(),eigval_dict2.values())
plt.plot(list(eigval_dict2.keys()),list(eigval_dict2.values()))
plt.yscale('log')
plt.xlabel('Principal component number')
plt.ylabel('Fraction of variance explained')
plt.title('Rank of Eigenvalues for Correlation Matrix')

<IPython.core.display.Javascript object>

Text(0.5, 1.0, 'Rank of Eigenvalues for Correlation Matrix')

In [48]:
plt.figure()
plt.title('Eigenvectors for Covariance Matrix')
for i in np.arange(eigvec.shape[1]):
#     plt.plot(eigvec[:,i])
    plt.quiver([0,0],*eigvec[:,i],scale=1.15,color='b')
plt.show()

<IPython.core.display.Javascript object>

In [49]:
plt.figure()
plt.title('Eigenvectors for Correlation Matrix')
for i in np.arange(eigvec2.shape[1]):
#     plt.plot(eigvec[:,i])
    plt.quiver([0,0],*eigvec2[:,i],scale=1.15,color='orange')
plt.show()

<IPython.core.display.Javascript object>

## Feature Vector and Final Data

Based on the explained variance, you can leave out components that are less signficant which results in the final data having less dimensions than the original. We put the eigenvectors of importance in the <b>feature vector</b>.

After forming the feature vector, we can determine the final data set which will reorient the data to be represented by the principal components instead of the original axes. You multiply the tranpose of the feature vector with the tranpose of the standardized data matrix (or B matrix).
$$FinalData = FeatureVector^{T} * B^{T}$$

In [14]:
feature_vec = np.array(eigpairs[0][1]).reshape(cov.shape[0],1) #feature vector
# print(feature_vec)
final_data = np.dot(feature_vec.T,stand_matrix.T) #multiplying transpose of feature vector and transpose of normalized matrix
print(final_data.shape)

(1, 500)
