# Principal Component Analysis

Here comes the next fitting technique. Wooo!!!!! So, principal component analysis (PCA) seems like it will be the most promising fitting method, as evidenced in the paper (de Oliveira-Costa et al. 2008). It is an efficient way to compress data as we are able to fit the data with as few parameters as possible while maintaining accuracy. When there are too many parameters, it leads to the risk of overfitting. 

Summary of steps for PCA:
<ol>
<li>Standardize the data (to ensure the data is at the same scale)</li>
<li>Find the covariance matrix </li>
<li>Compute eigenvalues and eigenvectors </li>
<li>Rank eigenvectors </li>
</ol>

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time
%matplotlib notebook 

In [29]:
#args path to data and frequency
#returns data matrix and data dictionary
def data_matrix(path, freq):
    data = np.load('Data/'+path)
    d_matrix = data['arr_0'] #data matrix
    #for the y values iterate through number of pixels and get the column of intensity 
    #column of intensity corresponds to all intensity values of that certain freq
    data_dict = dict(zip(freq,(d_matrix[:,i] for i in range(d_matrix.shape[0]))))#x points keys and ypoints  values
    return d_matrix, data_dict

In [50]:
freq_clean = [0.01, 0.022, 0.045, 0.408, 1.42,2.326] #clean frequency
clean_matrix,clean_dict = data_matrix('all_overlap.npz',freq_clean)

freq_noisy = [0.005, 0.01 , 0.016, 0.022, 0.034, 0.045, 0.227, 0.408, 0.914, 1.42 , 1.87, 2.326, 3.145] #noisy frequency
noisy_matrix,noisy_dict = data_matrix('noisy_data.npz',freq_noisy)

smooth_matrix,smooth_dict = data_matrix('smooth_data.npz',freq_clean)
smooth_noisy_matrix,smooth_noisy_dict = data_matrix('smooth_noisy_data.npz',freq_noisy)

#storing variables to use in other notebooks
%store clean_dict
%store freq_clean
%store noisy_dict
%store smooth_dict
%store smooth_noisy_dict
%store clean_matrix
%store smooth_matrix

#retrieving stored variables from Least_Squares file
%store -r pow_freq_fit

Stored 'clean_dict' (dict)
Stored 'freq_clean' (list)
Stored 'noisy_dict' (dict)
Stored 'smooth_dict' (dict)
Stored 'smooth_noisy_dict' (dict)
Stored 'clean_matrix' (ndarray)
Stored 'smooth_matrix' (ndarray)


In [51]:
fit_y = np.exp(pow_freq_fit) #finding y values from log form
mean_y = np.mean(clean_matrix, axis = 0) #mean of clean spectra

num_rows = clean_matrix.shape[0] #number of rows of clean matrix
num_col = clean_matrix.shape[1] #number of columns of clean matrix
avg_clean_matrix = np.zeros((num_rows,num_col)) #creating empty array to store values

for i in np.arange(num_rows):
    #dividing each row of clean matrix by average data 
    avg_clean_matrix[i] = np.divide(clean_matrix[i].reshape(1,num_col),fit_y.reshape(1,num_col)) #storing it in avg_clean_matrix
    
avg_clean_dict = dict(zip(freq_clean,(avg_clean_matrix[:,i] for i in range(avg_clean_matrix.shape[0]))))

#storing values to use in other notebooks
%store avg_clean_dict
%store mean_y
%store fit_y

Stored 'avg_clean_dict' (dict)
Stored 'mean_y' (ndarray)
Stored 'fit_y' (ndarray)


In [22]:
plt.figure()
plt.scatter(freq_clean,mean_y)
plt.plot(freq_clean,fit_y)
plt.title('Average of Clean Data')
plt.yscale('log')
plt.show()

<IPython.core.display.Javascript object>

## Standardization and Finding Covariance

To begin, we need to calculate the covariance matrix. The covariance matrix should be the same dimensions as the dimensions for data which, in this case, is 2 dimensions. There are different ways to calculate the covariance matrix but essentially it requires normalizing the data set by subtracting off the mean. The steps used here to calculate the covariance matrix is as follows:
<br>
###  Standardize data set
To standardize it, I just subtracted the mean value from the data set to make sure the data is the same scale. 
    Let $X$ be the matrix of the data of $nxp$ dimensions such that $X = \begin{bmatrix}
            x_{11} & \ldots & x_{1p} \\
            \vdots &  & \vdots \\
            x_{n1} & \ldots & x_{np}\\
            \end{bmatrix} $
    <ol>
      <li>Found the mean of each column of the matrix then put it in a vector. Let j = 1,..,p. 
                 $$u_{j} = \frac{1}{n} \sum_{i=1}^{n}{X_{ij}} \quad where \quad \bar{u} = \begin{bmatrix}
                                                                                u_{1} \\
                                                                                \vdots \\
                                                                                u_{p} \\
                                                                                \end{bmatrix}$$</li>
            <li>Multiplied a vector of ones (h vector of $nx1$ size) and the transpose mean vector ($1xp$) to create a $nxp$ matrix of the mean values.</li>
                 $$\bar{h}\bar{u}^{T}$$
            <li>Subtracted mean matrix ($nxp$) from the $nxp$ matrix of data set to form B matrix ($nxp$).</li>
                  $$B = X - \bar{h}\bar{u}^{T}$$
</ol>

### Find the covariance matrix
 With the standardized matrix found (B matrix), the covariance matrix of $pxp$ size can be calculated with
    $$C = \frac{1}{n-1} B^{T}B$$

Credit: I based this off the wiki page for PCA (https://en.wikipedia.org/wiki/Principal_component_analysis#Derivation_of_PCA_using_the_covariance_method)

In [6]:
#args matrix of data
#returns covariance of data and normalized data set
def cov_matrix(x_matrix):
    column_vec = x_matrix[:,np.arange(x_matrix.shape[1])] #taking each column vector of matrix
    mean_vector = np.c_[np.mean(column_vec,axis=0)] #calculating mean for each column and adding to vector
    ones_vector = np.ones([x_matrix.shape[0],1]) #one vector 
#     print('Mean value ', np.dot(ones_vector,mean_vector.T), 'and mean vector \n', mean_vector)
    b_matrix = x_matrix - np.dot(ones_vector,mean_vector.T) #subtracting mean value 
    cov = (np.dot(b_matrix.T,b_matrix))/(x_matrix.shape[0]-1) #covariance matrix formula
    return cov,b_matrix 

## Correlation Matrix

The correlation matrix is defined by:
$$R_{jk} = \frac{C_{jk}}{\sigma_{j} \sigma_{k}}$$
so that $-1 \leq R_{jk} \leq 1$ and $R_{jj} = 1$.

In [7]:
#arg matrix of data
#returns correlation matrix
def corr_matrix(x_matrix):
    c, stand = cov_matrix(x_matrix) #finding covariance of data matrix
    sigma = np.sqrt(np.diag(c)) #finding sigma vector from covariance 
    c /= sigma[:,None] #divide columns of c_matrix by sigma vector
    c /= sigma[None,:] #divide rows of c_matrix by sigma vector
    return c

## Eigenvalues and Eigenvectors

With the covariance matrix, we can find eigenvectors $\bar{v}$ such that $C\bar{v}=\lambda \bar{v}$ for eigenvalue $\lambda$. For a $pxp$ covariance matrix there will be $p$ eigenvectors with a corresponding set of eigenvalues. To determine how much information or variance is attributed to each principal component, you can calculate the explained variance. You determine the sum of all the eigenvalues and divide each eigenvalue by that sum. The result is a percentage of the total variance that is explained by each eigenvalue.

With the explained variance, we can rank the eigenvectors by the eigenvalue with their corresponding eigenvectors from highest to lowest to determine an order of significane. 

In [8]:
#args matrix
#returns eigenvalues, eigenvectors, and tuple of eigvec and eigvec
def eig_values(c):
    eigval,eigvec = np.linalg.eig(c) #finding eigenvalues and eigenvectors
    eig_pairs = [(eigval[i],eigvec[:,i]) for i in range(eigvec.shape[1])] #creating a tuple of eigval and eigvec
    eig_pairs.sort() #sorting from least to greatest
    eig_pairs.reverse() #reversing order to greatest to least
    return eigval, eigvec, eig_pairs

In [9]:
#args eigenvalue
#returns dictionary of ranked eigenvalues (keys number of rank and values explained variance of each eigenvalue)
def ordered_eigval(e):
    total_eig = np.sum(e)
    var_exp = np.sort([(e[i]/total_eig) for i in np.arange(e.size)]) #calculating and sorting explained variance
    var_exp = var_exp[::-1] #reverse array to descending order
    e_dict = dict(zip(np.arange(1,var_exp.size+1),var_exp)) #adding ordered eigvalues to dictionary with rank as key
    return e_dict

In [23]:
cov,stand_matrix = cov_matrix(clean_matrix) #covariance and standardized data matrix
corr = corr_matrix(clean_matrix) #correlation matrix
avg_cov,s = cov_matrix(avg_clean_matrix)
noisy_cov,noisy_stand_matrix = cov_matrix(noisy_matrix)

#eigvenvalues and eigenvectors for covariance
eigval,eigvec,eigpairs = eig_values(cov)
eigval_dict = ordered_eigval(eigval)

#eigvenvalues and eigenvectors for correlation
eigval2,eigvec2,eigpairs2 = eig_values(corr)
eigval_dict2 = ordered_eigval(eigval2)

#eigenvalues and eigenvectors for averaged data
eigval3,eigvec3,eigpairs3 = eig_values(avg_cov)
eigval_dict3 = ordered_eigval(eigval3)

#eigenvalues and eigenvectors for noisy data
eigval4,eigvec4,eigpairs4 = eig_values(noisy_cov)
eigval_dict4 = ordered_eigval(eigval4)


In [49]:
print(corr)

[[1.         0.99665931 0.98630643 0.86412482 0.67437979 0.58979366]
 [0.99665931 1.         0.99645151 0.8978503  0.71405513 0.62832145]
 [0.98630643 0.99645151 1.         0.92841164 0.75480163 0.66942596]
 [0.86412482 0.8978503  0.92841164 1.         0.93339775 0.87601192]
 [0.67437979 0.71405513 0.75480163 0.93339775 1.         0.99039294]
 [0.58979366 0.62832145 0.66942596 0.87601192 0.99039294 1.        ]]


In [30]:
def graph_eigval(title, eig_dict):
    plt.figure()
    plt.scatter(eig_dict.keys(),eig_dict.values())
    plt.plot(list(eig_dict.keys()),list(eig_dict.values()))
    plt.yscale('log')
    plt.xlabel('Principal component number')
    plt.ylabel('Fraction of variance explained')
    plt.title(title)
    plt.show()

In [33]:
graph_eigval('Rank of Eigenvalues for Covariance Matrix',eigval_dict)
graph_eigval('Rank of Eigenvalues for Correlation Matrix',eigval_dict2)
graph_eigval('Rank of Eigenvalues for Averaged Data',eigval_dict3)
graph_eigval('Rank of Eigenvalues for Noisy Covariance Matrix',eigval_dict4)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [52]:
plt.figure()
plt.title('Eigenvectors for Covariance Matrix')
for i in np.arange(1):
    plt.plot(eigvec[:,i])
#     plt.quiver([0,0],*eigvec[:,i],scale=1.15,color='b')
plt.show()

plt.figure()
plt.title('Eigenvectors for Correlation Matrix')
for i in np.arange(eigvec.shape[1]):
    plt.plot(eigvec2[:,i])
#     plt.quiver([0,0],*eigvec2[:,i],scale=1.15,color='orange')
plt.show()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Feature Vector and Final Data

Based on the explained variance, you can leave out components that are less signficant which results in the final data having less dimensions than the original. We put the eigenvectors of importance in the <b>feature vector</b>.

After forming the feature vector, we can determine the final data set which will reorient the data to be represented by the principal components instead of the original axes. You multiply the tranpose of the feature vector with the tranpose of the standardized data matrix (or B matrix).
$$FinalData = FeatureVector^{T} * B^{T}$$

In [17]:
feature_vec = np.array(eigpairs[0][1]).reshape(cov.shape[0],1) #feature vector
# print(feature_vec)
final_data = np.dot(feature_vec.T,stand_matrix.T) #multiplying transpose of feature vector and transpose of normalized matrix
print(final_data.shape)

(1, 163860)
