# Functions for Principal Component Analysis

Here comes the next fitting technique. Wooo!!!!! So, principal component analysis (PCA) seems like it will be the most promising fitting method, as evidenced in the paper (de Oliveira-Costa et al. 2008). It is an efficient way to compress data as we are able to fit the data with as few parameters as possible while maintaining accuracy. When there are too many parameters, it leads to the risk of overfitting. 

Summary of steps for PCA:
<ol>
<li>Standardize the data</li>
<li>Find the covariance matrix </li>
<li>Compute eigenvalues and eigenvectors </li>
<li>Rank eigenvectors </li>
</ol>

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time
%matplotlib notebook 

In [2]:
#args path to data and frequency
#returns data matrix and data dictionary
def data_matrix(path, freq):
    data = np.load('Data/'+path)
    d_matrix = data['arr_0'] #data matrix
    #for the y values iterate through number of pixels and get the column of intensity 
    #column of intensity corresponds to all intensity values of that certain freq
    data_dict = dict(zip(freq,(d_matrix[:,i] for i in range(d_matrix.shape[0]))))#x points keys and ypoints  values
    return d_matrix, data_dict

## Standardization and Finding Covariance

To begin, we need to calculate the covariance matrix. The covariance matrix should be the same dimensions as the dimensions for data which, in this case, is 2 dimensions. There are different ways to calculate the covariance matrix but essentially it requires normalizing the data set by subtracting off the mean. The steps used here to calculate the covariance matrix is as follows:
<br>
###  Standardize data set
To standardize it, I just subtracted the mean value from the data set to make sure the data is the same scale. 
    Let $X$ be the matrix of the data of $nxp$ dimensions such that $X = \begin{bmatrix}
            x_{11} & \ldots & x_{1p} \\
            \vdots &  & \vdots \\
            x_{n1} & \ldots & x_{np}\\
            \end{bmatrix} $
    <ol>
      <li>Found the mean of each column of the matrix then put it in a vector. Let j = 1,..,p. 
                 $$u_{j} = \frac{1}{n} \sum_{i=1}^{n}{X_{ij}} \quad where \quad \bar{u} = \begin{bmatrix}
                                                                                u_{1} \\
                                                                                \vdots \\
                                                                                u_{p} \\
                                                                                \end{bmatrix}$$</li>
            <li>Multiplied a vector of ones (h vector of $nx1$ size) and the transpose mean vector ($1xp$) to create a $nxp$ matrix of the mean values.</li>
                 $$\bar{h}\bar{u}^{T}$$
            <li>Subtracted mean matrix ($nxp$) from the $nxp$ matrix of data set to form B matrix ($nxp$).</li>
                  $$B = X - \bar{h}\bar{u}^{T}$$
</ol>

### Find the covariance matrix
 With the standardized matrix found (B matrix), the covariance matrix of $pxp$ size can be calculated with
    $$C = \frac{1}{n-1} B^{T}B$$

Credit: I based this off the wiki page for PCA (https://en.wikipedia.org/wiki/Principal_component_analysis#Derivation_of_PCA_using_the_covariance_method)

In [3]:
#args matrix of data
#returns covariance of data and normalized data set
def cov_matrix(x_matrix):
    column_vec = x_matrix[:,np.arange(x_matrix.shape[1])] #taking each column vector of matrix
    mean_vector = np.c_[np.mean(column_vec,axis=0)] #calculating mean for each column and adding to vector
    ones_vector = np.ones([x_matrix.shape[0],1]) #one vector 
#     print('Mean value ', np.dot(ones_vector,mean_vector.T), 'and mean vector \n', mean_vector)
    b_matrix = x_matrix - np.dot(ones_vector,mean_vector.T) #subtracting mean value 
    cov = (np.dot(b_matrix.T,b_matrix))/(x_matrix.shape[0]-1) #covariance matrix formula
    return cov,b_matrix 

## Correlation Matrix

The correlation matrix is defined by:
$$R_{jk} = \frac{C_{jk}}{\sigma_{j} \sigma_{k}}$$
so that $-1 \leq R_{jk} \leq 1$ and $R_{jj} = 1$.

In [4]:
#arg matrix of data
#returns correlation matrix
def corr_matrix(x_matrix):
    c, stand = cov_matrix(x_matrix) #finding covariance of data matrix
    sigma = np.sqrt(np.diag(c)) #finding sigma vector from covariance 
    c /= sigma[:,None] #divide columns of c_matrix by sigma vector
    c /= sigma[None,:] #divide rows of c_matrix by sigma vector
    return c

## Eigenvalues and Eigenvectors

With the covariance matrix, we can find eigenvectors $\bar{v}$ such that $C\bar{v}=\lambda \bar{v}$ for eigenvalue $\lambda$. For a $pxp$ covariance matrix there will be $p$ eigenvectors with a corresponding set of eigenvalues. To determine how much information or variance is attributed to each principal component, you can calculate the explained variance. You determine the sum of all the eigenvalues and divide each eigenvalue by that sum. The result is a percentage of the total variance that is explained by each eigenvalue.

With the explained variance, we can rank the eigenvectors by the eigenvalue with their corresponding eigenvectors from highest to lowest to determine an order of significane. 

In [5]:
#args matrix
#returns eigenvalues, eigenvectors, and tuple of eigvec and eigvec
def eig_values(c):
    eigval,eigvec = np.linalg.eig(c) #finding eigenvalues and eigenvectors
    eig_pairs = [(eigval[i],eigvec[:,i]) for i in range(eigvec.shape[1])] #creating a tuple of eigval and eigvec
    eig_pairs.sort() #sorting from least to greatest
    eig_pairs.reverse() #reversing order to greatest to least
    return eigval, eigvec, eig_pairs

In [6]:
#args eigenvalue
#returns dictionary of ranked eigenvalues (keys number of rank and values explained variance of each eigenvalue)
def ordered_eigval(e):
    total_eig = np.sum(e)
    var_exp = np.sort([(e[i]/total_eig) for i in np.arange(e.size)]) #calculating and sorting explained variance
    var_exp = var_exp[::-1] #reverse array to descending order
    e_dict = dict(zip(np.arange(1,var_exp.size+1),var_exp)) #adding ordered eigvalues to dictionary with rank as key
    return e_dict

In [1]:
#args title of graph, dictionary that holds ranked eigenvalues
#returns graph of ranked eigenvalues
def graph_eigval(title, eig_dict):
    plt.figure()
    plt.scatter(eig_dict.keys(),eig_dict.values())
    plt.plot(list(eig_dict.keys()),list(eig_dict.values()))
    plt.yscale('log')
    plt.xlabel('Principal component number')
    plt.ylabel('Fraction of variance explained')
    plt.title(title)
    plt.show()

## Cubic Spline Interpolation

Interpolation is the estimation of a value within a set of data points. The estimated curve passes through all the given points. Spline interpolation is a type of interpolation where the interpolant is a piecewise polynomial known as the spline. The cubic spline is the smoothest while also being the lowest degree. It also avoids the Runge's phenomenon where when interpolating with higher degrees can result in unexpected oscillations.

Summary of steps for cubic spline interpolation for PCA:
<ol>
<li>Normalize data
    <ul>
    <li>Find rms: square each column of data (corresponding to each intensity values at that frequency), average those results, then take square root. The result is one rms value per frequency </li>
        <li>Take data and divide it by corresponding rms value at every frequency</li> </ul></li>
<li>Use normalized data to find principal component fits</li>
<li>Interpolate rms values and eigenvectors with log frequency
   <ul> <li>Note: frequency values subtituted in interpolation function to be uniform in log</li></ul></li>
<li>Find fits by multiplying coefficients and interpolated eigenvectors</li>
<li>Undo normalization by multiplying with interpolated rms</li>
</ol>

In [None]:
#args data matrix
#returns rms for each frequency and normalized matrix
def normalized_rms(self,matrix):
    rms_freq = np.zeros(len(self.freq))
    for i in np.arange(len(self.freq)):
        freq_sqr = np.square(matrix[:,i]) #squaring each column of data matrix
        rms_freq[i] = np.sqrt(np.mean(freq_sqr)) #taking square root of mean of each squared column and adding to rms_freq array
    nor_matrix = matrix[:,None] /rms_freq #divides columns by rms_freq array
    nor_matrix = matrix[None:,] /rms_freq #divides rows by rms_freq array
    return rms_freq,nor_matrix

In [None]:
#args freq and intensity want to interpolate, new frequency values,coefficients, rms freq values
#returns cubic interpolated fit
def cub_interfit(self,inter_freq,inter_intensity,freq_new,coef,rms_freq):
    rms_inter = CubicSpline(inter_freq,rms_freq) #interpolating rms with freq
    inter = CubicSpline(inter_freq,inter_intensity,axis=0) #interpolated eigvec for 2 pc
    inter = inter(freq_new) #substituting new freq
    try: #multiplying coef and interpolated eigvec
        fit = np.dot(coef.T,inter.T) 
    except:
        fit = np.dot(coef.T,inter.reshape(1,inter.shape[0])) #inter 1d array need to reshape
    fit = fit * rms_inter(freq_new)
    return fit