# Test: Distributed/Statistical Analysis/PCA

## Principal component analysis

Principal component analysis (PCA) is a standard tool in modern data analysis. The main goal of a PCA analysis is to identify patterns in data by detecting correlations between variables to reduce its dimensionality. In a nutshell, PCA finds the directions of maximum variance in high-dimensional data and project it onto a smaller dimensional subspace while retaining most of the information.

PCA was invented in 1901 by Karl Pearson, as an analogue of the principal axis theorem in mechanics; it was later independently developed and named by Harold Hotelling in the 1930s.

## Use cases

The main question with PCA is how much should we reduce the dimensionality of the data. There are two ways to fixe that question:

- We set a threshold (=1 in general) and we keep only dimensions for which the eigen value is obove this threshold.
- We want to maintenant an amount of variation in the model and we keep ordered dimensions.  

## limitations

- PCA is sensitive to the scaling of the variables. This means that whenever the different variables have different units (like temperature and mass), PCA is a somewhat arbitrary method of analysis. One way of making the PCA less arbitrary is to use variables scaled so as to have unit variance, by standardizing the data and hence use the autocorrelation matrix instead of the autocovariance matrix as a basis for PCA.

- To ensure that the first principal component describes the direction of maximum variance, variables must be centered. If not, the first principal component might instead correspond more or less to the mean of the data. A mean of zero is needed for finding a basis that minimizes the mean square error of the approximation of the data.

- PCA is a popular primary technique in pattern recognition, but it is not optimized for class separability. The linear discriminant analysis is an alternative which is optimized for class separability.

## Implementation

Download study database from the subproject: sample-data-db-setup

In [1]:
import pandas as pd
import statsmodels.formula.api as sm
from statsmodels.stats.anova import anova_lm

dataRaw = pd.read_csv('Data/desd-synthdata.csv')
data = dataRaw[['righthippocampus', 'lefthippocampus', 
                'rightamygdala','leftamygdala',
                'rightlateralventricle','leftlateralventricle']]
data = data.dropna()
data.head()

Unnamed: 0,righthippocampus,lefthippocampus,rightamygdala,leftamygdala,rightlateralventricle,leftlateralventricle
1,3.7933,3.4613,0.89412,0.95116,16.1951,17.7235
2,3.5737,3.3827,0.86274,0.89655,24.7413,35.4198
3,3.4143,3.1983,0.86853,0.89788,17.2587,19.3711
4,2.9331,2.6429,0.68437,0.70803,42.9125,32.0855
5,3.0757,2.8996,0.80229,0.79138,14.9264,20.9043


Standardization

In [46]:
# it seems that the method PCA standardized the data  
from sklearn.preprocessing import StandardScaler
data_std = StandardScaler().fit_transform(data)

In [65]:
from sklearn.decomposition import PCA

sklearn_pca = PCA(n_components= 2)
data_transformed = sklearn_pca.fit_transform(data_std)
data_transformed

array([[-2.4008611 , -0.19547397],
       [-1.51101883,  1.98498473],
       [-1.24524255, -0.06743021],
       ...,
       [-1.16294986,  2.00695644],
       [ 0.54006644, -0.34693453],
       [-0.70289511, -2.47390013]])

In [66]:
eigenValues = sklearn_pca.explained_variance_
eigenValues

array([3.60506879, 1.92461596])

## PCA plot


In [74]:
pcaFull = PCA(n_components=6)
pcaFull.fit_transform(data_std)
eig = pcaFull.explained_variance_



array([3.60506879, 1.92461596, 0.20280846, 0.12318931, 0.07951498,
       0.07133133])

In [67]:
eigenVectors = sklearn_pca.components_
eigenVectors

array([[-0.50039386, -0.496709  , -0.49942028, -0.49825384,  0.05034485,
         0.0517113 ],
       [ 0.05209064,  0.05784484,  0.0018352 ,  0.03249855,  0.70481511,
         0.7043555 ]])

# TEST in MIP
### Export result

Correlation matrix is computed and the results are written in a file that will be used in integration test phase.

In [69]:
import json


dict_export = {}
dict_export['eigen_values'] = eigenValues.tolist()
dict_export['eigen_vectors'] = eigenVectors.tolist()
dict_export['data_transformed'] = data_transformed.tolist()

with open('Output/PCA.json', 'w') as fp:
    json.dump(dict_export, fp)
#json.dump(dict_export, open('Output/PCA.json', 'w'))

## Other methods

Their exist similar methods:

- LDA, Linear Discriminant Analysis aims to find the directions that maximize the variance of the data and maximize the separation (or discrimination) between different classes, which can be useful in pattern classification problem.
- 