# Principal Component Analysis in MDAnalysis

2019

Author: [Lily Wang](http://minium.com.au) [(@lilyminium)](https://github.com/lilyminium)

Inspired by the MDAnalysis PCA tutorial by [Kathleen Clark](https://becksteinlab.physics.asu.edu/people/75/kathleen-clark) [(@kaceyreidy)](https://github.com/kaceyreidy)

In this tutorial we:

* use PCA to analyse and visualise large macromolecular conformational changes in the enzyme adenylate kinase (AdK)
* use PCA to compare the conformational differences between ???

## Background

Principal component analysis (PCA) is a statistical technique that decomposes a system of observations into linearly uncorrelated variables called **principal components**. These components are ordered so that the first principal component accounts for the largest variance in the data, and each following component accounts for less and less variance. PCA is often applied to molecular dynamics trajectories to extract the large-scale conformational motions or "essential dynamics" of a protein. The frame-by-frame conformational fluctuation can be considered a linear combination of the essential dynamics yielded by the PCA.

In MDAnalysis, the method is as follows:

1. Optionally align each frame in your trajectory to the first frame.
2. Construct a 3N x 3N covariance for the N atoms in your trajectory. Optionally, you can provide a mean; otherwise the covariance is to the averaged structure over the trajectory.
3. Diagonalise the covariance matrix. The eigenvectors are the principal components, and their eigenvalues are the associated variance.
4. Sort the eigenvalues so that the principal components are ordered by variance.

<div class="alert alert-warning">
    
**Note**
    
It should be noted that principal component analysis algorithms are deterministic, but the solutions are not unique. For example, you could easily change the sign of an eigenvector without altering the PCA. Different algorithms are likely to produce different answers, due to variations in implementation. `MDAnalysis` is likely to return different solutions to, say, `cpptraj`. 
</div>

## Large conformational changes in adenylate kinase

In MDAnalysis, analysis modules usually need to be imported explicitly. The `pca` module contains the `PCA` class that we will use for analysis. We also import the AdK files from the MDAnalysis test suite.

In [1]:
import MDAnalysis as mda
import MDAnalysis.analysis.pca as pca
from MDAnalysisTests.datafiles import PSF, DCD

import nglview as nv

_ColormakerRegistry()

As usual, we start off by creating a universe.

In [2]:
u = mda.Universe(PSF, DCD)

Unlike other analyses, `pc.PCA` can only be applied to `Universe`s. The default `PCA` arguments are:

```python
my_pca = pca.PCA(u, select='all', align=False, mean=None, n_components=None)
```

By default (`align=False`), your trajectory will not be aligned to any structure. If you set `align=True`, every frame will be aligned to the first frame of your trajectory, based on the atoms in your `select` string. 

As PCA is usually used to extract large-scale conformational motions, we select only the backbone atoms here.

In [3]:
pc = pca.PCA(u, select="backbone", align=True)

Once you set up the class, you can run the analysis with `.run(start=None, stop=None, step=None, verbose=None)`. These allow you to specify the frames to compute the analysis over. The default arguments compute over every frame.

In [4]:
pc.run(verbose=True)

<MDAnalysis.analysis.pca.PCA at 0x11c506ef0>

The principal components are accessible in `.p_components`.

In [5]:
print(pc.p_components.shape)
pc.p_components[0]

(2565, 2565)


array([ 0.02725098,  0.00156086,  0.00816821, ..., -0.01783826,
        0.04746114,  0.04257271])

The variance of each principal component is in `.variance`. For example, to get the variance explained by the first principal component:

In [6]:
pc.variance[0]

281443.5086197605

This variance is somewhat meaningless by itself. It is much more intuitive to consider the variance of a principal component as a percentage of the total variance in the data. MDAnalysis also tracks the percentage cumulative variance in `.cumulated_variance`. As shown below, the first principal component contains 98.7% the total trajectory variance. The first three components combined account for 99.9% of the total variance.

In [7]:
print(pc.cumulated_variance[0])
print(pc.cumulated_variance[3])

0.9873464381554058
0.999419901112709


The average structure is also saved as an `AtomGroup` in `.mean_atoms`.

In [15]:
print(pc.mean_atoms.positions)

[[16.297781   6.8397956 -7.622989 ]
 [14.900139   7.062459  -7.235277 ]
 [14.185768   5.8268375 -6.879689 ]
 ...
 [13.035071  15.354209  -3.8042812]
 [13.695147  15.725297  -4.988666 ]
 [12.63667   15.566869  -6.1185045]]


In [14]:
mean_structure = mda.Merge(pc.mean_atoms)
nv.show_mdanalysis(mean_structure)

NGLWidget()