# Analysis Pipeline

In this notebook I created a valid pipeline that inputs topology and trajectory files and aims to output useful metrics and plots that will help us capture a differentiating signal **between agonists and antagonists**.

The flow in general is:

1. **Input**: A directory of 2 subdirectories (agonists and antagonists)
2. Read the files (currently using [MDAnalysis](https://www.mdanalysis.org/))
3. Apply methods and measurements (Radius of Gyration, RMSD, RMSF, SASA, PCA)
4. **Output**: Metrics, plots and conclusions

In [2]:
import MDAnalysis

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from MDAnalysis.analysis.rms import RMSF
import MDAnalysis.analysis.pca as pca

import re
import os
import logging
from tqdm import tqdm

## The AnalysisActor class

The class will be responsible for reading, running analysis methods and storing the results.
Explanation of the metrics, useful plots and conclusion can be found on later stages of the notebook.

In [3]:
class AnalysisActor:
    '''
    The AnalysisActor object inputs a single topology and trajectory and performs the analysis
    A full list of input formats: https://www.mdanalysis.org/docs/documentation_pages/coordinates/init.html#supported-coordinate-formats
    
    Args:
        topology (str): The topology file (.pdb, .gro etc)
        trajectory (str): The trajectory file (.xtc etx)
        
    Attributes:
        uni: The universe of atoms created by MDAnalysis tool
        name (str): The simulation name (extracted by the topology file)
        rg_list (List[double]): Radius of gyration of each frame
        rmsf_res (List[double]): RMSF of each residue
        pca_res (Object: MDAnalysis.analysis.pca.PCA): Object containing eigenvectors and eigenvalues of Cov Matrix
    '''
    
    def __init__(self, topology, trajectory, sim_name):
        self.uni = MDAnalysis.Universe(topology, trajectory)
        self.name = sim_name
        self.rg_list = []
        self.rmsf_res = None
        self.pca_res = None
        
        
    def info():
        ''' Prints basic info of the universe of atoms '''
        print(f'\n<<< Info of {self.name} >>>')
        print(f'\tNumber of Frames: {len(self.unitrajectory)}')
        print(f'\tNumber of Atoms: {len(self.uni.atoms)}')
        print(f'\tNumber of Residues: {len(self.uni.residues)}')
    
    
    def perform_analysis(self, metrics=[]):
        '''
        Runs the analysis methods for calculating the metrics specified by metrics argument
        
        Args:
            metrics (List[str]): A list of the metrics to be calculated. Available:
                                 Empty List []: All of the available metrics
                                 'Rg': Radius of Gyration
                                 'RMSF': Root Mean Square Fluctuations
                                 'SASA': Solven Accesible Surface Area
                                 'PCA': Principal Component Analysis
        '''
        
        # Calculate Radius of Gyration as time progresses
        if "Rg" in metrics or len(metrics) == 0:
            for frame in self.uni.trajectory:
                self.rg_list.append(self.uni.atoms.radius_of_gyration())
                
        # Calculate Solvent Accesible Surface Area 
#         if "SASA" in metrics or len(metrics) == 0:
#             logging.warning("SASA NOT IMPLEMENTED")
            
        # Calculate Root Mean Square Fluctuation
        if "RMSF" in metrics or len(metrics) == 0:
            # TODO: Look more into alignemnt step
            rmsf_res = RMSF(self.uni.atoms).run()

        # Perform PCA on the CA atoms
        if "PCA" in metrics or len(metrics) == 0:
            self.pca_res = pca.PCA(self.uni, select='name CA')
            self.pca_res.run()
        

## Reading the trajectories

Emphasis must be given on reading the trajectory files in an organized and optimal way.
The current approach is:

1. Input: path wich points to a directory that contains to subdirectories with names **"agonists", "antagonists"**
2. Extract the filepaths
3. Create the dictionary:
```python 
{
    "agonists": List[AnalysisActor.class]
    "agonists": List[AnalysisActor.class]
}
```
   
**The trajectory and topology file are expected to have a file ending of .xtc and .pdb respectively,
although we can easily expend it to more formats**


In [4]:
root_directory = '../datasets/New_AI_MD/'

dir_list = os.listdir(root_directory)
if 'Agonists' not in dir_list: logging.error('Agonists directory not found')
if 'Antagonists' not in dir_list: logging.error('Antagonists directory not found')

analysis_actors_dict = {"Agonists":[], "Antagonists":[]}

# Iterating through the directories tree in order to fill the analysis_actors_dict
# A warning is thrown when reading the Lorcaserin 
for which_dir in ['Agonists', 'Antagonists']:
    simulations = os.listdir(root_directory + which_dir + '/')
    for which_sim in tqdm(simulations, desc=which_dir):
        files = os.listdir(root_directory + which_dir + '/' + which_sim + '/')
        top = ""
        traj = ""
        for file in files:
            if file[-4:] == ".xtc":
                traj = root_directory + which_dir + '/' + which_sim +'/' + file
            elif file[-4:] == ".pdb":
                top = root_directory + which_dir + '/' + which_sim +'/' + file
        if traj == "" or top == "":
            logging.error("Failed to find topology or trajectory file in: " + root_directory + which_dir + '/' + which_sim +'/' + file)
        analysis_actors_dict[which_dir].append(AnalysisActor(top, traj, which_sim))
            

  alpha = np.rad2deg(np.arccos(np.dot(y, z) / (ly * lz)))
  beta = np.rad2deg(np.arccos(np.dot(x, z) / (lx * lz)))
  gamma = np.rad2deg(np.arccos(np.dot(x, y) / (lx * ly)))
  if np.all(box > 0.0) and alpha < 180.0 and beta < 180.0 and gamma < 180.0:
Agonists: 100%|██████████| 10/10 [00:00<00:00, 21.27it/s]
Antagonists: 100%|██████████| 13/13 [00:00<00:00, 22.03it/s]


## Performing the Calculations

Having read all of the trajectories and topology files we must calculate some features that will help us do our analysis.

Currently the calculations include:
* Radius of Gyration
* RMSF
* PCA

In [None]:
# The calculations are perfomed by calling the 'perform_analysis' method of our AnalysisActor objects

# Agonists
for which_actor in tqdm(analysis_actors_dict['Agonists'], desc="Agonists Calculations"):
    which_actor.perform_analysis()

Agonists Calculations:  40%|████      | 4/10 [01:12<01:50, 18.42s/it]