Root Mean Square Flactuation (RMSF)

RMSF is a measure of the deviation of the position of a particle i with respect to a reference position **over time**.

**Difference between RMSD and RMSF**: The latter is averaged over time, giving a value for each particle i. For the RMSD the average is taken over the particles, giving time specific values. So **RMSD is time specific** and **RMSF is atom specific** [(ref)](http://www.drugdesign.gr/uploads/7/6/0/2/7602318/lecture_mdanalysis.pdf).

\begin{equation*}
RMSF_i = \left[ \frac{1}{T} \sum_{t_j=1}^T \mid r_i(t_j) - r_i^\text{ref} \mid ^ 2 \right] ^ \frac{1}{2}
\end{equation*}

$T$ is the total simulation time  
$r_i^\text{ref}$ is the reference position of particle $i$, eg. the **time-averaged** position of the same particle $i$  
$\mid r_i(t_j) - r_i^\text{ref} \mid$ is the Euclidean distance of particle $i$ on frame $j$ from $r_i^\text{ref}$

This time since our $x-axis$ are the residue ids, we will have a slightly different approach.
Our $y-axis$ will have the RMSF value of the residues specified by the $x-axis$.

In [3]:
import MDAnalysis
from MDAnalysis.analysis.rms import RMSF
from MDAnalysis.analysis import contacts
from MDAnalysis.analysis.align import AlignTraj 
import MDAnalysis.analysis.pca as pca
import MDAnalysis.analysis.distances as dist_analysis

from MDSimsEval.AnalysisActorClass import AnalysisActor
from MDSimsEval.utils import create_analysis_actor_dict
import mdtraj as md_traj

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from scipy import stats

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import seaborn as sns
import imgkit

import re
import os
import subprocess
import logging
import math
import itertools
from operator import itemgetter

from tqdm.notebook import tqdm
from IPython.display import display

In [4]:
analysis_actors_dict = create_analysis_actor_dict('../datasets/New_AI_MD/')

# # Agonists
# for which_ligand in tqdm(analysis_actors_dict['Agonists'], desc="Agonists Calculations"):
#     which_ligand.perform_analysis(metrics=["RMSF"])
    
# # Antagonists
# for which_ligand in tqdm(analysis_actors_dict['Antagonists'], desc="Antagonists Calculations"):
#     which_ligand.perform_analysis(metrics=["RMSF"])

  alpha = np.rad2deg(np.arccos(np.dot(y, z) / (ly * lz)))
  beta = np.rad2deg(np.arccos(np.dot(x, z) / (lx * lz)))
  gamma = np.rad2deg(np.arccos(np.dot(x, y) / (lx * ly)))
  if np.all(box > 0.0) and alpha < 180.0 and beta < 180.0 and gamma < 180.0:
Agonists | Donitriptan: 100%|██████████| 15/15 [00:59<00:00,  3.98s/it]  
Antagonists | Ziprasione: 100%|██████████| 18/18 [02:06<00:00,  7.05s/it]   


## Residue Selection using Statistical Tests

Our goal is to find the residues that differentiate the most agonists from antagonists. In this part of the notebook we will apply a statistical test between the agonists' RMSF's of a residue and the antagonists' RMSF's of the same residue. Then having a p-value threshold we will select some of these residues.

In [91]:
from MDSimsEval.rmsf_analysis import get_avg_rmsf_per_residue
from MDSimsEval.rmsf_analysis import reset_rmsf_calculations
from scipy import stats

def stat_test_residues(analysis_actors_dict, stat_test=stats.ttest_ind, threshold=0.05, start=-1, stop=-1):
    """
    Finds the most differentiating residues based on a statistical test on their RMSF. If start==-1 and stop==-1
    then we do not recalculate RMSF on the given window. 
    
    | For example on the T-test we have:
    | Null Hypothesis: The RMSFS of a specific residue have identical average (expected) values
    
    Args:
        analysis_actors_dict: : ``{ "Agonists": List[AnalysisActor.class], "Antagonists": List[AnalysisActor.class] }``
        stat_test (scipy.stats): A statistical test method with the interface of scipy.stats methods 
        threshold (float): The p-value threshold of the accepted and returned residues
        start(int): The starting frame of the calculations
        stop(int): The stopping frame of the calculations
        
    Returns:
        A list of tuples of the residue ids (0-indexed) that had below the threshold p-value, the p_value.
        Eg ``[(10, 0.03), (5, 0.042), ...]``
        
    """
    # Recalculate the RMSF calculations on the input window if we need a reset
    if start != -1 and stop != -1:
        reset_rmsf_calculations(analysis_actors_dict, start=start, stop=stop)
    
    stacked_agons_rmsf = np.array([get_avg_rmsf_per_residue(ligand) for ligand in analysis_actors_dict['Agonists']])
    stacked_antagons_rmsf = np.array([get_avg_rmsf_per_residue(ligand) for ligand in analysis_actors_dict['Antagonists']])
    
    # Get the p_value of each residue
    p_values = []
    for agon_res_rmsf, antagon_res_rmsf in zip(stacked_agons_rmsf.T, stacked_antagons_rmsf.T):
        p_values.append(stat_test(agon_res_rmsf,antagon_res_rmsf)[1])
        
    # Select the p_values that pass the threshold
    enumed_pvalues = np.array(list(enumerate(p_values)))
    enumed_pvalues = enumed_pvalues[enumed_pvalues[:, 1] <= threshold]
    
    # Transform the ndarray to a list of tuples
    enumed_pvalues = [(int(pair[0]), pair[1]) for pair in enumed_pvalues]
    
    # Return in ascending order of p_values
    return sorted(enumed_pvalues, key=lambda x: x[1])

In [106]:
results = {}
thresh=0.2
def perform_3_tests(start, stop, threshold):
    ret_vals = []
    ret_vals.append(len(stat_test_residues(analysis_actors_dict, threshold=threshold, stat_test=stats.ttest_ind, start=start, stop=stop)))
    ret_vals.append(len(stat_test_residues(analysis_actors_dict, threshold=threshold, stat_test=stats.ks_2samp)))
    ret_vals.append(len(stat_test_residues(analysis_actors_dict, threshold=threshold, stat_test=stats.levene)))

    return ret_vals

results['0 - 2500'] = perform_3_tests(0, 2500, thresh)
results['0 - 1250'] = perform_3_tests(0, 1250, thresh)
results['1250 - 2500'] = perform_3_tests(1250, 2500, thresh)

for start in np.arange(0, 2500, 500):
    results[f'{start} - {start+500}'] = perform_3_tests(start, start+500, thresh)

  alpha = np.rad2deg(np.arccos(np.dot(y, z) / (ly * lz)))
  beta = np.rad2deg(np.arccos(np.dot(x, z) / (lx * lz)))
  gamma = np.rad2deg(np.arccos(np.dot(x, y) / (lx * ly)))
  if np.all(box > 0.0) and alpha < 180.0 and beta < 180.0 and gamma < 180.0:


In [107]:
df_tests = pd.DataFrame.from_dict(results, orient='index', columns=['T-test', 'Kolmogorov-Smirnov', 'Levene'])

display(df_tests)

Unnamed: 0,T-test,Kolmogorov-Smirnov,Levene
0 - 2500,2,10,6
0 - 1250,2,39,3
1250 - 2500,17,23,5
0 - 500,14,22,8
500 - 1000,2,18,9
1000 - 1500,2,19,4
1500 - 2000,7,20,3
2000 - 2500,152,73,107
