# Machine Learning on 5HT2A Receptor

Goal of this notebook is to use the features extracted and analyzed on the other Analysis notebook.  
We want to differentiate agonist ligands from antagonists using features extracted from the MD simulations.

In [49]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from AnalysisActor.AnalysisActorClass import AnalysisActor
from AnalysisActor.utils import create_analysis_actor_dict

from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

import re
import os
import subprocess
import logging
import math
import itertools
from operator import itemgetter

from tqdm.notebook import tqdm
from IPython.display import display

## Reading the Simulations

We will use the `AnalysisActor` package I wrote which is able to extract on low level features from the simulations. I call it low level since for example the `AnalysisActor.class` of a ligand will give us the $Rg$ of the ligand on each frame. This must be then reduced to features that I have analyzed like mean and std for each ligand.

In [4]:
# Reading the simulations
analysis_actors_dict = create_analysis_actor_dict('../datasets/New_AI_MD/')

  alpha = np.rad2deg(np.arccos(np.dot(y, z) / (ly * lz)))
  beta = np.rad2deg(np.arccos(np.dot(x, z) / (lx * lz)))
  gamma = np.rad2deg(np.arccos(np.dot(x, y) / (lx * ly)))
  if np.all(box > 0.0) and alpha < 180.0 and beta < 180.0 and gamma < 180.0:
Agonists | Donitriptan: 100%|██████████| 15/15 [00:48<00:00,  3.23s/it]  
Antagonists | Ziprasione: 100%|██████████| 18/18 [01:56<00:00,  6.47s/it]   


The `analysis_actors_dict` is a dictionary:
```python 
{
    "Agonists": List[AnalysisActor.class]
    "Antagonists": List[AnalysisActor.class]
}
```
This dictionary currently has only read the trajectories and has not calculated any of its metrics.  In order to do that we must call the `AnalysisActor.perform_analysis` method, which takes as an argument a list of metrics to be calculated.  
  
**Care**: The calculations need memory in order to be calculated and stored, so monitor the memory usage. If this becomes too big of a problem we can solve it in a "dynamic" way meaning that we will not keep saved the trajectories but demand them briefly for the calculations to be executed. 

In [7]:
# Possible arguments for "metrics" list:
#     Empty List [] (default): All of the available metrics will be calculated
#     'Rg': Radius of Gyration
#     'RMSF': Root Mean Square Fluctuations
#     'SASA': Solven Accessible Surface Area
#     'PCA': Principal Component Analysis
#     'Hbonds': Hydrogen Bonds
#     'Salt': Calculate number of salt bridges

# Iterate on all the ligands
for which_ligand in tqdm(analysis_actors_dict['Agonists'] + analysis_actors_dict['Antagonists'], desc="Ligand Calculations"):
    which_ligand.perform_analysis(metrics=["Rg", "SASA"])

HBox(children=(FloatProgress(value=0.0, description='Ligand Calculations', max=33.0, style=ProgressStyle(descr…




## Extracting the Features

From the calculations we must now extract ML appropriate features. The features used in our case are:
* Mean of Rg
* Std of Rg
* Mean of SASA
* Std of SASA

One parameter one must think of is the window of the features. Our simulations are of 2.500 and using all of them may  not be ideal. Our analysis actually shows that the event happens after 1.200 frames in most cases. However, as a starting point we will use all of the frames.  
  
**As labels we will use:**
* Agonist: 1
* Antagonist: 0

In [17]:
# Calculate the features on the agonists
start, stop = [0, 2500]

full_dataset = []

# Iterate on the agonists
for which_ligand in analysis_actors_dict['Agonists']:
    # Rg
    mean_rg = np.mean(which_ligand.get_radius_of_gyration()[start:stop])
    std_rg = np.std(which_ligand.get_radius_of_gyration()[start:stop])
    
    # SASA
    mean_sasa = np.mean(which_ligand.get_sasa()[1][start:stop])
    std_sasa = np.std(which_ligand.get_sasa()[1][start:stop])
   
    # For each ligand we will create a vector of [Mean Rg, Std Rg, Mean SASA, Std SASA, Ligand_Label]
    full_dataset.append([mean_rg, std_rg, mean_sasa, std_sasa, 1])
    
# Iterate on the antagonists
for which_ligand in analysis_actors_dict['Antagonists']:
    # Rg
    mean_rg = np.mean(which_ligand.get_radius_of_gyration()[start:stop])
    std_rg = np.std(which_ligand.get_radius_of_gyration()[start:stop])
    
    # SASA
    mean_sasa = np.mean(which_ligand.get_sasa()[1][start:stop])
    std_sasa = np.std(which_ligand.get_sasa()[1][start:stop])
   
    # For each ligand we will create a vector of [Mean Rg, Std Rg, Mean SASA, Std SASA, Ligand_Label]
    full_dataset.append([mean_rg, std_rg, mean_sasa, std_sasa, 0])
    
dataset_df = pd.DataFrame(full_dataset, columns=['RgMean', 'RgStd', 'SASAMean', 'SASAstd', 'LigandLabel'])

display(dataset_df)

Unnamed: 0,RgMean,RgStd,SASAMean,SASAstd,LigandLabel
0,20.919072,0.087596,156.570844,2.668977,1
1,21.012352,0.100447,157.272992,3.496758,1
2,21.066259,0.139269,158.865216,3.338327,1
3,21.054636,0.099477,160.425158,3.087227,1
4,21.065666,0.086757,160.09378,2.563157,1
5,20.987462,0.091215,155.726417,2.701195,1
6,21.111873,0.072859,157.394022,2.367292,1
7,21.056187,0.12194,157.885306,3.425428,1
8,20.954715,0.082948,155.38164,2.691224,1
9,20.87954,0.073616,153.995794,2.660802,1


## Splitting and Training


The main problem in our task is the little number of data points. This means that a good (or bad) result may be random and not reflect the reality. Taking that into account we will apply known techniques in order to be as general as possible using k-fold cross validation.

In [None]:
# Iris dataset is used for debugging
iris = pd.read_csv(filepath_or_buffer='../datasets/misc/iris.csv')

X = np.array(iris)[:, :-1]
y = np.array(iris)[:, -1]


In [76]:
# Separate training data from labels, the rows are fully separated meaning that 
# first k rows are agonists and then we have the antagonists
X = np.array(dataset_df)[:, :-1]
y = np.array(dataset_df)[:, -1]


# This dict will be used to save the metrics of each fold
total_metrics_train = {
    "acc": 0,
    "f1": 0,
    "rec": 0,
    "auc": 0
}

total_metrics_test = {
    "acc": 0,
    "f1": 0,
    "rec": 0,
    "auc": 0
}

def calculate_metrics(y_true, y_pred, total_metrics, y_pred_probs=None, print_enabled=True):
    # Calculate metrics
    acc = metrics.accuracy_score(y_true, y_pred)
    f1 = metrics.f1_score(y_true, y_pred)
    rec = metrics.recall_score(y_true, y_pred)
    if y_pred_probs is not None:
        auc = metrics.roc_auc_score(y_true, y_pred_probs)
    
    # Update total metrics
    total_metrics['acc'] += acc
    total_metrics['f1'] += f1
    total_metrics['rec'] += rec
    if y_pred_probs is not None:
        total_metrics['auc'] += auc
    
    # Print metrics
    if print_enabled:
        print(f'\t\tAccuraccy: {acc}')
        print(f'\t\tRecall: {rec}')
        print(f'\t\tF1_Score: {f1}')
        if y_pred_probs is not None:
            print(f'\t\tAUC: {auc}')
    
def print_total_metrics(total_metrics, splits):
    print('> Total Metrics')
    print(f'\tAccuraccy: {total_metrics["acc"] / splits}')
    print(f'\tRecall: {total_metrics["rec"] / splits}')
    print(f'\tF1_Score: {total_metrics["f1"] / splits}')
    if total_metrics['auc'] != 0:
        print(f'\tAUC: {total_metrics["auc"] / splits}')
    
which_split = 0
kf = StratifiedKFold(n_splits=5, shuffle=True)
for train_index, test_index in kf.split(X, y):
    print(f'> Split: {which_split}')
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index].astype(int), y[test_index].astype(int)
    
    # We will start with simple models like Logistic Regression
    clf = LogisticRegression().fit(X_train, y_train)
    
    # Predict on training set
    train_pred = clf.predict(X_train)
    train_pred_proba = clf.predict_proba(X_train)[:, 1]

    # Predict on test set
    test_pred = clf.predict(X_test)
    test_pred_proba = clf.predict_proba(X_test)[:, 1]
    
    # Metrics on the train set
    print('\tTraining Metrics')
    calculate_metrics(y_train, train_pred, total_metrics_train, y_pred_probs=train_pred_proba, print_enabled=True)
    
    # Metrics on the validation set
    print('\tValidation Metrics')
    calculate_metrics(y_test, test_pred, total_metrics_test, y_pred_probs=test_pred_proba, print_enabled=True)
    
    which_split += 1
    
print_total_metrics(total_metrics_train, which_split)

> Split: 0
	Training Metrics
		Accuraccy: 0.5
		Recall: 0.08333333333333333
		F1_Score: 0.13333333333333333
		AUC: 0.5773809523809523
	Validation Metrics
		Accuraccy: 0.5714285714285714
		Recall: 0.0
		F1_Score: 0.0
		AUC: 0.3333333333333333
> Split: 1
	Training Metrics
		Accuraccy: 0.6538461538461539
		Recall: 0.4166666666666667
		F1_Score: 0.5263157894736842
		AUC: 0.6607142857142857
	Validation Metrics
		Accuraccy: 0.42857142857142855
		Recall: 0.3333333333333333
		F1_Score: 0.3333333333333333
		AUC: 0.25
> Split: 2
	Training Metrics
		Accuraccy: 0.5769230769230769
		Recall: 0.25
		F1_Score: 0.35294117647058826
		AUC: 0.5833333333333334
	Validation Metrics
		Accuraccy: 0.42857142857142855
		Recall: 0.0
		F1_Score: 0.0
		AUC: 0.25
> Split: 3
	Training Metrics
		Accuraccy: 0.5555555555555556
		Recall: 0.0
		F1_Score: 0.0
		AUC: 0.5722222222222222
	Validation Metrics
		Accuraccy: 0.5
		Recall: 0.0
		F1_Score: 0.0
		AUC: 0.4444444444444444
> Split: 4
	Training Metrics
		Accuraccy: 0.629

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
