# SVM Model for Mechanism Prediction - Evaluation & Validation
## Overview

This notebook is based on the original work: "Building Predictive Models for Mechanism-of-Action Classification from Phenotypic Assay Data Sets". The aim is to get a basic working idea of what's going on and get a working model. It is NOT intended to exactly replicate what the original work was doing but rather to achieve similar or better performance. 

The [original work](http://journals.sagepub.com/doi/abs/10.1177/1087057113505324) built a predictive model for assigning mechanism class to compounds and to bioactive agents. The model used 84 features and 309 environmental chemicals. Knowing the mechanism class of chemicals can then be used to evaluate the safety and efficacy of compounds and include classes such as inhibitors of mitochondrial and microtubule function, etc.

## Read the data

In [1]:
import numpy as np
import pandas as pd
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 1000
from scripts.profile_reader2 import ProfileReader

pr = ProfileReader(data_file='data\\Final_Berg JBS 2013 Supplemental Table 3_For SVM14Dec2017.xlsx',  
                       mechanism_file='data\\Final_Berg JBS 2013 Supplemental Table 3_For SVM14Dec2017 - Mechanisms.xlsx')

data = pr.get_profile(index='mech')

data.head(5)

Unnamed: 0,BrEPI_IL-1b/TNF-a/IFN-g_24:CD87/uPAR,BrEPI_IL-1b/TNF-a/IFN-g_24:CXCL10/IP-10,BrEPI_IL-1b/TNF-a/IFN-g_24:CXCL9/MIG,BrEPI_IL-1b/TNF-a/IFN-g_24:HLA-DR,BrEPI_IL-1b/TNF-a/IFN-g_24:IL-1alpha,BrEPI_IL-1b/TNF-a/IFN-g_24:MMP-1,BrEPI_IL-1b/TNF-a/IFN-g_24:PAI-I,BrEPI_IL-1b/TNF-a/IFN-g_24:SRB,BrEPI_IL-1b/TNF-a/IFN-g_24:tPA,BrEPI_IL-1b/TNF-a/IFN-g_24:uPA,...,CASMC_HCL_IL-1b/TNF-a/IFN-g_24:CD87/uPAR,CASMC_HCL_IL-1b/TNF-a/IFN-g_24:CXCL8/IL-8,CASMC_HCL_IL-1b/TNF-a/IFN-g_24:CXCL9/MIG,CASMC_HCL_IL-1b/TNF-a/IFN-g_24:HLA-DR,CASMC_HCL_IL-1b/TNF-a/IFN-g_24:IL-6,CASMC_HCL_IL-1b/TNF-a/IFN-g_24:LDLR,CASMC_HCL_IL-1b/TNF-a/IFN-g_24:M-CSF,CASMC_HCL_IL-1b/TNF-a/IFN-g_24:Proliferation,CASMC_HCL_IL-1b/TNF-a/IFN-g_24:Serum Amyloid A,CASMC_HCL_IL-1b/TNF-a/IFN-g_24:SRB
AhR agonist,0.031453,0.020148,-0.006715,-0.027737,-0.022183,0.110827,-0.112386,-0.003285,0.102482,0.018481,...,-0.023821,0.022642,0.006361,-0.022993,0.021574,0.021963,-0.019432,-0.040229,-0.015235,-0.030172
AhR agonist,0.032902,0.010241,-0.004796,0.02303,0.029068,0.095447,-0.076931,0.001966,0.142174,0.01466,...,-0.027825,0.022642,0.00998,-0.02036,0.035095,0.025294,-0.025464,-0.023608,-0.002966,-0.010213
AhR agonist,0.02311,-0.003607,-0.002454,0.019399,0.038214,0.095619,-0.109271,0.000255,0.129162,-0.001623,...,-0.03197,0.023423,0.014892,-0.024019,0.034887,0.030041,-0.023914,-0.029077,0.001851,-0.019381
AhR agonist,0.031829,-0.008869,-0.003555,0.011887,0.030835,0.090178,-0.070916,-0.001388,0.110995,0.003789,...,-0.035239,0.022086,0.013848,-0.011024,0.049676,0.029844,-0.026556,-0.044014,-0.003483,-0.006186
Calcineurin inhibitor,-0.035647,-0.006448,0.012811,0.014311,-0.027989,-0.085317,-0.10013,-0.008895,-0.089039,-0.080649,...,-0.002686,-0.038858,0.01031,0.037441,-0.073437,0.017599,-0.021265,0.009795,-0.009959,0.000975


## Features
Each feature label is the cell type + stimulation (System) name combined with the readout name (Readout), separated by a colon. There are 8 Systems, and the list and count of each System are listed in the two tables below.

In [5]:
pr.get_system_markers(agg=True)

Unnamed: 0,System,Marker
0,BrEPI_IL-1b/TNF-a/IFN-g_24,"CD87/uPAR, CXCL10/IP-10, CXCL9/MIG, HLA-DR, IL-1alpha, MMP-1, PAI-I, SRB, tPA, uPA"
1,CASMC_HCL_IL-1b/TNF-a/IFN-g_24,"CCL2/MCP-1, CD106/VCAM-1, CD141/Thrombomodulin, CD142/Tissue Factor, CD87/uPAR, CXCL8/IL-8, CXCL9/MIG, HLA-DR, IL-6, LDLR, M-CSF, Proliferation, SRB, Serum Amyloid A"
2,HDFn_IL-1b/TNF-a/IFN-g/EGF/FGF/PDGFbb_24,"CD106/VCAM-1, CXCL10/IP-10, CXCL8/IL-8, CXCL9/MIG, Collagen III, EGFR, M-CSF, MMP-1, PAI-I, Proliferation_72hr, SRB, TIMP-2"
3,HEK/HDFn_IL-1b/TNF-a/IFN-g/TGF-b_24,"CCL2/MCP-1, CD54/ICAM-1, CXCL10/IP-10, IL-1alpha, MMP-9, SRB, TIMP-2, uPA"
4,HUVEC/PBMC_LPS_24,"CCL2/MCP-1, CD106/VCAM-1, CD142/Tissue Factor, CD40, CD62E/E-Selectin, CXCL8/IL-8, IL-1alpha, M-CSF, SRB, sPGE2, sTNF-alpha"
5,HUVEC/PBMC_SEB/TSST_24,"CCL2/MCP-1, CD38, CD40, CD62E/E-Selectin, CD69, CXCL8/IL-8, CXCL9/MIG, PBMC Cytotoxicity, Proliferation, SRB"
6,HUVEC_IL-1b/TNF-a/IFN-g_24,"CCL2/MCP-1, CD106/VCAM-1, CD141/Thrombomodulin, CD142/Tissue Factor, CD54/ICAM-1, CD62E/E-Selectin, CD87/uPAR, CXCL8/IL-8, CXCL9/MIG, HLA-DR, Proliferation, SRB"
7,HUVEC_IL-4/Histamine_24,"CCL2/MCP-1, CCL26/Eotaxin-3, CD106/VCAM-1, CD62P/P-selectin, CD87/uPAR, SRB, VEGFR2"


In [6]:
pr.get_system_marker_count()

Unnamed: 0,System,Marker
0,BrEPI_IL-1b/TNF-a/IFN-g_24,10
1,CASMC_HCL_IL-1b/TNF-a/IFN-g_24,14
2,HDFn_IL-1b/TNF-a/IFN-g/EGF/FGF/PDGFbb_24,12
3,HEK/HDFn_IL-1b/TNF-a/IFN-g/TGF-b_24,8
4,HUVEC/PBMC_LPS_24,11
5,HUVEC/PBMC_SEB/TSST_24,10
6,HUVEC_IL-1b/TNF-a/IFN-g_24,12
7,HUVEC_IL-4/Histamine_24,7


# Targets
Each target to be predicted is a mechanism class. The table below summarizes the list of Mechanisms and the number of profiles each representing the mechanism.


In [7]:
pr.get_mechanism_count()

Unnamed: 0,Mechanism,Count
0,mTOR inhibitor,27
1,HMG-CoA reductase inhibitor,25
2,Mitochondrial inhibitor,24
3,Microtubule disruptor,18
4,Microtubule stabilizer,18
5,GR agonist,16
6,RAR/RXR agonist,15
7,PDE IV inhibitor,15
8,p38 MAPK inhibitor,14
9,EGFR inhibitor,14


----
## Data Exploration
We begin with looking at one mechanism: mTOR Inhibitor. 

### Feature observations
Each feature in the data set is a biomarker readout. The features are labeled as System:Readout. The set of Readouts together make up Compound Profiles that are generated by measuring changes in the levels of a set (8–12) of biomarkers (proteins, mediators, known disease risk factors, etc.) in each System. These features indicate a response of a cell type (System) to a certain drug by making more or less of each type of Readout measured. 
We hope that the response of a cell to a drug class manifests itself in the pattern it exhibits in the profile. The profile is a vector of continuous value that represents a point in 84 dimensional space, each readout being a dimension. The values are themselves the logarithm of an n-fold increase or decrease in a expression levels of biomarkers. Their range is typically -2 to + 2.

### Calculate Statistics
[this notebook](Profile%20Statistics.ipynb). 

### Profile plots
The plots all data points grouped by mechanisms class can be seen in this [this notebook](Profile%20Plotting%20for%20each%20Mechanims.ipynb). 

### Mechanism box plots
The visualization of readout value distribution as box plots can be seen in [this notebook](Box%20plots%20of%20mechanism%20classes.ipynb). 



### Generating random profiles to use as negative class
You can see the plots of negative profiles [in this notebook](Print%20Random%20Profiles.ipynb)

----
## Develop Model

### Define performance metric
It is important to establish a single numerical measure to evaluate how well our model performs. F-beta score is a good measure that takes both Precision and Recall into consideration. Choice of 0.5 for beta rewards higher accuracy. 
	Precision (P):  TP/(TP + FP)
	Recall (R): TP / (TP + FN)
	(Precision is referred to by the original paper as PPV)
	F0.5 places more 0.5 time more importance to recall, or 2x time more importance to precision.
	
$F_\beta = (1 + \beta^2) * \frac{precision * recall}{(\beta^2 * precision) + recall}$
    
$F_{0.25} = (1.25) * \frac{precision * recall}{(0.25 * precision) + recall}$
	
From <https://en.wikipedia.org/wiki/Precision_and_recall> 

For the purposes of comparing with the original work, PPV will be calculated, too.

### Does the model capture differences among classes?

----
## Optimize Model

### Grid Search
 - Grid search is used to find the optimal values of a learning algorithm. It is used to calculate many performace scores of a learning algorithm corresponding to a range of hyper-parameter values and pick the best set of hyper-parameter values. It is a 'grid' in the sense that permutations of hyper-parameter values calculated that can be summarized and displayed in grid or a table.
 - Grid search can be applied to optimize a learning algorithm by running it with a range of hyper-parameters to see which combination of parameters perform best. A set of hyper-parameretes are varied between predefined ranges of values and the model is trained with these values. The optimal set of hyper-parameters are then reported. The model is then tested (evaluated) using a portion of the data not used for training.
 - The hyper parameter of importance when working with SVM is C. To balance margin violations and keeping the separation as wide as possible, SVM algorithm has the hyper-parameter C to control this balance. Reducing C will generalize the model better because it emphasizes the regularization term. Large value of C gives higher weight to the individual feaures while therefore it increases variance. 

----
## Analyzing Model Performance
This section we'll take a look at the model's learning and testing performances on subsets of training data. Graphing the model's performance based on hyperparameter "C" and the number of training points can reveal details that may not have been apparent from the results alone.

### Learning Curves
This graph visalizes the SVM both model's training and testing performance with increasing data set size. The shaded region denotes uncertainty of the curve measured in standard deviation.
The model is scored on both the training and testing sets using F1, the coefficient of determination.



-----
## Evaluating Model Performance
We'll construct a model and make a prediction on the compounds in client data sets, using the optimized model.

### Grid Search
 - Grid search is used to find the optimal values of a learning algorithm. It is used to calculate many performace scores of a learning algorithm corresponding to a range of hyper-parameter values and pick the best set of hyper-parameter values. It is a 'grid' in the sense that permutations of hyper-parameter values calculated that can be summarized and displayed in grid or a table.
 - Grid search can be applied to optimize a learning algorithm by running it with a range of hyper-parameters to see which combination of parameters perform best. A set of hyper-parameretes are varied between predefined ranges of values and the model is trained with these values. The optimal set of hyper-parameters are then reported. The model is then tested (evaluated) using a portion of the data not used for training.
 - The hyper parameter of importance when working with SVM is C. To balance margin violations and keeping the separation as wide as possible, SVM algorithm has the hyper-parameter C to control this balance. Reducing C will generalize the model better because it emphasizes the regularization term. Large value of C gives higher weight to the individual feaures while therefore it increases variance. 

----
## Making Predictions
The SVM model has been trained on the given set of data, it can now be used to make predictions on new sets of input compounds. We can use these prediciond to gain information about a compound exhibits featues that make it likely to be in a Mechanism class.
