# Classification Notebook
## Main questions of interest:
<ol> 
<li>Which descriptors best predict potency?  How do we validate these?
    <ul>
    <li>[...]</li>
    <li>[...]</li>
    </ul> 
</li>

<li>Can we augment the data set with predicted negative data (molecules expected to be inactive) to improve our machine learning models? Are there certain characteristics of negative data sets that are the most useful for training?
    <ul>

        <li>[...]</li>
        <li>[...]</li>
    </ul>
</li>

<li>Given the limited size of the data set and the high cost of experiments, can we use ML to identify the missing data that would be best for model training?
    <ul>

        <li>[...]</li>
        <li>[...]</li>
    </ul>
</li>
<li>Which cluster most closely with OSM-S-106?
    <ul>
        <li>[...]</li>
        <li>[...]</li>
    </ul>
</li>
<li>Would this provide clues as to the mechanism of OSM-S-106?
    <ul>

        <li>[...]</li>
        <li>[...]</li>
    </ul>
</li>
<li>How well do more advanced ML models perform over simple methods like multiple linear regression, SVM, and random forest?
    <ul>

        <li>[...]</li>
        <li>[...]</li>
    </ul></li>
</ol>

## Import Libraries
<hr>

In [1]:
# Core
import numpy as np
import pandas as pd
import os
import subprocess
def install_package(name):
    sudoPassword = ''
    command = 'pip install ' + name
    p = os.system('echo %s|sudo -S %s' % (sudoPassword, command))

# Stats
from statsmodels.regression import linear_model
import statsmodels.api as sm

# ML
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import seaborn as sns

  from pandas.core import datetools


## Import Data
<hr>

Data provided from: https://www.dropbox.com/sh/c9mbk8e2e8jxqfy/AADSmMbdoZduyG7Eq0HwOTT_a?dl=0

### Series3_6.15.17_padel.csv
This contains the data on OSM-S-106 and other OSM compounds. The field "IC50" describes potency. Smaller IC50 corresponds to higher potency, but a value of zero is impossible. A value of >40 means that the compound lacks enough activity to be interesting to us. OSM-S-106 has IC50 = 0.036.

### *_decoys_padel.csv
8 data sets called *_decoys_padel.csv. These large data sets contain compounds predicted to have minimal or no activity (IC50 can be assigned >200?).

### Selleck_filtered_padel_corrected.csv
Selleck_filtered_padel_corrected.csv. This is a set of well-characterized drugs from a vendor. We wish to identify drugs most similar to OSM-S-106 and predicted to be potent.

In [None]:
# Read in main data
main_df = pd.read_csv("Series3_6.15.17_padel.csv")

# Read in placebos data
placebo_1 = pd.read_csv("Akt1_decoys_padel.csv")
placebo_2 = pd.read_csv("AmpC_decoys_padel.csv")
placebo_3 = pd.read_csv("cp3a4_decoys_padel.csv")
placebo_4 = pd.read_csv("cxcr4_decoys_padel.csv")
placebo_5 = pd.read_csv("HIVpr_decoys_padel.csv")
placebo_6 = pd.read_csv("HIVrt_decoys_padel.csv")
placebo_7 = pd.read_csv("Kif11_decoys_padel.csv")
placebo_8 = pd.read_csv("Selleck_filtered_padel_corrected.csv")

# Append dumby response column.
placebo_1.insert(1, "IC50", pd.Series(np.array([250 for i in range(placebo_1.shape[0])], dtype="float64"))) 
placebo_2.insert(1, "IC50", pd.Series(np.array([250 for i in range(placebo_2.shape[0])], dtype="float64"))) 
placebo_3.insert(1, "IC50", pd.Series(np.array([250 for i in range(placebo_3.shape[0])], dtype="float64"))) 
placebo_4.insert(1, "IC50", pd.Series(np.array([250 for i in range(placebo_4.shape[0])], dtype="float64"))) 
placebo_5.insert(1, "IC50", pd.Series(np.array([250 for i in range(placebo_5.shape[0])], dtype="float64"))) 
placebo_6.insert(1, "IC50", pd.Series(np.array([250 for i in range(placebo_6.shape[0])], dtype="float64"))) 
placebo_7.insert(1, "IC50", pd.Series(np.array([250 for i in range(placebo_7.shape[0])], dtype="float64"))) 
placebo_8.insert(1, "IC50", pd.Series(np.array([250 for i in range(placebo_8.shape[0])], dtype="float64"))) 

# Gather our dataframes for collective manipulation
frames = [main_df,   placebo_1, placebo_2, 
          placebo_3, placebo_4, placebo_5,
          placebo_6, placebo_7, placebo_8]


### Remove NAs from main dataset

In [None]:
# It only makes sense to keep non-null responses.
row_mask=frames[0].isnull().any(axis=1) == False
frames[0] = frames[0].loc[row_mask,:]