## Overview ##

PubChem is a site run by the NIH which hosts raw data associated with chemical experiments; here we analyze the data hosted at PubChem for assay 1030, which looks for inhibitors of the protein encoding gene ALDH1A1. You can access the page for this assay [here](https://pubchem.ncbi.nlm.nih.gov/bioassay/1030)

## Results ##

We use the SMILES string, a common representation for a molecule amongst chemists, to begin the featurization process. Because the length of this string varies, it is normalized in the form of a Morgan Fingerprint; these are then used to train various regression models, which we then use as binary classifiers. This is to see if the continuous input in the activity score has predictive value. More specifically, our goal here is to find the highest precision in the 'True' label class so as not to miss any potentially useful compounds

In [1]:
# Exploratory data analysis (regression)

In [2]:
import pickle
import numpy as np
import pandas as pd
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, rdMolDescriptors
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
import sys

import warnings
warnings.filterwarnings('ignore')

global_random_state = 42
k_fold_splits = 2

np.random.seed(global_random_state)

import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')

fh = logging.FileHandler('log_regression.txt')
fh.setLevel(logging.INFO)
fh.setFormatter(formatter)
logger.addHandler(fh)

ch = logging.StreamHandler(sys.stdout)
ch.setLevel(logging.INFO)
ch.setFormatter(formatter)
logger.addHandler(ch)

In [3]:
# Load assay info. Note: This CSV was obtained from PubChem bioassay aka PCBA, via searching for AID 1030 
# and downloading the datatable

ba_df = pd.read_csv("AID_1030_datatable_all.csv")

# Load compound info
cs_df = pd.read_csv("AID_1030_compound_smiles.csv",sep='\t',header=0)

# Merge the two
full_df = ba_df.merge(cs_df,on='PUBCHEM_CID')

# Cleanup the compound ID column
full_df["PUBCHEM_CID"] = full_df["PUBCHEM_CID"].astype(int)
full_df["PUBCHEM_ACTIVITY_SCORE"] = full_df["PUBCHEM_ACTIVITY_SCORE"].astype(int)


compound_ids = list()
smiles_list = list()
fingerprints = list()
activities = list()

#fingerprint_df = 

for index, row in full_df.iterrows() :
    cid = row["PUBCHEM_CID"]
    smiles_string = row["Smiles"]
    mol = Chem.MolFromSmiles(smiles_string)
    is_active = row["PUBCHEM_ACTIVITY_OUTCOME"] == "Active"
    activity_score = row["PUBCHEM_ACTIVITY_SCORE"]
    if mol is None:
        print("Molecule failed featurization")
        print(index)
    else: 
        fingerprint = rdMolDescriptors.GetMorganFingerprintAsBitVect(mol,2,nBits=2048,useChirality=False,
                                                                     useBondTypes=False,useFeatures=False)
        
        # From RDKit documentation
        arr = np.zeros((1,))
        DataStructs.ConvertToNumpyArray(fingerprint, arr)
        fingerprint = arr
        
        compound_ids.append(cid)
        smiles_list.append(smiles_string)
        fingerprints.append(fingerprint)
        activities.append(activity_score)
    
    if index % 10000 == 0:
        logger.info("Processed index: {0}".format(index))

# Convert activities to np array of ints

X = np.array(fingerprints)
y = np.array(activities,dtype=float)


with open("data.regression.nonsampled.pickle","wb") as f:
    pickle.dump((X,y),f)

2017-09-24 15:15:06,819 - INFO - Processed index: 0
2017-09-24 15:15:14,442 - INFO - Processed index: 10000
2017-09-24 15:15:21,863 - INFO - Processed index: 20000
2017-09-24 15:15:29,500 - INFO - Processed index: 30000
2017-09-24 15:15:37,187 - INFO - Processed index: 40000
2017-09-24 15:15:44,634 - INFO - Processed index: 50000
2017-09-24 15:15:51,897 - INFO - Processed index: 60000
2017-09-24 15:15:58,585 - INFO - Processed index: 70000
2017-09-24 15:16:06,000 - INFO - Processed index: 80000
2017-09-24 15:16:13,257 - INFO - Processed index: 90000
2017-09-24 15:16:20,820 - INFO - Processed index: 100000
2017-09-24 15:16:27,989 - INFO - Processed index: 110000
2017-09-24 15:16:35,501 - INFO - Processed index: 120000
2017-09-24 15:16:43,208 - INFO - Processed index: 130000
2017-09-24 15:16:50,753 - INFO - Processed index: 140000
2017-09-24 15:16:58,095 - INFO - Processed index: 150000
2017-09-24 15:17:05,133 - INFO - Processed index: 160000
2017-09-24 15:17:12,057 - INFO - Processed in

In [4]:
# First we look at using a DecisionTreeRegressor, but since the performance on active compounds is so poor, we don't

import pickle
import numpy as np
import pandas as pd
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, rdMolDescriptors
from collections import Counter
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, classification_report
from sklearn.model_selection import KFold, train_test_split

smiles_list = None
compound_ids = None
fingerprints = None
activities = None

global_random_state = 42

with open('data.regression.nonsampled.pickle', 'rb') as f:
    # Pickle the 'data' dictionary using the highest protocol available.
    (X, y) = pickle.load(f)

# Print the number of compounds loaded
logger.info("Successfully loaded {0} compounds.".format(len(X)))

kf = KFold(n_splits=k_fold_splits,shuffle=True,random_state=global_random_state)

mse_avg = 0

for train_index, test_index in kf.split(X,y) :

    X_train = X[train_index]
    X_test = X[test_index]
    y_train = y[train_index]
    y_test = y[test_index]

    classifier = DecisionTreeRegressor(random_state=global_random_state)
    classifier.fit(X_train,y_train)
    y_pred = classifier.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    logger.info("Computed mse score of: {}".format(mse))
    mse_avg = mse_avg + mse
    
    y_pred_binary = y_pred > .4
    y_test_binary = y_test > .4

    logger.info("How good is it as a classifier at 0.4 threshold?")
    logger.info(classification_report(y_test_binary, y_pred_binary))
    
    logger.info("How good is it at predicting True compounds at 0.9 threshold?")
    y_pred_binary = y_pred > .9
    y_test_binary = y_test > .4
    
    logger.info(classification_report(y_test_binary, y_pred_binary))
    
mse_avg = mse_avg / k_fold_splits
logger.info("Average mse score is: {}".format(mse_avg))

# Note: Unfortunately it's not directly comparable to ROC_AUC calculated in MoleculeNet at: https://arxiv.org/pdf/1703.00564.pdf 
# This is because MoleculeNet looks at a different metric (roc_auc) and also a different task (multiclass prediction across 128 bioassays simultaneously vs binary classification here)

2017-09-24 15:18:18,290 - INFO - Successfully loaded 220364 compounds.
2017-09-24 15:24:48,929 - INFO - Computed mse score of: 317.520810371325
2017-09-24 15:24:48,930 - INFO - How good is it as a classifier at 0.4 threshold?
2017-09-24 15:24:48,956 - INFO -              precision    recall  f1-score   support

      False       0.75      0.77      0.76     74223
       True       0.50      0.48      0.49     35959

avg / total       0.67      0.67      0.67    110182

2017-09-24 15:24:48,957 - INFO - How good is it at predicting True compounds at 0.9 threshold?
2017-09-24 15:24:48,983 - INFO -              precision    recall  f1-score   support

      False       0.75      0.77      0.76     74223
       True       0.50      0.48      0.49     35959

avg / total       0.67      0.67      0.67    110182

2017-09-24 15:31:29,189 - INFO - Computed mse score of: 316.89158771956033
2017-09-24 15:31:29,190 - INFO - How good is it as a classifier at 0.4 threshold?
2017-09-24 15:31:29,216 - 