## Overview ##

PubChem is a site run by the NIH which hosts raw data associated with chemical experiments; here we analyze the data hosted at PubChem for assay 1030, which looks for inhibitors of the protein encoding gene ALDH1A1. You can access the page for this assay [here](https://pubchem.ncbi.nlm.nih.gov/bioassay/1030)

## Results ##

We use the SMILES string, a common representation for a molecule amongst chemists, to begin the featurization process. Because the length of this string varies, it is normalized in the form of a Morgan Fingerprint; these are then used to train various binary classifiers

In [14]:
# Exploratory data analysis and visualization

In [15]:
import pickle
import numpy as np
import pandas as pd
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, rdMolDescriptors
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.model_selection import train_test_split


import warnings
warnings.filterwarnings('ignore')

global_random_state = 42

np.random.seed(global_random_state)


active_pct = 0.073125471
inactive_pct = 1 - active_pct

# We set the inactive to have the weight of the active, and vice versa, to account for imbalance
class_weights = { 0: active_pct, 1: inactive_pct }

import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')

fh = logging.FileHandler('log_dnn.txt')
fh.setLevel(logging.INFO)
fh.setFormatter(formatter)
logger.addHandler(fh)

ch = logging.StreamHandler()
ch.setLevel(logging.INFO)
ch.setFormatter(formatter)
logger.addHandler(ch)

In [16]:
import keras
print(keras.backend.backend())

tensorflow


In [17]:
# What about a deep neural network?
# Sample code from: https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/

from keras.models import Sequential
from keras.layers import Dense
from keras import metrics
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, f1_score

import pickle

k_fold_splits = 2
global_random_state = 42

with open('data.classification.undersampled.pickle', 'rb') as f:
    # Pickle the 'data' dictionary using the highest protocol available.
    (X, y) = pickle.load(f)

def create_model() :
    model = Sequential()
    model.add(Dense(12, input_dim=2048, activation='relu'))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
    return model

skf = StratifiedKFold(n_splits=k_fold_splits,shuffle=True,random_state=global_random_state)

roc_auc_avg = 0
f1_score_avg = 0

for train_index, test_index in skf.split(X,y) :

    X_train = X[train_index]
    X_test = X[test_index]
    y_train = y[train_index]
    y_test = y[test_index]

    classifier = KerasClassifier(build_fn=create_model, epochs=10, batch_size=1000, verbose=1)
    classifier.fit(X_train,y_train)
    y_pred = classifier.predict(X_test)
    auc = roc_auc_score(y_test, y_pred, average='macro', sample_weight=None)
    logger.info("\nComputed roc_auc score of: {}".format(auc))
    logger.info(classification_report(y_test, y_pred))
    roc_auc_avg = roc_auc_avg + auc
    
    
    y_pred_binarized = y_pred[0:] >= .4
    y_test_binarized = y_test[0:] >= .4
    fscore = f1_score(y_pred_binarized,y_test_binarized)
    logger.info("When using regressor as an active/inactive classifier, f1 score of: {}".format(fscore))
    f1_score_avg = f1_score_avg + fscore

    
roc_auc_avg = roc_auc_avg / k_fold_splits
f1_score_avg = f1_score_avg / k_fold_splits
logger.info("Average roc_auc score is: {}".format(roc_auc_avg))
logger.info("Average f1_score is: {}".format(f1_score_avg))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

2017-09-24 08:54:10,241 - INFO - 
Computed roc_auc score of: 0.673411122144985
2017-09-24 08:54:10,241 - INFO - 
Computed roc_auc score of: 0.673411122144985
2017-09-24 08:54:10,241 - INFO - 
Computed roc_auc score of: 0.673411122144985
2017-09-24 08:54:10,241 - INFO - 
Computed roc_auc score of: 0.673411122144985
2017-09-24 08:54:10,250 - INFO -              precision    recall  f1-score   support

          0       0.65      0.74      0.69      8056
          1       0.70      0.60      0.65      8056

avg / total       0.68      0.67      0.67     16112

2017-09-24 08:54:10,250 - INFO -              precision    recall  f1-score   support

          0       0.65      0.74      0.69      8056
          1       0.70      0.60      0.65      8056

avg / total       0.68      0.67      0.67     16112

2017-09-24 08:54:10,250 - INFO -              precision    recall  f1-score   support

          0       0.65      0.74      0.69      8056
          1       0.70      0.60      0.65      

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

2017-09-24 08:54:14,361 - INFO - 
Computed roc_auc score of: 0.6760397268777156
2017-09-24 08:54:14,361 - INFO - 
Computed roc_auc score of: 0.6760397268777156
2017-09-24 08:54:14,361 - INFO - 
Computed roc_auc score of: 0.6760397268777156
2017-09-24 08:54:14,361 - INFO - 
Computed roc_auc score of: 0.6760397268777156
2017-09-24 08:54:14,369 - INFO -              precision    recall  f1-score   support

          0       0.68      0.66      0.67      8055
          1       0.67      0.69      0.68      8055

avg / total       0.68      0.68      0.68     16110

2017-09-24 08:54:14,369 - INFO -              precision    recall  f1-score   support

          0       0.68      0.66      0.67      8055
          1       0.67      0.69      0.68      8055

avg / total       0.68      0.68      0.68     16110

2017-09-24 08:54:14,369 - INFO -              precision    recall  f1-score   support

          0       0.68      0.66      0.67      8055
          1       0.67      0.69      0.68  

In [18]:
# Let's try using DNN on the full non-sampled dataset, but with class_weight set and larger network. It didn't work

from keras.models import Sequential
from keras.layers import Dense
from keras import metrics
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, f1_score

import pickle

k_fold_splits = 2
global_random_state = 42

logger.debug("Trying a larger network")

with open('data.classification.nonsampled.pickle', 'rb') as f:
    # Pickle the 'data' dictionary using the highest protocol available.
    (X, y) = pickle.load(f)

def create_model() :
    model = Sequential()
    model.add(Dense(1024, input_dim=2048, activation='relu'))
    model.add(Dense(512, activation='relu'))
    model.add(Dense(256, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
    return model


X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33)

classifier = KerasClassifier(build_fn=create_model, epochs=20, batch_size=1000, verbose=1,class_weight=class_weights)
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
auc = roc_auc_score(y_test, y_pred, average='macro', sample_weight=None)
logger.info("Computed roc_auc score of: {}".format(auc))
logger.info(classification_report(y_test, y_pred))
roc_auc_avg = roc_auc_avg + auc


y_pred_binarized = y_pred[0:] >= .4
y_test_binarized = y_test[0:] >= .4
fscore = f1_score(y_pred_binarized,y_test_binarized)
logger.info("When using regressor as an active/inactive classifier, f1 score of: {}".format(fscore))
f1_score_avg = f1_score_avg + fscore


2017-09-24 08:54:14,425 - DEBUG - Trying a larger network
2017-09-24 08:54:14,425 - DEBUG - Trying a larger network
2017-09-24 08:54:14,425 - DEBUG - Trying a larger network


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20

2017-09-24 08:55:37,984 - INFO - Computed roc_auc score of: 0.6354741989195873
2017-09-24 08:55:37,984 - INFO - Computed roc_auc score of: 0.6354741989195873
2017-09-24 08:55:37,984 - INFO - Computed roc_auc score of: 0.6354741989195873
2017-09-24 08:55:37,984 - INFO - Computed roc_auc score of: 0.6354741989195873
2017-09-24 08:55:38,006 - INFO -              precision    recall  f1-score   support

          0       0.95      0.94      0.94     67470
          1       0.31      0.33      0.32      5251

avg / total       0.90      0.90      0.90     72721

2017-09-24 08:55:38,006 - INFO -              precision    recall  f1-score   support

          0       0.95      0.94      0.94     67470
          1       0.31      0.33      0.32      5251

avg / total       0.90      0.90      0.90     72721

2017-09-24 08:55:38,006 - INFO -              precision    recall  f1-score   support

          0       0.95      0.94      0.94     67470
          1       0.31      0.33      0.32      