# Deep neural network modelling

In this notebook, a deep nueral network is applied to the drug discovery exericse of the Tox-21 challenge. We divide up the exercise in 6 main components.

1. Imports and code admin (particular since this notebook has been run on google colab for the accesiility of the GPU.)

2. Obtaining the data from the previous preprocessings. This is supplmented with a few basic data continuity checks.

3. Splitting the data into training and testing sets for each data type input.

5. Data engineering, for example, the bag of words paradigm for the smiles but also some PCA is introduce for the sparse datasets.

6. A grid search over specific data inputs and DNN layers is added.

For the google colab - the basic presciption is to copy the drugdiscovery library into a google drive folder and import the library in the colab notebook from the drive. Accessing google drive is done in the following step below:

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

root = '/content/gdrive/My Drive/drug_discovery/DrugDiscovery/'
# root is the root folder where the library is held.

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


### 1. Imports

In [None]:

# Basic imports
import pandas as pd
import numpy as np
import pickle

# Sklearn data preprocessing imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')

# drugdiscovery code imports
import sys
sys.path.insert(1, root)
sys.path.insert(1, root + 'drugdiscovery')

import drugdiscovery as dd
from drugdiscovery import preprocessing as pp
from drugdiscovery import deeplearning as dl

# torch imports
import torch
import torch.nn.functional as F
import torch.nn as nn
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

print(device)

Continue if using colab and not planning to use rdkit
Continue if using colab and not planning to use rdkit
cuda:0


2. ### Obtaining the data

In [None]:
source_data = root+'data/data_dups_removed_with_H.csv'
molecular_descriptors = root + 'data/molecular_descriptors_from_source.csv'
fingerprint = root + 'data/morgan_fingerprint_from_source_10.csv'
toxicophores = root +'data/known_toxic.csv'

In [None]:
source_data, molecular_descriptors, fingerprint, toxicophores = \
[pd.read_csv(x, index_col = 0) for x in [source_data, molecular_descriptors, fingerprint, toxicophores]]

In [None]:
[x.isnull().sum().any() for x in [source_data, molecular_descriptors, fingerprint, toxicophores]]

[True, False, False, False]

So it would seem that there are some nulls for the source data, however this comes from the face that not all targets are filled - this will taken care of with masks.

In [None]:
source_data.isnull().sum()[source_data.isnull().sum()!=0]

SR-HSE           1425
NR-AR             575
SR-ARE           2082
NR-Aromatase     2077
NR-ER-LBD         905
NR-AhR           1327
SR-MMP           2101
NR-ER            1708
NR-PPAR-gamma    1436
SR-p53           1112
SR-ATAD5          787
NR-AR-LBD        1116
dtype: int64

In [None]:
#test that all dfs that the same indices since they have non-trivial indices due to duplicate deletion.
[(source_data.index == x.index).all() for x in [molecular_descriptors, fingerprint, toxicophores]]

[True, True, True]

### 3. Separating into training and testing

In this section we will do the three tasks:

1. We will consider the target data set from the source data. We will need to create masks for the datapoints where there are null values. This is so that any training or metric computation ignores this in the target.

2. We will then separate the data and the masks into training/training data.

3. Use the training/testing index structure to deduce the same training/testing for the other data inputs.

We start by extracting the raw data from the source data:

In [None]:
source_features = ['FW','SMILES']
targets = ['SR-HSE','NR-AR', 'SR-ARE', 'NR-Aromatase', 'NR-ER-LBD', 'NR-AhR', 'SR-MMP',\
       'NR-ER', 'NR-PPAR-gamma', 'SR-p53', 'SR-ATAD5', 'NR-AR-LBD']


#dealing the source data
raw_y = source_data[targets]
raw_X = source_data[source_features]


Next, we know that `raw_y` will have null from which we would like to create masks from:

In [None]:
null_mask = np.array(np.logical_not(raw_y.isnull().values),int)
raw_y = raw_y.fillna(0.0)
mask_df = pd.DataFrame(null_mask, columns = [r+'_mask' for r in raw_y.columns], index = raw_y.index)

# The masks are attached to the raw target set, so it is easier to move this data around.
raw_y = pd.concat([raw_y,mask_df],1)

Next, the data is split into training/testing data:

In [None]:
test_size = 0.1
X_train_source, X_test_source, y_train, y_test = train_test_split(raw_X, raw_y, test_size = test_size, random_state=42)

training_index = X_train_source.index
testing_index = X_test_source.index

The indeices `training_index` and `testing_index` are non-trvial and are kept to be applied to the ther data inputs.

In [None]:
#fingerprints
fp_cols = list(fingerprint.columns)
fp_cols.remove('DSSTox_CID')
fp_cols.remove('SMILES')
X_train_fp, X_test_fp = fingerprint[fp_cols].loc[training_index],fingerprint[fp_cols].loc[testing_index]

# descriptors
desc_cols = list(molecular_descriptors.columns)
desc_cols.remove('SMILES')
X_train_desc, X_test_desc = molecular_descriptors[desc_cols].loc[training_index],\
                                                molecular_descriptors[desc_cols].loc[testing_index]

# known toxic
X_train_tox, X_test_tox = toxicophores.loc[training_index],\
                                                toxicophores.loc[testing_index]

# finally, we separate out the training from the mask data.
y_train, mask_train = y_train[targets],y_train[mask_df.columns]
y_test, mask_test = y_test[targets],y_test[mask_df.columns]

So we now have the four data sources:

1. Source data (which includes the smiles and molecular weight)
2. Fingerprints
3. The molecular descriptors of each molecule
4. The dice similarity for the fingerprints of molecular scaffolds with known toxicophores.
5. We also have the training/testing masks which we will use to mask over the nulls in the target dataset.

# 4. Data engineering

In this section we perform some data engineering tasks:
1. Extract the molecular bag of words
2. Standardise all the datasets

In [None]:
def transform(train, test, apply_transformer):
  train_new = apply_transformer.fit_transform(train)
  test_new = apply_transformer.transform(test)
  return train_new, test_new

In [None]:
# The source dataset
smiles = X_train_source['SMILES'].values
bow = pp.BagOfWordsMols(smiles)
bow_train = bow.fit()
bow_test = bow.transform(X_test_source['SMILES'].values)

bow_train = np.insert(bow_train, 0, X_train_source['FW'], 1)
bow_test = np.insert(bow_test, 0, X_test_source['FW'], 1)

# Standardise the data
bow_train, bow_test = transform(bow_train, bow_test, StandardScaler())

In [None]:
# Standardise the toxicophores dataset
X_train_tox, X_test_tox = transform(X_train_tox, X_test_tox, StandardScaler())

In [None]:
#standardise the molecular description dataset
X_train_desc, X_test_desc = transform(X_train_desc, X_test_desc, StandardScaler())

So, at this juncture - our total data sourcing is as follows:

1. bow_train, bow_test : standardised molecular bag of words of the source smiles.

2. X_train_fp, X_test_fp : The molecular fingerprint of the candidate molecules.

3. X_train_tox, X_test_tox : standardised dice similarities between kwwn toxicophores and scaffolds of the candidate molecules.

4. X_train_desc, X_test_desc : standardised molecular descriptors for the candidate molecules.

5. y_train, y_test : target variable that we want , nulls included.

6. mask_train, mask_test : mask dataset which masks over the nulls in y_train and y_test, this is so that the nulls targets are not included in back propagation in the neural networks and they are not used in any metric computation. 

# 5. Modelling



In [None]:
from sklearn.decomposition import TruncatedSVD

X_train_fp_svd, X_test_fp_svd = transform(X_train_fp, X_test_fp, TruncatedSVD(1024))
bow_train_svd, bow_test_svd = transform(bow_train, bow_test, TruncatedSVD(50))



We now have two mode datasets that we can add to the data input space - but we shall take X_train_fp OR X_train_fp_svd and bow_train OR bow_train_svd, as well as similary for the test side.

In [None]:
from sklearn.decomposition import PCA
def prepare_data(with_pca, batch_size, y_train, y_test, mask_train, mask_test, training_data, testing_data):
  """
  Always use PCA to half the size.
  """
  X_train = np.concatenate(training_data,1)
  X_test = np.concatenate(testing_data,1)

  if with_pca:
    N,p = X_train.shape
    pca_shape = int(p/2)
    pca = PCA(pca_shape)
    X_train = pca.fit_transform(X_train)
    X_test = pca.transform(X_test)

  train_set, test_set, train_loader, number_of_batches = dl.get_data(X_train, y_train, mask_train, X_test, y_test, mask_test,batch_size)
  return train_set, test_set, train_loader, number_of_batches

In [None]:
# we enumerate valid configurations here (we write them out as strings to save on local space)

train_inputs = [
          #'bow_train',
          'X_train_fp',
          'X_train_tox',
          'X_train_desc',
]
import collections
import itertools

In [None]:
all_combinations = []
for r in range(len(train_inputs) +1):

    combinations_object = itertools.combinations(train_inputs, r)
    combinations_list = list(combinations_object)
    all_combinations += combinations_list

In [None]:
training_configurations = []
for x in all_combinations[1:]:
  config = '[' + ','.join(list(x)) + ']'
  training_configurations.append(config)
  if 'bow_train' in config:
    config = config.replace("bow_train", "bow_train_svd")
    training_configurations.append(config)
  if 'X_train_fp' in config:
    config = config.replace("X_train_fp", "X_train_fp_svd")
    training_configurations.append(config)

testing_configurations = []
for configs in training_configurations:
    config = configs.replace("train", "test")
    testing_configurations.append(config)

In [None]:
len(training_configurations) == len(testing_configurations)

True

In [None]:
training_configurations

['[X_train_fp]',
 '[X_train_fp_svd]',
 '[X_train_tox]',
 '[X_train_desc]',
 '[X_train_fp,X_train_tox]',
 '[X_train_fp_svd,X_train_tox]',
 '[X_train_fp,X_train_desc]',
 '[X_train_fp_svd,X_train_desc]',
 '[X_train_tox,X_train_desc]',
 '[X_train_fp,X_train_tox,X_train_desc]',
 '[X_train_fp_svd,X_train_tox,X_train_desc]']

In [None]:
len(training_configurations)

11

In [None]:
from IPython.display import clear_output
def run(epochs, layers, config_n, final_pca):
  print('Working in config ',config_n)
  print('INPUTS : ',training_configurations[config_n])
  print('LAYERS : ',layers)
  print('WITH FINAL PCA : ',final_pca)

  train_set, test_set, train_loader, number_of_batches = prepare_data(final_pca, 128,  y_train, y_test, mask_train, \
                                                                        mask_test, eval(training_configurations[config_n]),eval(testing_configurations[config_n]))
  p = train_set[0][0].shape[0]
  activations = [torch.relu]*len(layers)
  early_stopper = dl.EarlyStopping(patience=10)
  model = dl.net(p, 12, seed = 12345, hidden_layers = layers, activations = activations).to(device)
  optimizer = torch.optim.Adam(model.parameters(), lr=4e-5,weight_decay= 1e-5)
  criterion = nn.BCELoss()
  scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', verbose = False)
  training_module = dl.Trainer(model, optimizer,criterion,epochs, root, device, scheduler ,early_stopper)
  model_name = 'config_N'+ str(config_n) +'_'+'_'.join([str(l) for l in layers])
  results = training_module.train_model(train_loader, test_set, number_of_batches, targets, mask_train, mask_test, model_name)
  if final_pca:
    results.to_csv(root + 'models/dnn_models/' + model_name + '_wpca.csv')
  else:
    results.to_csv(root + 'models/dnn_models/' + model_name + '.csv')
  clear_output()

In [None]:
epochs = 200
layers = [[1024],[1024,2048],[1024,2048,4196]]


for k in [i for i in range(10,11)]:  
  for layer_config in layers:
    for with_pca in [True, False]: 
      run(epochs, layer_config,  k, with_pca)