# D-GCAN Deep Dive

In this tutorial, we take a deep dive into D-GCAN and show how it builds a drug-likeness prediction model from scratch.

Let's start!

## Part I: Overview of D-GCAN and Data

The drug-likeness has been widely used as a criterion to distinguish drug-like molecules from non-drugs. Developing reliable computational methods to predict drug-likeness of compounds is crucial to triage unpromising molecules and accelerate the drug discovery process.In this study, a deep learning method was developed to predict drug-likeness based on the graph convo-lutional attention network (D-GCAN) directly from molecular structures. The model combined the ad-vantages of graph convolution and attention mechanism. Results showed that the D-GCAN outper-formed other state-of-the-art models for drug-likeness prediction. Molecular graph was used as encoding method for drug-likeness prediction.

A dataset with enough drugs and non-drugs is the prerequisite to train accurate deep neural network models for prediction of drug-likeness.In this study, D-GCAN model was trained on the dataset released by Beker, which consists of drug and non-drug sets (abbrevi-ated as: Drugs and Non-drugs). The Drugs set includes 2136 FDA small-molecule drugs assembled from Drugbank. The Non-drugs was chosen from ZINC15. Compounds with a maximum fingerprint-based Tanimoto similarity to drugs above 0.85 were removed, and standard binary classification was used to itera-tively refine the set of reliable negative set. Since the negative set is much larger than the positive set, it was randomly down-sampled to create a balanced dataset for model training. The dataset was randomly divided into training, validation, and test sets at ratio 8:1:1. In addition, two additional datasets, the non-US dataset and the bRo5 dataset, were used to test the performance of the model. The non-US dataset composes of 1281 word-wide drugs from Drugbank and an equal size of non-drugs from ZINC15. The bRo5 dataset includes 135 FDA and non-US drugs beyond Ro5 space (bRo5). The GDB-13 data-base was used to test the ability of D-GCAN in screening large-scale data. It consists of about 977 million drug-like small molecules according to Lipinski’s rule. All molecules contain up to 13 heavy atoms , and they were stored in the canonical SMILES. All the independent test datasets and validation dataset were not used in the training process.


## Part II: To train the model

In [1]:
import train

In [2]:


tes = train.train('../dataset/data_test.txt', #test set   
    radius = 1,        #hops of radius subgraph: 1, 2 
    dim = 52,          #dimension of graph convolution layers
    layer_hidden = 4,  #Number of graph convolution layers
    layer_output = 10, #Number of dense layers
    dropout = 0.45,    #drop out rate :0-1
    batch_train = 8,   # batch of training set
    batch_test = 8,    #batch of test set
    lr =3e-4,          #learning rate: 1e-5,1e-4,3e-4, 5e-4, 1e-3, 3e-3,5e-3
    lr_decay = 0.85,   #Learning rate decay:0.5, 0.75, 0.85, 0.9
    decay_interval = 25,#Number of iterations for learning rate decay:10,25,30,50
    iteration = 140,    #Number of iterations 
    N = 5000,           #length of embedding: 2000,3000,5000,7000 
    dataset_train='../dataset/data_train.txt') #training set



The code uses a GPU!
----------------------------------------------------------------------------------------------------
Just a moment......
----------------------------------------------------------------------------------------------------
../dataset/data_train.txt
../dataset/data_test.txt
The preprocess has finished!
# of training data samples: 3802
# of test data samples: 428
----------------------------------------------------------------------------------------------------
Creating a model.
# of model parameters: 311698
----------------------------------------------------------------------------------------------------
Start training.
The result is saved in the output directory every epoch!
The training will finish in about 0 hours 21 minutes.
----------------------------------------------------------------------------------------------------
Epoch	Time(sec)	Loss_train	Loss_test	AUC_train	AUC_test
1	9.334350300000011	318.02376973629	33.23613902926445	0.6330387783115992	0.5116822

79	507.72747260000006	178.98830798268318	21.47413921356201	0.9541025527486882	0.8598130841121495
80	516.8672275	178.3373854458332	21.658688694238663	0.9525628500896672	0.8901869158878505
81	526.2272582	176.8597036600113	22.024581998586655	0.9535130737042532	0.8644859813084113
82	537.2631751	177.2030012011528	21.766023725271225	0.957143212965218	0.8691588785046729
83	548.0884053000001	176.38141465187073	21.79708757996559	0.9572966712422787	0.8714953271028038
84	559.8119843000001	174.46399101614952	21.843281388282776	0.956292066110213	0.8574766355140186
85	569.1060591	175.65917918086052	21.409487038850784	0.9559244027719352	0.9018691588785047
86	576.5913633	176.77976202964783	21.49503728747368	0.9538743717758542	0.8785046728971962
87	584.3266782000001	179.85141596198082	21.427332252264023	0.9514658102154229	0.9018691588785047
88	595.5166383000001	178.52282038331032	21.403560250997543	0.9539066132353267	0.8644859813084113
89	605.9234084000001	177.05544209480286	21.411171078681946	0.953515

## Part III: To test the performance of the D-GCAN on independent model

We have provided the trained model. And it can be used directly as follow:

We test the trained model on bRo5 dataset.



In [4]:
import predict

In [5]:
test = predict.predict('../dataset/bRo5.txt',
    radius = 1,
    property = True,   #True if drug-likeness is known 
    dim = 52 ,
    layer_hidden = 4,
    layer_output = 10,
    dropout = 0.45,
    batch_train = 8,
    batch_test = 8,
    lr = 3e-4,
    lr_decay = 0.85,
    decay_interval = 25 ,
    iteration = 140,
    N = 5000)

The code uses a GPU!
../dataset/bRo5.txt
SMILESis error
bacc: 0.9580740740740741
pre: 0.9696969696969697
rec: 0.9481481481481482
f1: 0.9588014981273408
mcc: 0.9155786319049269
sp: 0.968
q_: 0.9453125
acc: 0.9576923076923077


In [None]:
Feedbacks would also be appreciated and you can send me an email (jinyusun@csu.edu.cn)!