# D-GCAN Deep Dive

In this tutorial, we take a deep dive into D-GCAN and show how it builds a drug-likeness prediction model from scratch.

Let's start!

## Part I: Overview of D-GCAN and Data

The drug-likeness has been widely used as a criterion to distinguish drug-like molecules from non-drugs. Developing reliable computational methods to predict drug-likeness of compounds is crucial to triage unpromising molecules and accelerate the drug discovery process.In this study, a deep learning method was developed to predict drug-likeness based on the graph convo-lutional attention network (D-GCAN) directly from molecular structures. The model combined the ad-vantages of graph convolution and attention mechanism. Results showed that the D-GCAN outper-formed other state-of-the-art models for drug-likeness prediction. Molecular graph was used as encoding method for drug-likeness prediction.

A dataset with enough drugs and non-drugs is the prerequisite to train accurate deep neural network models for prediction of drug-likeness.In this study, D-GCAN model was trained on the dataset released by Beker, which consists of drug and non-drug sets (abbrevi-ated as: Drugs and Non-drugs). The Drugs set includes 2136 FDA small-molecule drugs assembled from Drugbank. The Non-drugs was chosen from ZINC15. Compounds with a maximum fingerprint-based Tanimoto similarity to drugs above 0.85 were removed, and standard binary classification was used to itera-tively refine the set of reliable negative set. Since the negative set is much larger than the positive set, it was randomly down-sampled to create a balanced dataset for model training. The dataset was randomly divided into training, validation, and test sets at ratio 8:1:1. In addition, two additional datasets, the non-US dataset and the bRo5 dataset, were used to test the performance of the model. The non-US dataset composes of 1281 word-wide drugs from Drugbank and an equal size of non-drugs from ZINC15. The bRo5 dataset includes 135 FDA and non-US drugs beyond Ro5 space (bRo5). The GDB-13 data-base was used to test the ability of D-GCAN in screening large-scale data. It consists of about 977 million drug-like small molecules according to Lipinski’s rule. All molecules contain up to 13 heavy atoms , and they were stored in the canonical SMILES. All the independent test datasets and validation dataset were not used in the training process.


## Part II: To train the model

In [1]:
import train

In [2]:
result = train.train('../dataset/data_test.txt')

The code uses a GPU!
----------------------------------------------------------------------------------------------------
Just a moment......
----------------------------------------------------------------------------------------------------
../dataset/data_train.txt
../dataset/data_test.txt
The preprocess has finished!
# of training data samples: 3802
# of test data samples: 428
----------------------------------------------------------------------------------------------------
Creating a model.
# of model parameters: 311698
----------------------------------------------------------------------------------------------------
Start training.
The result is saved in the output directory every epoch!
The training will finish in about 0 hours 16 minutes.
----------------------------------------------------------------------------------------------------
Epoch	Time(sec)	Loss_train	Loss_test	AUC_train	AUC_test
1	7.063134700000001	318.02376973629	33.3182889521122	0.6330387783115992	0.5
2	13.3

81	481.3716675	176.8597036600113	21.900392472743988	0.9535130737042532	0.9065420560747663
82	487.6844983	177.2030012011528	21.63964667916298	0.957143212965218	0.9135514018691588
83	493.789744	176.38141465187073	21.673029512166977	0.9572966712422787	0.9112149532710281
84	499.83378749999997	174.46399101614952	21.718599021434784	0.956292066110213	0.9112149532710281
85	505.73110719999994	175.65917918086052	21.284449100494385	0.9559244027719352	0.9182242990654206
86	511.55495319999994	176.77976202964783	21.370542109012604	0.9538743717758542	0.9205607476635514
87	517.4436017	179.85141596198082	21.303212136030197	0.9514658102154229	0.9158878504672897
88	523.2955598000001	178.52282038331032	21.278790563344955	0.9539066132353267	0.9158878504672897
89	529.1086349000001	177.05544209480286	21.28691440820694	0.9535158412115041	0.9135514018691588
90	534.9974391000001	176.31908676028252	21.241813361644745	0.955243734363584	0.9228971962616822
91	540.8404270000001	176.20382365584373	21.768229216337204	

## Part III: To test the performance of the D-GCAN on independent model

We have provided the trained model. And it can be used directly as follow:

We test the trained model on an independent dataset containing non-US drugs.



In [3]:
import predict

predict = predict.predict('../dataset/world_wide.txt',property=True)

The code uses a GPU!
../dataset/world_wide.txt
bacc_dev: 0.9344262295081968
pre_dev: 0.9317300232738557
rec_dev: 0.9375487900078064
f1_dev: 0.9346303501945525
mcc_dev: 0.868869402802378
sp_dev: 0.9313036690085871
q__dev: 0.9371563236449332
acc_dev: 0.9344262295081968


In [None]:
Feedbacks would also be appreciated and you can send me an email (jinyusun@csu.edu.cn)!