# DeepCoy

This notebook serves as a simple workflow for generating and selecting decoy molecules using DeepCoy. 

Feel free to adapt and alter the workflow presented to the needs of your project. The workflow adopt closely follows the methods described in our manuscript, [Generating Property-Matched Decoys Using Deep Learning](https://www.biorxiv.org/content/10.1101/2020.08.26.268193v1). 

Any questions, comments or feedback, please email imrie@stats.ox.ac.uk

## Imports

In [1]:
import sys
sys.path.append("../")
sys.path.append("../evaluation/")

In [2]:
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.Draw import MolDrawing, DrawingOptions
from rdkit.Chem import MolStandardize

import numpy as np

from itertools import product
from joblib import Parallel, delayed
import re
from collections import defaultdict

from IPython.display import clear_output
IPythonConsole.ipython_useSVG = True

from DeepCoy import DenseGGNNChemModel
from data.prepare_data import read_file, preprocess
from select_and_evaluate_decoys import select_and_evaluate_decoys

## Basic settings

In [3]:
# Whether to use GPU for generating molecules with DeLinker
use_gpu = True

## Preprocess actives data

We first need to preprocess the data used by DeepCoy. The active molecules should be supplied in a text file, with one entry on each line. 

E.g.
```
c1(ccc(cc1)c1c(ocn1)c1ccc2n(c1)c(nn2)c1c(cccc1)OC)F
c1c(c(cc(c1)C(=O)NOC)Nc1c2c(c(cn2ncn1)C(=O)NCc1ccccc1)C)C
```

There should be no other entries on a line other than the SMILES string of the molecule to generate decoys for.

In this example, we will use the actives for [DEKOIS 2.0 target P38-alpha](http://www.dekois.com/). For speed purposes, we will only use the first 10 actives.

In [4]:
data_path = './P38-alpha_actives.smi'

raw_data = read_file(data_path)
preprocess(raw_data, "zinc", "P38-alpha_actives")

Finished reading: 10 / 10
Parsing smiles as graphs.
Processed: 10 / 10
Saving data.
Length raw data: 	10
Length processed data: 	10


## Load DeepCoy model and generate decoys

Let's now setup and generate candidate decoys with DeepCoy. The below settings generate 100 candidate decoys for each active molecule (note in our paper we generated 1000 candidates per active).

It should take <10 minutes using a GPU and around 10-15 minutes using CPU only to generate 1000 candidate decoys on a consumer laptop. The exact time will vary depending on your hardware. The generation process is fully parallelisable if you need to generate large numbers of decoys.

In [5]:
import os
if not use_gpu:
    os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
else:
    os.environ['CUDA_VISIBLE_DEVICES'] = '0'

In [6]:
# Arguments for DeepCoy
args = defaultdict(None)
args['--dataset'] = 'zinc'
args['--config'] = '{"generation": true, \
                     "batch_size": 1, \
                     "number_of_generation_per_valid": 100, \
                     "train_file": "molecules_P38-alpha_actives.json", \
                     "valid_file": "molecules_P38-alpha_actives.json", \
                     "output_name": "P38-alpha_example_decoys.smi", \
                     "use_subgraph_freqs": false}'
args['--freeze-graph-model'] = False
args['--restore'] = '../models/DeepCoy_DUDE_model_e09.pickle'

In [7]:
# Setup model and generate molecules
model = DenseGGNNChemModel(args)
model.train()
# Free up some memory
model = ''

Run 2021-01-18-13-15-29_16228 starting with following parameters:
{"task_sample_ratios": {}, "use_edge_bias": true, "clamp_gradient_norm": 1.0, "out_layer_dropout_keep_prob": 1.0, "tie_fwd_bkwd": true, "random_seed": 0, "batch_size": 1, "num_epochs": 10, "epoch_to_generate": 10, "number_of_generation_per_valid": 100, "maximum_distance": 50, "use_argmax_generation": false, "residual_connection_on": true, "residual_connections": {"2": [0], "4": [0, 2], "6": [0, 2, 4], "8": [0, 2, 4, 6], "10": [0, 2, 4, 6, 8], "12": [0, 2, 4, 6, 8, 10], "14": [0, 2, 4, 6, 8, 10, 12]}, "num_timesteps": 7, "hidden_size": 100, "encoding_size": 8, "kl_trade_off_lambda": 0.3, "learning_rate": 0.001, "graph_state_dropout_keep_prob": 1, "compensate_num": 0, "train_file": "molecules_P38-alpha_actives.json", "valid_file": "molecules_P38-alpha_actives.json", "try_different_starting": true, "num_different_starting": 1, "generation": true, "use_graph": true, "label_one_hot": false, "multi_bfs_path": false, "bfs_path_

## Assess generated decoys

Now we need to select a final set of decoys from the candidate decoys.

We will select 20 decoys per active.

In [8]:
chosen_properties = "ALL"
num_decoys_per_active = 20

results = select_and_evaluate_decoys('P38-alpha_example_decoys.smi', file_loc='./', output_loc='./', 
                                     dataset=chosen_properties, num_cand_dec_per_act=num_decoys_per_active*2, num_dec_per_act=num_decoys_per_active)

Processing:  P38-alpha_example_decoys.smi


The following results are calculated and contained in ```results```:
- File name - Name of input file
- Chosen properties - Name of the property set chosen
- Number of actives in input file
- Number of actives after applying the minimum size filter
- Number of candidate decoys
- Number of unique candidate decoys
- AUC ROC - 1NN - Performance as measured by AUC ROC of 1-nearest neighbour (1NN) algorithm in 10-fold cross-validation using all of the chosen properties
- AUC ROC - RF - Performance as measured by AUC ROC of random forest (RF) algorithm in 10-fold cross-validation using all of the chosen properties,
- DOE score - Deviation from Optimal Embedding score, a measure of property matching
- LADS score - Latent Active in Decoy Set score
- Average Doppelganger score - A measure of the structural similarity between actives and decoys
- Maximum Doppelganger score - A measure of the structural similarity between actives and decoys


In [9]:
print("DOE score: \t\t\t%.3f" % results[8])
print("Average Doppelganger score: \t%.3f" % results[10])
print("Max Doppelganger score: \t%.3f" % results[11])

DOE score: 			0.076
Average Doppelganger score: 	0.210
Max Doppelganger score: 	0.243
