# Compound representation learning and property prediction

In this tuorial, we will go through how to run a Graph Neural Network (GNN) model for compound property prediction. In particular, we will demonstrate how to pretrain and finetune the model in the downstream tasks. If you are intersted in more details, please refer to the README for "[info graph](https://github.com/PaddlePaddle/PaddleHelix/apps/pretrained_compound/info_graph)" and "[pretrained GNN](https://github.com/PaddlePaddle/PaddleHelix/apps/pretrained_compound/pretrain_gnns)".

#  Part I: Pretraining

In this part, we will show how to pretrain a compound GNN model. The pretraining skills here are adapted from the work of pretrain gnns, including attribute masking, context prediction and supervised pretraining.

Visit `pretrain_attrmask.py`, `pretrain_contextpred.py` and `pretrain_supervised.py` for more details.

In [1]:
import os
import numpy as np
import sys
sys.path.insert(0, os.getcwd() + "/..")
os.chdir("../apps/pretrained_compound/pretrain_gnns")

The Pahelix framework is build upon PaddlePaddle, which is a deep learning framework.

In [2]:
import paddle
import paddle.fluid as fluid
from paddle.fluid.incubate.fleet.collective import fleet
from pahelix.datasets import load_zinc_dataset
from pahelix.featurizers import PreGNNAttrMaskFeaturizer
from pahelix.utils.compound_tools import CompoundConstants
from pahelix.model_zoo import PreGNNAttrmaskModel

[INFO] 2020-12-18 19:57:50,304 [mp_reader.py:   23]:	ujson not install, fail back to use json instead


In [3]:
# switch to paddle static graph mode.
paddle.enable_static()

  and should_run_async(code)


## Build the static graph
Basically we build the static graph with Paddle `Program` and `Executor`. Here, we use `model_config` to hold the model configurations. `PreGNNAttrmaskModel` is an unsupervised pretraining model which randomly masks the atom type of a node and then tries to predict the masked atom type. In the meanwhile, we use the Adam optimizer and set the learning rate to be 0.001.

To use the GPU for training, please uncomment the line `exe = fluid.Executor(fluid.CUDAPlace(0))`. While the `fluid.CPUPlace()` is for CPU training.

In [4]:
model_config = {
    "dropout_rate": 0.5,# dropout rate
    "gnn_type": "gin",  # other choices like "gat", "gcn".
    "layer_num": 5,     # the number of gnn layers.
}
train_prog = fluid.Program()
startup_prog = fluid.Program()
with fluid.program_guard(train_prog, startup_prog):
    with fluid.unique_name.guard():
        model = PreGNNAttrmaskModel(model_config=model_config)
        model.forward()
        opt = fluid.optimizer.Adam(learning_rate=0.001)
        opt.minimize(model.loss)

exe = fluid.Executor(fluid.CPUPlace())
# exe = fluid.Executor(fluid.CUDAPlace(0))
exe.run(startup_prog)
print(model.loss)

  op_type, op_type, EXPRESSION_MAP[method_name]))
  op_type, op_type, EXPRESSION_MAP[method_name]))
  op_type, op_type, EXPRESSION_MAP[method_name]))


var mean_0.tmp_0 : LOD_TENSOR.shape(1,).dtype(FP32).stop_gradient(False)


## Dataset loading and feature extraction
### Download the dataset using wget
First, we need to download a small dataset for this demo. If you do not have `wget` on your machine, you could also
copy the url below into your web browser to download the data. But remember to copy the data manually to the
path "../apps/pretrained_compound/pretrain_gnns/".

In [5]:
### Download a toy dataset for demonstration:
!wget "https://baidu-nlp.bj.bcebos.com/PaddleHelix%2Fdatasets%2Fcompound_datasets%2Fchem_dataset_small.tgz" --no-check-certificate
!tar -zxf "PaddleHelix%2Fdatasets%2Fcompound_datasets%2Fchem_dataset_small.tgz"
!ls "./chem_dataset_small"
### Download the full dataset as you want:
# !wget "http://snap.stanford.edu/gnn-pretrain/data/chem_dataset.zip" --no-check-certificate
# !unzip "chem_dataset.zip"
# !ls "./chem_dataset"

--2020-12-18 19:58:05--  https://baidu-nlp.bj.bcebos.com/PaddleHelix%2Fdatasets%2Fcompound_datasets%2Fchem_dataset_small.tgz
Connecting to 172.19.56.199:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 609563 (595K) [application/gzip]
Saving to: “PaddleHelix%2Fdatasets%2Fcompound_datasets%2Fchem_dataset_small.tgz”


2020-12-18 19:58:12 (231 KB/s) - “PaddleHelix%2Fdatasets%2Fcompound_datasets%2Fchem_dataset_small.tgz” saved [609563/609563]

tox21  zinc_standard_agent


`PreGNNAttrMaskFeaturizer` is used along with `PreGNNAttrmaskModel`. It inherits from the super class `Featurizer` which is used for feature extractions. The `Featurizer` has two functions: `gen_features` for converting from a single raw SMILES to a single graph data, `collate_fn` for aggregating a sublist of graph data into a big batch.
The Zinc dataset is used as the pretraining dataset.

In [9]:
featurizer = PreGNNAttrMaskFeaturizer(
        model.graph_wrapper, 
        atom_type_num=len(CompoundConstants.atom_num_list),
        mask_ratio=0.15)
### Load the first 1000 of the toy dataset for speed up
dataset = load_zinc_dataset("./chem_dataset_small/zinc_standard_agent/raw", featurizer=featurizer)
dataset = dataset[:1000]
### Load the full dataset:
# dataset = load_zinc_dataset("./chem_dataset/zinc_standard_agent/raw", featurizer=featurizer)
print("dataset num: %s" % (len(dataset)))

dataset num: 1000


## Start train
Now we train the attrmask model for 2 epochs for demostration purposes. The data loading process is accelerated with 4 processors. Then the pretrained model is saved to "./model/pretrain_attrmask", which will serve as the initial model of the downstream tasks.

In [10]:
def train(exe, train_prog, model, dataset, featurizer):
    data_gen = dataset.iter_batch(
            batch_size=256, num_workers=4, shuffle=True, collate_fn=featurizer.collate_fn)
    list_loss = []
    for batch_id, feed_dict in enumerate(data_gen):
        train_loss, = exe.run(train_prog, 
                feed=feed_dict, fetch_list=[model.loss], return_numpy=False)
        list_loss.append(np.array(train_loss).mean())
    return np.mean(list_loss)

for epoch_id in range(2):
    train_loss = train(exe, train_prog, model, dataset, featurizer)
    print("epoch:%d train/loss:%s" % (epoch_id, train_loss))
fluid.io.save_params(exe, './model/pretrain_attrmask', train_prog)

epoch:0 train/loss:0.7213694
epoch:1 train/loss:0.70324224


The above is about the pretraining steps,you can adjust as needed.

# Part II: Downstream finetuning
Below we will introduce how to use the pretrained model for the finetuning of downstream tasks.

Visit `finetune.py` for more details.

In [11]:
from pahelix.utils.paddle_utils import load_partial_params
from pahelix.utils.splitters import \
    RandomSplitter, IndexSplitter, ScaffoldSplitter, RandomScaffoldSplitter
from pahelix.datasets import *

from model import DownstreamModel
from featurizer import DownstreamFeaturizer
from utils import calc_rocauc_score

The downstream datasets are usually small and have different tasks. For example, the BBBP dataset is used for the predictions of the Blood-brain barrier permeability. The Tox21 dataset is used for the predictions of toxicity of compounds. Here we use the Tox21 dataset for demonstrations.

In [12]:
task_names = get_default_tox21_task_names()
print(task_names)

['NR-AR', 'NR-AR-LBD', 'NR-AhR', 'NR-Aromatase', 'NR-ER', 'NR-ER-LBD', 'NR-PPAR-gamma', 'SR-ARE', 'SR-ATAD5', 'SR-HSE', 'SR-MMP', 'SR-p53']


## Build the static graph
Basically we build the static graph with Paddle Programs and Executors. Here, we use `model_config` to hold the model configurations. Note that the configurations of the model architecture should align with that of the pretraining model, otherwise the loading will fail. `DownstreamModel` is an supervised GNN model which predicts the tasks shown in `task_names`. Meanwhile, we use Adam optimizer and set the lr to be 0.001.

We use `train_prog` and `test_prog` to hold static graphs related to training and testing. They have the same network architecture but the functioning of some operators will change, like `Dropout`, `BatchNorm`.

To use the GPU for training, please uncomment the `fluid.CUDAPlace(0)`. While the `fluid.CPUPlace()` is for CPU training.

In [13]:
model_config = {
    "dropout_rate": 0.5,# dropout rate
    "gnn_type": "gin",  # other choices like "gat", "gcn".
    "layer_num": 5,     # the number of gnn layers.
    "num_tasks": len(task_names), # number of targets to predict for the downstream task.
}
train_prog = fluid.Program()
test_prog = fluid.Program()
startup_prog = fluid.Program()
with fluid.program_guard(train_prog, startup_prog):
    with fluid.unique_name.guard():
        model = DownstreamModel(model_config=model_config)
        model.train()
        opt = fluid.optimizer.Adam(learning_rate=0.001)
        opt.minimize(model.loss)
with fluid.program_guard(test_prog, fluid.Program()):
    with fluid.unique_name.guard():
        model = DownstreamModel(model_config=model_config)
        model.train(is_test=True)

exe = fluid.Executor(fluid.CPUPlace())
# exe = fluid.Executor(fluid.CUDAPlace(0))
exe.run(startup_prog)

  op_type, op_type, EXPRESSION_MAP[method_name]))


[]

## Load pretrained models
Load the pretrained model in the pretraining phase. Here we load the model "pretrain_attrmask" as an example.

In [14]:
load_partial_params(exe, './model/pretrain_attrmask', train_prog)

Load parameters from ./model/pretrain_attrmask.


## Dataset loading and feature extraction
`DownstreamFeaturizer` is used along with `DownstreamModel`. It inherits from the super class `Featurizer` which is used for feature extractions. The `Featurizer` has two functions: `gen_features` for converting from a single raw smiles to a single graph data, `collate_fn` for aggregating a sublist of graph data into a big batch.

The Tox21 dataset is used as the downstream dataset and we use `ScaffoldSplitter` to split the dataset into train/valid/test set. `ScaffoldSplitter` will firstly order the compounds according to Bemis-Murcko scaffold, then take the first `frac_train` proportion as the train set, the next `frac_valid` proportion as the valid set and the rest as the test set. `ScaffoldSplitter` can better evaluate the generalization ability of the model on out-of-distribution samples. Note that other splitters like `RandomSplitter`, `RandomScaffoldSplitter` and `IndexSplitter` is also available.

In [15]:
featurizer = DownstreamFeaturizer(model.graph_wrapper)
### Load the toy dataset:
dataset = load_tox21_dataset(
        "./chem_dataset_small/tox21/raw", task_names, featurizer=featurizer)
### Load the full dataset:
# dataset = load_tox21_dataset(
#         "./chem_dataset/tox21/raw", task_names, featurizer=featurizer)

# splitter = RandomSplitter()
splitter = ScaffoldSplitter()
train_dataset, valid_dataset, test_dataset = splitter.split(
        dataset, frac_train=0.8, frac_valid=0.1, frac_test=0.1)
print("Train/Valid/Test num: %s/%s/%s" % (
        len(train_dataset), len(valid_dataset), len(test_dataset)))



Train/Valid/Test num: 6264/783/784


## Start train
Now we train the attrmask model for 4 epochs for demostration purposes. Since each downstream task will contain more than one sub-task, the performance of the model is evaluated by the average roc-auc on all sub-tasks.

In [16]:
def train(exe, train_prog, model, train_dataset, featurizer):
    data_gen = train_dataset.iter_batch(
        batch_size=64, num_workers=4, shuffle=True, collate_fn=featurizer.collate_fn)
    list_loss = []
    for batch_id, feed_dict in enumerate(data_gen):
        train_loss, = exe.run(train_prog, feed=feed_dict, fetch_list=[model.loss], return_numpy=False)
        list_loss.append(np.array(train_loss).mean())
    return np.mean(list_loss)

def evaluate(exe, test_prog, model, test_dataset, featurizer):
    """
    In the dataset, a proportion of labels are blank. So we use a `valid` tensor
    to help eliminate these blank labels in both training and evaluation phase.
    
    Returns:
        the average roc-auc of all sub-tasks.
    """
    data_gen = test_dataset.iter_batch(
    		batch_size=64, num_workers=4, shuffle=False, collate_fn=featurizer.collate_fn)
    total_pred = []
    total_label = []
    total_valid = []
    for batch_id, feed_dict in enumerate(data_gen):
        pred, = exe.run(test_prog, feed=feed_dict, fetch_list=[model.pred], return_numpy=False)
        total_pred.append(np.array(pred))
        total_label.append(feed_dict['finetune_label'])
        total_valid.append(feed_dict['valid'])
    total_pred = np.concatenate(total_pred, 0)
    total_label = np.concatenate(total_label, 0)
    total_valid = np.concatenate(total_valid, 0)
    return calc_rocauc_score(total_label, total_pred, total_valid)

for epoch_id in range(4):
    train_loss = train(exe, train_prog, model, train_dataset, featurizer)
    val_auc = evaluate(exe, test_prog, model, valid_dataset, featurizer)
    test_auc = evaluate(exe, test_prog, model, test_dataset, featurizer)
    print("epoch:%s train/loss:%s" % (epoch_id, train_loss))
    print("epoch:%s val/auc:%s" % (epoch_id, val_auc))
    print("epoch:%s test/auc:%s" % (epoch_id, test_auc))
fluid.io.save_params(exe, './model/tox21', train_prog)

Valid ratio: 0.7603235
Task evaluated: 12/12
Valid ratio: 0.7513818
Task evaluated: 12/12
epoch:0 train/loss:0.49475822
epoch:0 val/auc:0.6658935384290235
epoch:0 test/auc:0.6521651483374383
Valid ratio: 0.7603235
Task evaluated: 12/12
Valid ratio: 0.7513818
Task evaluated: 12/12
epoch:1 train/loss:0.25100735
epoch:1 val/auc:0.6873462784930324
epoch:1 test/auc:0.6848870489412003
Valid ratio: 0.7603235
Task evaluated: 12/12
Valid ratio: 0.7513818
Task evaluated: 12/12
epoch:2 train/loss:0.22103228
epoch:2 val/auc:0.6841346893995142
epoch:2 test/auc:0.6818656406099143
Valid ratio: 0.7603235
Task evaluated: 12/12
Valid ratio: 0.7513818
Task evaluated: 12/12
epoch:3 train/loss:0.21569714
epoch:3 val/auc:0.6623265146967362
epoch:3 test/auc:0.6324247392008124


# Part III: Downstream Inference
In this part, we will briefly introduce how to use the trained downstream model to do inference on the given SMILES sequences.

## Build the static graph
This part is the basically the same as the part II.

In [17]:
model_config = {
    "dropout_rate": 0.5,# dropout rate
    "gnn_type": "gin",  # other choices like "gat", "gcn".
    "layer_num": 5,     # the number of gnn layers.
    "num_tasks": len(task_names), # number of targets to predict for the downstream task.
}
inference_prog = fluid.Program()
startup_prog = fluid.Program()
with fluid.program_guard(inference_prog, startup_prog):
    with fluid.unique_name.guard():
        model = DownstreamModel(model_config=model_config)
        model.inference()

exe = fluid.Executor(fluid.CPUPlace())
# exe = fluid.Executor(fluid.CUDAPlace(0))
exe.run(startup_prog)

  and should_run_async(code)
  op_type, op_type, EXPRESSION_MAP[method_name]))
  op_type, op_type, EXPRESSION_MAP[method_name]))
  op_type, op_type, EXPRESSION_MAP[method_name]))


[]

## Load finetuned models
Load the finetuned model from part II.

In [18]:
load_partial_params(exe, './model/tox21', inference_prog)

Load parameters from ./model/tox21.


## Start Inference
Do inference on the given SMILES sequence. We use directly call `gen_features` and `collate_fn` of the `DownstreamFeaturizer` to convert the raw SMILES sequence to the model input. 

Using Tox21 dataset as an example, our finetuned downstream model can make predictions over 12 sub-tasks on Tox21.

In [19]:
SMILES="O=C1c2ccccc2C(=O)C1c1ccc2cc(S(=O)(=O)[O-])cc(S(=O)(=O)[O-])c2n1"
featurizer = DownstreamFeaturizer(model.graph_wrapper, is_inference=True)
feed_dict = featurizer.collate_fn([featurizer.gen_features({'smiles': SMILES})])
pred, = exe.run(inference_prog, feed=feed_dict, fetch_list=[model.pred], return_numpy=False)
probs = np.array(pred)[0]
print('SMILES:%s' % SMILES)
print('Predictions:')
for name, prob in zip(task_names, probs):
    print("  %s:\t%s" % (name, prob))

SMILES:O=C1c2ccccc2C(=O)C1c1ccc2cc(S(=O)(=O)[O-])cc(S(=O)(=O)[O-])c2n1
Predictions:
  NR-AR:	0.07395526
  NR-AR-LBD:	0.060779173
  NR-AhR:	0.32708064
  NR-Aromatase:	0.105057806
  NR-ER:	0.23960893
  NR-ER-LBD:	0.10899512
  NR-PPAR-gamma:	0.08047223
  SR-ARE:	0.3242342
  SR-ATAD5:	0.103697956
  SR-HSE:	0.08612242
  SR-MMP:	0.321919
  SR-p53:	0.15215957
