In this tuorial, we will go through how to run a Graph Neural Network (GNN) model for compound property prediction. In particular, we will demonstrate how to pretrain it and how to finetune in the downstream tasks.

#  Part I: Pretraining

In this part, we will show how to pretrain a compound GNN model. The pretraining skills here are adapted from the work of pretrain gnns, including attribute masking, context prediction and supervised pretraining.

Visit `pretrain_attrmask.py`, `pretrain_contextpred.py` and `pretrain_supervised.py` for more details.

In [1]:
import os
import numpy as np
import sys
sys.path.insert(0, os.getcwd() + "/..")
os.chdir("../apps/pretrained_compound/pretrain_gnns")

The Pahelix framework is build upon PaddlePaddle, which is a deep learning framework.

In [5]:
import paddle
import paddle.fluid as fluid
from paddle.fluid.incubate.fleet.collective import fleet
from pahelix.datasets import load_zinc_dataset
from pahelix.featurizer import PreGNNAttrMaskFeaturizer
from pahelix.utils.compound_tools import CompoundConstants
from pahelix.model_zoo import PreGNNAttrmaskModel

In [6]:
# switch to paddle static graph mode.
paddle.enable_static()

## Build the static graph
Basically we build the static graph with Paddle Programs and Executors. Here, we use `model_config` to hold the model configurations. `PreGNNAttrmaskModel` is an unsupervised pretraining model which randomly masks the atom type of some node and then use the masked atom type as the prediction targets. Meanwhile, we use Adam optimizer and set the lr to be 0.001.

To use the GPU for training, please uncomment the `fluid.CUDAPlace(0)`. While the `fluid.CPUPlace()` is for CPU training.

In [8]:
model_config = {
    "dropout_rate": 0.5,# dropout rate
    "gnn_type": "gin",  # other choices like "gat", "gcn".
    "layer_num": 5,     # the number of gnn layers.
}
train_prog = fluid.Program()
startup_prog = fluid.Program()
with fluid.program_guard(train_prog, startup_prog):
    with fluid.unique_name.guard():
        model = PreGNNAttrmaskModel(model_config=model_config)
        model.forward()
        opt = fluid.optimizer.Adam(learning_rate=0.001)
        opt.minimize(model.loss)

exe = fluid.Executor(fluid.CPUPlace())
# exe = fluid.Executor(fluid.CUDAPlace(0))
exe.run(startup_prog)
print(model.loss)

var mean_0.tmp_0 : fluid.VarType.LOD_TENSOR.shape(1,).astype(VarType.FP32)


## Dataset loading and feature extraction
`PreGNNAttrMaskFeaturizer` is used along with `PreGNNAttrmaskModel`. It inherits from the super class `Featurizer` which is used for feature extractions. The `Featurizer` has two functions: `gen_features` for converting from a single raw smiles to a single graph data, `collate_fn` for aggregating a sublist of graph data into a big batch.
The zinc dataset is used as the pretraining dataset.

In [9]:
featurizer = PreGNNAttrMaskFeaturizer(
        model.graph_wrapper, 
        atom_type_num=len(CompoundConstants.atom_num_list),
        mask_ratio=0.15)
dataset = load_zinc_dataset("../../../data/chem_dataset/zinc_standard_agent/raw", featurizer=featurizer)
print("dataset num: %s" % (len(dataset)))

dataset num: 1000


## Start train
Now we train the attrmask model for 2 epochs for demostration purposes. The data loading process is accelerated with 4 processors. Then the pretrained model is saved to "./model/pretrain_attrmask", which will serve as the initial model of the downstream tasks.

In [10]:
def train(exe, train_prog, model, dataset, featurizer):
    data_gen = dataset.iter_batch(
            batch_size=256, num_workers=4, shuffle=True, collate_fn=featurizer.collate_fn)
    list_loss = []
    for batch_id, feed_dict in enumerate(data_gen):
        train_loss, = exe.run(train_prog, 
                feed=feed_dict, fetch_list=[model.loss], return_numpy=False)
        list_loss.append(np.array(train_loss).mean())
    return np.mean(list_loss)

for epoch_id in range(2):
    train_loss = train(exe, train_prog, model, dataset, featurizer)
    print("epoch:%d train/loss:%s" % (epoch_id, train_loss))
fluid.io.save_params(exe, './model/pretrain_attrmask', train_prog)

epoch:0 train/loss:4.3884435
epoch:1 train/loss:1.53111


The above is about the pretraining steps,you can adjust as needed.

# Downstream finetuning
Below we will introduce how to use the pretrained model for the finetuning of downstream tasks.

Visit `finetune.py` for more details.

In [11]:
from pahelix.utils.paddle_utils import load_partial_params
from pahelix.utils.splitters import \
    RandomSplitter, IndexSplitter, ScaffoldSplitter, RandomScaffoldSplitter
from pahelix.datasets import *

from model import DownstreamModel
from featurizer import DownstreamFeaturizer
from utils import calc_rocauc_score

The downstream datasets are usually small and have different tasks. For example, the BBBP dataset is used for the predictions of the Blood-brain barrier permeability. The Tox21 dataset is used for the predictions of toxicity of compounds. Here we use the Tox21 dataset for demonstrations.

In [12]:
task_names = get_default_tox21_task_names()
# task_names = get_default_sider_task_names()
print(task_names)

['NR-AR', 'NR-AR-LBD', 'NR-AhR', 'NR-Aromatase', 'NR-ER', 'NR-ER-LBD', 'NR-PPAR-gamma', 'SR-ARE', 'SR-ATAD5', 'SR-HSE', 'SR-MMP', 'SR-p53']


## Build the static graph
Basically we build the static graph with Paddle Programs and Executors. Here, we use `model_config` to hold the model configurations. Note that the configurations of the model architecture should align with that of the pretraining model, otherwise the loading will fail. `DownstreamModel` is an supervised GNN model which predicts the tasks shown in `task_names`. Meanwhile, we use Adam optimizer and set the lr to be 0.001.

To use the GPU for training, please uncomment the `fluid.CUDAPlace(0)`. While the `fluid.CPUPlace()` is for CPU training.

In [14]:
model_config = {
    "dropout_rate": 0.5,# dropout rate
    "gnn_type": "gin",  # other choices like "gat", "gcn".
    "layer_num": 5,     # the number of gnn layers.
    "num_tasks": len(task_names), # number of targets to predict for the downstream task.
}
train_prog = fluid.Program()
startup_prog = fluid.Program()
with fluid.program_guard(train_prog, startup_prog):
    with fluid.unique_name.guard():
        model = DownstreamModel(model_config=model_config)
        model.forward()
        test_prog = train_prog.clone(for_test=True)
        adam = fluid.optimizer.Adam(learning_rate=0.001)
        adam.minimize(model.loss)

exe = fluid.Executor(fluid.CPUPlace())
# exe = fluid.Executor(fluid.CUDAPlace(0))
exe.run(startup_prog)

[]

## Load pretrained models
Load the pretrained model in the pretraining phase.here we load the pretrain_attrmask as an example.

In [15]:
load_partial_params(exe, './model/pretrain_attrmask', train_prog)

Load parameters from ./model/pretrain_attrmask.


## Dataset loading and feature extraction
`DownstreamFeaturizer` is used along with `DownstreamModel`. It inherits from the super class `Featurizer` which is used for feature extractions. The `Featurizer` has two functions: `gen_features` for converting from a single raw smiles to a single graph data, `collate_fn` for aggregating a sublist of graph data into a big batch.

The Tox21 dataset is used as the downstream dataset and we use `ScaffoldSplitter` to split the dataset into train/valid/test set. `ScaffoldSplitter` will firstly order the compounds according to Bemis-Murcko scaffold, then take the first `frac_train` proportion as the train set, the next `frac_valid` proportion as the valid set and the rest as the test set. `ScaffoldSplitter` can better evaluate the generalization ability of the model on out-of-distribution samples. Note that other splitters like `RandomSplitter`, `RandomScaffoldSplitter` and `IndexSplitter` is also available.

In [18]:
featurizer = DownstreamFeaturizer(model.graph_wrapper)
dataset = load_tox21_dataset(
        "../../../data/chem_dataset/tox21/raw", task_names, featurizer=featurizer)
# dataset = load_sider_dataset(
#         "../../../data/chem_dataset/sider/raw", task_names, featurizer=featurizer)

# splitter = RandomSplitter()
splitter = ScaffoldSplitter()
train_dataset, valid_dataset, test_dataset = splitter.split(
        dataset, frac_train=0.8, frac_valid=0.1, frac_test=0.1)
print("Train/Valid/Test num: %s/%s/%s" % (
        len(train_dataset), len(valid_dataset), len(test_dataset)))



Train/Valid/Test num: 6264/783/784


## Start train
Now we train the attrmask model for 4 epochs for demostration purposes. Since each downstream task will contain more than one sub-task, the performance of the model is evaluated by the average roc-auc on all sub-tasks.

In [19]:
def train(exe, train_prog, model, train_dataset, featurizer):
    data_gen = train_dataset.iter_batch(
        batch_size=64, num_workers=4, shuffle=True, collate_fn=featurizer.collate_fn)
    list_loss = []
    for batch_id, feed_dict in enumerate(data_gen):
        train_loss, = exe.run(train_prog, feed=feed_dict, fetch_list=[model.loss], return_numpy=False)
        list_loss.append(np.array(train_loss).mean())
    return np.mean(list_loss)

def evaluate(exe, test_prog, model, test_dataset, featurizer):
    """
    In the dataset, a proportion of labels are blank. So we use a `valid` tensor
    to help eliminate these blank labels in both training and evaluation phase.
    
    Returns:
        the average roc-auc of all sub-tasks.
    """
    data_gen = test_dataset.iter_batch(
    		batch_size=64, num_workers=4, shuffle=False, collate_fn=featurizer.collate_fn)
    total_pred = []
    total_label = []
    total_valid = []
    for batch_id, feed_dict in enumerate(data_gen):
        pred, = exe.run(test_prog, feed=feed_dict, fetch_list=[model.pred], return_numpy=False)
        total_pred.append(np.array(pred))
        total_label.append(feed_dict['finetune_label'])
        total_valid.append(feed_dict['valid'])
    total_pred = np.concatenate(total_pred, 0)
    total_label = np.concatenate(total_label, 0)
    total_valid = np.concatenate(total_valid, 0)
    return calc_rocauc_score(total_label, total_pred, total_valid)

for epoch_id in range(4):
    train_loss = train(exe, train_prog, model, train_dataset, featurizer)
    val_auc = evaluate(exe, test_prog, model, valid_dataset, featurizer)
    test_auc = evaluate(exe, test_prog, model, test_dataset, featurizer)
    print("epoch:%s train/loss:%s" % (epoch_id, train_loss))
    print("epoch:%s val/auc:%s" % (epoch_id, val_auc))
    print("epoch:%s test/auc:%s" % (epoch_id, test_auc))
# fluid.io.save_params(exe, './model/sider', train_prog)
fluid.io.save_params(exe, './model/tox21', train_prog)

epoch:0 train/loss:0.503272
epoch:0 val/auc:0.6493802688609531
epoch:0 test/auc:0.6282874648082166
epoch:1 train/loss:0.25345027
epoch:1 val/auc:0.6470544426655516
epoch:1 test/auc:0.6279795005837258
epoch:2 train/loss:0.21940874
epoch:2 val/auc:0.6548664758200556
epoch:2 test/auc:0.6585136404994928
epoch:3 train/loss:0.21591327
epoch:3 val/auc:0.6972151038228583
epoch:3 test/auc:0.7012427094328548


The above is about the finetuning steps,you can adjust as needed.