# Predicting drug-target interaction

In this tuorial, we will go through how to run a MolTrans model for compound-protein affinity prediction. In particular, we will demonstrate how to train, validate and test of classification and regression tasks within folder `/apps/drug_target_interaction/moltrans_dti/`.

# MolTrans

**MolTrans** represents Molecular Interaction Transformer for drug-target interaction prediction. It leverages massive unlabeled biomedical data to extract high-quality sub-structures of drugs and proteins. As for the whole process, a FCS mining module first decomposes the input drug and protein into a set of explicit sequences of sub-structures using BPE based decomposition method. Then, the latent representations are fed into an augmented transformer module to obtain augmented contextual embeddings for each sub-structure of drug and protein. Next, in the interaction prediction module, drug sub-structures and protein sub-structures are combined with pairwise interaction scores. A CNN layer is later applied on the interaction map to capture higher-order interactions. Finally, a decoder module outputs a score indicating the probability of pairwise interactions.

![title](./figures/moltrans_model.png)

The scripts for MolTrans are in `/apps/drug_target_interaction/moltrans_dti/`, we redirect to this folder for later steps.

In [0]:
import os
import sys

sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), "..")))
os.chdir('../apps/drug_target_interaction/moltrans_dti/')
os.listdir(os.getcwd())

['train_cls.py',
 'train_reg.py',
 'README.md',
 'pretrained_model',
 'LICENSE',
 'config.json',
 '.DS_Store',
 'helper',
 'vocabulary',
 'double_towers.py',
 'util_function.py',
 'finetune_model',
 'preprocess.py',
 'requirement.txt']

## Prepare dataset

Download all datasets needed using `wget`. If you do not have `wget` on your machine, you could also
copy the url below into your web browser to download them. But remember to copy the data manually under the
path `/apps/drug_target_interaction/moltrans_dti/`.

In [1]:
# download and decompress the data
!wget "https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/dti_datasets/dti_dataset.tgz" --no-check-certificate
!tar -zxvf "dti_dataset.tgz"
!ls "./dataset"

--2021-05-10 13:15:07--  https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/dti_datasets/dti_dataset.tgz
Connecting to 172.19.61.250:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 196384974 (187M) [application/gzip]
Saving to: "dti_dataset.tgz"


2021-05-10 13:15:30 (8.86 MB/s) - "dti_dataset.tgz" saved [196384974/196384974]

./dataset/
./dataset/classification/
./dataset/._.DS_Store
./dataset/.DS_Store
./dataset/regression/
./dataset/regression/benchmark/
./dataset/regression/._DAVIS
./dataset/regression/DAVIS/
./dataset/regression/._.DS_Store
./dataset/regression/.DS_Store
./dataset/regression/._BindingDB
./dataset/regression/BindingDB/
./dataset/regression/._KIBA
./dataset/regression/KIBA/
./dataset/regression/._ChEMBL
./dataset/regression/ChEMBL/
./dataset/regression/ChEMBL/._.DS_Store
./dataset/regression/ChEMBL/.DS_Store
./dataset/regression/ChEMBL/._Chem_SMILES.txt
./dataset/regression/ChEMBL/Chem_SMILES.txt
./dataset/regression/ChEMBL/._Chem_Affini

## Install dependencies

Before playing, we need to install all the packages indicated within `requirement.txt`.

In [2]:
file1 = open("requirement.txt","r")
file1.read().splitlines()

['paddlepaddle==2.0.2',
 'visualdl==2.1.1',
 'scikit-learn==0.24.1',
 'scipy==1.6.1',
 'subword-nmt==0.3.7',
 'PyYAML==5.4.1',
 'numpy==1.19.5',
 'pandas==1.2.3']

## Initialize model

First, import related packages and modules. For details of `MolTransModel`, please refer to `double_towers.py`.

In [3]:
import paddle
import numpy as np
from paddle import nn
from paddle import io
from helper import utils
from double_towers import MolTransModel

Then, load all the hyperparameters for the networks from `config.json`. You can adjust them for your own need.

In [4]:
lr = 5e-4
model_config = {
    "drug_max_seq": 50,               # Max length of drug sequence
    "target_max_seq": 545,            # Max length of protein sequence
    "emb_size": 384,                  # Embedding size
    "input_drug_dim": 23532,          # Length of drug vocabulary
    "input_target_dim": 16693,        # Length of protein vocabulary
    "interm_size": 1536,              # Latent size
    "num_attention_heads": 12,        # Number of attention heads
    "flatten_dim": 81750,             # Flatten size 
    "layer_size": 2,                  # Layer size of transformer blocks
    "dropout_ratio": 0.1,             # Dropout rate
    "attention_dropout_ratio": 0.1,   # Dropout rate within attention
    "hidden_dropout_ratio": 0.1       # Dropout rate within hidden states
}

Set the device and GPU for later tasks. Indicate the seed for reproduction.

In [5]:
# Set seed for reproduction
paddle.seed(2)
np.random.seed(3)

# Set device as CUDA_VISIBLE_DEVICES='your device number'
use_cuda = paddle.is_compiled_with_cuda()
device = 'cuda:0' if use_cuda else 'cpu'
device = device.replace('cuda', 'gpu')
device = paddle.set_device(device)

Next, we initialize the model with indicated configuration. The optimizer we used here is Adam.

In [6]:
model = MolTransModel(model_config)
model = model.cuda()
optim = utils.Adam(parameters=model.parameters(), learning_rate=lr)

## Classification task

In this tutorial, we take DAVIS dataset for example. For classification task, we set positive label for all the drug-target pairs whose Kd is smaller than 30.

### Preprocess data

Load DAVIS dataset which contains training set, validation set and testing set.

In [7]:
import pandas as pd
data_path = './dataset/classification/DAVIS'
training_set = pd.read_csv(data_path + '/train.csv')
validation_set = pd.read_csv(data_path + '/val.csv')
testing_set = pd.read_csv(data_path + '/test.csv')
print(len(training_set), len(validation_set), len(testing_set))

2086 3006 6011


Use `DataEncoder` and `DataLoader` modules to transform the input data. For details of `DataEncoder`, please refer to `preprocess.py`.

In [8]:
import paddle
from helper import utils
from preprocess import DataEncoder

training_data = DataEncoder(training_set.index.values, training_set.Label.values, training_set)
train_loader = utils.BaseDataLoader(training_data, batch_size=64, shuffle=True, 
                                        drop_last=False, num_workers=0)
validation_data = DataEncoder(validation_set.index.values, validation_set.Label.values, validation_set)
validation_loader = utils.BaseDataLoader(validation_data, batch_size=64, shuffle=False, 
                                        drop_last=False, num_workers=0)
testing_data = DataEncoder(testing_set.index.values, testing_set.Label.values, testing_set)
testing_loader = utils.BaseDataLoader(testing_data, batch_size=64, shuffle=False, 
                                        drop_last=False, num_workers=0)

### Train, validate and test

**Basic settings**. Ideally, `max_epoch` should be at least **200** for better performance.

In [9]:
import paddle
from paddle import nn

# Basic settings
optimal_auc = 0
log_iter = 50
log_step = 0
max_epoch = 10

# Set loss function
sig = paddle.nn.Sigmoid()
loss_fn = paddle.nn.BCELoss()

**Training part**. After enumerating the `train_loader`, the transformed inputs are handled by the model and BCELoss is used to evaluate the performance.

In [10]:
# Training
for epoch in range(max_epoch):
    print("=====Start Training=====")
    model.train()
    for batch_id, data in enumerate(train_loader):
        d_out, mask_d_out, t_out, mask_t_out, label = data
        temp = model(d_out.long().cuda(), t_out.long().cuda(), mask_d_out.long().cuda(), mask_t_out.long().cuda())
        label = paddle.cast(label, "float32")
        predicts = paddle.squeeze(sig(temp))
        loss = loss_fn(predicts, label)

        optim.clear_grad()
        loss.backward()
        optim.step()

        if batch_id % log_iter == 0:
            print("Training at epoch: {}, step: {}, loss is: {}".format(epoch, batch_id, loss.cpu().detach().numpy()))
            log_step += 1  

**Evaluation function for classification task**. Various metrics like AUROC, AUPRC, Precision, Recall, Accuracy, etc. are used.

In [11]:
from sklearn.metrics import (roc_auc_score, average_precision_score, f1_score, roc_curve, confusion_matrix, 
                             precision_score, recall_score, auc, mean_squared_error)

# Evaluation function
def cls_test(data_generator, model):
    """
    Test for classification task
    """
    y_pred = []
    y_label = []
    loss_res = 0.0
    count = 0.0

    model.eval()    
    for _, data in enumerate(data_generator):
        d_out, mask_d_out, t_out, mask_t_out, label = data
        temp = model(d_out.long().cuda(), t_out.long().cuda(), mask_d_out.long().cuda(), mask_t_out.long().cuda())
        predicts = paddle.squeeze(sig(temp))
        label = paddle.cast(label, "float32")

        loss = loss_fn(predicts, label)
        loss_res += loss
        count += 1

        predicts = predicts.detach().cpu().numpy()
        label_id = label.to('cpu').numpy()
        y_label = y_label + label_id.flatten().tolist()
        y_pred = y_pred + predicts.flatten().tolist()
    loss = loss_res / count

    fpr, tpr, threshold = roc_curve(y_label, y_pred)
    precision = tpr / (tpr + fpr)
    f1 = 2 * precision * tpr / (tpr + precision + 1e-05)
    optimal_threshold = threshold[5:][np.argmax(f1[5:])]
    print("Optimal threshold: {}".format(optimal_threshold))

    y_pred_res = [(1 if i else 0) for i in y_pred >= optimal_threshold]
    auroc = auc(fpr, tpr)
    print("AUROC: {}".format(auroc))
    print("AUPRC: {}".format(average_precision_score(y_label, y_pred)))

    cf_mat = confusion_matrix(y_label, y_pred_res)
    print("Confusion Matrix: \n{}".format(cf_mat))
    print("Precision: {}".format(precision_score(y_label, y_pred_res)))
    print("Recall: {}".format(recall_score(y_label, y_pred_res)))

    total_res = sum(sum(cf_mat))
    accuracy = (cf_mat[0, 0] + cf_mat[1, 1]) / total_res
    print("Accuracy: {}".format(accuracy))
    sensitivity = cf_mat[0, 0] / (cf_mat[0, 0] + cf_mat[0, 1])
    print("Sensitivity: {}".format(sensitivity))
    specificity = cf_mat[1, 1] / (cf_mat[1, 0] + cf_mat[1, 1])
    print("Specificity: {}".format(specificity))
    outputs = np.asarray([(1 if i else 0) for i in np.asarray(y_pred) >= 0.5])
    return (roc_auc_score(y_label, y_pred), 
            f1_score(y_label, outputs), loss.item())

**Validation part**. AUROC is used to evalute the performace of the model. Here the best model is selected by better AUROC. 

In [12]:
# Validation
print("=====Start Validation=====")
with paddle.no_grad():
    auroc, f1, loss = cls_test(validation_loader, model) 
    print("Validation at epoch: {}, AUROC: {}, F1: {}, loss is: {}".format(epoch, auroc, f1, loss))
        
    # Save best model
    if auroc > optimal_auc:
        optimal_auc = auroc
        print("Saving the best_model...")
        print("Best AUROC: {}".format(optimal_auc))
        paddle.save(model.state_dict(), 'DAVIS_bestAUC_model_cls1')

**Testing part**. Load the best model and test.

In [13]:
# Load the trained model
params_dict= paddle.load('DAVIS_bestAUC_model_cls1')
model.set_dict(params_dict)

# Testing
print("=====Start Testing=====")
with paddle.no_grad():
    try:
        auroc, f1, loss = cls_test(testing_loader, model)
        print("Testing result: AUROC: {}, F1: {}, Testing loss is: {}".format(auroc, f1, loss))
    except:
        print("Testing failed...")

In [14]:
!CUDA_VISIBLE_DEVICES='6' python train_cls.py --epochs 10

Starting Time: 1620636839.3391168
W0510 16:53:59.340250 20693 device_context.cc:320] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.1
W0510 16:53:59.344445 20693 device_context.cc:330] device: 0, cuDNN Version: 7.6.
=====Start Initial Testing=====
Optimal threshold: 0.08874599635601044
AUROC: 0.4449206833787793
AUPRC: 0.04431604679412547
Confusion Matrix: 
[[   1 5707]
 [   0  303]]
Precision: 0.05041597337770383
Recall: 1.0
Accuracy: 0.05057394776243553
Sensitivity: 0.0001751927119831815
Specificity: 1.0
Initial testing set: AUROC: 0.4449206833787793, F1: 0.0625, Testing loss: 0.5268805027008057
=====Start Training=====
Training at epoch: 0, step: 0, loss is: [0.77942777]
=====Start Validation=====
Optimal threshold: 0.46215638518333435
AUROC: 0.7542153460997891
AUPRC: 0.16601991199123586
Confusion Matrix: 
[[1357 1489]
 [  17  143]]
Precision: 0.08762254901960784
Recall: 0.89375
Accuracy: 0.499001996007984
Sensitivity: 0.4768095

## Regression task

In the original MolTrans paper, it only consists of classification task. However, we also provide regression task here. In this tutorial, we take the same DAVIS dataset that is used in GraphDTA and DGraphDTA for example. In real world, the data for drug-target interaction are indicated by various metrics like Kd, IC50, Ki, etc. and it is more reasonable to predict the score of the drug-target interaction instead of binary label.

### Preprocess data

Load DAVIS benchmark dataset and transform. For details of `load_davis_dataset` function, please refer to `util_function.py`.

In [15]:
import paddle
from helper import utils
from preprocess import DataEncoder
from util_function import load_davis_dataset

trainset, testset = load_davis_dataset()
trainset_smiles = [d['smiles'] for d in trainset]
trainset_protein = [d['protein'] for d in trainset]
trainset_aff = [d['aff'] for d in trainset]

testset_smiles = [d['smiles'] for d in testset]
testset_protein = [d['protein'] for d in testset]
testset_aff = [d['aff'] for d in testset]

df_data_t = pd.DataFrame(zip(trainset_smiles, trainset_protein, trainset_aff))
df_data_t.rename(columns={0:'SMILES', 1: 'Target Sequence', 2: 'Label'}, inplace=True)
df_data_tt = pd.DataFrame(zip(testset_smiles, testset_protein, testset_aff))
df_data_tt.rename(columns={0:'SMILES', 1: 'Target Sequence', 2: 'Label'}, inplace=True)

reg_training_data = DataEncoder(df_data_t.index.values, df_data_t.Label.values, df_data_t)
reg_train_loader = utils.BaseDataLoader(reg_training_data, batch_size=64, 
                                    shuffle=True, drop_last=False, num_workers=args.workers)
reg_validation_data = DataEncoder(df_data_tt.index.values, df_data_tt.Label.values, df_data_tt)
reg_validation_loader = utils.BaseDataLoader(reg_validation_data, batch_size=64, 
                                    shuffle=False, drop_last=False, num_workers=args.workers)

### Train and evaluate

**Basic settings**. Ideally, `max_epoch` should at least **200** for better performance.

In [16]:
import paddle
from paddle import nn

# Basic setting
optimal_mse = 10000
optimal_CI = 0
log_iter = 50
log_step = 0
max_epoch = 10

# Set loss function
reg_loss_fn = paddle.nn.MSELoss()

**Training part**. After enumerating the `reg_train_loader`, the transformed inputs are handled by the model and MSELoss is used to evaluate the performance.

In [17]:
# Training
for epoch in range(max_epoch):
    print("=====Go for Training=====")
    model.train()        
    # Regression Task
    for batch_id, data in enumerate(reg_train_loader):
        d_out, mask_d_out, t_out, mask_t_out, label = data
        temp = model(d_out.long().cuda(), t_out.long().cuda(), mask_d_out.long().cuda(), mask_t_out.long().cuda())
        label = paddle.cast(label, "float32")
        predicts = paddle.squeeze(temp)
        loss = reg_loss_fn(predicts, label)

        optim.clear_grad()
        loss.backward()
        optim.step()

        if batch_id % log_iter == 0:
            print("Training at epoch: {}, step: {}, loss is: {}".format(epoch, batch_id, loss.cpu().detach().numpy()))
            log_step += 1

**Evaluation function for regression task**. Metrics like MSE and CI are used.

In [18]:
from preprocess import concordance_index1

# Evaluation function
def reg_test(data_generator, model):
    """
    Test for regression task
    """
    y_pred = []
    y_label = []

    model.eval()    
    for _, data in enumerate(data_generator):
        d_out, mask_d_out, t_out, mask_t_out, label = data
        temp = model(d_out.long().cuda(), t_out.long().cuda(), mask_d_out.long().cuda(), mask_t_out.long().cuda())

        label = paddle.cast(label, "float32")
        predicts = paddle.squeeze(temp, axis=1)

        loss = reg_loss_fn(predicts, label)
        predict_id = paddle.squeeze(temp).detach().cpu().numpy()
        label_id = label.to('cpu').numpy()

        y_label = y_label + label_id.flatten().tolist()
        y_pred = y_pred + predict_id.flatten().tolist()

        total_label = np.array(y_label)
        total_pred = np.array(y_pred)

        mse = ((total_label - total_pred) ** 2).mean(axis=0)
    return (mse, concordance_index1(np.array(y_label), np.array(y_pred)), loss.item())

**Evaluation part**. CI and MSE are used to evalute the performace of the model. Here the best model is selected by better CI or MSE separately. 

In [19]:
# Evaluation
print("=====Go for Validation=====")
with paddle.no_grad():
    mse, CI, reg_loss = reg_test(reg_validation_loader, model)
    print("Validation at epoch: {}, MSE: {}, CI: {}, loss is: {}".format(epoch, mse, CI, reg_loss))
        
    # Save best model
    if mse < optimal_mse:
        optimal_mse = mse
        print("Saving the best_model with best MSE...")
        print("Best MSE: {}".format(optimal_mse))
        paddle.save(model.state_dict(), 'DAVIS_bestMSE_model_reg1')
    if CI > optimal_CI:
        optimal_CI = CI
        print("Saving the best_model with best CI...")
        print("Best CI: {}".format(optimal_CI))
        paddle.save(model.state_dict(), 'DAVIS_bestCI_model_reg1')

In [20]:
!CUDA_VISIBLE_DEVICES='5' python train_reg.py --epochs 10

Starting Time: 1620644443.8780835
W0510 19:00:43.879191 15912 device_context.cc:320] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 10.1
W0510 19:00:43.883350 15912 device_context.cc:330] device: 0, cuDNN Version: 7.6.
=====Go for Initial Testing=====
Testing result: MSE: 36.81499150313501, CI: 0.4923231465886854
=====Go for Training=====
Training at epoch: 0, step: 0, loss is: [36.40392]
Training at epoch: 0, step: 50, loss is: [0.7971545]
Training at epoch: 0, step: 100, loss is: [1.5158722]
Training at epoch: 0, step: 150, loss is: [0.59505415]
Training at epoch: 0, step: 200, loss is: [0.6165484]
Training at epoch: 0, step: 250, loss is: [1.2195725]
Training at epoch: 0, step: 300, loss is: [0.85159874]
Training at epoch: 0, step: 350, loss is: [0.6839756]
=====Go for Validation=====
Validation at epoch: 0, MSE: 0.8015637226389766, CI: 0.6708589398191237, loss is: 0.8781716227531433
Saving the best_model with best MSE...
Best MSE

Besides the examples shown above, you can try other drug-target interaction datasets within `apps/drug_target_interaction/moltrans_dti/dataset`. Please refer to the scripts for details or submit your issue via the GitHub repo if you have any concern.