<a href="https://colab.research.google.com/github/Jahan08/Advanced-AI-algorithm-practice-hERG-data/blob/main/hERG_AI_model_with_GROVER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to GROVER

In this tutorial, we will go over what Grover is, and how to get it up and running.

GROVER, or, Graph Representation frOm selfsuperVised mEssage passing tRansformer, is a novel framework proposed by Tencent AI Lab. GROVER utilizes self-supervised tasks in the node, edge and graph level in order to learn rich structural and semantic information of molecules from large unlabelled molecular datasets. GROVER integrates Message Passing Networks into a Transformer-style architecture to deliver more expressive molecular encoding.

Reference Paper: [Rong, Yu, et al. "Grover: Self-supervised message passing transformer on large-scale molecular data." Advances in Neural Information Processing Systems (2020).](https://drug.ai.tencent.com/publications/GROVER.pdf)

## Colab

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/Introduction_to_GROVER.ipynb)

## Setup

To run DeepChem within Colab, you'll need to run the following installation commands. This will take about 5 minutes to run to completion and install your environment. You can of course run this tutorial locally if you prefer. In that case, don't run these cells since they will download and install Anaconda on your local machine.

## Import and Setup required modules.
We will first clone the repository onto the preferred platform, then install it as a library. We will also import deepchem and install descriptastorus.

NOTE: The [original GROVER repository](https://github.com/tencent-ailab/grover) does not contain a `setup.py` file, thus we are currently using a fork which does.

In [1]:
# Clone the forked repository.
#%cd drive/MyDrive
!git clone https://github.com/atreyamaj/grover.git

Cloning into 'grover'...
remote: Enumerating objects: 206, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 206 (delta 11), reused 4 (delta 4), pack-reused 190[K
Receiving objects: 100% (206/206), 2.21 MiB | 2.59 MiB/s, done.
Resolving deltas: 100% (14/14), done.


In [2]:
# Navigate to the working folder.
%cd grover

/content/grover


In [3]:
# Install the forked repository.
!pip install -e ./

Obtaining file:///content/grover
  Preparing metadata (setup.py) ... [?25l[?25hdone
Installing collected packages: grover
  Running setup.py develop for grover
Successfully installed grover-1.0.0


In [4]:
!git clone https://github.com/bp-kelley/descriptastorus

Cloning into 'descriptastorus'...
remote: Enumerating objects: 2045, done.[K
remote: Counting objects: 100% (298/298), done.[K
remote: Compressing objects: 100% (117/117), done.[K
remote: Total 2045 (delta 186), reused 286 (delta 180), pack-reused 1747[K
Receiving objects: 100% (2045/2045), 52.75 MiB | 11.84 MiB/s, done.
Resolving deltas: 100% (944/944), done.


In [5]:
# Install deepchem and descriptastorus.
!pip install deepchem


Collecting deepchem
  Downloading deepchem-2.7.1-py3-none-any.whl (693 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m693.2/693.2 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
Collecting scipy<1.9 (from deepchem)
  Downloading scipy-1.8.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (42.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 MB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rdkit (from deepchem)
  Downloading rdkit-2023.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.7/29.7 MB[0m [31m50.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scipy, rdkit, deepchem
  Attempting uninstall: scipy
    Found existing installation: scipy 1.11.2
    Uninstalling scipy-1.11.2:
      Successfully uninstalled scipy-1.11.2
Successfully installed deepchem-2.7.1 rdkit-2023.3.3 scipy-1.8.1


In [9]:
!pip install /content/grover/descriptastorus

Processing ./descriptastorus
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: descriptastorus
  Building wheel for descriptastorus (setup.py) ... [?25l[?25hdone
  Created wheel for descriptastorus: filename=descriptastorus-2.5.0.23-py3-none-any.whl size=1083538 sha256=c9225b77ab1b04c334a56fc406b87a088d61eed2a7a1b5b1b2c89c13b5ec8367
  Stored in directory: /tmp/pip-ephem-wheel-cache-lge56n9g/wheels/33/a7/79/a735914259445b357d6b62ac74806fd7d62f69c1305e4dd5fe
Successfully built descriptastorus
Installing collected packages: descriptastorus
  Attempting uninstall: descriptastorus
    Found existing installation: descriptastorus 2.5.0.23
    Uninstalling descriptastorus-2.5.0.23:
      Successfully uninstalled descriptastorus-2.5.0.23
Successfully installed descriptastorus-2.5.0.23


## Extracting semantic motif labels
The semantic motif label is extracted by `scripts/save_feature.py` with feature generator `fgtasklabel`.

In [10]:
!python scripts/save_features.py --data_path exampledata/pretrain/tryout.csv  \
                                --save_path exampledata/pretrain/tryout.npz   \
                                --features_generator fgtasklabel \
                                --restart

100% 5970/5970 [00:15<00:00, 381.77it/s]


## Extracting atom/bond contextual properties (vocabulary)
The atom/bond Contextual Property (Vocabulary) is extracted by `scripts/build_vocab.py`.

In [11]:
!python scripts/build_vocab.py --data_path exampledata/pretrain/tryout.csv  \
                             --vocab_save_folder exampledata/pretrain  \
                             --dataset_name tryout

Building atom vocab from file: exampledata/pretrain/tryout.csv
50000it [00:03, 13801.26it/s]
atom vocab size 324
Building bond vocab from file: exampledata/pretrain/tryout.csv
50000it [00:21, 2332.31it/s]
bond vocab size 353


## Splitting the data
To accelerate the data loading and reduce the memory cost in the multi-gpu pretraining scenario, the unlabelled molecular data need to be spilt into several parts using `scripts/split_data.py`.

In [12]:
!python scripts/split_data.py --data_path exampledata/pretrain/tryout.csv  \
                             --features_path exampledata/pretrain/tryout.npz  \
                             --sample_per_file 100  \
                             --output_path exampledata/pretrain/tryout

Number of files: 60


## Running Pretraining on Single GPU

In [13]:
!python main.py pretrain \
               --data_path exampledata/pretrain/tryout \
               --save_dir model/tryout \
               --atom_vocab_path exampledata/pretrain/tryout_atom_vocab.pkl \
               --bond_vocab_path exampledata/pretrain/tryout_bond_vocab.pkl \
               --batch_size 32 \
               --dropout 0.1 \
               --depth 5 \
               --num_attn_head 1 \
               --hidden_size 100 \
               --epochs 3 \
               --init_lr 0.0002 \
               --max_lr 0.0004 \
               --final_lr 0.0001 \
               --weight_decay 0.0000001 \
               --activation PReLU \
               --backbone gtrans \
               --embedding_output_type both

Namespace(parser_name='pretrain', cuda=True, enable_multi_gpu=False, data_path='exampledata/pretrain/tryout', fg_label_path=None, atom_vocab_path='exampledata/pretrain/tryout_atom_vocab.pkl', bond_vocab_path='exampledata/pretrain/tryout_bond_vocab.pkl', embedding_output_type='both', save_dir='model/tryout', save_interval=9999999999, hidden_size=100, bias=False, depth=5, dropout=0.1, activation='PReLU', undirected=False, weight_decay=1e-07, num_attn_head=1, num_mt_block=1, dist_coff=0.1, backbone='gtrans', epochs=3, batch_size=32, warmup_epochs=2.0, init_lr=0.0002, max_lr=0.0004, final_lr=0.0001, bond_drop_rate=0, dense=False, fine_tune_coff=1, no_cache=True)
Loading data
Loading data:
Number of files: 60
Number of samples: 5970
Samples/file: 100
Splitting data with seed 0.
Total size = 5,970 | train size = 5,400 | val size = 570
atom vocab size: 324, bond vocab size: 353, Number of FG tasks: 85
Pre-loaded test data: 6
No checkpoint found %d
GROVEREmbedding(
  (encoders): GTransEncoder(

# Training and Finetuning

##Extracting Molecular Features

Given a labelled molecular dataset, it is possible to extract the additional molecular features in order to train & finetune the model from the existing pretrained model. The feature matrix is stored as `.npz`.

In [14]:
!python scripts/save_features.py --data_path exampledata/finetune/hERG.csv \
                                --save_path exampledata/finetune/hERG.npz \
                                --features_generator rdkit_2d_normalized \
                                --restart

100% 2968/2968 [03:21<00:00, 14.74it/s]


## Finetuning with existing data
Given the labelled dataset and the molecular features, we can use `finetune` function to finetune the pretrained model.

In [15]:
!python main.py finetune --data_path exampledata/finetune/hERG.csv \
                        --features_path exampledata/finetune/hERG.npz \
                        --save_dir model/finetune/hERG/ \
                        --checkpoint_path model/tryout/model.ep3 \
                        --dataset_type classification \
                        --split_type scaffold_balanced \
                        --ensemble_size 1 \
                        --num_folds 3 \
                        --no_features_scaling \
                        --ffn_hidden_size 200 \
                        --batch_size 32 \
                        --epochs 10 \
                        --init_lr 0.00015

Fold 0
Loading data
Number of tasks = 1
Splitting data with seed 0
100% 2968/2968 [00:01<00:00, 1630.21it/s]
Total scaffolds = 1,422 | train scaffolds = 1,136 | val scaffolds = 146 | test scaffolds = 140
Label averages per scaffold, in decreasing order of scaffold frequency,capped at 10 scaffolds and 20 labels: [(array([1.]), array([1])), (array([1.]), array([2])), (array([0.]), array([1])), (array([1.]), array([1])), (array([1.]), array([2])), (array([1.]), array([1])), (array([1.]), array([2])), (array([0.]), array([1])), (array([1.]), array([2])), (array([1.]), array([4]))]
Class sizes
Activity 0: 35.58%, 1: 64.42%
Total size = 2,968 | train size = 2,374 | val size = 296 | test size = 298
Loading model 0 from model/tryout/model.ep3
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_q.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_q.W_h.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_k.ac

# Predicting output

## Extracting molecular features

If the finetuned model uses the molecular feature as input, we need to generate the molecular feature for the target molecules as well.

In [38]:
!python scripts/save_features.py --data_path exampledata/finetune/hERG-house.csv \
                                --save_path exampledata/finetune/hERG-house.npz \
                                --features_generator rdkit_2d_normalized \
                                --restart

100% 17/17 [00:01<00:00, 10.02it/s]


## Predicting output with the finetuned model

In [39]:
!python main.py predict --data_path exampledata/finetune/hERG-house.csv \
               --features_path exampledata/finetune/hERG-house.npz \
               --checkpoint_dir ./model \
               --no_features_scaling \
               --output data_pre.csv

Loading training args
Loading data
Validating SMILES
Test size = 17
Predicting...
  0% 0/3 [00:00<?, ?it/s]Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_q.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_q.W_h.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_k.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_k.W_h.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_v.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.heads.0.mpn_v.W_h.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.act_func.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.layernorm.weight".
Loading pretrained parameter "grover.encoders.edge_blocks.0.layernorm.bias".
Loading pretrained parameter "grover.encoders.edge_blocks.0.W_i.weight".
Loading pretrained parameter "grover.encoders.ed

## Output

The output will be saved in a file called `data_pre.csv`.

In [41]:
import pandas as pd
df = pd.read_csv("/content/grover/data_pre.csv")
df.head(10)

Unnamed: 0.1,Unnamed: 0,Activity
0,OC(C1=CC=CC=C1)(C2CCN(CCC[C@@H](C3=CC=C(C(C)(C...,0.946972
1,O=C(NCCCCN1CCN(C2=CC(C(F)(F)F)=NC(C(C)(C)C)=N2...,0.630853
2,O=C(COC(C=CN1)=NC1=O)NC23CC4C[C@H](C2)C[C@@H](...,0.140539
3,O=C(C1=CC(CCC(N2)=O)=C2C=C1)NC3=CC(C#CC4=NC=CC...,0.488667
4,O=C(CN1CCCC[C@@H]1C(N)=O)C(C(N2CC)=O)=C(N(C2=O...,0.293114
5,O=C(CN1CCCC[C@@H]1C(N)=O)C(C(N2CC)=O)=C(N(C2=O...,0.30918
6,O=C(N)CN(C(C1=CNC(N(C2=O)CC3=CC=C(F)C=C3)=C1C(...,0.154679
7,O=C1N(C2CCC(NC2=O)=O)CC3=C1C=CC(CNC4=NC(C(C=C5...,0.494759
8,CC(C=C1)=C(Cl)C=C1NC2=NC(C3=CC(CN(C4=O)C5C(NC(...,0.586602
9,O=C(CN1CCCC[C@@H]1C(N2CCOCC2)=O)C(C(N3CC)=O)=C...,0.155188
