# Dataset Construction to Explore Chemical Space with 3D Geometry and Deep Learning

This is a tutorial on how to train a simplified PhysNet (sPhysNet) on Frag20 dataset described in this [paper]($PLACEHOLDER)

## 1. Download the code and data

First you should download the code for data preprocessing (/dataProviders) and model training(/PhysDime-Seq). You can do it by:


` $ git clone --recurse-submodules -j8 https://github.com/SongXia-NYU/sPhysNet.git`

You need to setup the environment variable:

` $ export PYTHONPATH=./dataProviders/:$PYTHONPATH `

` $ export PYTHONPATH=./PhysDime-Seq/:$PYTHONPATH `

before you launch this notebook: 

` $ jupyter notebook `

This tutorial file is supposed to be in the /sPhysNet folder, which is the same level as /dataProviders and /PhysDime-Seq

Download Frag20 data from our website:

In [1]:
# TODO
! wget ${website}

wget: missing URL
Usage: wget [OPTION]... [URL]...

Try `wget --help' for more options.


You will need to extract the *.tar.bz2 file to wherever is convinent for you. You may need to change the line below:

In [2]:
# change me
frag20_data_root = "/ext3"

In [1]:
from tqdm import tqdm
from glob import glob
import os
import os.path as osp
import torch
import argparse


def check_frag20_data(root):
    for n_heavy in tqdm(range(9, 21)):
        csv_file = osp.join(root, "Frag20_{}_target.csv".format(n_heavy))
        pt_file = osp.join(root, "Frag20_{}_extra_target.pt".format(n_heavy))
        sdf_folder = osp.join(root, "Frag20_{}_data".format(n_heavy))
        for f in [csv_file, pt_file, sdf_folder]:
            if not osp.exists(f):
                raise ValueError("file/folder: {} doesn't exist! Is your root correct?".format(f))
    print("Frag20 data status: Normal")

In [4]:
check_frag20_data(frag20_data_root)

100%|██████████| 12/12 [00:00<00:00, 12.99it/s]

Frag20 data status: Normal





## Data preprocess

Note in the downloaded data, all geometries and targets are in different format. There we need to preprocess them into a single `torch_geometric.data.InMemoryDataset` format. I have written a function for this:

In [5]:
from dataProviders.GaussUtils.GaussInfo import sdf_to_pt
dst_dir = "dataProviders/data/processed"
os.makedirs(dst_dir, exist_ok=True)

  return torch._C._cuda_getDeviceCount() > 0


Data preprocessing extract geometry, targets and calculate `edge_index` as well. The whole process took several hours.

In [6]:
# ------------- Frag20 Data preprocess------------- #

for n_heavy in range(9, 21):
    sdf_to_pt(n_heavy=n_heavy, src_root=frag20_data_root, dst_root=dst_dir)

processing heavy: 9: 100%|██████████| 158535/158535 [22:09<00:00, 119.24it/s]
processing heavy: 10: 100%|██████████| 143180/143180 [29:36<00:00, 80.61it/s] 
processing heavy: 11: 100%|██████████| 17269/17269 [03:28<00:00, 82.85it/s] 
processing heavy: 12: 100%|██████████| 21502/21502 [05:00<00:00, 71.51it/s] 
processing heavy: 13: 100%|██████████| 25793/25793 [06:51<00:00, 62.61it/s] 
processing heavy: 14: 100%|██████████| 30331/30331 [09:01<00:00, 56.03it/s] 
processing heavy: 15: 100%|██████████| 31719/31719 [10:20<00:00, 51.13it/s]
processing heavy: 16: 100%|██████████| 35584/35584 [12:41<00:00, 46.72it/s]
processing heavy: 17: 100%|██████████| 36201/36201 [14:15<00:00, 42.31it/s]
processing heavy: 18: 100%|██████████| 32501/32501 [14:04<00:00, 38.49it/s]
processing heavy: 19: 100%|██████████| 23474/23474 [11:19<00:00, 34.55it/s]
processing heavy: 20: 100%|██████████| 10207/10207 [05:48<00:00, 29.28it/s]


In [8]:
! ls dataProviders/data/processed/

Frag20_eMol9_QM.pt
eMol9_raw.pt
frag20_10_raw.pt
frag20_11_raw.pt
frag20_12_raw.pt
frag20_13_raw.pt
frag20_14_raw.pt
frag20_15_raw.pt
frag20_16_raw.pt
frag20_17_raw.pt
frag20_18_raw.pt
frag20_19_raw.pt
frag20_20_raw.pt
frag20_9_raw.pt
frag20_eMol9_split.pt
frag20reducedAllSolRef-Bmsg-cutoff-10.00-sorted-defined_edge-lr-QM.pt


In [7]:

frag20_20_raw = torch.load("dataProviders/raw/frag20_20_raw.pt")
frag20_20_raw[0].keys

['R',
 'Z',
 'Q',
 'D',
 'F',
 'E',
 'N',
 'BN_edge_index',
 'L_edge_index',
 'num_L_edge',
 'num_BN_edge']

Also you need to do the same procedure to preprocess eMol9 Dataset

In [3]:
# ------------ eMol9 data preprocess ---------------- #

# change me
eMol9_data_root = "/scratch/sx801/data/eMol9/eMol9_dataset"

In [4]:
from dataProviders.GaussUtils.GaussInfo import sdf_to_pt_eMol9

In [5]:
sdf_to_pt_eMol9(src_root=eMol9_data_root, dst_root=dst_dir)

100%|██████████| 88234/88234 [15:54<00:00, 92.45it/s] 


In [7]:
! ls dataProviders/data/processed/eMol9*

dataProviders/data/processed/eMol9_raw.pt


In [2]:
# ----------- Concat datasets ----------------- #

# recommened memory: 32GB
from DataPrepareUtils import concat_im_datasets
data_root = "./dataProviders/data"
datasets = ["frag20_{}_raw.pt".format(i) for i in range(9, 21)]
datasets.append("eMol9_raw.pt")
concat_im_datasets(root=data_root, datasets=datasets, out_name="Frag20_eMol9_QM.pt")

  return torch._C._cuda_getDeviceCount() > 0
frag20_9_raw.pt: 100%|██████████| 158535/158535 [00:25<00:00, 6105.57it/s]
frag20_10_raw.pt: 100%|██████████| 143180/143180 [00:23<00:00, 6061.26it/s]
frag20_11_raw.pt: 100%|██████████| 17269/17269 [00:02<00:00, 7031.08it/s]
frag20_12_raw.pt: 100%|██████████| 21502/21502 [00:03<00:00, 7070.70it/s]
frag20_13_raw.pt: 100%|██████████| 25793/25793 [00:03<00:00, 7028.68it/s]
frag20_14_raw.pt: 100%|██████████| 30331/30331 [00:06<00:00, 5044.81it/s]
frag20_15_raw.pt: 100%|██████████| 31719/31719 [00:04<00:00, 6991.87it/s]
frag20_16_raw.pt: 100%|██████████| 35584/35584 [00:07<00:00, 5025.57it/s]
frag20_17_raw.pt: 100%|██████████| 36201/36201 [00:05<00:00, 7054.65it/s]
frag20_18_raw.pt: 100%|██████████| 32501/32501 [00:04<00:00, 7038.79it/s]
frag20_19_raw.pt: 100%|██████████| 23474/23474 [00:03<00:00, 7073.40it/s]
frag20_20_raw.pt: 100%|██████████| 10207/10207 [00:01<00:00, 7070.00it/s]
eMol9_raw.pt: 100%|██████████| 88234/88234 [00:15<00:00, 5865.10

saving... it is recommended to have 32GB memory


## Model training

Now we have prepared frag20 and eMol9 dataset with QM geometry. We are ready to train a QM optimized model. First we need to join the training folder:

In [1]:
! pwd
%cd PhysDime-Seq
! pwd

/scratch/sx801/scripts/sPhysNet
/scratch/sx801/scripts/sPhysNet/PhysDime-Seq
/scratch/sx801/scripts/sPhysNet/PhysDime-Seq


In [2]:
from DummyIMDataset import DummyIMDataset
frag20_eMol9_QM_dataset = DummyIMDataset(root="../dataProviders/data", dataset_name="Frag20_eMol9_QM.pt")

In [3]:
len(frag20_eMol9_QM_dataset)

654530

To get the same result as the paper, we will load the same train/test split:

In [6]:
split = torch.load("../dataProviders/frag20_eMol9_split.pt")

In [7]:
# train-valid split is randomly generated on the fly
train_perm = torch.randperm(len(split["train_index"]))
train_perm

tensor([583261,  19821, 213143,  ...,  90419, 482983, 121258])

In [8]:
frag20_eMol9_QM_dataset.train_index = split["train_index"][train_perm[:-1000]]
frag20_eMol9_QM_dataset.val_index = split["train_index"][train_perm[-1000:]]
frag20_eMol9_QM_dataset.test_index = split["test_index"]

In [9]:
from train import train

In [10]:
# Use this config file to train a sPhysNet in the paper
! cat config-sPhysNet-Frag20-eMol9-QM.txt

--debug_mode=False
--modules=P-noOut P-noOut P C
--bonding_type=BN BN BN BN
--activations=ssp ssp ssp
--expansion_fn=(P_BN,P-noOut_BN):gaussian_64_10.0 C_BN:coulomb_10.0
--n_feature=160
--n_dime_before_residual=1
--n_dime_after_residual=2
--n_output_dense=3
--n_phys_atomic_res=1
--n_phys_interaction_res=1
--n_phys_output_res=1
--n_bi_linear=8
--num_epochs=1000
--warm_up_steps=0
--data_provider=frag20_eMol9_combine
--test_interval=-1
--learning_rate=0.001
--ema_decay=0.999
--l2lambda=0.0
--nh_lambda=0.01
--restrain_non_bond_pred=True
--decay_steps=620000
--decay_rate=0.1
--batch_size=100
--valid_batch_size=32
--force_weight=0
--charge_weight=1
--dipole_weight=1
--use_trained_model=False
--max_norm=1000.0
--log_file_name=training.log
--normalize=True
--shared_normalize_param=True
--edge_version=cutoff
--cutoff=10.0
--boundary_factor=100.
--remove_atom_ids=-1
--folder_prefix=exp-sPhysNet-QM
--comment=original PhysNet, coulomb correction
--coulomb_ch

In [11]:
from utils.utils_functions import add_parser_arguments
config_name = "config-sPhysNet-Frag20-eMol9-QM.txt"
# set up parser and arguments
parser = argparse.ArgumentParser(fromfile_prefix_chars='@')
parser = add_parser_arguments(parser)

args, unknown = parser.parse_known_args(["@" + config_name])
args.config_name = config_name

In [12]:
frag20_eMol9_QM_dataset[0]

Data(BN_edge_index=[2, 272], D=[1, 3], E=[1], F=[1, 3], L_edge_index=[2, 0], N=[1], Q=[1], R=[17, 3], Z=[17], num_BN_edge=[1], num_L_edge=[1])

In [13]:
frag20_eMol9_QM_dataset.data.num_L_edge.sum()

tensor(7892096)

In [None]:
train(args, data_provider=frag20_eMol9_QM_dataset, use_tqdm=True)

REMOVING ATOM -1 FROM DATASET


epoch: 0: 5969it [15:36,  6.37it/s]
epoch: 1: 5969it [14:12,  7.00it/s]
epoch: 2: 5969it [14:13,  7.00it/s]
epoch: 3: 5969it [14:12,  7.00it/s]
epoch: 4: 5969it [14:13,  6.99it/s]
epoch: 5: 5969it [14:04,  7.07it/s]
epoch: 6: 5969it [14:10,  7.02it/s]
epoch: 7: 5969it [14:10,  7.01it/s]
epoch: 8: 5969it [14:11,  7.01it/s]
epoch: 9: 5969it [14:12,  7.00it/s]
epoch: 10: 5969it [14:10,  7.02it/s]
epoch: 11: 5969it [14:12,  7.00it/s]
epoch: 12: 5969it [14:11,  7.01it/s]
epoch: 13: 5969it [14:11,  7.01it/s]
epoch: 14: 5969it [14:11,  7.01it/s]
epoch: 15: 5969it [14:11,  7.01it/s]
epoch: 16: 5969it [14:14,  6.99it/s]
epoch: 17: 5969it [14:12,  7.00it/s]
epoch: 18: 5969it [14:12,  7.00it/s]
epoch: 19: 5969it [14:12,  7.00it/s]
epoch: 20: 5969it [14:13,  6.99it/s]
epoch: 21: 5969it [14:11,  7.01it/s]
epoch: 22: 5969it [14:13,  7.00it/s]
epoch: 23: 5969it [14:11,  7.01it/s]
epoch: 24: 5969it [14:12,  7.01it/s]
epoch: 25: 5969it [14:11,  7.01it/s]
epoch: 26: 5969it [14:12,  7.00it/s]
epoch: 27: 