Skip to content

Data-reindeer/MOCO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 

Repository files navigation

Improving Molecular Pretraining with Complementary Featurizations

This repository provides the source code for the paper Improving Molecular Pretraining with Complementary Featurizations. Here we consider four kinds of views:

  • 2D Graph
  • 3D Geometry
  • Morgan Fingerprint
  • SMILES String

Environments

numpy             1.21.2
networkx          2.6.3
scikit-learn      1.0.2
pandas            1.3.4
python            3.7.11
torch             1.10.2+cu113
torch-geometric   2.0.3
transformers      4.17.0
rdkit             2020.09.1.0
ase               3.22.1
descriptastorus   2.3.0.5
ogb               1.3.3

Dataset Preprocessing

Datasets

  • Geometric Ensemble Of Molecules (GEOM)
mkdir datasets
cd datasets
mkdir -p GEOM/raw
mkdir -p GEOM/processed

wget https://dataverse.harvard.edu/api/access/datafile/4327252
mv 4327252 rdkit_folder.tar.gz
tar -xvf rdkit_folder.tar.gz
  • Chem Datasets
wget http://snap.stanford.edu/gnn-pretrain/data/chem_dataset.zip
unzip chem_dataset.zip
mv dataset molecule_datasets
  • Other Chem Datasets
    • malaria
    • cep
wget -O malaria-processed.csv https://raw.githubusercontent.com/HIPS/neural-fingerprint/master/data/2015-06-03-malaria/malaria-processed.csv
mkdir -p ./molecule_datasets/malaria/raw
mv malaria-processed.csv ./molecule_datasets/malaria/raw/malaria.csv

wget -O cep-processed.csv https://raw.githubusercontent.com/HIPS/neural-fingerprint/master/data/2015-06-02-cep-pce/cep-processed.csv
mkdir -p ./molecule_datasets/cep/raw
mv cep-processed.csv ./molecule_datasets/cep/raw/cep.csv

Preprocessing

Before preprocessing the datasets, please train the RoBERTa model first and store the corresponding SMILES embedding in order to save memory cost.

cd src
python SMILES_train.py
python SMILES_process.py
  • GEOM preprocessing
python dataset_preparation.py --n_mol 50000 --n_conf 5 --n_upper 1000
  • Downstream preprocessing (Classification)
python molecule_preparation.py
  • Downstream preprocessing (Regression)
cd src/datasets
python regression_datasets.py
python qm9_data.py

Experiments

Due to different training dynamics of different view encoders, we do a hyperparameter search of the learning rates and dropout ratio for each encoder from [1e-3,1e-4,...,1e-7] and [0, 0.3, 0.5], respectively. The following command are different hyperparameter combination for classfication and regression tasks.

  • Pre-training for classification
cd src
python pretrain.py --dataset=Final_GEOM_FULL_nmol50000_nconf5 --lr=0.0001 --gnn_lr_scale=1 --schnet_lr_scale=0.1 --fp_lr_scale=0.1 --mlp_lr_scale=10 --fuse_lr_scale=0.01 --dropout_ratio=0
  • Fine-tune for classification
python finetune_supervised.py --input_model_file = '../runs/Classification_models/' --lr=0.0001 --gnn_lr_scale=1 --schnet_lr_scale=0.1 --fp_lr_scale=0.1 --mlp_lr_scale=10 --fuse_lr_scale=0.001 --dropout_ratio=0.5
  • Pre-training for regression
cd src
python pretrain_regression.py --dataset=Final_GEOM_FULL_nmol50000_nconf5 --lr=0.001 --gnn_lr_scale=0.1 --schnet_lr_scale=0.1 --fp_lr_scale=0.1 --mlp_lr_scale=1 --fuse_lr_scale=0.1 --dropout_ratio=0
  • Fine-tune for regression
python finetune_QM9.py --input_model_file = '../runs/Regression_models/' --lr=0.001 --gnn_lr_scale=0.1 --schnet_lr_scale=0.1 --fp_lr_scale=0.1 --mlp_lr_scale=1 --fuse_lr_scale=0.01 --dropout_ratio=0.5

python finetune_regression.py --input_model_file = '../runs/Regression_models/' --lr=0.001 --gnn_lr_scale=1 --schnet_lr_scale=0.1 --fp_lr_scale=0.1 --mlp_lr_scale=10 --fuse_lr_scale=0.01 --dropout_ratio=0.5

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages