Skip to content

gersteinlab/MolLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MolLM: A Unified Language Model to Integrate Biomedical Text with 2D and 3D Molecular Representations

📖 Paper

This is code for the MolLM model. The code is organized as follows:

  • dataset-creation: Code to generate our dataset of 160k graph text pairs
  • downstream: Code for the MolLM and downstream tasks
  • environments: Conda environments
    • Use base environment for general model usage and pretraining
    • Use other specific environments for each downstream tasks
  • pretrain: Code for pretraining MolLM

Model checkpoints files are in zip files downloadable from this Hugging Face model. The paths within this folder correspond to locations where the zip folders should be decompressed. The dataset is available at this Hugging Face dataset.

Pre-training

Utilize the training script pretrain/train.sh to pretrain the model. Inside the script, the GPUs used, batch size, epochs, 3D setting, and checkpoints save location may be viewed and modified as desired.

Downstream Tasks

See the above on environments for the appropriate Conda environment to use for each downstream task.

For graph-text cross-modality retrieval, within the downstream/GraphTextRetrieval folder, the finetune-sentence.sh and finetune-paragraph.sh scripts are for finetuning the moel for the task at the sentence and paragraph respectively. Similarly, the test-sent.sh and test-paragraph.sh scripts perform the evaluation. These scripts can be modified to change the pretrained epoch and GPUs used.

For molecule captioning, within the downstream/MoleculeCaption folder, the scripts starting with train- are for training the small & base versions in both 2D & 3D settings. The test-base.sh provides an example of utilizing the evaluation script. These scripts can be modified to change the pretrained checkpoints (for both MolT5 and MolLM) and GPUs used.

For molecule editing, within the downstream/MoleculeEditing folder, the scripts starting with run_ are for generating the molecules for various prompts. The eval.py takes in the generated molecule file, desired metric, desired change in the metric, and output path to output an evaluation of the generation. These scripts can be modified to change the pretrained checkpoint, prompts, and GPUs used.

Finally, for molecule prediction, within the downstream/MoleculePrediction folder, the finetune_arg.sh script takes in a MoleculeNet dataset as an argument to finetune the model and also reports the evaluation scores. These scripts can be modified to change the pretrained checkpoint and GPUs used.

Cite us

@article{tang2023mollm,
  title={MolLM: A Unified Language Model to Integrate Biomedical Text with 2D and 3D Molecular Representations},
  author={Tang, Xiangru and Tran, Andrew and Tan, Jeffrey and Gerstein, Mark},
  journal={bioRxiv},
  pages={2023--11},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}

About

A Unified Language Model to Integrate Biomedical Text with 2D and 3D Molecular Representations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published