MolLM: A Unified Language Model to Integrate Biomedical Text with 2D and 3D Molecular Representations

This is code for the MolLM model. The code is organized as follows:

dataset-creation: Code to generate our dataset of 160k graph text pairs
downstream: Code for the MolLM and downstream tasks
environments: Conda environments
- Use base environment for general model usage and pretraining
- Use other specific environments for each downstream tasks
pretrain: Code for pretraining MolLM

Model checkpoints files are in zip files downloadable from this Hugging Face model. The paths within this folder correspond to locations where the zip folders should be decompressed. The dataset is available at this Hugging Face dataset.

Pre-training

Utilize the training script pretrain/train.sh to pretrain the model. Inside the script, the GPUs used, batch size, epochs, 3D setting, and checkpoints save location may be viewed and modified as desired.

Downstream Tasks

See the above on environments for the appropriate Conda environment to use for each downstream task.

For graph-text cross-modality retrieval, within the downstream/GraphTextRetrieval folder, the finetune-sentence.sh and finetune-paragraph.sh scripts are for finetuning the moel for the task at the sentence and paragraph respectively. Similarly, the test-sent.sh and test-paragraph.sh scripts perform the evaluation. These scripts can be modified to change the pretrained epoch and GPUs used.

For molecule captioning, within the downstream/MoleculeCaption folder, the scripts starting with train- are for training the small & base versions in both 2D & 3D settings. The test-base.sh provides an example of utilizing the evaluation script. These scripts can be modified to change the pretrained checkpoints (for both MolT5 and MolLM) and GPUs used.

For molecule editing, within the downstream/MoleculeEditing folder, the scripts starting with run_ are for generating the molecules for various prompts. The eval.py takes in the generated molecule file, desired metric, desired change in the metric, and output path to output an evaluation of the generation. These scripts can be modified to change the pretrained checkpoint, prompts, and GPUs used.

Finally, for molecule prediction, within the downstream/MoleculePrediction folder, the finetune_arg.sh script takes in a MoleculeNet dataset as an argument to finetune the model and also reports the evaluation scores. These scripts can be modified to change the pretrained checkpoint and GPUs used.

Cite us

@article{tang2023mollm,
  title={MolLM: A Unified Language Model to Integrate Biomedical Text with 2D and 3D Molecular Representations},
  author={Tang, Xiangru and Tran, Andrew and Tan, Jeffrey and Gerstein, Mark},
  journal={bioRxiv},
  pages={2023--11},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.idea		.idea
dataset-creation		dataset-creation
downstream		downstream
environments		environments
examples		examples
pretrain		pretrain
.gitignore		.gitignore
README.MD		README.MD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

dataset-creation

dataset-creation

downstream

downstream

environments

environments

examples

examples

pretrain

pretrain

.gitignore

.gitignore

README.MD

README.MD

Repository files navigation

MolLM: A Unified Language Model to Integrate Biomedical Text with 2D and 3D Molecular Representations

Pre-training

Downstream Tasks

Cite us

About

Releases

Packages

Contributors 2

Languages

gersteinlab/MolLM

Folders and files

Latest commit

History

Repository files navigation

MolLM: A Unified Language Model to Integrate Biomedical Text with 2D and 3D Molecular Representations

Pre-training

Downstream Tasks

Cite us

About

Resources

Stars

Watchers

Forks

Languages