Skip to content

Latest commit

 

History

History
80 lines (58 loc) · 3.94 KB

molcap.md

File metadata and controls

80 lines (58 loc) · 3.94 KB

Molecule Captioning

Molecule captioning is a multi-modal task that aims to generate texts for a molecule that describes its funcitons and properties.

Features

  • Supported models: MolT5, GraphMVP with MolT5 decoder, MoMu, BioMedGPT-1.6B, and MolFM.
  • Supported dataset: ChEBI-20.
  • Supported evaluation: BLEU-2, BLEU-4, ROUGE score, METEOR score, Text2Mol score.

Additional Packages

To evaluate the performance of molecule captioning, run the following:

pip install spacy
pip install rouge_score

pip install nltk
python
>>> import nltk
>>> nltk.download('wordnet')
>>> nltk.download('omw-1.4')

Data Preparation

Install ChEBI-20 and put the files under datasets/molcap/chebi-20. For Text2Mol evaluation, download cids_to_smiles.pkl here and test.txt here. Put them under assets/text2mol.

Model Preparation

Install SciBERT and MolT5 and put them under ckpts/text_ckpts/. Distinguish between MolT5-smiles2caption (fine-tuned for molecule caption) and MolT5 (not fine-tuned).

To reproduce MoMu, install MoMu checkpoints following instructions here and put it under ckpts/fusion_ckpts/momu.

To reproduce MolFM and BioMedGPT-1.6B, install the checkpoint here (password is 7a6b) and put them under ckpts/fusion_ckpts/.

For Text2Mol evaluation, download the Text2Mol checkpoint test_outputfinal_weights.320.pt here and put it under ckpts/fusion_ckpts/text2mol/.

Training and Evaluation

You can run the Bash scripts under scripts/multimodal/molcap/:

scripts/multimodal/molcap/
├── train.sh										# train molecule captioning model
├── test.sh											# test molecule captioning model
└── test_from_file.sh						# report evaluation metrics for molecule captions within a given file

Example:

bash scripts/multimodal/molcap/train.sh cuda:0 # switch to your own cuda device or cpu

You can also modify the scripts or directly use the following command:

python open_biomed/tasks/multi_modal_task/molcap.py \
--device DEVICE \                         # gpu device id
--mode MODE \                             # traning mode, select from [train, test, traintest]
--config_path CONFIG_PATH \               # configuration file, see configs/molcap/ for more details
--dataset DATASET \                       # dataset name, now only PCdes is available
--dataset_path DATASET_PATH \             # path to the dataset
--output_path OUTPUT_PATH \               # path to save checkpoint for training
--caption_save_path CAPTION_SAVE_PATH \   # path to save generated captions
--num_workers NUM_WORKERS \               # number of workers when loading data
--epochs EPOCHS \                         # number of training epochs
--patience PATIENCE \                     # number of tolerant epochs for early-stopping
--weight_decay WEIGHT_DECAY \             # weight decay, default is 0
--lr LR \                                 # learning rate, default is 1e-4
--batch_size BATCH_SIZE \                 # batch size, default is 32
--logging_steps LOGGING_STEPS \           # steps for printing training information
--text2mol_bert_path TEXT2MOL_BERT_PATH \ # path to scibert
--text2mol_data_path TEXT2MOL_DATA_PATH \ # path to `cids_to_smiles.pkl`
--text2mol_ckpt_path TEXT2MOL_CKPT_PATH   # path to Text2Mol checkpoint

# If you follow the preparation in the above sections, you can simply set text2mol arguments to default values.