Repository containing tools, training/inference code, and analysis for the MMPT project.
This repo organizes code and data used to train, run inference with, and analyze MMPT models. It includes training and inference scripts, example data, analysis notebooks.
model/- Training and inference scriptsanalysis/- Jupyter notebooks and analysis utilitiesdata/- Example input data used by scripts and notebookschembl_250320_MVR33so.RGP.csvchembl_neutralized_250320_smiles.txtpmv17.mmp.csvpmv_2017_to_pmv_2021_mmps.csv
See the LICENSE file for licensing terms.
- Create and activate a Python virtual environment:
python -m venv .venv
source .venv/bin/activate- Install runtime dependencies:
pip install -r requirements.txtNote: the requirements.txt file has pinned versions to ensure reproducible environments. This ensures the code runs consistently across different systems and time periods.
Key training flags available in model/train_MMPT.py include:
--model_type(T5 | GPT | T5Chem | RealT5Chem)--dataset(dataset name)--epochs,--batch_size,--learning_rate,--output_dir--use_wandbto enable Weights & Biases logging
Key inference flags available in model/inference_MMPT.py include:
--model_type,--model_path(required)--dataset,--batch_size,--max_length--num_beams,--num_return_sequences,--early_stopping
Adjust num_beams and num_return_sequences so that num_beams >= num_return_sequences.
Open the notebooks in analysis/ for visualizations and exploratory analyses. Example notebooks:
Note: The analysis notebooks require external sample data files (model outputs, REINVENT results) that are not included in this repository. To use these notebooks:
- Prepare or obtain the required sample data files
- Update the
DATA_ROOTconfiguration variable at the top of each notebook to point to your data directory - Run the notebook cells with your configured data path
The analysis utilities (analysis/calc_properties.py, analysis/canonicalize.py, analysis/get_umap.py) are provided for post-processing model outputs and can be adapted for your own analysis workflows.
To start a notebook server locally:
pip install jupyterlab
jupyter labSome exemplary datasets are stored in data/.
- GPU not detected: make sure CUDA and a compatible
torchbuild are installed. Checktorch.cuda.is_available(). - Tokenizer / model errors: ensure
--model_pathpoints to the trained model directory containing the tokenizer and model weights.
If you need help or want changes to this README (commands, examples, or extra sections), open an issue.
This project is licensed under the terms in the LICENSE file.