See modulm.yml. Run the following command to create a new anaconda environment modulm:
conda env create -f modulm.yml-
Dataset
-
The datasets used in this project can be downloaded from the MolTC.
All data should be placed in the/datafolder. -
LLM Backbone Models
-
The backbones of different Large Language Models (LLMs) can be downloaded from Hugging Face.
Please make sure the downloaded LLMs are stored in thebackbonefolder. -
It is worth noting that in ModuLM, to ensure a fair comparison, we adopted a configuration similar to Galactica. This means that if you download other series of LLMs for extension, you will need to make some modifications to the pretokenizer in the tokenizer vocabulary. For specific changes, you can refer to the tokenizer used in Galactica.
- In this section, we will explain how to specifically use our ModuLM framework for training.
- We provide dataset processing methods in the dataproces folder, including 2D molecular graph processing and 3D molecular conformation processing. You can choose different processing approaches based on your specific needs e.g..
python ZhangDDI.py
python ChChMiner.py
python ZhangDDI_3d.py
python ChChMiner_3d.py
python CombiSolv.py
python CombiSolv_3d.py-
We provide the configuration of the best-performing model from our paper, and you can run it directly with
bash python demo.py. -
You can specify the dataset to use by the config below and the rest of the dataset configuration can follow the default settings. If you wish to make modifications, you can edit the configuration file yourself.
{
"root": "data/DDI/DeepDDI/"
}- You can specify the molecular data format by setting
use_3doruse_2dtotrue. Correspondingly, you can choose the encoder under the selected data format, as shown in the following example:
{
"use_3d": true,
"graph3d": "unimol"
}- You can select the backbone and choose between fine-tuning or pretraining by specifying the configuration as shown below. Note that different methods correspond to different datasets. To select the backbone and set fine-tuning or pretraining mode, use the following configuration:
{
"mode": "ft",
"backbone": "DeepSeek-1.5B",
"min_len": 10,
"max_len": 40
}-
You can use the default configuration for LoRA that we provide.
-
The reference configuration for training is as follows:
{
"batch_size": 12,
"max_epochs": "30",
"save_every_n_epochs": 5,
"scheduler": "linear_warmup_cosine_lr",
"seed": 42,
"warmup_lr": 1e-06,
"warmup_steps": 1000,
"weight_decay": "0.05"
}- The specific methods for calculating the model performance evaluation metrics are provided in the
test_resultfolder. - The configuration file offers more parameters for you to choose from. You can modify different parameters according to your specific needs.