ModuLM

Requirements

See modulm.yml. Run the following command to create a new anaconda environment modulm:

conda env create -f modulm.yml

Dataset and Backbone

Dataset
The datasets used in this project can be downloaded from the MolTC.
All data should be placed in the /data folder.
LLM Backbone Models
The backbones of different Large Language Models (LLMs) can be downloaded from Hugging Face.
Please make sure the downloaded LLMs are stored in the backbone folder.
It is worth noting that in ModuLM, to ensure a fair comparison, we adopted a configuration similar to Galactica. This means that if you download other series of LLMs for extension, you will need to make some modifications to the pretokenizer in the tokenizer vocabulary. For specific changes, you can refer to the tokenizer used in Galactica.

Usage of ModuLM

In this section, we will explain how to specifically use our ModuLM framework for training.

Data Process

We provide dataset processing methods in the dataproces folder, including 2D molecular graph processing and 3D molecular conformation processing. You can choose different processing approaches based on your specific needs e.g..

python ZhangDDI.py
python ChChMiner.py
python ZhangDDI_3d.py
python ChChMiner_3d.py
python CombiSolv.py
python CombiSolv_3d.py

ModuLM Config

We provide the configuration of the best-performing model from our paper, and you can run it directly with bash python demo.py.
You can specify the dataset to use by the config below and the rest of the dataset configuration can follow the default settings. If you wish to make modifications, you can edit the configuration file yourself.

{
  "root": "data/DDI/DeepDDI/"
}

You can specify the molecular data format by setting use_3d or use_2d to true. Correspondingly, you can choose the encoder under the selected data format, as shown in the following example:

{
  "use_3d": true,
  "graph3d": "unimol"
}

You can select the backbone and choose between fine-tuning or pretraining by specifying the configuration as shown below. Note that different methods correspond to different datasets. To select the backbone and set fine-tuning or pretraining mode, use the following configuration:

{
  "mode": "ft",
  "backbone": "DeepSeek-1.5B",
  "min_len": 10,
  "max_len": 40
}

You can use the default configuration for LoRA that we provide.
The reference configuration for training is as follows:

{
  "batch_size": 12,
  "max_epochs": "30",
  "save_every_n_epochs": 5,
  "scheduler": "linear_warmup_cosine_lr",
  "seed": 42,
  "warmup_lr": 1e-06,
  "warmup_steps": 1000,
  "weight_decay": "0.05"
}

The specific methods for calculating the model performance evaluation metrics are provided in the test_result folder.
The configuration file offers more parameters for you to choose from. You can modify different parameters according to your specific needs.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data_provider		data_provider
dataprocess		dataprocess
model		model
test_result		test_result
ModuLM.py		ModuLM.py
ModuLM_config.json		ModuLM_config.json
README.md		README.md
demo.py		demo.py
lora_config.json		lora_config.json
modulm.yml		modulm.yml
mystage2_3d.py		mystage2_3d.py
pyrightconfig.json		pyrightconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ModuLM

Requirements

Dataset and Backbone

Usage of ModuLM

Data Process

ModuLM Config

About

Uh oh!

Releases

Packages

Languages

LLzhuoc/ModuLM

Folders and files

Latest commit

History

Repository files navigation

ModuLM

Requirements

Dataset and Backbone

Usage of ModuLM

Data Process

ModuLM Config

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages