Skip to content

LLzhuoc/ModuLM

Repository files navigation

ModuLM

Requirements

See modulm.yml. Run the following command to create a new anaconda environment modulm:

conda env create -f modulm.yml

Dataset and Backbone

  • Dataset

  • The datasets used in this project can be downloaded from the MolTC.
    All data should be placed in the /data folder.

  • LLM Backbone Models

  • The backbones of different Large Language Models (LLMs) can be downloaded from Hugging Face.
    Please make sure the downloaded LLMs are stored in the backbone folder.

  • It is worth noting that in ModuLM, to ensure a fair comparison, we adopted a configuration similar to Galactica. This means that if you download other series of LLMs for extension, you will need to make some modifications to the pretokenizer in the tokenizer vocabulary. For specific changes, you can refer to the tokenizer used in Galactica.

Usage of ModuLM

  • In this section, we will explain how to specifically use our ModuLM framework for training.

Data Process

  • We provide dataset processing methods in the dataproces folder, including 2D molecular graph processing and 3D molecular conformation processing. You can choose different processing approaches based on your specific needs e.g..
python ZhangDDI.py
python ChChMiner.py
python ZhangDDI_3d.py
python ChChMiner_3d.py
python CombiSolv.py
python CombiSolv_3d.py

ModuLM Config

  • We provide the configuration of the best-performing model from our paper, and you can run it directly with bash python demo.py.

  • You can specify the dataset to use by the config below and the rest of the dataset configuration can follow the default settings. If you wish to make modifications, you can edit the configuration file yourself.

{
  "root": "data/DDI/DeepDDI/"
}
  • You can specify the molecular data format by setting use_3d or use_2d to true. Correspondingly, you can choose the encoder under the selected data format, as shown in the following example:
{
  "use_3d": true,
  "graph3d": "unimol"
}
  • You can select the backbone and choose between fine-tuning or pretraining by specifying the configuration as shown below. Note that different methods correspond to different datasets. To select the backbone and set fine-tuning or pretraining mode, use the following configuration:
{
  "mode": "ft",
  "backbone": "DeepSeek-1.5B",
  "min_len": 10,
  "max_len": 40
}
  • You can use the default configuration for LoRA that we provide.

  • The reference configuration for training is as follows:

{
  "batch_size": 12,
  "max_epochs": "30",
  "save_every_n_epochs": 5,
  "scheduler": "linear_warmup_cosine_lr",
  "seed": 42,
  "warmup_lr": 1e-06,
  "warmup_steps": 1000,
  "weight_decay": "0.05"
}
  • The specific methods for calculating the model performance evaluation metrics are provided in the test_result folder.
  • The configuration file offers more parameters for you to choose from. You can modify different parameters according to your specific needs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published