ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

Le Zhuo*, Zewen Chi*, Minghao Xu*, Heyan Huang, Heqi Zheng, Conghui He, Xian-Ling Mao, Wentao Zhang

This repository hosts the code, data and model weights of ProtLLM, a versatile cross-modal large language model for both protein-centric and protein-language tasks.

TODOs

Release the code for retrieval.
Release the raw InterPT dataset.
Update the huggingface version of ProtLLM.
...

Setup

Enviroment

Clone this repository and navigate to the ProtLLM folder

git clone https://github.com/ProtLLM/ProtLLM.git
cd ProtLLM

Install Package

conda create -n protllm python=3.10 -y
conda activate protllm
pip install e .

Data & Checkpoints

We release the pre-processed version of our InterPT dataset, all datasets for downstream tasks, and pre-trained checkpoints in Hugging Face.

Training

Pre-training

For pre-training, you should download the pre-preprocessed dataset from Hugging Face first and run the following script:

bash scripts/pretrain.sh

Fine-tuning

We provide the fine-tuning scripts to reproduce all results of ProtLLM on various protein-centric tasks, including Enzyme Commission (EC) number prediction, Gene Ontology (GO) term prediction, and Protein-Protein Interaction (PPI) prediction. By default, we use the pre-trained ProtST-ESM-2 as the protein encoder, which can be downloaded from the ProtST repository. After downloading the processed dataset from Hugging Face, you can run the following script to finetune ProtLLM on specific downstream task:

bash scripts/finetune.sh

The detailed hyperparameters and settings for each task can be found in the appendix of our paper. Note that, we also fine-tune the weights of protein encoder for GO and EC prediction tasks, which can be done by setting --lr_ratio to 0.1 in the fine-tuning script.

Evaluation

Fine-tuning

After fine-tuning ProtLLM on protein-centric tasks, you can evaluate its performance by running the following script:

bash scripts/eval.sh

Remember to set --task to the target task name and --n_labels to the number of labels of the task. You should also change the LoRA hyperparameters --sft_lora_r and --sft_lora_alpha to the values you used in the fine-tuning script.

In-context Learning

Run the following script to perform in-context learning with ProtLLM (using PPI prediction as an example):

bash scripts/icl.sh

You can specify the --n_demo argument to control the number of demonstration samples.

Contact

If you have any questions related to the code or the paper, feel free to contact Le Zhuo, Zewen Chi, and Minghao Xu.

Citation

If you find our work useful in your research, please consider citing ProtLLM:

@article{zhuo2024protllm,
  title={ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training},
  author={Le Zhuo and Zewen Chi and Minghao Xu and Heyan Huang and Heqi Zheng and Conghui He and Xian-Ling Mao and Wentao Zhang},
  journal={arXiv preprint arXiv:2403.079205},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
protllm		protllm
scripts		scripts
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

TODOs

Setup

Enviroment

Data & Checkpoints

Training

Pre-training

Fine-tuning

Evaluation

Fine-tuning

In-context Learning

Contact

Citation

About

Releases

Packages

Languages

ProtLLM/ProtLLM

Folders and files

Latest commit

History

Repository files navigation

ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

TODOs

Setup

Enviroment

Data & Checkpoints

Training

Pre-training

Fine-tuning

Evaluation

Fine-tuning

In-context Learning

Contact

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages