This repository contains code for the paper "InsCL: A Data-efficient Continual Learning Paradigm for Fine-tuning Large Language Models with instructions". We propose a novel paradigm called Instruction-based Continual Learning (InsCL). InsCL dynamically replays previous data based on task similarity, calculated by Wasserstein Distance with instructions. And we further introduce an Instruction Information Metric (InsInfo) to quantify the complexity and diversity of instructions. According to InsInfo, InsCL guides the replay process more inclined to high-quality data.
To install the experiment, please install the pip file. We chiefly just need pytorch and transformers package from huggingface. It might be a good idea to create a conda environment.
pip install -r requirements.txt
We obtain 16 categories by integrating English tasks in SuperNI dataset (loaded from https://github.com/allenai/natural-instructions), and conduct further experiments based on 16 reallocated tasks. The details of task composition are shown in the Appendix A.2 of InsCL. We randomly hold out 20% instances on each task to test the LLM on different training stages, and store the train and test set in the 'dataset' folder.
High-performing open-source LLMs demonstrate the ability to annotate queries with tag entities, and the precision and consistency are proven through manual annotation. Consequently, we employ GPT-4 (OpenAI, 2023) as an intention tagger and clean the raw tags, representing instructions at a fine-grained entity level.
cd replay
python get_api_output.py
You can define your own function to call the api and store output tags.
After annotating the instructions, run the following code to sample replay data in the 'replay/replay_data' folder.
python InsCL_sampling.py --emb_file_path encoded_ins_dist.pkl --style curriculum_pWdist
here we load prepared file that stores original instructions, embeddings and distributions from emb_file_path
.
Please generate corresponding instruction embeddings and modify the file path as needed.
And style
control the training order of tasks and the calculation method of Wasserstein Distance.
When the real distribution of instructions can not be obtained, remove the '_pWdist' at the end of style.
To merge the replay data with source training data, you can run:
bash merge_data.sh
We call the training function in a sequential loop in the script to simulate incremental learning of staged fine-tuning.
Here we define data_dir
as 'dataset/train' and root_model
as 'llama_7B' (loaded from https://huggingface.co/baffo32/decapoda-research-llama-7B-hf).
You can modify the dataset path and model path as needed.
cd ..
bash run_train.sh
Run the script to evaluate model with Rouge-L.
bash run_evaluate.sh
This repo relies on the POT packages for calculating Wasserstein Distance. We are grateful to the authors and maintainers of the project.