This is the code repository for the paper: Task-Specific Skill Localization in Fine-tuned Language Models (Appeared in ICML'23).
- Task-Specific Skill localization in Fine-tuned Language Models
- Quick Links
- Abstract
- Overview on Grafting
- Fine-tuning and Grafting
- Bugs or Questions
- Citation
Pre-trained language models can be fine-tuned to solve diverse NLP tasks, including in few-shot settings. Thus fine-tuning allows the model to quickly pick up task-specific “skills,” but there has been limited study of where these newlylearnt skills reside inside the massive model. This paper introduces the term skill localization for this problem and proposes a solution. Given the downstream task and a model fine-tuned on that task, a simple optimization is used to identify a very small subset of parameters (~0.01% of model parameters) responsible for (> 95%) of the model’s performance, in the sense that grafting the fine-tuned values for just this tiny subset onto the pre-trained model gives a performance almost as well as the fine-tuned model. While reminiscent of recent works on parameter-efficient fine-tuning, the novel aspects here are that: (i) No further re-training is needed on the subset (unlike, say, with lottery tickets). (ii) Notable improvements are seen over vanilla fine-tuning with respect to calibration of predictions in-distribution (40-90% error reduction) as well as the quality of predictions out-of-distribution (OOD). In models trained on multiple tasks, a stronger notion of skill localization is observed, where the sparse regions corresponding to different tasks are almost disjoint, and their overlap (when it happens) is a proxy for task similarity.
We provide a yml file that can be used to create a conda environment, containing all the necessary packages.
Install necessary conda environment using conda env create -n icl_as_ft --file task_skill.yml
Please follow the repository LM-BFF-main to download all data. In the remaining codes, we assume that there is a "data" folder containing all the necessary datasets.
Please make two new folders: log_files and ckpt_paths before running the code below. The results of fine-tuning and grafting are stored in a log file inside log_files (please see run_experiment.sh and run_graft_experiment.sh for the filenames). The model checkpoints are stored in ckpt_paths.
To run grafting, we need to have a pre-trained model and a fine-tuned model (please refer to the overview section for grafting formulation). This section provides code to train roberta and gpt models on different downstream tasks.
Please refer to run_experiment.sh for the arguments that we use to train a model on a task. For all the SGD trained single task models in the paper, the current command line in run_task.sh suffices.
TAG=exp \
TYPE=$TYPE \
TASK=$TASK \
K=$K \
BS=$BS \
LR=$LR \
SEED=$seed \
modelseed=$modelseed \
uselmhead=$uselmhead \
useCLS=$useCLS \
max_step=$max_step \
fixhead=$fixhead \
fixembeddings=$fixembeddings \
MODEL=$model \
train_bias_only=$train_bias_only \
MODELNAME=$model\
bash run_experiment.sh
TYPE
: There are three types of trainingfinetune
: Linear Head-tuning of bert based modelsprompt
: Prompt-based fine-tuning of bert based models.autoregressive
: Prompt-based fine-tuning of gpt models.
TASK
: Please refer to run_experiments.sh for the exact task names to useK
: Number of training samples per classBS
: Batch size to use for trainingLR
: Learning rate to use for trainingSEED
: Data seedmodelseed
: Model seed for traininguselmhead
: Whether to use Language model head (true for prompt-based training)useCLS
: Whether to use CLS (for finetune experiments, if False, we train linear head on mask tokens)train_bias_only
: Whether to train biases onlyfixembeddings
: Fix embeddings? (helps for SGD based training)fixhead
: Fix head? We fix LM head in all our prompt based training experimentsmodel
: roberta-base/gpt-2
With a pre-trained and fine-tuned model, we perform grafting to find the minimal set of parameters from the fine-tuned model that can be grafted onto the pre-trained model and still retain most of the model's final performance.
Please refer to run_graft_experiment.sh for the arguments that we use to train a graft on a task. For all the SGD trained single task models in the paper, the current command line in run_graft_task.sh suffices.
TAG=exp \
TYPE=$TYPE \
TASK=$TASK \
K=$K \
LR=$lr \
SEED=$seed \
MODEL=$model_path \
modelseed=$modelseed \
uselmhead=$uselmhead \
useCLS=$useCLS \
num_train_epochs=100 \
mask_path=$mask_path \
sparsitylevel=$sparsitylevel \
pretrained_model=$modelbase \
fixhead=$fixhead \
fixembeddings=$fixembeddings \
truncate_head=True\
no_train=$no_train \
checkpoint_location=$location\
bash run_graft_experiment.sh;
TYPE
: Same as trained modelTASK
: Please refer to run_experiments.sh for the exact task names to useK
: Number of training samples per classLR
: Learning rate to use for graft trainingSEED
: Data seedmodelseed
: Seed for graft traininguselmhead
: Same as trained model hyperparameteruseCLS
: Same as trained model hyperparametertrain_bias_only
: Same as trained model hyperparameterfixembeddings
: Same as trained model hyperparameterfixhead
: Same as trained model hyperparameterMODEL
: Finetuned model pathpretrained_model
: roberta-base/gpt-2sparsitylevel
: Sparsity Level of basepatchmask_path
: Path to store the maskno_train
: Whether to train a mask (if True, we upload mask from checkpoint_location)
If you have any questions related to the code, feel free to email Abhishek ({ap34}@cs.princeton.edu
). If you encounter a problem or bug when using the code, you can also open an issue.
Please cite our work if you make use of our code in your work:
@article{panigrahi2023task,
title={Task-Specific Skill Localization in Fine-tuned Language Models},
author={Panigrahi, Abhishek and Saunshi, Nikunj and Zhao, Haoyu and Arora, Sanjeev},
journal={arXiv preprint arXiv:2302.06600},
year={2023}
}