This directory contains code and instructions in support of our paper RepoFusion: Training Code Models to Understand Your Repository. For more details, please see the paper.
- Python >= 3.7
- PyTorch
- Transformers
- Accelerate
- Datasets
- NumPy
- pandas
- tqdm
We create and release Stack-Repo, a dataset of 200 Java repositories from GitHub with permissive licenses and near-deduplicated files that are augmented with three types of repository contexts.
-
Prompt Proposal (PP) Contexts: These contexts are based on the prompt proposals from the paper [Repository-Level Prompt Generation for Large Language Models of Code] (https://arxiv.org/abs/2206.12839).
-
BM25 Contexts: These contexts are obtained based on the BM25 similarity scores.
-
RandomNN Contexts: These contexts are obtained using the nearest neighbors in the representation space of an embedding model.It contains three folders corresponding to our train, validation, and test splits. Each split contains a separate folder for a repository where each repository contains all .java files in the repository in the original directory structure along with three .json files corresponding to the PP, BM25, and RandomNN repo contexts.
For more details of the dataset including the usage and instructions to download, please see here.
The trained checkpoints can be downloaded using here. We have released the following checkpoints:
-
RepoFusion_PPC
: RepoFusion model trained with prompt proposal repo contexts. This is our best-performing model. -
RepoFusion_BM25
: RepoFusion model trained with BM25 repo contexts. -
RepoFusion_RandomNN
: RepoFusion model trained with RandomNN repo contexts. -
finetuned_codet5base_512
: Our finetuned CodeT5-base model. This was used as initialization for our RepoFusion models. -
finetuned_codet5large_512
: Our finetuned CodeT5-large model. This was used as a baseline.
RepoFusion can be trained using train_reader.py
and evaluated with test_reader.py
. Please see RepoFusion/src/options.py for a complete list of arguments.
train_reader.py
provides the code to train a model. An example usage of the script is given below:
torchrun --standalone --nnodes=1 --nproc_per_node=1 RepoFusion/train_reader.py \
--dataset_path=../the-stack-repo/ \
--data_file_pattern=*/hole_and_PP_contexts.json \
--n_context=32 \ #number of repo contexts, $N$ in the paper
--text_maxlength=768 \ #max length of repo context, $l$ in the paper
--scheduler=linear \
--lr=1e-5 \
--save_freq=5000 \
--initialize_from_pretrained \ # whether to initialize from a pretrained model or finetuned model
--finetuned_model_path=../trained_checkpoints/finetuned_codet5base_512 \
--passage_mode=no-truncation-codex-last \ # type of repo context creation and ordering starategy to use. no-truncation-codex-last corresponds to NT-Prior-Last in the paper.
--checkpoint_dir=../checkpoints/ \
--name=PPC_768_32\
You can evaluate your model or a pretrained model with test_reader.py
. An example usage of the script for evaluating the a trained RepoFusion model is provided below.
torchrun --standalone --nnodes=1 --nproc_per_node=1 RepoFusion/test_reader.py \
--dataset_path=../the-stack-repo/ \
--eval_split_name=test \
--output_dir=../../outputs/ \
--passage_mode=no-truncation-codex-last \
--trained_model_path=../trained_checkpoints/RepoFusion_PPC \
--n_context=32 \
--text_maxlength=768 \
--per_gpu_batch_size 1 \
--num_of_eval_examples_per_gpu=-1 \ # -1 for all examples
--name=PPC_test_768_32
To evaluate with a model other than CodeT5-base, add arguments for model_name, model_size and model_type. For example, to evalaute CodeGen-2B-multi, add --model_name=Salesforce/codegen-2B --model_size=multi --model_type=codegen
to the above command.
In addition, use passage_mode=toprule+prior
for post+prior experiments and passage_mode=pretrained
for prior experiments.
When evaluating on CodeT5 finetuned models, use passage_mode=finetuned
. In all these cases, do not provide the trained_model_path argument.
The predictions from the models along with the ground truth can be found in the outputs directory. Use calculate_metrics.py
to calculate the completion success rates.
For details on the implementation of finetuning CodeT5 models, please refer to the README file inside the Finetuning_CodeT5 directory.
Most of our implementation is built on top of the following sources:
[1] G. Izacard, E. Grave Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
@misc{izacard2020leveraging,
title={Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering},
author={Gautier Izacard and Edouard Grave},
url = {https://arxiv.org/abs/2007.0128},
year={2020},
publisher = {arXiv},
}
[2] D. Shrivastava, H. Larochelle, D. Tarlow Repository-Level Prompt Generation for Large Language Models of Code
@article{shrivastava2022repository,
title={Repository-level prompt generation for large language models of code},
author={Shrivastava, Disha and Larochelle, Hugo and Tarlow, Daniel},
journal={arXiv preprint arXiv:2206.12839},
year={2022}
}
If you use our data, trained checkpoints or code, please cite us as below:
@article{shrivastava2023repofusion,
title={RepoFusion: Training Code Models to Understand Your Repository},
author={Shrivastava, Disha and Kocetkov, Denis and de Vries, Harm and Bahdanau, Dzmitry and Scholak, Torsten},
journal={arXiv preprint arXiv:2306.10998},
year={2023}
}
See the LICENSE file for more details.