Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



1 Commits

Repository files navigation

This directory contains code and instructions in support of our paper RepoFusion: Training Code Models to Understand Your Repository. For more details, please see the paper.



Download data

We create and release Stack-Repo, a dataset of 200 Java repositories from GitHub with permissive licenses and near-deduplicated files that are augmented with three types of repository contexts.

  • Prompt Proposal (PP) Contexts: These contexts are based on the prompt proposals from the paper [Repository-Level Prompt Generation for Large Language Models of Code] (

  • BM25 Contexts: These contexts are obtained based on the BM25 similarity scores.

  • RandomNN Contexts: These contexts are obtained using the nearest neighbors in the representation space of an embedding model.It contains three folders corresponding to our train, validation, and test splits. Each split contains a separate folder for a repository where each repository contains all .java files in the repository in the original directory structure along with three .json files corresponding to the PP, BM25, and RandomNN repo contexts.

For more details of the dataset including the usage and instructions to download, please see here.

Trained Checkpoints

The trained checkpoints can be downloaded using here. We have released the following checkpoints:

  • RepoFusion_PPC: RepoFusion model trained with prompt proposal repo contexts. This is our best-performing model.

  • RepoFusion_BM25: RepoFusion model trained with BM25 repo contexts.

  • RepoFusion_RandomNN: RepoFusion model trained with RandomNN repo contexts.

  • finetuned_codet5base_512: Our finetuned CodeT5-base model. This was used as initialization for our RepoFusion models.

  • finetuned_codet5large_512: Our finetuned CodeT5-large model. This was used as a baseline.

Implementation of RepoFusion

RepoFusion can be trained using and evaluated with Please see RepoFusion/src/ for a complete list of arguments.

Train provides the code to train a model. An example usage of the script is given below:

torchrun --standalone --nnodes=1 --nproc_per_node=1 RepoFusion/ \
        --dataset_path=../the-stack-repo/ \
        --data_file_pattern=*/hole_and_PP_contexts.json \
        --n_context=32 \ #number of repo contexts, $N$ in the paper
        --text_maxlength=768 \  #max length of repo context, $l$ in the paper
        --scheduler=linear \
        --lr=1e-5 \
        --save_freq=5000 \
        --initialize_from_pretrained \ # whether to initialize from a pretrained model or finetuned model
        --finetuned_model_path=../trained_checkpoints/finetuned_codet5base_512 \
        --passage_mode=no-truncation-codex-last \ # type of repo context creation and ordering starategy to use. no-truncation-codex-last corresponds to NT-Prior-Last in the paper.
        --checkpoint_dir=../checkpoints/ \


You can evaluate your model or a pretrained model with An example usage of the script for evaluating the a trained RepoFusion model is provided below.

torchrun --standalone --nnodes=1 --nproc_per_node=1 RepoFusion/ \
        --dataset_path=../the-stack-repo/ \
        --eval_split_name=test \
        --output_dir=../../outputs/ \
        --passage_mode=no-truncation-codex-last \
        --trained_model_path=../trained_checkpoints/RepoFusion_PPC \
        --n_context=32 \
        --text_maxlength=768 \
        --per_gpu_batch_size 1 \
        --num_of_eval_examples_per_gpu=-1 \ # -1 for all examples

To evaluate with a model other than CodeT5-base, add arguments for model_name, model_size and model_type. For example, to evalaute CodeGen-2B-multi, add --model_name=Salesforce/codegen-2B --model_size=multi --model_type=codegen to the above command.

In addition, use passage_mode=toprule+prior for post+prior experiments and passage_mode=pretrained for prior experiments. When evaluating on CodeT5 finetuned models, use passage_mode=finetuned. In all these cases, do not provide the trained_model_path argument.

The predictions from the models along with the ground truth can be found in the outputs directory. Use to calculate the completion success rates.

Finetuning CodeT5

For details on the implementation of finetuning CodeT5 models, please refer to the README file inside the Finetuning_CodeT5 directory.


Most of our implementation is built on top of the following sources:

[1] G. Izacard, E. Grave Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

      title={Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering},
      author={Gautier Izacard and Edouard Grave},
      url = {},
      publisher = {arXiv},

[2] D. Shrivastava, H. Larochelle, D. Tarlow Repository-Level Prompt Generation for Large Language Models of Code

  title={Repository-level prompt generation for large language models of code},
  author={Shrivastava, Disha and Larochelle, Hugo and Tarlow, Daniel},
  journal={arXiv preprint arXiv:2206.12839},


If you use our data, trained checkpoints or code, please cite us as below:

  title={RepoFusion: Training Code Models to Understand Your Repository},
  author={Shrivastava, Disha and Kocetkov, Denis and de Vries, Harm and Bahdanau, Dzmitry and Scholak, Torsten},
  journal={arXiv preprint arXiv:2306.10998},


See the LICENSE file for more details.


This repository contains code for data preparation and experiments for pre training llm with repository level context in various ways







No releases published


No packages published