Skip to content
C# Data Extraction for "Learning to Represent Edits"
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Learning to Represent Edits

This repo contains scripts to extract the Github code edits datasets used in "Learning to Represent Edits" by Yin et al., 2018.


First, create a conda environment that includes all required libraries.

conda env create -f environment.yml

source activate github_edits  # activate the environment

You also need to install dotnet core 2.1.

Run the script in the repo's root folder


This script will (1) crawl the Github to clone repos listed in sampled_repos.txt, (2) extract commits using DumpCommitData/; (3) filter the extracted commits and perform cannonicalization, and extract the Abstract Syntax Tree of the previous and updated code in a commit (e.g., renaming locally defined variables)

The final output file DumpCommitData/github_commits.dataset.jsonl is a jsonl file, with each line consisting of a json-serialized entry. The format is:

Field Description
Id Id of the entry, format is `{ProjectName}
PrevCodeChunk Untokenized previous code (i.e., code before editing)
UpdatedCodeChunk Untokenized updated code (i.e., code after editing)
PrevCodeChunkTokens Tokenized previous code
UpdatedCodeChunkTokens Tokenized updated code
PrevCodeAST Json-serialized Abstract Syntax Tree of the previous code
UpdatedCodeAST Json-serialized Abstract Syntax Tree of the updated code
PrecedingContext Tokenized 3 lines of code before the edit
SucceedingContext Tokenized 3 lines of code after the edit


If you use this extractor in an academic work, please consider citing

   author = {{Yin}, P. and {Neubig}, G. and {Allamanis}, M. and {Brockschmidt}, M. and {Gaunt}, A.~L.},
   title = "{Learning to Represent Edits}",
   journal = {ArXiv e-prints},
   archivePrefix = "arXiv",
   eprint = {1810.13337},
   year = 2018,
   month = oct,


This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact with any additional questions or comments.

You can’t perform that action at this time.