Skip to content

Research Artifact - Generative AI for Pull Request Descriptions: Adoption, Impact, and Developer Interventions

License

Notifications You must be signed in to change notification settings

NAIST-SE/CopilotForPRsEarlyAdoption

Repository files navigation

Research Artifact - Generative AI for Pull Request Descriptions: Adoption, Impact, and Developer Interventions

This is a research artifact for "Generative AI for Pull Request Descriptions: Adoption, Impact, and Developer Interventions". This artifact is a repository that includes lists of studied PRs from GitHub, both with and without the use of Copilot for PRs. It also provides the features of PRs that were either generated or not generated by Copilot for PRs (pertaining to RQ2), coding results for RQ3, and scripts. The purpose of this artifact is enabling researchers to replicate our results of the paper, and to reuse our dataset of Copilot for PRs for further research.

The following three research questions were constructed to guide the study.

  • RQ1: To what extent do developers use Copilot for PRs in the code review process?
  • RQ2: How are the code reviews affected by the use of Copilot for PRs?
    • RQ2.1: Is there a relationship between the use of Copilot for PRs and review time?
    • RQ2.2: Is there a relationship between the use of Copilot for PRs and the likelihood of a PR being merged?
  • RQ3: How developers adopt the content suggested by Copilot?
    • RQ3.1: What kind of the supplementary information that complements the content suggested by Copilot?
    • RQ3.2: What kind of the content suggested by Copilot undergoes subsequent editing by developers?

This artifact provides datasets, scripts, and other relevant material, structured as follows:

Contents

  • data - a directory of the dataset
    • LLMPRs.csv - a list of PRs which is powered by Copilot for PRs
    • control_prs_df.csv - a list of PRs which is not powered by Copilot for PRs
    • LLMPRsComments.csv - raw data of PR comments from PRs powered by Copilot for PRs
    • control_comments_df.csv - raw data of PR comments from PRs not powered by Copilot for PRs
    • cleanedLLMPRsComments.csv - a list of cleaned PR comments (bot removal) from PRs powered by Copilot for PRs
    • cleaned_control_comments_df.csv - a list of cleaned PR comments (bot removal) from PRs not powered by Copilot for PRs
    • edit_contents.csv - raw data of editorial histories of PRs powered by Copilot for PRs
    • edit_contents_developers.csv - editorial histories of PRs powered by Copilot for PRs after filtering
    • edit_contents_developers_with_diff.csv - PRs powered by Copilot for PRs with Post-Copilot Edits
    • control_metrics.csv - the metrics used for R scripts from the control group
    • treatment_metrics.csv - the metrics used for R scripts from the treatment group
    • groundtruthbots.csv - a list of bots from Golzadeh et al.
    • coded_sample.csv - the coded editorial revisions in RQ3
  • scripts - a directory of the scripts
    • env - a directory of environmental variables
      • tokens.txt - a list of GitHub access tokens
    • CollectCopilot4prs.ipynb - Notebook file that collects raw data for RQ1-3
    • ParseHistory.ipynb - Notebook file that parses editorial revisions and prepares for manual inspection in RQ3
    • BuildingResults.ipynb - Notebook file that builds results in RQ1-3
    • PMW_review.R - R script utilizing Propensity Score Weighting method for estimating review time in RQ2.1
    • PMW_merge.R - R script utilizing Propensity Score Weighting method for estimating a PR being merged in RQ2.2
  • LICENSE.md - MIT License
  • README.md - this file
  • requirements.txt - required libraries for Notebook files
  • requirements_for_R_scripts.txt - required packages for R scripts
  • STATUS.txt - targeting ACM badges
  • INSTALL.txt - installation process of this artifact
  • FSE_Copilots_For_PRs.pdf - a copy of the accepted paper in PDF format
  • .gitattributes
  • .gitignore

Provenance

The replication package comprises scripts and a dataset, accessible at DOI

Important Notice

As of December 15, 2023, GitHub has discontinued the Copilot for PRs feature, converting the copilot4prs bot to a ghost account. To replicate the dataset, replace 'copilot4prs' with 'ghost' in the CollectCopilot4prs.ipynb notebook.

Environments

We concluded specific installation process in INSTALL.txt

  • A functional Python environment, compatible with the versions used in the notebooks, with all necessary libraries installed as specified in requirements.txt.
  • An R installation, preferably of the same version used for script development, with all required packages installed as indicated in requirements_for_R_scripts.txt.
  • Access to Jupyter Notebooks, either through an Anaconda installation or a direct Python setup.
  • A computer with sufficient processing power and memory to handle the computational demands of the scripts.
  • A stable internet connection, especially necessary if scripts involve fetching data from online sources.
  • An operating system (Windows, MacOS, Linux) compatible with the software and tools used.

Installation and Replication

Please follow the instructions in INSTALL.txt step by step to replicate this study. All necessary data is included in this artifact, allowing you to reproduce all results by running BuildingResults.ipynb without the need to prepare the dataset separately.

Skills

  • Proficiency in Python programming, including familiarity with data analysis libraries like Pandas, NumPy, and Matplotlib.
  • Competence in R programming, particularly for statistical analysis, and familiarity with relevant R libraries.
  • Experience with Jupyter Notebooks, including running and modifying notebook cells and interpreting outputs.
  • Skills in data analysis and interpretation.
  • Basic knowledge of version control systems, particularly Git, for accessing code repositories like GitHub.
  • Ability to troubleshoot common software installation, library dependencies, and environment configuration issues.

Citation BibTeX

@inproceedings{copilotforpr,
  title={Generative AI for Pull Request Descriptions: Adoption, Impact, and Developer Interventions},
  author={Xiao, Tao and Hata, Hideaki and Treude, Christoph and Matsumoto, Kenichi},
  booktitle={Proceedings of the ACM on Software Engineering (PACMSE)},
    number={FSE 2024},
  year={2024}
}

Authors