A tool for automatic correctness assessment for patches generated by program repair systems. We consider the human patch as ground truth oracle and use Random tests based on the Ground Truth (RGT). See Automated Patch Assessment for Program Repair at Scale
If you use this repo, please cite:
@Article{Ye2021EMSE,
author = {Ye, He and Martinez, Matias and Monperrus, Martin},
title = "Automated Patch Assessment for Program Repair at Scale",
journal="Empirical Software Engineering",
volume = "26",
issn = "1573-7616",
doi = "https://doi.org/10.1007/s10664-020-09920-w",
year = "2021"
}
├── Patches 257 patches from Dcorrect and 381 patches from Doverfitting
│
├── RGT: incl. tests from Evosuite2019, Randoop2019, EvosuitASE15, RandoopASE15 and EvosuiteEMSE18
│
├── DiffTGen
│ ├── Results: the running result overfitting patches found by DiffTGen.
│ ├── runDrr.py: a command to reproduce DiffTGen experiment(details see below)
│
├── statistics: our exerimental statistics for all RQs
│
└── run.py: a command to reproduce all experiments
- JDK 1.7
- OS: Linux and Mac
- Configure the DEFECTS4J_HOME="home_of_defects4j"
- Add submodule defects4j and checkout the commit 486e2b4(Please note our experiment depends on several Defects4J commands)
git submodule add https://github.com/rjust/defects4j
git reset --hard 486e2b49d806cdd3288a64ee3c10b3a25632e991
To assess an indiviual patch for Defects4J:
./run.py patch_assessment <patch_id> <dataset:Dcorrect|Doverfitting> <RGT:ASE15_Evosuite|ASE15_Randoop|EMSE18_Evosuite|2019_Evosuite|2019_Randoop>
example: ./run.py patch_assessment patch1-Lang-35-ACS.patch Dcorrect 2019_Evosuite
To perform different sanity checks:
./run.py applicable_check
./run.py plausible_check
To identify flaky tests:
./run.py flaky_check <patch_id> <dataset:Dcorrect|Doverfitting> <RGT:ASE15_Evosuite|ASE15_Randoop|EMSE18_Evosuite|2019_Evosuite|2019_Randoop>
example: ./run.py flaky_check patch1-Lang-35-ACS.patch Dcorrect 2019_Evosuite
To reproduce our Expriment with RGT patch assessment
RQ1: ./run.py RQ1
RQ3: ./run.py RQ3
RQ4: ./run.py RQ4
RQ5: cd ./statistics ./RQ5-randomness-script.py <Evosuite2019|Randoop2019>
- patches_overview.csv, an overview of patches.
- The RGT tests generation log for Evosuite2019 and Randoop2019.
- Individually, Evosuite2019 and Randoop2019 fail to generate cases for 31/3510 and 1080/3510 executions. There are respective 2.2% and 2.4% falky tests are found, please see the log Flaky_Check_For_Evosuite2019 and Flaky_Check_For_Randoop2019.
- We run Evosuite2019 and Randoop2019 over 257 patches from Dcorrect. The statistics for each test execution is available at RQ1 and RQ2_Result.
- The detailed execution logs of RGT compared to DiffTGen are available at Evosuite2019_Execution_on_Doverfitting and Randoop2019_Execution_on_Doverfitting
- Overfitting patches found by Evosite2019 and Randoop2019 are individally summarized in the statistics.Together, they found 274 overfitting patches.
- Overfitting patches found by DiffTGen is summarized in the DiffTGen Result.
- Overfitting patches found by RGT from previous research: EvosuiteASE15, RandoopASE15, EvosuiteEMSE18 are summarized here.
- Experiment statistics of RQ4_RGT_From_Previous_on_Dcorrect.csv: 9 misclassified patches found.
- Experiment statistics of RQ4_RGT_From_Previous_on_Doverfitting.csv: 219 misclassified patches found.
- Individually, the failure number for 30 RGT tests of identifying overfitting patches are available Evosuite2019 and Randoop2019.
- We develop a script to simulate 1000 random groups that each contain 30 runs test generation.
- Statistics that record the average test generations that achieve effectiveness of 80%, 85%, 90%, 95% and 100%: 1000 groups of Evosuite and 1000 groups of Randoop.
- For more details about Defects4J, see the original repository of the Defects4J benchmark.
- For more details about DiffTGen, see the original repository of the DiffTGen.