Defect classes selection bias |
2014 |
A critical review of "automatic patch generation learned from human-written patches": Essay on the problem statement and the evaluation of automatic software repair |
https://dl.acm.org/doi/abs/10.1145/2568225.2568324 |
Using different defect classes when evaluating multiple APR techniques. |
Rigorously, automatic repair approaches can be compared only if they address similar defect classes. It would also be considered that the bias is mitigated, to a certain extent, when multiple APR techniques are evaluated on the same datasets. |
not applicable |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
not applicable |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
34 |
2 |
0 |
0 |
36 |
Fix acceptability metric bias |
2014 |
A critical review of "automatic patch generation learned from human-written patches": Essay on the problem statement and the evaluation of automatic software repair |
https://dl.acm.org/doi/abs/10.1145/2568225.2568324 |
Using fix acceptability as an evaluation metric. |
The bias is mitigated if the fix acceptability metric (i.e., manually judging if the APR-generated patch is acceptable) is not used in the APR evaluation. |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
36 |
0 |
0 |
0 |
36 |
Non-manual validation bias |
2015 |
An analysis of patch plausibility and correctness for generate-and-validate patch generation systems |
https://dl.acm.org/doi/abs/10.1145/2771783.2771791 |
Assessing APR-generated patches without manual validation. |
The bias is mitigated if the APR-generated patch is validated manually. |
yes |
no |
yes |
yes |
yes |
no |
yes |
yes |
yes |
yes |
no |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
no |
no |
yes |
yes |
no |
yes |
yes |
yes |
yes |
no |
yes |
no |
yes |
no |
yes |
yes |
27 |
0 |
9 |
0 |
36 |
Non-independent test validation bias |
2015 |
Is the cure worse than the disease? overfitting in automated program repair |
https://dl.acm.org/doi/abs/10.1145/2786805.2786825 |
Assessing APR-generated patches without independent tests. |
The bias is mitigated if the APR-generated patch is assessed by held-out tests that are not included in the original test suite used in patch validation. |
no |
no |
no |
no |
no |
no |
yes |
not applicable |
no |
no |
yes |
no |
no |
yes |
yes |
no |
yes |
not applicable |
no |
no |
yes |
no |
yes |
no |
no |
no |
no |
yes |
no |
no |
yes |
yes |
yes |
yes |
no |
yes |
13 |
2 |
21 |
0 |
36 |
NCP vs. NTCE metric bias |
2018 |
How to Measure the Performance of Automated Program Repair |
https://ieeexplore.ieee.org/abstract/document/8612557/ |
Using Number of Test Case Executions (NTCE for short) as an efficiency metric rather than Number of Candidate Patches before a valid patch is found (NCP for short). |
The bias does not exist at all if the repair efficiency is not evaluated in the paper. The bias is mitigated if the NCP metric is used during the APR evaluation. |
not applicable |
no |
no |
yes |
yes |
no |
not applicable |
yes |
yes |
yes |
not applicable |
yes |
yes |
no |
yes |
yes |
not applicable |
yes |
not applicable |
yes |
not applicable |
yes |
yes |
not applicable |
no |
yes |
yes |
not applicable |
yes |
no |
no |
no |
yes |
yes |
yes |
not applicable |
19 |
9 |
8 |
0 |
36 |
Defect classes evaluation bias |
2018 |
Do automated program repair techniques repair hard and important bugs? |
https://link.springer.com/article/10.1007/s10664-017-9550-0 |
Whether APR techniques can repair hard and important bugs are not evaluated. |
Rigorously, it is expected that the researchers could discuss the complexity and importance of repaired bugs using the metrics proposed by the bias study (e.g., Priority of the Defect). It would also be considered that the bias is mitigated, to a certain extent, when the repaired defect are publicly available for further check or are dicussed in the paper. |
yes |
yes |
yes |
yes |
yes |
no |
yes |
yes |
yes |
yes |
yes |
no |
yes |
no |
yes |
no |
yes |
yes |
yes |
yes |
yes |
no |
yes |
yes |
no |
yes |
yes |
yes |
yes |
yes |
yes |
no |
yes |
yes |
yes |
yes |
29 |
0 |
7 |
0 |
36 |
Only-manual validation bias |
2019 |
On reliability of patch correctness assessment |
https://ieeexplore.ieee.org/abstract/document/8812054/ |
Assessing APR-generated patches only by author annotation (i.e., the patch correctness is validated by authors of the APR tool). |
The bias does not exist at all if no manual check are perfermed for APR-generated patch correctness assessment. Or the bias is mitigated if both held-out tests and manual check are used for patch correctness assessment. |
no |
not applicable |
no |
no |
no |
not applicable |
yes |
not applicable |
no |
no |
yes |
no |
no |
yes |
yes |
no |
yes |
not applicable |
no |
yes |
yes |
no |
yes |
no |
no |
no |
no |
yes |
no |
not applicable |
yes |
not applicable |
yes |
yes |
no |
yes |
13 |
6 |
17 |
0 |
36 |
Only-independent test validation bias |
2019 |
On reliability of patch correctness assessment |
https://ieeexplore.ieee.org/abstract/document/8812054/ |
Assessing APR-generated patches only by held-out tests (i.e., tests that are not included in the original test suite used in patch validation). |
The bias does not exist at all if no held-out tests are used for APR-generated patch correctness assessment. Or the bias is mitigated if both held-out tests and manual check are used for patch correctness assessment. |
not applicable |
not applicable |
not applicable |
not applicable |
not applicable |
not applicable |
yes |
yes |
yes |
not applicable |
no |
not applicable |
not applicable |
yes |
yes |
not applicable |
yes |
yes |
yes |
not applicable |
yes |
not applicable |
yes |
not applicable |
not applicable |
not applicable |
yes |
yes |
yes |
not applicable |
yes |
no |
yes |
no |
not applicable |
yes |
16 |
17 |
3 |
0 |
36 |
Fault localization bias |
2019 |
You cannot fix what you cannot find! an investigation of fault localization bias in benchmarking automated program repair systems |
https://ieeexplore.ieee.org/abstract/document/8730164/ |
Using inconsistent fault localization configurations when evaluating APR techniques. |
The bias does not exist at all when no fault localization is performed. The bias is mitigated if multiple APR techniques use the same fault localization confirguration during evaluation. |
not applicable |
yes |
yes |
yes |
yes |
yes |
not applicable |
not applicable |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
not applicable |
not applicable |
yes |
yes |
not applicable |
yes |
yes |
yes |
yes |
no |
yes |
yes |
no |
yes |
yes |
no |
not applicable |
yes |
yes |
not applicable |
25 |
8 |
3 |
0 |
36 |
Subject bugs selection bias |
2019 |
Attention please: Consider Mockito when evaluating newly proposed automated program repair techniques |
https://dl.acm.org/doi/abs/10.1145/3319008.3319349 |
Excluding Mockito bugs when evaluating APR techniques with Defects4J. |
The bias does not exist at all when Defects4j is not used. The bias is mitigated if the Mockito bugs are included when using Defects4J for evaluation. |
not applicable |
no |
no |
no |
no |
yes |
yes |
no |
yes |
yes |
yes |
no |
no |
yes |
not applicable |
not applicable |
not applicable |
yes |
yes |
yes |
yes |
yes |
not applicable |
yes |
yes |
no |
yes |
yes |
yes |
not applicable |
not applicable |
yes |
not applicable |
no |
yes |
yes |
19 |
8 |
9 |
0 |
36 |
Flaky test inclusion bias |
2019 |
"Flakime: Laboratory-controlled test flakiness impact assessment. a case study on mutation testing and program repair" & "On the Impact of Flaky Tests in Automated Program Repair" |
https://arxiv.org/abs/1912.03197 & https://ieeexplore.ieee.org/abstract/document/9425948/ |
Including flaky tests when evaluating APR techniques. |
Rigorously, the bias is very hard to mitigate as it is hard to identify if there exist flaky tests in the bug dataset. However, eliminating flaky tests is still an open challenge. Thus, it would also be considered that the bias is mitigated, to a certain extent, when only a few tests are used or curated datasets are used. |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
no |
yes |
yes |
yes |
no |
no |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
yes |
no |
no |
no |
yes |
yes |
yes |
yes |
30 |
0 |
6 |
0 |
36 |
Benchmark selection bias |
2019 |
Empirical review of Java program repair tools: a large-scale experiment on 2,141 bugs and 23,551 repair attempts |
https://dl.acm.org/doi/abs/10.1145/3338906.3338911 |
Using a single dataset when evaluating APR techniques. |
Rigorously, the bias is mitigated if multiple datasets are used for APR evaluation. It would also be considered that the bias is mitigated, to a certain extent, when buggy programs from different sources are used. |
no |
no |
no |
no |
no |
no |
yes |
no |
yes |
yes |
not applicable |
no |
no |
yes |
no |
no |
yes |
yes |
yes |
yes |
not applicable |
no |
no |
no |
no |
yes |
no |
not applicable |
no |
yes |
no |
yes |
yes |
no |
no |
yes |
13 |
3 |
20 |
0 |
36 |
NCP vs. Time metric bias |
2020 |
On the Efficiency of Test Suite based Program Repair: A Systematic Assessment of 16 Automated Repair Systems for Java Programs |
https://dl.acm.org/doi/abs/10.1145/3377811.3380338 |
Using repair time as an efficiency metric rather than the NCP. |
The bias does not exist at all if the repair efficiency is not evaluated in the paper. The bias is mitigated if the NCP metric is discussed during the APR evaluation. |
not applicable |
# |
# |
# |
# |
# |
# |
# |
# |
# |
not applicable |
# |
# |
# |
yes |
no |
not applicable |
not applicable |
not applicable |
yes |
not applicable |
no |
yes |
not applicable |
yes |
yes |
yes |
not applicable |
yes |
# |
# |
# |
yes |
# |
# |
not applicable |
8 |
9 |
2 |
17 |
36 |
Tool exception bias |
2020 |
Understanding the Non-Repairability Factors of Automated Program Repair Techniques |
https://ieeexplore.ieee.org/abstract/document/9359317 |
Not addressing exceptions of APR techniques during the evaluation. |
The bias does not exist at all if repair recall metric is not calculated. The bias is mitigated if bug fixing attempts that ended with unexpected results are excluded when calculating unexpected results. |
not applicable |
# |
# |
# |
# |
# |
# |
# |
# |
# |
not applicable |
# |
# |
# |
not applicable |
not applicable |
not applicable |
not applicable |
not applicable |
not applicable |
not applicable |
not applicable |
not applicable |
no |
not applicable |
not applicable |
not applicable |
not applicable |
not applicable |
# |
# |
# |
yes |
# |
# |
not applicable |
1 |
17 |
1 |
17 |
36 |
Bug processing bias |
2021 |
A critical review on the evaluation of automated program repair systems |
https://www.sciencedirect.com/science/article/pii/S0164121220302156 |
Using future test cases that are not available at the time of the bug is reported for dataset construction. |
The bias is mitigated when buggy programs with future test cases are excluded during the APR evaluation. |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
0 |
0 |
0 |
36 |
36 |
NTCE vs. NCP metric bias |
2021 |
How Does Regression Test Selection Affect Program Repair? An Extensive Study on 2 Million Patches |
https://arxiv.org/abs/2105.07311 |
Using NCP as an efficiency metric rather than NTCE. |
The bias does not exist at all if the repair efficiency is not evaluated in the paper. The bias is mitigated if the NTCE metric (i.e., Number of Test Case Executions) is used during the APR evaluation. |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
0 |
0 |
0 |
36 |
36 |
Inaccurate ground truth bias |
2021 |
Is the Ground Truth Really Accurate? Dataset Purification for Automated Program Repair |
https://ieeexplore.ieee.org/abstract/document/9426017/ |
Using inaccurate ground truth (i.e., human-written patches) to assess correctness of APR-generated patches. |
"Rigorously, the bias is mitigated if no inaccurate ground truth are used for patch correctness assessment. Similar to the ""Flaky test inclusion bias"", it might be very hard for us to identify if the ground truth patch is really accurate." |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
# |
0 |
0 |
0 |
36 |
36 |