Skip to content

Latest commit

 

History

History
19 lines (19 loc) · 34.3 KB

results_of_investigation_on_known_bias_mitigation.md

File metadata and controls

19 lines (19 loc) · 34.3 KB
Bias Name Time when the bias was revealed Study Revealing the bias Paper Link Bias Definition Bias Mitigation Judgement Criteria presler2021sqlreair venugopal2020modification yuan2020making yuan2020toward yuan2020evolutionary villanueva2020novelty chen2020contract xu2020restore lutellier2020coconut li2020dlfix bohme2020human oo2020automatic koyuncu2020fixminer motwani2020automatically bian2021refining khalilian2021cgenprog gao2021beyond shariffdeen2021concolic ye2021neural jiang2021cure baudry2021software trujillo2021novel mesecan2021crnrepair qin2021impact lou2021does kechagia2021evaluating liu2021critical He2021A yang2021evaluating abdessalem2020automated yu2020smart jiang2020input shariffdeen2020automated motwani2020quality liu2020efficiency ginelli2020comprehensive yes not applicable no # total
Defect classes selection bias 2014 A critical review of "automatic patch generation learned from human-written patches": Essay on the problem statement and the evaluation of automatic software repair https://dl.acm.org/doi/abs/10.1145/2568225.2568324 Using different defect classes when evaluating multiple APR techniques. Rigorously, automatic repair approaches can be compared only if they address similar defect classes. It would also be considered that the bias is mitigated, to a certain extent, when multiple APR techniques are evaluated on the same datasets. not applicable yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes not applicable yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes 34 2 0 0 36
Fix acceptability metric bias 2014 A critical review of "automatic patch generation learned from human-written patches": Essay on the problem statement and the evaluation of automatic software repair https://dl.acm.org/doi/abs/10.1145/2568225.2568324 Using fix acceptability as an evaluation metric. The bias is mitigated if the fix acceptability metric (i.e., manually judging if the APR-generated patch is acceptable) is not used in the APR evaluation. yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes 36 0 0 0 36
Non-manual validation bias 2015 An analysis of patch plausibility and correctness for generate-and-validate patch generation systems https://dl.acm.org/doi/abs/10.1145/2771783.2771791 Assessing APR-generated patches without manual validation. The bias is mitigated if the APR-generated patch is validated manually. yes no yes yes yes no yes yes yes yes no yes yes yes yes yes yes yes yes yes no no yes yes no yes yes yes yes no yes no yes no yes yes 27 0 9 0 36
Non-independent test validation bias 2015 Is the cure worse than the disease? overfitting in automated program repair https://dl.acm.org/doi/abs/10.1145/2786805.2786825 Assessing APR-generated patches without independent tests. The bias is mitigated if the APR-generated patch is assessed by held-out tests that are not included in the original test suite used in patch validation. no no no no no no yes not applicable no no yes no no yes yes no yes not applicable no no yes no yes no no no no yes no no yes yes yes yes no yes 13 2 21 0 36
NCP vs. NTCE metric bias 2018 How to Measure the Performance of Automated Program Repair https://ieeexplore.ieee.org/abstract/document/8612557/ Using Number of Test Case Executions (NTCE for short) as an efficiency metric rather than Number of Candidate Patches before a valid patch is found (NCP for short). The bias does not exist at all if the repair efficiency is not evaluated in the paper. The bias is mitigated if the NCP metric is used during the APR evaluation. not applicable no no yes yes no not applicable yes yes yes not applicable yes yes no yes yes not applicable yes not applicable yes not applicable yes yes not applicable no yes yes not applicable yes no no no yes yes yes not applicable 19 9 8 0 36
Defect classes evaluation bias 2018 Do automated program repair techniques repair hard and important bugs? https://link.springer.com/article/10.1007/s10664-017-9550-0 Whether APR techniques can repair hard and important bugs are not evaluated. Rigorously, it is expected that the researchers could discuss the complexity and importance of repaired bugs using the metrics proposed by the bias study (e.g., Priority of the Defect). It would also be considered that the bias is mitigated, to a certain extent, when the repaired defect are publicly available for further check or are dicussed in the paper. yes yes yes yes yes no yes yes yes yes yes no yes no yes no yes yes yes yes yes no yes yes no yes yes yes yes yes yes no yes yes yes yes 29 0 7 0 36
Only-manual validation bias 2019 On reliability of patch correctness assessment https://ieeexplore.ieee.org/abstract/document/8812054/ Assessing APR-generated patches only by author annotation (i.e., the patch correctness is validated by authors of the APR tool). The bias does not exist at all if no manual check are perfermed for APR-generated patch correctness assessment. Or the bias is mitigated if both held-out tests and manual check are used for patch correctness assessment. no not applicable no no no not applicable yes not applicable no no yes no no yes yes no yes not applicable no yes yes no yes no no no no yes no not applicable yes not applicable yes yes no yes 13 6 17 0 36
Only-independent test validation bias 2019 On reliability of patch correctness assessment https://ieeexplore.ieee.org/abstract/document/8812054/ Assessing APR-generated patches only by held-out tests (i.e., tests that are not included in the original test suite used in patch validation). The bias does not exist at all if no held-out tests are used for APR-generated patch correctness assessment. Or the bias is mitigated if both held-out tests and manual check are used for patch correctness assessment. not applicable not applicable not applicable not applicable not applicable not applicable yes yes yes not applicable no not applicable not applicable yes yes not applicable yes yes yes not applicable yes not applicable yes not applicable not applicable not applicable yes yes yes not applicable yes no yes no not applicable yes 16 17 3 0 36
Fault localization bias 2019 You cannot fix what you cannot find! an investigation of fault localization bias in benchmarking automated program repair systems https://ieeexplore.ieee.org/abstract/document/8730164/ Using inconsistent fault localization configurations when evaluating APR techniques. The bias does not exist at all when no fault localization is performed. The bias is mitigated if multiple APR techniques use the same fault localization confirguration during evaluation. not applicable yes yes yes yes yes not applicable not applicable yes yes yes yes yes yes yes yes not applicable not applicable yes yes not applicable yes yes yes yes no yes yes no yes yes no not applicable yes yes not applicable 25 8 3 0 36
Subject bugs selection bias 2019 Attention please: Consider Mockito when evaluating newly proposed automated program repair techniques https://dl.acm.org/doi/abs/10.1145/3319008.3319349 Excluding Mockito bugs when evaluating APR techniques with Defects4J. The bias does not exist at all when Defects4j is not used. The bias is mitigated if the Mockito bugs are included when using Defects4J for evaluation. not applicable no no no no yes yes no yes yes yes no no yes not applicable not applicable not applicable yes yes yes yes yes not applicable yes yes no yes yes yes not applicable not applicable yes not applicable no yes yes 19 8 9 0 36
Flaky test inclusion bias 2019 "Flakime: Laboratory-controlled test flakiness impact assessment. a case study on mutation testing and program repair" & "On the Impact of Flaky Tests in Automated Program Repair" https://arxiv.org/abs/1912.03197 & https://ieeexplore.ieee.org/abstract/document/9425948/ Including flaky tests when evaluating APR techniques. Rigorously, the bias is very hard to mitigate as it is hard to identify if there exist flaky tests in the bug dataset. However, eliminating flaky tests is still an open challenge. Thus, it would also be considered that the bias is mitigated, to a certain extent, when only a few tests are used or curated datasets are used. yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes no yes yes yes no no yes yes yes yes yes yes yes yes no no no yes yes yes yes 30 0 6 0 36
Benchmark selection bias 2019 Empirical review of Java program repair tools: a large-scale experiment on 2,141 bugs and 23,551 repair attempts https://dl.acm.org/doi/abs/10.1145/3338906.3338911 Using a single dataset when evaluating APR techniques. Rigorously, the bias is mitigated if multiple datasets are used for APR evaluation. It would also be considered that the bias is mitigated, to a certain extent, when buggy programs from different sources are used. no no no no no no yes no yes yes not applicable no no yes no no yes yes yes yes not applicable no no no no yes no not applicable no yes no yes yes no no yes 13 3 20 0 36
NCP vs. Time metric bias 2020 On the Efficiency of Test Suite based Program Repair: A Systematic Assessment of 16 Automated Repair Systems for Java Programs https://dl.acm.org/doi/abs/10.1145/3377811.3380338 Using repair time as an efficiency metric rather than the NCP. The bias does not exist at all if the repair efficiency is not evaluated in the paper. The bias is mitigated if the NCP metric is discussed during the APR evaluation. not applicable # # # # # # # # # not applicable # # # yes no not applicable not applicable not applicable yes not applicable no yes not applicable yes yes yes not applicable yes # # # yes # # not applicable 8 9 2 17 36
Tool exception bias 2020 Understanding the Non-Repairability Factors of Automated Program Repair Techniques https://ieeexplore.ieee.org/abstract/document/9359317 Not addressing exceptions of APR techniques during the evaluation. The bias does not exist at all if repair recall metric is not calculated. The bias is mitigated if bug fixing attempts that ended with unexpected results are excluded when calculating unexpected results. not applicable # # # # # # # # # not applicable # # # not applicable not applicable not applicable not applicable not applicable not applicable not applicable not applicable not applicable no not applicable not applicable not applicable not applicable not applicable # # # yes # # not applicable 1 17 1 17 36
Bug processing bias 2021 A critical review on the evaluation of automated program repair systems https://www.sciencedirect.com/science/article/pii/S0164121220302156 Using future test cases that are not available at the time of the bug is reported for dataset construction. The bias is mitigated when buggy programs with future test cases are excluded during the APR evaluation. # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 0 0 0 36 36
NTCE vs. NCP metric bias 2021 How Does Regression Test Selection Affect Program Repair? An Extensive Study on 2 Million Patches https://arxiv.org/abs/2105.07311 Using NCP as an efficiency metric rather than NTCE. The bias does not exist at all if the repair efficiency is not evaluated in the paper. The bias is mitigated if the NTCE metric (i.e., Number of Test Case Executions) is used during the APR evaluation. # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 0 0 0 36 36
Inaccurate ground truth bias 2021 Is the Ground Truth Really Accurate? Dataset Purification for Automated Program Repair https://ieeexplore.ieee.org/abstract/document/9426017/ Using inaccurate ground truth (i.e., human-written patches) to assess correctness of APR-generated patches. "Rigorously, the bias is mitigated if no inaccurate ground truth are used for patch correctness assessment. Similar to the ""Flaky test inclusion bias"", it might be very hard for us to identify if the ground truth patch is really accurate." # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 0 0 0 36 36