# Experiment 2

To validate our results in a more complex setting, we examine how each distance measure ranks an expert annotation against a single other high-quality candidate repair found by a state-of-the-art automated repair technique. 

We use the state-of-the-art semantic Automated Repair Tool (ART) Refactory to find a candidate repair for each incorrect solution in our annotated dataset. To obtain a high-quality repair, we run the ART giving it access to the same pool of candidate repairs as used in the first experiment (without the expert solution). Using this pool of correct programs, Refactory generates a bigger suite of semantically equivalent code by refactoring all these available working solutions to a problem. Then, given an incorrect program, Refactory analyzes its control flow structure to find a closely matching working program to compare for isolating the buggy components of the buggy solution. As such, the candidate repair generated by Refactory should be better or at least as appropriate as the best candidates in the original pool (which, once again, might contain the student's own correction to the problem).

We repeat the previous experiment (experiment 1) using the candidate repair found for each buggy solution. The main difference with the first experiment is that we compare the expert annotation/repair against the single candidate obtained using Refactory. Therefore, the ranking error for each buggy program becomes a binary classification error. We report the total classification error --  the number of times the ART candidate repair was favored over the expert annotation -- for all metrics.

In [1]:
import os, sys
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datasets import disable_caching

#### General settings

In [2]:
sys.path.append("../")
sys.path.append("../../")
disable_caching()
sns.set_theme("paper")
plt.rcParams['font.size'] = '7'
sns.set(font_scale=1.1)

In [3]:
from src.common import dist_funcs, new_assignments_id

## Let's load our data

In [4]:
CONFIG_PATH = '../configs/conf.json'

In [5]:
from src.utils.files import read_config

config = read_config(CONFIG_PATH)
config

DotMap(save_path='../data/', split_year=False, _ipython_display_=DotMap(), _repr_mimebundle_=DotMap())

### Loading the Refactory results dataframe

In [6]:
def extract_index(file_name):
    return int(file_name.split("_")[-1][:-3])

In [7]:
from warnings import warn

questions = os.listdir(config.save_path)
questions = [q for q in questions if q.startswith("question")]
key_f = lambda q: int(q.split('_')[-1])
questions = sorted(questions, key=key_f)

dataframe = []
for q in questions:
    q_path = os.path.join(config.save_path, q, 'refactory_online.csv')
    if not os.path.exists(q_path):
        warn(f"Results for assignment {q} are not available")
        continue
    dataframe.append(pd.read_csv(q_path))
    
dataframe = pd.concat(dataframe, axis=0, ignore_index=True)
dataframe["index"] = dataframe["File Name"].apply(extract_index).astype(int)
dataframe = dataframe.set_index("index")
dataframe = dataframe.sort_index()
dataframe

Unnamed: 0_level_0,Question,Sampling Rate,Experiment ID,File Name,Status,Match (Rfty Code),Match (Ori Code),Buggy Code,Buggy Mutation,Refactored Correct Code,...,Online Refactoring Time,GCR Time,Stru. Mutation Time,Block Mapping Time,Variable Mapping Time,Specification&Synthesis Time,Total Time,#Passed Test Case,#Test Case,RPS
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12,question_5,100,0,wrong_5_12.py,success_wo_mut,1.0,1,def reverse ( a ) :\n i = 0\n while ( i ...,def reverse ( a ) :\n i = 0\n while ( i ...,def reverse ( a ) :\n i = 0\n while ( i ...,...,0.069,0.331,0.0,0.0,0.033,35.700,36.142,8.0,8.0,0.034
16,question_6,100,0,wrong_6_16.py,success_wo_mut,1.0,1,def reverse ( a ) :\n print ( a . reverse (...,def reverse ( a ) :\n print ( a . reverse (...,def reverse ( a ) :\n return a [ : : ( - 1 ...,...,0.012,0.013,0.0,0.0,0.005,0.042,0.076,3.0,3.0,0.700
49,question_7,100,0,wrong_7_49.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,0.095,0.095,0.0,0.0,0.015,0.008,0.220,9.0,9.0,0.286
150,question_3,100,0,wrong_3_150.py,success_wo_mut,1.0,1,def maximum ( l ) :\n if ( len ( l ) == 1 )...,def maximum ( l ) :\n if ( len ( l ) == 1 )...,def maximum ( l ) :\n if ( len ( l ) == 1 )...,...,0.248,0.061,0.0,0.0,0.009,0.128,0.450,4.0,4.0,0.029
213,question_7,100,0,wrong_7_213.py,success_wo_mut,1.0,0,"def search ( string , letter ) :\n if ( let...","def search ( string , letter ) :\n if ( let...","\n\ndef search(str, letter):\n if True:\n ...",...,0.139,0.002,0.0,0.0,0.015,0.183,0.346,9.0,9.0,0.316
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42327,question_7,100,0,wrong_7_42327.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,0.094,0.095,0.0,0.0,0.014,0.008,0.217,9.0,9.0,0.286
42374,question_1,100,0,wrong_1_42374.py,success_wo_mut,1.0,1,def count_letters ( s ) :\n total = 0\n ...,def count_letters ( s ) :\n total = 0\n ...,def count_letters ( s ) :\n counter = 0\n ...,...,0.124,0.011,0.0,0.0,1.593,3.548,5.283,5.0,5.0,0.344
42384,question_3,100,0,wrong_3_42384.py,success_wo_mut,1.0,1,def maximum ( l ) :\n if ( len ( l ) == 1 )...,def maximum ( l ) :\n if ( len ( l ) == 1 )...,def maximum ( l ) :\n if ( len ( l ) == 1 )...,...,0.248,0.115,0.0,0.0,0.009,0.111,0.489,4.0,4.0,0.029
42412,question_7,100,0,wrong_7_42412.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,0.095,0.095,0.0,0.0,0.015,0.007,0.219,9.0,9.0,0.286


### Loading the dataframe used to obtain the Refactory's repair

In [8]:
from datasets import load_from_disk

dataset = load_from_disk(os.path.join(config.save_path, 'hgf'))
original_df = dataset.to_pandas()
# We only take the incorrect ones
original_df = original_df[~original_df.correct]
original_df = original_df.set_index("submission_id")
original_df = original_df.sort_index()
original_df

Unnamed: 0_level_0,func_code,assignment_id,func_name,description,test,annotation,user,academic_year,correct,__index_level_0__
submission_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
12,def reverse(a):\n i = 0\n while i < len(...,reverse_by_swap,reverse,Reverse a list of elements.,assert reverse([])==[] and reverse([0])==[0] a...,def reverse(a):\n i = 0\n while i < len(...,4a1f2726-b713-40f0-b544-9de55d617a12,2017,False,1219
12,def reverse(a):\n i = 0\n while i < len(...,reverse_by_swap,reverse,Reverse a list of elements.,assert reverse([])==[] and reverse([0])==[0] a...,def reverse(a):\n i = 0\n while i < len(...,4a1f2726-b713-40f0-b544-9de55d617a12,2017,False,370
16,def reverse(a):\n print(a.reverse()),reverse_iter,reverse,Reverse a list of elements.,"assert reverse([])==[] and reverse([20, 10, 0,...",def reverse(a):\n print(a.reverse())\n r...,03141ef3-f364-4b7c-9f52-990a173ac162,2016,False,164
49,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",e380b6f8-84c6-4978-a85b-78c22ace6b9b,2017,False,800
49,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",e380b6f8-84c6-4978-a85b-78c22ace6b9b,2017,False,1649
...,...,...,...,...,...,...,...,...,...,...
42384,def maximum(l):\n if len(l) == 1:\n ...,maximum,maximum,Return the maximum element in a list of numbers.,"assert maximum([0])==0 and maximum([67, 1, 2, ...",def maximum(l):\n if len(l) == 1:\n ...,0412928d-97c6-46f2-980b-7d98214b9765,2017,False,1033
42412,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",6618fe7e-6fd3-499b-a742-8d68ec712ad3,2017,False,1423
42412,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",6618fe7e-6fd3-499b-a742-8d68ec712ad3,2017,False,574
42462,"def swap(a, i, j):\n tmp = a[i]\n a[i] =...",reverse_by_swap,reverse,Reverse a list of elements.,assert reverse([])==[] and reverse([0])==[0] a...,"def swap(a, i, j):\n tmp = a[i]\n a[i] =...",30a4c165-17bc-4bdf-a096-e2a252a403eb,2017,False,487


Let's merge these together.
Both dataframe and the original dataset should have the same lenght. Are there mismatches?

In [9]:
results_df = pd.concat([dataframe, original_df], axis=1)
results_df = results_df.replace(new_assignments_id)
results_df

Unnamed: 0,Question,Sampling Rate,Experiment ID,File Name,Status,Match (Rfty Code),Match (Ori Code),Buggy Code,Buggy Mutation,Refactored Correct Code,...,func_code,assignment_id,func_name,description,test,annotation,user,academic_year,correct,__index_level_0__
12,question_5,100,0,wrong_5_12.py,success_wo_mut,1.0,1,def reverse ( a ) :\n i = 0\n while ( i ...,def reverse ( a ) :\n i = 0\n while ( i ...,def reverse ( a ) :\n i = 0\n while ( i ...,...,def reverse(a):\n i = 0\n while i < len(...,reverse_by_swap,reverse,Reverse a list of elements.,assert reverse([])==[] and reverse([0])==[0] a...,def reverse(a):\n i = 0\n while i < len(...,4a1f2726-b713-40f0-b544-9de55d617a12,2017,False,1219
12,question_5,100,0,wrong_5_12.py,success_wo_mut,1.0,1,def reverse ( a ) :\n i = 0\n while ( i ...,def reverse ( a ) :\n i = 0\n while ( i ...,def reverse ( a ) :\n i = 0\n while ( i ...,...,def reverse(a):\n i = 0\n while i < len(...,reverse_by_swap,reverse,Reverse a list of elements.,assert reverse([])==[] and reverse([0])==[0] a...,def reverse(a):\n i = 0\n while i < len(...,4a1f2726-b713-40f0-b544-9de55d617a12,2017,False,370
16,question_6,100,0,wrong_6_16.py,success_wo_mut,1.0,1,def reverse ( a ) :\n print ( a . reverse (...,def reverse ( a ) :\n print ( a . reverse (...,def reverse ( a ) :\n return a [ : : ( - 1 ...,...,def reverse(a):\n print(a.reverse()),reverse_iter,reverse,Reverse a list of elements.,"assert reverse([])==[] and reverse([20, 10, 0,...",def reverse(a):\n print(a.reverse())\n r...,03141ef3-f364-4b7c-9f52-990a173ac162,2016,False,164
49,question_7,100,0,wrong_7_49.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",e380b6f8-84c6-4978-a85b-78c22ace6b9b,2017,False,800
49,question_7,100,0,wrong_7_49.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",e380b6f8-84c6-4978-a85b-78c22ace6b9b,2017,False,1649
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42384,question_3,100,0,wrong_3_42384.py,success_wo_mut,1.0,1,def maximum ( l ) :\n if ( len ( l ) == 1 )...,def maximum ( l ) :\n if ( len ( l ) == 1 )...,def maximum ( l ) :\n if ( len ( l ) == 1 )...,...,def maximum(l):\n if len(l) == 1:\n ...,maximum,maximum,Return the maximum element in a list of numbers.,"assert maximum([0])==0 and maximum([67, 1, 2, ...",def maximum(l):\n if len(l) == 1:\n ...,0412928d-97c6-46f2-980b-7d98214b9765,2017,False,1033
42412,question_7,100,0,wrong_7_42412.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",6618fe7e-6fd3-499b-a742-8d68ec712ad3,2017,False,1423
42412,question_7,100,0,wrong_7_42412.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",6618fe7e-6fd3-499b-a742-8d68ec712ad3,2017,False,574
42462,question_5,100,0,wrong_5_42462.py,fail_exception,,1,,,,...,"def swap(a, i, j):\n tmp = a[i]\n a[i] =...",reverse_by_swap,reverse,Reverse a list of elements.,assert reverse([])==[] and reverse([0])==[0] a...,"def swap(a, i, j):\n tmp = a[i]\n a[i] =...",30a4c165-17bc-4bdf-a096-e2a252a403eb,2017,False,487


We decided to remove the results for the reverse_recur assignment since we have only 5 annotations for this one (not enough to matter)

In [10]:
results_df = results_df[results_df.assignment_id != "reverse_recur"]
results_df

Unnamed: 0,Question,Sampling Rate,Experiment ID,File Name,Status,Match (Rfty Code),Match (Ori Code),Buggy Code,Buggy Mutation,Refactored Correct Code,...,func_code,assignment_id,func_name,description,test,annotation,user,academic_year,correct,__index_level_0__
12,question_5,100,0,wrong_5_12.py,success_wo_mut,1.0,1,def reverse ( a ) :\n i = 0\n while ( i ...,def reverse ( a ) :\n i = 0\n while ( i ...,def reverse ( a ) :\n i = 0\n while ( i ...,...,def reverse(a):\n i = 0\n while i < len(...,reverse_by_swap,reverse,Reverse a list of elements.,assert reverse([])==[] and reverse([0])==[0] a...,def reverse(a):\n i = 0\n while i < len(...,4a1f2726-b713-40f0-b544-9de55d617a12,2017,False,1219
12,question_5,100,0,wrong_5_12.py,success_wo_mut,1.0,1,def reverse ( a ) :\n i = 0\n while ( i ...,def reverse ( a ) :\n i = 0\n while ( i ...,def reverse ( a ) :\n i = 0\n while ( i ...,...,def reverse(a):\n i = 0\n while i < len(...,reverse_by_swap,reverse,Reverse a list of elements.,assert reverse([])==[] and reverse([0])==[0] a...,def reverse(a):\n i = 0\n while i < len(...,4a1f2726-b713-40f0-b544-9de55d617a12,2017,False,370
16,question_6,100,0,wrong_6_16.py,success_wo_mut,1.0,1,def reverse ( a ) :\n print ( a . reverse (...,def reverse ( a ) :\n print ( a . reverse (...,def reverse ( a ) :\n return a [ : : ( - 1 ...,...,def reverse(a):\n print(a.reverse()),reverse_iter,reverse,Reverse a list of elements.,"assert reverse([])==[] and reverse([20, 10, 0,...",def reverse(a):\n print(a.reverse())\n r...,03141ef3-f364-4b7c-9f52-990a173ac162,2016,False,164
49,question_7,100,0,wrong_7_49.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",e380b6f8-84c6-4978-a85b-78c22ace6b9b,2017,False,800
49,question_7,100,0,wrong_7_49.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",e380b6f8-84c6-4978-a85b-78c22ace6b9b,2017,False,1649
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42384,question_3,100,0,wrong_3_42384.py,success_wo_mut,1.0,1,def maximum ( l ) :\n if ( len ( l ) == 1 )...,def maximum ( l ) :\n if ( len ( l ) == 1 )...,def maximum ( l ) :\n if ( len ( l ) == 1 )...,...,def maximum(l):\n if len(l) == 1:\n ...,maximum,maximum,Return the maximum element in a list of numbers.,"assert maximum([0])==0 and maximum([67, 1, 2, ...",def maximum(l):\n if len(l) == 1:\n ...,0412928d-97c6-46f2-980b-7d98214b9765,2017,False,1033
42412,question_7,100,0,wrong_7_42412.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",6618fe7e-6fd3-499b-a742-8d68ec712ad3,2017,False,1423
42412,question_7,100,0,wrong_7_42412.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",6618fe7e-6fd3-499b-a742-8d68ec712ad3,2017,False,574
42462,question_5,100,0,wrong_5_42462.py,fail_exception,,1,,,,...,"def swap(a, i, j):\n tmp = a[i]\n a[i] =...",reverse_by_swap,reverse,Reverse a list of elements.,assert reverse([])==[] and reverse([0])==[0] a...,"def swap(a, i, j):\n tmp = a[i]\n a[i] =...",30a4c165-17bc-4bdf-a096-e2a252a403eb,2017,False,487


## Let's take a look at how well Refactory really performs 

#### Rexecuting the codes and  looking at what is the real success percentage

We notice that Refactory sometimes produces incorrect results but the tool classifies them as correct.
To avoid that, let's determine correctness ourselves. We'll only analyze the Results of Refactory on the codes
which were successfully corrected

In [11]:
from src.utils.TestResults import TestResults

results_df.loc[pd.isnull(results_df.Repair), "Repair"] = ""
results_df = TestResults().get_correctness(results_df, "Repair")
results_df

Unnamed: 0,Question,Sampling Rate,Experiment ID,File Name,Status,Match (Rfty Code),Match (Ori Code),Buggy Code,Buggy Mutation,Refactored Correct Code,...,func_code,assignment_id,func_name,description,test,annotation,user,academic_year,correct,__index_level_0__
12,question_5,100,0,wrong_5_12.py,success_wo_mut,1.0,1,def reverse ( a ) :\n i = 0\n while ( i ...,def reverse ( a ) :\n i = 0\n while ( i ...,def reverse ( a ) :\n i = 0\n while ( i ...,...,def reverse(a):\n i = 0\n while i < len(...,reverse_by_swap,reverse,Reverse a list of elements.,assert reverse([])==[] and reverse([0])==[0] a...,def reverse(a):\n i = 0\n while i < len(...,4a1f2726-b713-40f0-b544-9de55d617a12,2017,True,1219
12,question_5,100,0,wrong_5_12.py,success_wo_mut,1.0,1,def reverse ( a ) :\n i = 0\n while ( i ...,def reverse ( a ) :\n i = 0\n while ( i ...,def reverse ( a ) :\n i = 0\n while ( i ...,...,def reverse(a):\n i = 0\n while i < len(...,reverse_by_swap,reverse,Reverse a list of elements.,assert reverse([])==[] and reverse([0])==[0] a...,def reverse(a):\n i = 0\n while i < len(...,4a1f2726-b713-40f0-b544-9de55d617a12,2017,True,370
16,question_6,100,0,wrong_6_16.py,success_wo_mut,1.0,1,def reverse ( a ) :\n print ( a . reverse (...,def reverse ( a ) :\n print ( a . reverse (...,def reverse ( a ) :\n return a [ : : ( - 1 ...,...,def reverse(a):\n print(a.reverse()),reverse_iter,reverse,Reverse a list of elements.,"assert reverse([])==[] and reverse([20, 10, 0,...",def reverse(a):\n print(a.reverse())\n r...,03141ef3-f364-4b7c-9f52-990a173ac162,2016,True,164
49,question_7,100,0,wrong_7_49.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",e380b6f8-84c6-4978-a85b-78c22ace6b9b,2017,False,800
49,question_7,100,0,wrong_7_49.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",e380b6f8-84c6-4978-a85b-78c22ace6b9b,2017,False,1649
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42384,question_3,100,0,wrong_3_42384.py,success_wo_mut,1.0,1,def maximum ( l ) :\n if ( len ( l ) == 1 )...,def maximum ( l ) :\n if ( len ( l ) == 1 )...,def maximum ( l ) :\n if ( len ( l ) == 1 )...,...,def maximum(l):\n if len(l) == 1:\n ...,maximum,maximum,Return the maximum element in a list of numbers.,"assert maximum([0])==0 and maximum([67, 1, 2, ...",def maximum(l):\n if len(l) == 1:\n ...,0412928d-97c6-46f2-980b-7d98214b9765,2017,True,1033
42412,question_7,100,0,wrong_7_42412.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",6618fe7e-6fd3-499b-a742-8d68ec712ad3,2017,False,1423
42412,question_7,100,0,wrong_7_42412.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",6618fe7e-6fd3-499b-a742-8d68ec712ad3,2017,False,574
42462,question_5,100,0,wrong_5_42462.py,fail_exception,,1,,,,...,"def swap(a, i, j):\n tmp = a[i]\n a[i] =...",reverse_by_swap,reverse,Reverse a list of elements.,assert reverse([])==[] and reverse([0])==[0] a...,"def swap(a, i, j):\n tmp = a[i]\n a[i] =...",30a4c165-17bc-4bdf-a096-e2a252a403eb,2017,False,487


In [12]:
groups = results_df.groupby("assignment_id")
success_percentage = groups.apply(lambda gdf: (gdf.correct.sum() / len(gdf)) * 100)
success_percentage

assignment_id
count_letters        60.000000
index_iter            0.000000
maximum              97.894737
minimum              97.761194
reverse_by_swap      33.944954
reverse_iter         92.682927
search_iter          24.185249
search_recur         16.802168
sumup               100.000000
swap_keys_values     94.000000
dtype: float64

In [13]:
non_working = results_df[~results_df.correct]
non_working

Unnamed: 0,Question,Sampling Rate,Experiment ID,File Name,Status,Match (Rfty Code),Match (Ori Code),Buggy Code,Buggy Mutation,Refactored Correct Code,...,func_code,assignment_id,func_name,description,test,annotation,user,academic_year,correct,__index_level_0__
49,question_7,100,0,wrong_7_49.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",e380b6f8-84c6-4978-a85b-78c22ace6b9b,2017,False,800
49,question_7,100,0,wrong_7_49.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",e380b6f8-84c6-4978-a85b-78c22ace6b9b,2017,False,1649
437,question_8,100,0,wrong_8_437.py,success_wo_mut,1.0,1,"def search ( string , letter ) :\n if ( str...","def search ( string , letter ) :\n if ( str...","def search ( str , letter ) :\n if ( letter...",...,"def search(string, letter):\n if string[0] ...",search_recur,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(string, letter):\n if string == ...",2a5e3eed-41c7-46e6-9bee-3acc21c1f81b,2017,False,1759
437,question_8,100,0,wrong_8_437.py,success_wo_mut,1.0,1,"def search ( string , letter ) :\n if ( str...","def search ( string , letter ) :\n if ( str...","def search ( str , letter ) :\n if ( letter...",...,"def search(string, letter):\n if string[0] ...",search_recur,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(string, letter):\n if string == ...",2a5e3eed-41c7-46e6-9bee-3acc21c1f81b,2017,False,910
472,question_7,100,0,wrong_7_472.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",bc728955-e4e8-48d1-9acb-b83b3fd023ba,2017,False,1526
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42327,question_7,100,0,wrong_7_42327.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",e380b6f8-84c6-4978-a85b-78c22ace6b9b,2017,False,808
42412,question_7,100,0,wrong_7_42412.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",6618fe7e-6fd3-499b-a742-8d68ec712ad3,2017,False,1423
42412,question_7,100,0,wrong_7_42412.py,success_wo_mut,1.0,1,"def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...","def search ( str , letter ) :\n if ( letter...",...,"def search(str, letter):\n if letter in str...",search_iter,search,Return whether a letter is part of a string,"assert search('','0')==False and search('0','0...","def search(str, letter):\n if letter in str...",6618fe7e-6fd3-499b-a742-8d68ec712ad3,2017,False,574
42462,question_5,100,0,wrong_5_42462.py,fail_exception,,1,,,,...,"def swap(a, i, j):\n tmp = a[i]\n a[i] =...",reverse_by_swap,reverse,Reverse a list of elements.,assert reverse([])==[] and reverse([0])==[0] a...,"def swap(a, i, j):\n tmp = a[i]\n a[i] =...",30a4c165-17bc-4bdf-a096-e2a252a403eb,2017,False,487


### Preparing the distance computations

In [14]:
from src.utils.code import clean_code

results_df = results_df[results_df.correct] # take only the Refactory corrections which are actually correct
rename = {
    "func_code": "buggy_code",
    "Repair": "candidate_code",
    "annotation": "expert_code"
}
results_df = results_df.rename(columns=rename)
results_df = results_df[["buggy_code", "candidate_code", "expert_code", "assignment_id"]]

results_df = results_df[results_df.buggy_code.astype(bool)]
results_df["buggy_code"] = results_df["buggy_code"].apply(clean_code)

results_df = results_df[results_df.expert_code.astype(bool)]
results_df["expert_code"] = results_df["expert_code"].apply(clean_code)

results_df = results_df[results_df.candidate_code.astype(bool)]
results_df["candidate_code"] = results_df["candidate_code"].apply(clean_code)
results_df

Unnamed: 0,buggy_code,candidate_code,expert_code,assignment_id
12,def reverse(a):\n i = 0\n while i < len(...,def reverse(a):\n i = 0\n while i < len(...,def reverse(a):\n i = 0\n while i < len(...,reverse_by_swap
12,def reverse(a):\n i = 0\n while i < len(...,def reverse(a):\n i = 0\n while i < len(...,def reverse(a):\n i = 0\n while i < len(...,reverse_by_swap
16,def reverse(a):\n print(a.reverse()),def reverse(a):\n return a[::-1],def reverse(a):\n print(a.reverse())\n r...,reverse_iter
150,def maximum(l):\n if len(l) == 1:\n ...,def maximum(l):\n if len(l) == 1:\n ...,def maximum(l):\n if len(l) == 1:\n ...,maximum
213,"def search(string, letter):\n if letter in ...","def search(string, letter):\n if letter in ...","def search(string, letter):\n if letter in ...",search_iter
...,...,...,...,...
42174,"def search(string, *letter):\n i = 0\n i...","def search(string, letter):\n if letter in ...","def search(string, letter):\n if string == ...",search_recur
42374,def count_letters(s):\n total = 0\n if n...,def count_letters(s):\n total = 0\n if n...,def count_letters(s):\n total = 0\n if n...,count_letters
42374,def count_letters(s):\n total = 0\n if n...,def count_letters(s):\n total = 0\n if n...,def count_letters(s):\n total = 0\n if n...,count_letters
42384,def maximum(l):\n if len(l) == 1:\n ...,def maximum(l):\n if len(l) == 1:\n ...,def maximum(l):\n if len(l) == 1:\n ...,maximum


In [15]:
for b, r, e in results_df[results_df.assignment_id == "maximum"][["buggy_code", "candidate_code", "expert_code"]].to_numpy():
    print(b)
    print(r)
    print(e)
    print("---")

def maximum(l):
    if len(l) == 1:
        return l[0]
    m = minimum(l[1:])
    return m if m > l[0] else l[0]
def maximum(l):
    if len(l) == 1:
        return l[0]
    m = maximum(l[1:])
    return m if m > l[0] else l[0]
def maximum(l):
    if len(l) == 1:
        return l[0]
    m = maximum(l[1:])
    return m if m > l[0] else l[0]
---
def maximum(a):
    if len(a) == 1:
        return a[0]
    if a[0] > minimum(a[1:]):
        return a[0]
    else:
        return minimum(a[1:])
def maximum(a):
    if len(a) == 1:
        return a[0]
    if a[0] > maximum(a[1:]):
        return a[0]
    else:
        return maximum(a[1:])
def maximum(a):
    if len(a) == 1:
        return a[0]
    if a[0] > maximum(a[1:]):
        return a[0]
    else:
        return maximum(a[1:])
---
def maximum(a):
    if len(a) == 1:
        return a[0]
    if a[0] > minimum(a[1:]):
        return a[0]
    else:
        return minimum(a[1:])
def maximum(a):
    if len(a) == 1:
        return a[0]
    if a[0

### Distance computations between different codes 

### Let's compute the classification error between the expert annotation and refactory candidate repair

Let's compute the number of times where, if we would use the sequence edit distance, or the string edit distance, we would select the candidate repair (the Refactory output) over the true goal.

In [16]:
from itertools import product, combinations

get_name = lambda c: c.split('_')[0]
from_to = list(combinations(["buggy_code", "expert_code", "candidate_code"], 2))
elements = list(product(from_to, dist_funcs))
for (from_, target), dist_f in elements:
    col_name = f"{dist_f.__name__}-{get_name(from_)}_{get_name(target)}"
    buggies = results_df[from_].to_list()
    corrections = results_df[target].to_list()
    results_df[col_name] = list(map(dist_f, buggies, corrections))

results_df = results_df.reset_index(drop=True)
results_df



Unnamed: 0,buggy_code,candidate_code,expert_code,assignment_id,codebleu_dist-buggy_expert,codebleu_dist-buggy_candidate,codebleu_dist-expert_candidate
0,def reverse(a):\n i = 0\n while i < len(...,def reverse(a):\n i = 0\n while i < len(...,def reverse(a):\n i = 0\n while i < len(...,reverse_by_swap,0.036765,0.036765,0.000000
1,def reverse(a):\n i = 0\n while i < len(...,def reverse(a):\n i = 0\n while i < len(...,def reverse(a):\n i = 0\n while i < len(...,reverse_by_swap,0.036765,0.036765,0.000000
2,def reverse(a):\n print(a.reverse()),def reverse(a):\n return a[::-1],def reverse(a):\n print(a.reverse())\n r...,reverse_iter,0.367936,0.743558,0.806406
3,def maximum(l):\n if len(l) == 1:\n ...,def maximum(l):\n if len(l) == 1:\n ...,def maximum(l):\n if len(l) == 1:\n ...,maximum,0.187393,0.187393,0.000000
4,"def search(string, letter):\n if letter in ...","def search(string, letter):\n if letter in ...","def search(string, letter):\n if letter in ...",search_iter,0.319754,0.428543,0.289745
...,...,...,...,...,...,...,...
750,"def search(string, *letter):\n i = 0\n i...","def search(string, letter):\n if letter in ...","def search(string, letter):\n if string == ...",search_recur,0.787896,0.878620,0.692724
751,def count_letters(s):\n total = 0\n if n...,def count_letters(s):\n total = 0\n if n...,def count_letters(s):\n total = 0\n if n...,count_letters,0.323579,0.457042,0.342078
752,def count_letters(s):\n total = 0\n if n...,def count_letters(s):\n total = 0\n if n...,def count_letters(s):\n total = 0\n if n...,count_letters,0.323579,0.457042,0.342078
753,def maximum(l):\n if len(l) == 1:\n ...,def maximum(l):\n if len(l) == 1:\n ...,def maximum(l):\n if len(l) == 1:\n ...,maximum,0.183167,0.183167,0.000000


In [17]:
def compute_error(sub_df):
    r = {}
    for dist_n in dist_names:
        bcd = sub_df[f"{dist_n}-buggy_candidate"]
        bed = sub_df[f"{dist_n}-buggy_expert"]
        r[dist_n] = sub_df[bcd < bed].shape[0]
               
    return pd.Series(r)
     

dist_names = [d.__name__ for d in dist_funcs]
targets = [c.split('_')[0] for c in ["candidate", "expert"]]
dist_names, targets

error = results_df.groupby("assignment_id").apply(compute_error)

error.columns = [c.replace("_dist", '').upper() for c in error.columns]
error = error.sort_values(by=error.first_valid_index(), ascending=False, axis=1)


selected_columns = [c for c in error.columns if "RPS" not in c]
selected_columns = ["TED", "SEQ", "STR", "TED_NORM", "SEQ_NORM", "STR_NORM","BLEU", "CODEBLEU", "ROUGE1", "ROUGELCSUM"]
error = error[selected_columns]

# adding the number of solutions per assignment as well as the success percentage
nb_code = results_df.groupby("assignment_id").buggy_code.count()
nb_code.name = "#prog"
error = pd.concat([nb_code, error], axis=1)
total = error.sum(axis=0).astype(int)
total.name = "total"
error.loc["total"] = total
error = error.astype(int)
error = error.rename(columns = {
            "TED": 'ted', 'SEQ': 'seq', 'STR': 'str',
            "TED_NORM": "nted", "STR_NORM": "nstr", "SEQ_NORM": "nseq", 
            'BLEU': 'bleu', "CODEBLEU": "codebleu", "ROUGE1": "rouge", "ROUGELCSUM": "rougeLCS"})
print(error.to_latex(multicolumn=True, multirow=True, column_format='r|c|ccc|ccc|ccc'))
error

KeyError: "['TED', 'SEQ', 'STR', 'TED_NORM', 'SEQ_NORM', 'STR_NORM', 'BLEU', 'ROUGE1', 'ROUGELCSUM'] not in index"

We can observe that the number of times were we observe that the rouge distance metric misclassifies our elements is consistantly lower than for the string distance measure

### Let's look at the distances a bit deeper

#### Average distance between buggy->expert, and buggy->candidate

In [None]:
# melt the dataframe
df = results_df.melt(
    id_vars="assignment_id",
    var_name="measure",
    value_name="value",
    value_vars=[c for c in results_df.columns if "-" in c])
# rename the distance metrics
df["distance_metric"] = df["measure"].apply(lambda dm: dm.split("-")[0])
df["distance_metric"] = df["distance_metric"].apply(lambda c: c.replace("_dist", '').upper())
df["from"] = df["measure"].apply(lambda dm: dm.split("-")[1].split("_")[0])
df["to"] = df["measure"].apply(lambda dm: dm.split("-")[1].split("_")[1])
df = df.replace({"ROUGELCSUM": "ROUGELCS"})
df

In [None]:
df.distance_metric.unique()

In [None]:
def plot_univariate(metric):
    print("Metric", metric)
    sub_df = df[(df.distance_metric == metric) & (df["from"] == "buggy")]
    g = sns.displot(data=sub_df, x="value", hue="to", col="distance_metric", kde=True)
    sns.move_legend(g, "center", bbox_to_anchor=(0.50, 0.65), ncol=11, title=None, frameon=True)
    plt.savefig(f'images/{metric}_hist.pdf', dpi=100,  bbox_inches='tight')

In [None]:
def plot_ecdf(metric):
    sub_df = df[(df.distance_metric == metric) & (df["from"] == "buggy")]
    g = sns.displot(data=sub_df, x="value", hue="to", kind="ecdf", col="distance_metric")
    sns.move_legend(g, "center", bbox_to_anchor=(0.50, 0.30), ncol=11, title=None, frameon=True)
    plt.savefig(f'images/{metric}_ecdf.pdf', dpi=100,  bbox_inches='tight')

In [None]:
for metric in ["STR", "SEQ", "ROUGELCS", "SEQ_NORM", "STR_NORM"]:
    plot_univariate(metric)
    plot_ecdf(metric)