# Experiment 2

To validate our results in a more complex setting, we examine how each distance measure ranks an expert annotation against a single other high-quality candidate repair found by a state-of-the-art automated repair technique. 

We use the state-of-the-art semantic Automated Repair Tool (ART) Refactory to find a candidate repair for each incorrect solution in our annotated dataset. To obtain a high-quality repair, we run the ART giving it access to the same pool of candidate repairs as used in the first experiment (without the expert solution). Using this pool of correct programs, Refactory generates a bigger suite of semantically equivalent code by refactoring all these available working solutions to a problem. Then, given an incorrect program, Refactory analyzes its control flow structure to find a closely matching working program to compare for isolating the buggy components of the buggy solution. As such, the candidate repair generated by Refactory should be better or at least as appropriate as the best candidates in the original pool (which, once again, might contain the student's own correction to the problem).

We repeat the previous experiment (experiment 1) using the candidate repair found for each buggy solution. The main difference with the first experiment is that we compare the expert annotation/repair against the single candidate obtained using Refactory. Therefore, the ranking error for each buggy program becomes a binary classification error. We report the total classification error --  the number of times the ART candidate repair was favored over the expert annotation -- for all metrics.

In [None]:
import os, sys
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datasets import disable_caching

#### General settings

In [None]:
sys.path.append("../")
sys.path.append("../../")
disable_caching()
sns.set_theme("paper")
plt.rcParams['font.size'] = '7'
sns.set(font_scale=1.1)

In [None]:
from src.common import dist_funcs, new_assignments_id

## Let's load our data

In [None]:
CONFIG_PATH = '../configs/conf.json'

In [None]:
from src.utils.files import read_config

config = read_config(CONFIG_PATH)
config

### Loading the Refactory results dataframe

In [None]:
def extract_index(file_name):
    return int(file_name.split("_")[-1][:-3])

In [None]:
from warnings import warn

questions = os.listdir(config.save_path)
questions = [q for q in questions if q.startswith("question")]
key_f = lambda q: int(q.split('_')[-1])
questions = sorted(questions, key=key_f)

dataframe = []
for q in questions:
    q_path = os.path.join(config.save_path, q, 'refactory_online.csv')
    if not os.path.exists(q_path):
        warn(f"Results for assignment {q} are not available")
        continue
    dataframe.append(pd.read_csv(q_path))
    
dataframe = pd.concat(dataframe, axis=0, ignore_index=True)
dataframe["index"] = dataframe["File Name"].apply(extract_index).astype(int)
dataframe = dataframe.set_index("index")
dataframe = dataframe.sort_index()
dataframe

### Loading the dataframe used to obtain the Refactory's repair

In [None]:
from datasets import load_from_disk

dataset = load_from_disk(os.path.join(config.save_path, 'hgf'))
original_df = dataset.to_pandas()
# We only take the incorrect ones
original_df = original_df[~original_df.correct]
original_df = original_df.set_index("submission_id")
original_df = original_df.sort_index()
original_df

Let's merge these together.
Both dataframe and the original dataset should have the same lenght. Are there mismatches?

In [None]:
results_df = pd.concat([dataframe, original_df], axis=1)
results_df = results_df.replace(new_assignments_id)
results_df

We decided to remove the results for the reverse_recur assignment since we have only 5 annotations for this one (not enough to matter)

In [None]:
results_df = results_df[results_df.assignment_id != "reverse_recur"]
results_df

## Let's take a look at how well Refactory really performs 

#### Rexecuting the codes and  looking at what is the real success percentage

We notice that Refactory sometimes produces incorrect results but the tool classifies them as correct.
To avoid that, let's determine correctness ourselves. We'll only analyze the Results of Refactory on the codes
which were successfully corrected

In [None]:
from src.utils.TestResults import TestResults

results_df.loc[pd.isnull(results_df.Repair), "Repair"] = ""
results_df = TestResults().get_correctness(results_df, "Repair")
results_df

In [None]:
groups = results_df.groupby("assignment_id")
success_percentage = groups.apply(lambda gdf: (gdf.correct.sum() / len(gdf)) * 100)
success_percentage

In [None]:
non_working = results_df[~results_df.correct]
non_working

### Preparing the distance computations

In [None]:
from src.utils.code import clean_code

results_df = results_df[results_df.correct] # take only the Refactory corrections which are actually correct
rename = {
    "func_code": "buggy_code",
    "Repair": "candidate_code",
    "annotation": "expert_code"
}
results_df = results_df.rename(columns=rename)
results_df = results_df[["buggy_code", "candidate_code", "expert_code", "assignment_id"]]

results_df = results_df[results_df.buggy_code.astype(bool)]
results_df["buggy_code"] = results_df["buggy_code"].apply(clean_code)

results_df = results_df[results_df.expert_code.astype(bool)]
results_df["expert_code"] = results_df["expert_code"].apply(clean_code)

results_df = results_df[results_df.candidate_code.astype(bool)]
results_df["candidate_code"] = results_df["candidate_code"].apply(clean_code)
results_df

In [None]:
for b, r, e in results_df[results_df.assignment_id == "maximum"][["buggy_code", "candidate_code", "expert_code"]].to_numpy():
    print(b)
    print(r)
    print(e)
    print("---")

### Distance computations between different codes 

### Let's compute the classification error between the expert annotation and refactory candidate repair

Let's compute the number of times where, if we would use the sequence edit distance, or the string edit distance, we would select the candidate repair (the Refactory output) over the true goal.

In [None]:
from itertools import product, combinations

get_name = lambda c: c.split('_')[0]
from_to = list(combinations(["buggy_code", "expert_code", "candidate_code"], 2))
elements = list(product(from_to, dist_funcs))
for (from_, target), dist_f in elements:
    col_name = f"{dist_f.__name__}-{get_name(from_)}_{get_name(target)}"
    buggies = results_df[from_].to_list()
    corrections = results_df[target].to_list()
    results_df[col_name] = list(map(dist_f, buggies, corrections))

results_df = results_df.reset_index(drop=True)
results_df

In [None]:
def compute_error(sub_df):
    r = {}
    for dist_n in dist_names:
        bcd = sub_df[f"{dist_n}-buggy_candidate"]
        bed = sub_df[f"{dist_n}-buggy_expert"]
        r[dist_n] = sub_df[bcd < bed].shape[0]
               
    return pd.Series(r)
     

dist_names = [d.__name__ for d in dist_funcs]
targets = [c.split('_')[0] for c in ["candidate", "expert"]]
dist_names, targets

error = results_df.groupby("assignment_id").apply(compute_error)

error.columns = [c.replace("_dist", '').upper() for c in error.columns]
error = error.sort_values(by=error.first_valid_index(), ascending=False, axis=1)


selected_columns = [c for c in error.columns if "RPS" not in c]
selected_columns = ["TED", "SEQ", "STR", "TED_NORM", "SEQ_NORM", "STR_NORM","BLEU", "CODEBLEU", "ROUGE1", "ROUGELCSUM"]
error = error[selected_columns]

# adding the number of solutions per assignment as well as the success percentage
nb_code = results_df.groupby("assignment_id").buggy_code.count()
nb_code.name = "#prog"
error = pd.concat([nb_code, error], axis=1)
total = error.sum(axis=0).astype(int)
total.name = "total"
error.loc["total"] = total
error = error.astype(int)
error = error.rename(columns = {
            "TED": 'ted', 'SEQ': 'seq', 'STR': 'str',
            "TED_NORM": "nted", "STR_NORM": "nstr", "SEQ_NORM": "nseq", 
            'BLEU': 'bleu', "CODEBLEU": "codebleu", "ROUGE1": "rouge", "ROUGELCSUM": "rougeLCS"})
print(error.to_latex(multicolumn=True, multirow=True, column_format='r|c|ccc|ccc|ccc'))
error

We can observe that the number of times were we observe that the rouge distance metric misclassifies our elements is consistantly lower than for the string distance measure

### Let's look at the distances a bit deeper

#### Average distance between buggy->expert, and buggy->candidate

In [None]:
# melt the dataframe
df = results_df.melt(
    id_vars="assignment_id",
    var_name="measure",
    value_name="value",
    value_vars=[c for c in results_df.columns if "-" in c])
# rename the distance metrics
df["distance_metric"] = df["measure"].apply(lambda dm: dm.split("-")[0])
df["distance_metric"] = df["distance_metric"].apply(lambda c: c.replace("_dist", '').upper())
df["from"] = df["measure"].apply(lambda dm: dm.split("-")[1].split("_")[0])
df["to"] = df["measure"].apply(lambda dm: dm.split("-")[1].split("_")[1])
df = df.replace({"ROUGELCSUM": "ROUGELCS"})
df

In [None]:
df.distance_metric.unique()

In [None]:
def plot_univariate(metric):
    print("Metric", metric)
    sub_df = df[(df.distance_metric == metric) & (df["from"] == "buggy")]
    g = sns.displot(data=sub_df, x="value", hue="to", col="distance_metric", kde=True)
    sns.move_legend(g, "center", bbox_to_anchor=(0.50, 0.65), ncol=11, title=None, frameon=True)
    plt.savefig(f'images/{metric}_hist.pdf', dpi=100,  bbox_inches='tight')

In [None]:
def plot_ecdf(metric):
    sub_df = df[(df.distance_metric == metric) & (df["from"] == "buggy")]
    g = sns.displot(data=sub_df, x="value", hue="to", kind="ecdf", col="distance_metric")
    sns.move_legend(g, "center", bbox_to_anchor=(0.50, 0.30), ncol=11, title=None, frameon=True)
    plt.savefig(f'images/{metric}_ecdf.pdf', dpi=100,  bbox_inches='tight')

In [None]:
for metric in ["STR", "SEQ", "ROUGELCS", "SEQ_NORM", "STR_NORM"]:
    plot_univariate(metric)
    plot_ecdf(metric)