#### This Notebook aims to prove the correlation between number of patterns and mutation score

Research Question: Does the presence of code patterns in a function statistically correlate with the mutation score for that function?

In [17]:
import pandas as pd

In [42]:
def process_json_file(filepath):
    sub_df = pd.read_json(filepath)
    sub_df = sub_df.dropna()
    sub_df = sub_df[~sub_df['function_name'].str.startswith("test_")]
    return sub_df

In [43]:
import os

In [50]:
dataframes = []
folder_path = "../json_analysis"

for filename in os.listdir(folder_path):
    if filename.endswith("json"):
        filepath = os.path.join(folder_path, filename)
        sub_df = process_json_file(filepath)
        sub_df.insert(0,"filename",filename)
        dataframes.append(sub_df)

df = pd.concat(dataframes, ignore_index=True)

In [51]:
df['num_patterns'] = df['patterns'].apply(len)

In [52]:
df

Unnamed: 0,filename,function_name,function_scope,patterns,mutants,mutation_score,num_patterns
0,101_sorting.json,bubble_sort,21-38,"[{'lineno': 21, 'coloffset': 0, 'linematch': '...","[{'name': 'Mutant #167', 'line': 21, 'descript...",0.0,11
1,101_sorting.json,insertion_sort,41-60,"[{'lineno': 41, 'coloffset': 0, 'linematch': '...","[{'name': 'Mutant #182', 'line': 44, 'descript...",0.0,7
2,101_sorting.json,merge,63-93,"[{'lineno': 63, 'coloffset': 0, 'linematch': '...","[{'name': 'Mutant #191', 'line': 64, 'descript...",0.0,19
3,101_sorting.json,generate_random_number,17-22,"[{'lineno': 17, 'coloffset': 0, 'linematch': '...","[{'name': 'Mutant #166', 'line': 18, 'descript...",0.0,6
4,101_sorting.json,generate_random_container,25-33,"[{'lineno': 25, 'coloffset': 0, 'linematch': '...","[{'name': 'Mutant #170', 'line': 25, 'descript...",0.0,7
5,101_sorting.json,run_sorting_algorithm,36-52,"[{'lineno': 36, 'coloffset': 0, 'linematch': '...","[{'name': 'Mutant #178', 'line': 36, 'descript...",0.0,8
6,101_sorting.json,run_sorting_algorithm_experiment_campaign,55-81,"[{'lineno': 55, 'coloffset': 0, 'linematch': '...","[{'name': 'Mutant #187', 'line': 57, 'descript...",0.0,7
7,101_sorting.json,listsorting,45-78,"[{'lineno': 45, 'coloffset': 0, 'linematch': '...","[{'name': 'Mutant #183', 'line': 46, 'descript...",0.0,17
8,101_intersection.json,human_readable_boolean,30-36,"[{'lineno': 30, 'coloffset': 0, 'linematch': '...","[{'name': 'Mutant #11', 'line': 34, 'descripti...",0.0,7
9,101_intersection.json,generate_random_container,39-47,"[{'lineno': 39, 'coloffset': 0, 'linematch': '...","[{'name': 'Mutant #13', 'line': 40, 'descripti...",33.333333,9


#### Pearson's correlation coefficient

In [39]:
import scipy.stats as stats

In [40]:
correlation, p_value = stats.pearsonr(df['num_patterns'], df['mutation_score'])

In [41]:
print("Correlation coefficient (r):", correlation)
print("p-value:", p_value)

Correlation coefficient (r): -0.062768625330273
p-value: 0.7557786416213805


Observation: 
* Correlation Coefficient (r) is very close of 0, indicating a very weak negative correlation between the number of code patterns and mutation score
* p-value is significantly higher than the typical threshold (often set at 0.05), suggesting the observed weak negative correlation is likely due to chance, and there's not enough evidence to claim a statistically significant relationship

--> There's not enough evidence to conclude a statistically significant corerlation between the number of patterns and the mutation score.