#### This Notebook aims to prove the correlation between number of patterns and mutation score

Research Question: Does the presence of code patterns in a function statistically correlate with the mutation score for that function?

In [1]:
import pandas as pd

In [2]:
def process_json_file(filepath):
    sub_df = pd.read_json(filepath)
    sub_df = sub_df.dropna()
    sub_df = sub_df[~sub_df['function_name'].str.startswith("test_")]
    return sub_df

In [3]:
import os

In [4]:
dataframes = []
folder_path = "../json_analysis"

for filename in os.listdir(folder_path):
    if filename.endswith("json"):
        filepath = os.path.join(folder_path, filename)
        sub_df = process_json_file(filepath)
        sub_df.insert(0,"filename",filename)
        dataframes.append(sub_df)

df = pd.concat(dataframes, ignore_index=True)

In [5]:
df['num_patterns'] = df['patterns'].apply(len)

In [6]:
df

Unnamed: 0,filename,function_name,function_scope,patterns,mutants,mutation_score,num_patterns
0,diagrams_output.json,pre_mutation,3-11,"[{'lineno': 6, 'coloffset': 4, 'linematch': 'i...","[{'name': 'Mutant #367', 'line': 9, 'descripti...",0.000000,5
1,diagrams_output.json,render,193-198,"[{'lineno': 194, 'coloffset': 8, 'linematch': ...","[{'name': 'Mutant #107', 'line': 196, 'descrip...",0.000000,4
2,diagrams_output.json,__init__,446-487,"[{'lineno': 466, 'coloffset': 8, 'linematch': ...","[{'name': 'Mutant #224', 'line': 449, 'descrip...",11.764706,25
3,diagrams_output.json,append,515-525,"[{'lineno': 518, 'coloffset': 12, 'linematch':...","[{'name': 'Mutant #247', 'line': 516, 'descrip...",25.000000,6
4,diagrams_output.json,connect,527-540,"[{'lineno': 528, 'coloffset': 8, 'linematch': ...","[{'name': 'Mutant #251', 'line': 533, 'descrip...",100.000000,11
...,...,...,...,...,...,...,...
318,dummy_project.json,setSSN,306-308,"[{'lineno': 306, 'coloffset': 4, 'linematch': ...","[{'name': 'Mutant #175', 'line': 307, 'descrip...",100.000000,4
319,dummy_project.json,setEmail,309-311,"[{'lineno': 309, 'coloffset': 4, 'linematch': ...","[{'name': 'Mutant #176', 'line': 310, 'descrip...",100.000000,4
320,dummy_project.json,setAddress,312-314,"[{'lineno': 312, 'coloffset': 4, 'linematch': ...","[{'name': 'Mutant #177', 'line': 313, 'descrip...",100.000000,4
321,dummy_project.json,create_person,315-318,"[{'lineno': 315, 'coloffset': 4, 'linematch': ...","[{'name': 'Mutant #178', 'line': 316, 'descrip...",100.000000,5


#### Pearson's correlation coefficient

In [7]:
import scipy.stats as stats

In [8]:
correlation, p_value = stats.pearsonr(df['num_patterns'], df['mutation_score'])

In [9]:
print("Correlation coefficient (r):", correlation)
print("p-value:", p_value)

Correlation coefficient (r): 0.04403846493962099
p-value: 0.4302391627315757


Observation: 
* Correlation Coefficient (r) is very close of 0, indicating a very weak negative correlation between the number of code patterns and mutation score
* p-value is significantly higher than the typical threshold (often set at 0.05), suggesting the observed weak negative correlation is likely due to chance, and there's not enough evidence to claim a statistically significant relationship

--> There's not enough evidence to conclude a statistically significant corerlation between the number of patterns and the mutation score.