Imports definition

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import shapiro,wilcoxon,ttest_rel

# Load the dataset

In [35]:
# Load the data
data = pd.read_csv('result_analysis/input/DatasetExecution.csv')
data.replace("No NOD tests", np.nan, inplace=True)

#Obtain variable excluding hive due to it's unstable execution
dataExclHive = data.drop(5)

data

Unnamed: 0,project,isolated_tests_OD_detected,isolated_tests_NOD_detected,isolated_tests_repaired,isolated_detection_OD_recall,isolated_detection_NOD_recall,isolated_repair_recall,isolated_detection_time,isolated_repair_time,integrated_tests_OD_detected,integrated_tests_NOD_detected,integrated_tests_repaired,integrated_detection_OD_recall,integrated_detection_NOD_recall,integrated_repair_recall,integrated_total_time
0,spring-cloud-netflix,8,0,0,0.8,,0.0,44,9,8,0,0,0.8,,0.0,39
1,json-iterator,20,1,0,1.0,0.125,0.0,9,8,20,1,0,1.0,0.125,0.0,17
2,nifi,0,0,0,0.0,,0.0,494,143,1,0,0,0.047619,,0.0,174
3,fastjson,0,0,1,0.0,0.0,0.066667,12,70,0,0,0,0.0,0.0,0.0,112
4,wasp,28,89,0,1.0,0.773913043,0.0,215,12,28,89,0,1.0,0.773913043,0.0,226
5,hive,3,0,0,0.157895,,0.0,1440,74,2,0,0,0.105263,,0.0,792
6,spring-cloud-kubernetes,30,0,0,1.0,,0.0,481,34,29,0,0,0.966667,,0.0,456
7,visualee,47,0,46,1.0,,0.978723,10,222,47,0,45,1.0,,0.957447,227
8,ormlite-core,87,0,0,1.0,,0.0,61,40,87,0,0,1.0,,0.0,76
9,remoting,0,0,0,0.0,0.0,0.0,380,26,0,0,0,0.0,0.0,0.0,558


# Obtain Descriptive Statistics

__Recall descriptive statistics (OD)__

In [22]:
recall_columns_OD = [
    'isolated_detection_OD_recall', 'isolated_repair_recall', 
    'integrated_detection_OD_recall', 'integrated_repair_recall'
]

data[recall_columns_OD].describe()

Unnamed: 0,isolated_detection_OD_recall,isolated_repair_recall,integrated_detection_OD_recall,integrated_repair_recall
count,10.0,10.0,10.0,10.0
mean,0.595789,0.104539,0.591955,0.095745
std,0.4848,0.307871,0.481166,0.302771
min,0.0,0.0,0.0,0.0
25%,0.039474,0.0,0.06203,0.0
50%,0.9,0.0,0.883333,0.0
75%,1.0,0.0,1.0,0.0
max,1.0,0.978723,1.0,0.957447


Conclusions: Using a manual analysis, the mean of the detection recall for OD flaky tests, seems to be approximate for both the isolated execution and integrated execution of the tools. The same occurs for the repair recall for both isolated and integrated executions.

__Recall descriptive statistics (NOD)__

In [33]:
recall_columns_NOD = [
    'isolated_detection_NOD_recall', 'integrated_detection_NOD_recall',
]

data_nod = data.dropna()

data_nod[recall_columns_NOD].astype(float).describe()

Unnamed: 0,isolated_detection_NOD_recall,integrated_detection_NOD_recall
count,4.0,4.0
mean,0.224728,0.224728
std,0.370835,0.370835
min,0.0,0.0
25%,0.0,0.0
50%,0.0625,0.0625
75%,0.287228,0.287228
max,0.773913,0.773913


Conclusions: Using a manual analysis, the mean of the detection recall for NOD flaky tests, seems to be approximate for both the isolated execution and integrated execution of the tools.

__Time descriptive statistics (with hive)__

In [37]:
data['sum_isolated_time'] = data['isolated_detection_time'] + data['isolated_repair_time']

time_columns = ['isolated_detection_time', 'isolated_repair_time', 'sum_isolated_time', 'integrated_total_time']
data[time_columns].describe()

Unnamed: 0,isolated_detection_time,isolated_repair_time,sum_isolated_time,integrated_total_time
count,10.0,10.0,10.0,10.0
mean,314.6,63.8,378.4,267.7
std,441.697785,69.222347,449.881021,254.50215
min,9.0,8.0,17.0,17.0
25%,20.0,15.5,86.75,85.0
50%,138.0,37.0,229.5,200.0
75%,455.75,73.0,487.75,398.75
max,1440.0,222.0,1514.0,792.0


Conclusions: Using a manual analysis, the mean of the total time taken for both the detection and repair tools to run in isolation is higher than time taken for the integration to run.

__Time descriptive statistics (without hive)__

In [38]:
dataExclHive['sum_isolated_time'] = dataExclHive['isolated_detection_time'] + dataExclHive['isolated_repair_time']

time_columns = ['isolated_detection_time', 'isolated_repair_time', 'sum_isolated_time', 'integrated_total_time']
dataExclHive[time_columns].describe()

Unnamed: 0,isolated_detection_time,isolated_repair_time,sum_isolated_time,integrated_total_time
count,9.0,9.0,9.0,9.0
mean,189.555556,62.666667,252.222222,209.444444
std,208.752911,73.322916,220.414824,186.248564
min,9.0,8.0,17.0,17.0
25%,12.0,12.0,82.0,76.0
50%,61.0,34.0,227.0,174.0
75%,380.0,70.0,406.0,227.0
max,494.0,222.0,637.0,558.0


Conclusions: If the hive anomaly is excluded, the difference between the time taken for the isolated executions and the integrated executions is reduced, however, the time taken for the sum of the isolated executions is still superior to the integrated execution.

# Addressing MQ2

MQ2: Can a unified approach for detecting and repairing flakiness caused by order-dependent tests be as effective as the existing alternative approaches?

### Hipothesis test (Detection)

__Define the hypothesis:__

Null Hypothesis (H0): There is no difference in the effectiveness between the unified approach and existing alternative approaches for detecting order-dependent test flakiness.

Alternative Hypothesis (H1): There is a difference in the effectiveness between the unified approach and existing alternative approaches for detecting order-dependent test flakiness.

__Select the significance level:__

5%.

__Verify normality of the data using the Shapiro–Wilk test__

In [45]:
isolated_detection_OD_recall = data['isolated_detection_OD_recall']
integrated_detection_OD_recall = data['integrated_detection_OD_recall']

# Shapiro-Wilk for isolated_detection_OD_recall
stat_isolated, p_isolated = shapiro(isolated_detection_OD_recall)
print("Shapiro-Wilk Test for isolated_detection_OD_recall:")
print("p-value:", p_isolated)

# Shapiro-Wilk test for integrated_detection_OD_recall
stat_integrated, p_integrated = shapiro(integrated_detection_OD_recall)
print("\nShapiro-Wilk Test integrated_detection_OD_recall:")
print("p-value:", p_integrated)

Shapiro-Wilk Test for isolated_detection_OD_recall:
p-value: 0.001281676592161565

Shapiro-Wilk Test integrated_detection_OD_recall:
p-value: 0.0014673270193433048


Since both p-values are less than the defined significance level, we reject the null hypothesis for both data groups. This suggests that the data for both isolated and integrated detection recall are not normally distributed. 

Considering this, to validate the hipothesis the Wilcoxon signed-rank test will be used.

__Wilcoxon signed-rank test__

In [54]:
statistic, p_value = wilcoxon(isolated_detection_OD_recall, integrated_detection_OD_recall)
print("Wilcoxon Signed-Rank Test:")
print("p-value:", p_value)

Wilcoxon Signed-Rank Test:
p-value: 0.5929800980174267


Since the p-value is superior to the defined significance level, we fail to reject the null hypothesis. 
As such, the detection component of the integration seems to be as effective in terms of identifying the flaky tests as it is when isolated.

### Hipothesis test (Repair)

__Define the hypothesis:__

Null Hypothesis (H0): There is no difference in the effectiveness between the unified approach and existing alternative approaches for repairing order-dependent test flakiness.

Alternative Hypothesis (H1): There is a difference in the effectiveness between the unified approach and existing alternative approaches for repairing order-dependent test flakiness.

__Select the significance level:__

5%.

__Verify normality of the data using the Shapiro–Wilk test__

In [49]:
isolated_repair_recall = data['isolated_repair_recall']
integrated_repair_recall = data['integrated_repair_recall']

# Shapiro-Wilk for isolated_repair_recall
stat_isolated, p_isolated = shapiro(isolated_repair_recall)
print("Shapiro-Wilk Test for isolated_repair_recall:")
print("p-value:", p_isolated)

# Shapiro-Wilk test for integrated_repair_recall
stat_integrated, p_integrated = shapiro(integrated_repair_recall)
print("\nShapiro-Wilk Test integrated_repair_recall:")
print("p-value:", p_integrated)

Shapiro-Wilk Test for isolated_repair_recall:
p-value: 2.4309789957156995e-07

Shapiro-Wilk Test integrated_repair_recall:
p-value: 1.0036928213864587e-07


Since both p-values are less than the defined significance level, we reject the null hypothesis for both data groups. This suggests that the data for both isolated and integrated repair recall are not normally distributed. 

Considering this, to validate the hipothesis the Wilcoxon signed-rank test will be used.

__Wilcoxon signed-rank test__

In [50]:
statistic, p_value = wilcoxon(isolated_repair_recall, integrated_repair_recall)
print("Wilcoxon Signed-Rank Test:")
print("p-value:", p_value)

Wilcoxon Signed-Rank Test:
p-value: 0.17971249487899976


Since the p-value is superior to the defined significance level, we fail to reject the null hypothesis. 
As such, the detection component of the integration seems to be as effective in terms of repairing the flaky tests as it is when isolated.

### Conclusion

A unified approach for detecting and repairing flakiness caused by order-dependents tests can be as effective as the existing alternative and isolated alternatives both in terms of detection and repair.

# Addressing MQ3

MQ3: Can a unified approach for detecting and repairing flakiness caused by order-dependent tests be less time-consuming than the existing alternative approaches?

### Hipothesis test (Including Hive)

__Define the hypothesis:__

Null Hypothesis (H0): There is no difference in the time taken between the unified approach and existing alternative approaches for detecting and repairing order-dependent test flakiness.

Alternative Hypothesis (H1): There is a difference in the time taken between the unified approach and existing alternative approaches for detecting and repairing order-dependent test flakiness.

__Select the significance level:__

5%.

__Verify normality of the data using the Shapiro–Wilk test__

In [53]:
sum_isolated_time = data['sum_isolated_time']
integrated_total_time = data['integrated_total_time']

# Shapiro-Wilk for sum_isolated_time
stat_isolated, p_isolated = shapiro(sum_isolated_time)
print("Shapiro-Wilk Test for sum_isolated_time:")
print("p-value:", p_isolated)

# Shapiro-Wilk test for integrated_total_time
stat_integrated, p_integrated = shapiro(integrated_total_time)
print("\nShapiro-Wilk Test integrated_total_time:")
print("p-value:", p_integrated)

Shapiro-Wilk Test for sum_isolated_time:
p-value: 0.005632316702184004

Shapiro-Wilk Test integrated_total_time:
p-value: 0.10583784750839886


While the p-value of the sum of the integrated execution time is higher than the defined significance level, indicating that it is normally distributed, the p-value of the sum of the isolated execution times is lower than the defined significance level therefore not being normally distributed.

Since one of the groups is not normally distributed, the Wilcoxon signed-rank test will be used to validate the hipothesis.

__Wilcoxon signed-rank test__

In [55]:
statistic, p_value = wilcoxon(sum_isolated_time, integrated_total_time)
print("Wilcoxon Signed-Rank Test:")
print("p-value:", p_value)

Wilcoxon Signed-Rank Test:
p-value: 0.21352435403618242


Since the p-value is superior to the defined significance level, we fail to reject the null hypothesis. 
As such, there does not seem to be a statistically significant difference in the time taken between the unified approach and the existing alternative approaches for detecting and repairing order-dependent test flakiness.

### Hipothesis test (Exclusing Hive)

__Define the hypothesis:__

Null Hypothesis (H0): There is no difference in the time taken between the unified approach and existing alternative approaches for detecting and repairing order-dependent test flakiness.

Alternative Hypothesis (H1): There is a difference in the time taken between the unified approach and existing alternative approaches for detecting and repairing order-dependent test flakiness.

__Select the significance level:__

5%.

__Verify normality of the data using the Shapiro–Wilk test__

In [58]:
sum_isolated_time = dataExclHive['sum_isolated_time']
integrated_total_time = dataExclHive['integrated_total_time']

# Shapiro-Wilk for sum_isolated_time
stat_isolated, p_isolated = shapiro(sum_isolated_time)
print("Shapiro-Wilk Test for sum_isolated_time:")
print("p-value:", p_isolated)

# Shapiro-Wilk test for integrated_total_time
stat_integrated, p_integrated = shapiro(integrated_total_time)
print("\nShapiro-Wilk Test integrated_total_time:")
print("p-value:", p_integrated)

Shapiro-Wilk Test for sum_isolated_time:
p-value: 0.25234480140365034

Shapiro-Wilk Test integrated_total_time:
p-value: 0.1609466302956885


Excluding the hive project due to its anomalous execution results in both data groups having a p-value higher than the defined significance level, indicating that they are normally distributed. 

Since both groups are normally distributed, the paired t-test will be used to validate the hipothesis.

__Paired t-test__

In [60]:
statistic, p_value = ttest_rel(sum_isolated_time, integrated_total_time)
print("Paired t-Test for Time Comparison:")
print("p-value:", p_value)

Paired t-Test for Time Comparison:
p-value: 0.46728123446021697


Since the p-value is superior to the defined significance level, we fail to reject the null hypothesis. 
As such, there does not seem to be a statistically significant difference in the time taken between the unified approach and the existing alternative approaches for detecting and repairing order-dependent test flakiness.

### Conclusion

Regardless of considering the hive project, a unified approach for detecting and repairing flakiness caused by order-dependents tests can be as time-consuming as the existing alternative and isolated alternatives both in terms of detection and repair.