# Statistical Analysis 

## Class imbalance, WSS95%, and TNR9%

To evaluate the differences between the WSS95% and TNR95% metric across datasets. A naïve Bayes (NB) + TF-IDF simulation was run on 24 of 26 synergy datasets. Brouwer_2019 and Walker_2018 were excluded due to the high expected computation time based on their large size. The rank oder per metric and a Wilcoxon rank-sum test were computed. Spearmean correlations were computed to examine associations between the variables.

### Dataset
After running a simulation the 24 datasets, the revelant metrics file can be accessed in the /output/tables folder. 


In [38]:
import os 
import numpy as np
import pandas as pd
from scipy.stats import spearmanr
from scipy.stats import shapiro
from scipy.stats import wilcoxon

In [14]:
#access data_metrics.csv
os.chdir('C:/.../')
syn = pd.read_csv('SYN.csv')

In [10]:
syn

Unnamed: 0.1,Unnamed: 0,file_name,recall_0.1,recall_0.25,recall_0.5,recall_0.75,recall_0.9,wss_0.95,erf_0.1,atd,...,fn_1,tnr_0.1,tnr_0.25,tnr_0.5,tnr_0.75,tnr_0.8,tnr_0.85,tnr_0.9,tnr_0.95,tnr_1
0,13,Appenzeller-Herzog_2019_0.json,0.88,0.96,0.96,1.0,1.0,0.794845,0.8,169.96,...,0,0.997189,0.991216,0.981026,0.962052,0.929023,0.925861,0.921293,0.901616,0.429726
1,14,Bos_2018_0.json,1.0,1.0,1.0,1.0,1.0,0.815217,0.888889,65.777778,...,0,0.0,0.995274,0.993014,0.985206,0.984179,0.984179,0.983357,0.983357,0.963427
2,15,Chou_2003_0.json,0.857143,0.857143,0.928571,0.928571,1.0,0.449108,0.785714,207.642857,...,0,0.998943,0.998943,0.983615,0.970402,0.930761,0.930761,0.930761,0.559725,0.223573
3,0,Chou_2004_0.json,0.25,0.375,0.625,1.0,1.0,0.208845,0.125,572.625,...,0,0.0,0.983333,0.582099,0.448148,0.448148,0.448148,0.397531,0.397531,0.396914
4,1,Donners_2021_0.json,0.642857,0.928571,0.928571,1.0,1.0,0.6875,0.571429,32.071429,...,0,1.0,0.991736,0.954545,0.917355,0.917355,0.917355,0.85124,0.834711,0.318182
5,16,Hall_2012_0.json,0.990291,1.0,1.0,1.0,1.0,0.910818,0.893204,122.563107,...,0,0.999424,0.998273,0.994705,0.990677,0.990447,0.989641,0.988605,0.984692,0.806054
6,2,Jeyaraman_2020_0.json,0.568421,0.778947,0.989474,0.989474,0.989474,0.543052,0.473684,166.673684,...,0,0.997217,0.988868,0.956401,0.833952,0.785714,0.769944,0.66141,0.648423,0.068646
7,12,Leenaars_2019_0.json,1.0,1.0,1.0,1.0,1.0,0.894664,0.875,31.875,...,0,1.0,0.99931,0.999137,0.99534,0.99534,0.99534,0.993096,0.990853,0.975664
8,17,Leenaars_2020_0.json,0.475945,0.82646,0.984536,0.998282,1.0,0.576795,0.376289,1031.510309,...,0,0.996984,0.981604,0.926267,0.83655,0.820115,0.783625,0.736429,0.679885,0.197678
9,3,Meijboom_2021_0.json,0.666667,0.944444,1.0,1.0,1.0,0.688636,0.555556,88.861111,...,0,0.996445,0.978673,0.945498,0.895735,0.895735,0.853081,0.832938,0.787915,0.597156


In [22]:
#add rank, differences columns
syn_wt = syn[['file_name', 'wss_0.95', 'tnr_0.95']]
syn_wt = syn_wt.assign(tnr_wss_difference=syn_wt['tnr_0.95'] - syn_wt['wss_0.95'])
syn_wt['rank_wss_0.95'] = (syn_wt['wss_0.95'].rank(ascending=False)).astype(int)
syn_wt['rank_tnr_0.95'] = (syn_wt['tnr_0.95'].rank(ascending=False)).astype(int)
syn_wt['rank_difference'] = syn_wt['rank_tnr_0.95'] - syn_wt['rank_wss_0.95']
syn_wt['class_imbalance'] = [0.9,0.2,0.8,0.6,5.8,1.2,8.2,0.3,8.1,4.2,7.6,2.1,12.4,21.9,2.1,0.8,14.8,1,0.8,12.3,1.7,0.8,1.4,0.4]

#Format file name
remove_right = '_0.json'
remove_left = 'metrics_sim_'
syn_wt['file_name'] = syn_wt['file_name'].str.rstrip(remove_right)
syn_wt['file_name'] = syn_wt['file_name'].str.lstrip(remove_left)

In [25]:
syn_wt = syn_wt[['file_name', 'wss_0.95', 'tnr_0.95', 'rank_wss_0.95','rank_tnr_0.95', 'rank_difference', 'tnr_wss_difference','class_imbalance']]


Unnamed: 0,file_name,wss_0.95,tnr_0.95,rank_wss_0.95,rank_tnr_0.95,rank_difference,tnr_wss_difference,class_imbalance
0,Appenzeller-Herzog_2019,0.794845,0.901616,7,6,-1,0.106771,0.9
1,Bos_2018,0.815217,0.983357,5,3,-2,0.16814,0.2
2,Chou_2003,0.449108,0.559725,18,18,0,0.110617,0.8
3,Chou_2004,0.208845,0.397531,22,22,0,0.188686,0.6
4,Donners_2021,0.6875,0.834711,12,11,-1,0.147211,5.8
5,Hall_2012,0.910818,0.984692,1,2,1,0.073874,1.2
6,Jeyaraman_202,0.543052,0.648423,17,17,0,0.105371,8.2
7,Leenaars_2019,0.894664,0.990853,2,1,-1,0.096189,0.3
8,Leenaars_202,0.576795,0.679885,16,16,0,0.10309,8.1
9,Meijboom_2021,0.688636,0.787915,11,12,1,0.099279,4.2


### Normality tests (significance level 0.05)

In [32]:
print(shapiro(syn_wt['wss_0.95']))
print(shapiro(syn_wt['tnr_0.95']))
print(shapiro(syn_wt['tnr_wss_difference']))
print(shapiro(syn_wt['class_imbalance']))

ShapiroResult(statistic=0.9047209024429321, pvalue=0.02713652327656746)
ShapiroResult(statistic=0.9089611768722534, pvalue=0.03348847106099129)
ShapiroResult(statistic=0.8883335590362549, pvalue=0.012280217371881008)
ShapiroResult(statistic=0.7565858960151672, pvalue=6.2245708249975e-05)


The result is significiant for all variables, the normality assumption is not met. Hence, Spearman correlation are computed.

## Wilcoxon rank-sum test

In [37]:
# Assuming you have two arrays representing paired observations: x and y
x = syn_wt['rank_wss_0.95'].values
y = syn_wt['rank_tnr_0.95'].values

# Perform the Wilcoxon signed-rank test
statistic, p_value = wilcoxon(x, y, zero_method='wilcox')

# Print the test statistic and p-value
print("Test Statistic:", statistic)
print("P-value:", p_value)

Test Statistic: 68.0
P-value: 1.0


The result is not signficant.

## Spearman Correlation: Class imbalance & WSS95%

In [33]:
# Extract the values from the columns
x = syn_wt['wss_0.95'].values
y = syn_wt['class_imbalance'].values

# Calculate the Spearman correlation coefficient and p-value
correlation, p_value = spearmanr(x, y)

# Print the correlation coefficient and p-value
print("Spearman correlation coefficient:", correlation)
print("p-value:", p_value)

Spearman correlation coefficient: -0.6197446438307457
p-value: 0.0012382368885044117


## Spearman Correlation: Class imbalance & TNR95%

In [34]:
# Extract the values from the columns
x = syn_wt['tnr_0.95'].values
y = syn_wt['class_imbalance'].values

# Calculate the Spearman correlation coefficient and p-value
correlation, p_value = spearmanr(x, y)

# Print the correlation coefficient and p-value
print("Spearman correlation coefficient:", correlation)
print("p-value:", p_value)

Spearman correlation coefficient: -0.6114639488709819
p-value: 0.0014999459545429728


## Spearman Correlation: Class imbalance &  difference (TNR95%-WSS95%)

In [35]:
# Extract the values from the columns
x = syn_wt['tnr_wss_difference'].values
y = syn_wt['class_imbalance'].values

# Calculate the Spearman correlation coefficient and p-value
correlation, p_value = spearmanr(x, y)

# Print the correlation coefficient and p-value
print("Spearman correlation coefficient:", correlation)
print("p-value:", p_value)

Spearman correlation coefficient: 0.1573332042355128
p-value: 0.46281639212584
