In [2]:
from scipy.stats import ks_2samp
import pandas as pd
from scipy import stats
import numpy as np
from scipy.stats import shapiro

### For the results of a two sample t-test to be valid, the following assumptions should be met:
- The observations in one sample should be independent of the observations in the other sample. **Given**
- The data should be approximately normally distributed. **Will be tested**
- The two samples should have approximately the same variance. If this assumption is not met, you should instead perform Welch’s t-test. **Will be tested**
- The data in both samples was obtained using a random sampling method. **Given**

## Classification Results CNN vs ViT

H0: The mean performance score of the cnn model is equal to the mean performance score of the vit model.
<br>H1: The mean performacne score of the cnn model is not equal to the mean performance score of the vit model.


In [2]:
results_cnn = [64.0625, 69.375, 55.59322033898305, 60.15625, 53.125]
results_vit = [37.8125, 31.864406779661017, 48.125, 50.390625, 36.328125]

### Test for normality
If the P-Value of the Shapiro Wilk Test is larger than 0.05, we assume a normal distribution

In [3]:
print(shapiro(results_cnn))
print(shapiro(results_vit))

ShapiroResult(statistic=0.9702812433242798, pvalue=0.8770557045936584)
ShapiroResult(statistic=0.9036481380462646, pvalue=0.43037623167037964)


### Test for same variance

In [4]:
print(np.var(results_cnn))
print(np.var(results_vit))

34.00704718471703
50.86982893183533


Same variance is not given --> therefore a Welch's t-test is performed

In [None]:
stats.ttest_ind(results_cnn, results_vit, equal_var = False)

TtestResult(statistic=4.245859096615145, pvalue=0.003073694853434769, df=7.696221372788065)

Typically, a significance level (e.g., p < 0.05) is chosen, and if the p-value is below this threshold, it suggests a statistically significant difference in performance between the two conditions or models

Since the pvalue is < 0.05 we have reason to reject the H0 hypothesis. The mean performance scores are not equal

## Classification Results CNN vs AST

H0: The mean performance score of the cnn model is equal to the mean performance score of the vit model.
<br>H1: The mean performacne score of the cnn model is not equal to the mean performance score of the ast model.


In [3]:
results_cnn = [64.0625, 69.375, 55.59322033898305, 60.15625, 53.125]
results_ast = [0.8680555555555556,  0.9357142857142857, 0.9142857142857143, 0.8785714285714286, 0.9214285714285714] # Fill in the gap

### Test for normality
If the P-Value of the Shapiro Wilk Test is larger than 0.05, we assume a normal distribution

In [4]:
print(shapiro(results_cnn))
print(shapiro(results_ast))

ShapiroResult(statistic=0.9702812433242798, pvalue=0.8770557045936584)
ShapiroResult(statistic=0.9124070405960083, pvalue=0.48218968510627747)


### Test for same variance

In [5]:
print(np.var(results_cnn))
print(np.var(results_ast))
# if variance not similar continue copy past as above

34.00704718471703
0.0006706412194507425


Same variance is not given --> therefore a Welch's t-test is performed

In [6]:
stats.ttest_ind(results_cnn, results_ast, equal_var = False)

TtestResult(statistic=20.426175832638407, pvalue=3.39126914658412e-05, df=4.000157765233905)

Typically, a significance level (e.g., p < 0.05) is chosen, and if the p-value is below this threshold, it suggests a statistically significant difference in performance between the two conditions or models

Since the pvalue is < 0.05 we have reason to reject the H0 hypothesis. The mean performance scores are not equal