# Test for statistical significance
There are two baseline methods for generation:
1. NLPrompt: The description of opinion concepts is in Natural Language.
2. OntoPrompt: Serialized description of ontology added to the prompt

In [2]:
import scipy.stats as stats
import numpy as np

## Shapiro-Wilk Test

### Statistical test to see if the sequences follow a normal distribution or not
Results from table-5, F1 scores averaged between models, and the standard deviation

Followed by:
**Welch's t-test (Independent t-test)**
and
**Mann-Whitney U Test**


In [8]:

# Data for Method A (NLPrompts)
means_A = np.array([55.34, 54.91, 55.81, 55.48, 54.65, 53.41])
stdevs_A = np.array([5.29, 4.01, 5.69, 6.4, 4.37, 7.38])

# Data for Method B (OntoPrompts)
means_B = np.array([56.65, 55.88, 55.92, 55.0, 55.71, 55.08, 56.15])
stdevs_B = np.array([3.31, 3.32, 3.52, 3.5, 3.6, 3.38, 4.14])

# Perform normality tests
shapiro_A = stats.shapiro(means_A)
shapiro_B = stats.shapiro(means_B)

# Perform an independent t-test (Welch's t-test)
t_test = stats.ttest_ind(means_A, means_B, equal_var=False)

# Perform a Mann-Whitney U test (non-parametric alternative)
mann_whitney = stats.mannwhitneyu(means_A, means_B, alternative='two-sided')

# Output results
print("Shapiro-Wilk Normality Test (Method A, i.e. NLPrompt):", shapiro_A)
print("Shapiro-Wilk Normality Test (Method B, i.e. OntoPrompt):", shapiro_B)
print("Independent t-test (Welch's t-test):", t_test)
print("Mann-Whitney U Test:", mann_whitney)

Shapiro-Wilk Normality Test (Method A, i.e. NLPrompt): ShapiroResult(statistic=0.9055399894714355, pvalue=0.40771064162254333)
Shapiro-Wilk Normality Test (Method B, i.e. OntoPrompt): ShapiroResult(statistic=0.9426997303962708, pvalue=0.6631444692611694)
Independent t-test (Welch's t-test): Ttest_indResult(statistic=-2.0329499813774965, pvalue=0.07391524561488866)
Mann-Whitney U Test: MannwhitneyuResult(statistic=7.0, pvalue=0.05128205128205128)


## Inferencing statistical significace tests
Neither the Welch’s t-test nor the Mann-Whitney U test indicates a statistically significant difference between Method A and Method B at the 95% confidence level.
However, both p-values are borderline suggesting a **potential trend towards significance** that might be observed at a confidence threshold of 90%.
