# Empirical Investigation: Statistical Test

In [1]:
import pandas as pd
from statsmodels.stats.weightstats import ttest_ind
from scipy import stats

In [6]:
data = {
    'ff_train_acc': [0.947119951248169, 0.9496999979019165, 0.9481199979782104, 0.9467799663543701, 0.9473399519920349, 
                     0.9502599835395813, 0.9484399557113647, 0.9496999979019165, 0.9482399821281433, 0.9493399858474731, 
                     0.9495399594306946],
    'ff_test_acc': [0.946399986743927, 0.9513999819755554, 0.9474999904632568, 0.9440999627113342, 0.948199987411499, 
                    0.9488999843597412, 0.9476999640464783, 0.9490000009536743, 0.9472999572753906, 0.946399986743927, 
                    0.9472000002861023],
    'fb_train_acc': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
    'fb_test_acc': [0.9697999954223633, 0.9722999930381775, 0.972599983215332, 0.9722999930381775, 0.9738999605178833, 
                    0.972000002861023, 0.9734999537467957, 0.9710999727249146, 0.9734999537467957, 0.9705999493598938, 
                    0.9715999960899353]
}

df = pd.DataFrame(data)

In [7]:
# Perform the t-test and get the p-value
p = ttest_ind(df['ff_test_acc'], df['fb_test_acc'])
print("p-value:", p)

p-value: (-36.24328609914901, 1.0211810892963202e-19, 20.0)


The result of the t-test is a tuple of three values:

The first value (-36.24328609914901) is the t-statistic, which measures the difference between the means of two groups (in this case, df['ff_test_acc'] and df['fb_test_acc']) in units of standard error.

The second value (1.0211810892963202e-19) is the p-value, which is the probability of observing the t-statistic (or a more extreme value) assuming the null hypothesis (i.e., the means of the two groups are equal) is true. In this case, the p-value is very low (close to 0), indicating strong evidence against the null hypothesis.

The third value (20.0) is the degrees of freedom, which is the number of observations minus the number of parameters estimated.

The whole result shows that there is strong evidence to reject the null hypothesis and suggest that the means of df['ff_test_acc'] and df['fb_test_acc'] are not equal, based on the low p-value and large t-statistic.

The p-value of the t-test is a measure of the evidence against the null hypothesis, which in this case is that the means of the two models (df['ff_test_acc'] and df['fb_test_acc']) are equal. A small p-value (typically less than 0.05) indicates strong evidence against the null hypothesis, while a large p-value (typically greater than 0.05) indicates weak evidence against the null hypothesis.

In the given test, the p-value is 1.0211810892963202e-19, which is extremely small, indicating strong evidence against the null hypothesis. This suggests that the means of the two models are not equal, and there is a significant difference in their performance (as measured by the test accuracy).

Based on the p-value, it can be concluded that the two models are not equally good at predicting the test accuracy, and that one model may outperform the other. Further analysis and investigation may be necessary to determine which model is better, and why.


In [4]:
# Load data into a pandas DataFrame
data = [[0.930679976940155,0.9305999875068665,1.0,0.9716999530792236],
        [0.9301999807357788,0.9286999702453613,1.0,0.9745999574661255],
        [0.9290199875831604,0.9273999929428101,1.0,0.97489994764328],
        [0.9310199618339539,0.9300999641418457,1.0,0.9732999801635742],
        [0.9288399815559387,0.9293999671936035,1.0,0.9726999998092651],
        [0.9312199950218201,0.9271999597549438,1.0,0.9731999635696411],
        [0.9293799996376038,0.9276999831199646,1.0,0.9714999794960022],
        [0.9312199950218201,0.9318999648094177,1.0,0.9740999937057495],
        [0.9301199913024902,0.9287999868392944,1.0,0.973800003528595],
        [0.9304199814796448,0.9320999979972839,1.0,0.9733999967575073],
        [0.9295199513435364,0.9276999831199646,1.0,0.9736999869346619],
        [0.9306399822235107,0.9294999837875366,1.0,0.9726999998092651],
        [0.9275199770927429,0.9299999475479126,1.0,0.9728999733924866],
        [0.9293199777603149,0.9296999573707581,1.0,0.9723999500274658],
        [0.9285999536514282,0.9283999800682068,1.0,0.9751999974250793]]
df = pd.DataFrame(data, columns=['ff_train_acc','ff_test_acc','fb_train_acc','fb_test_acc'])

In [5]:
# Perform the t-test and get the p-value
p = ttest_ind(df['ff_test_acc'], df['fb_test_acc'])
print("p-value:", p)


p-value: (-91.60628683450511, 3.028381352208943e-36, 28.0)


p-value: (-91.60628683450511, 3.028381352208943e-36, 28.0)

The first value, -91.60628683450511, is the t-statistic. It is the test statistic used to determine the p-value.

The second value, 3.028381352208943e-36, is the p-value. It is the probability of observing a t-statistic as extreme or more extreme than the one calculated, assuming that the null hypothesis is true. This p-value indicates that the difference between the means of the two groups is statistically significant. A p-value less than 0.05 is considered statistically significant.

The third value, 28.0, is the degrees of freedom of the t-distribution. It is used in the calculation of the p-value.

In this case, the p-value is very small, which suggests that the null hypothesis (the means of the two groups are equal) can be rejected in favor of the alternative hypothesis (the means of the two groups are different). This means that the test result shows that the accuracy of the ff_test_acc is statistically different from the accuracy of the fb_test_acc


In [None]:
ff_test_acc = [0.9305999875068665, 0.9286999702453613, 0.9273999929428101, 0.9300999641418457, 0.9293999671936035, 0.9271999597549438, 0.9276999831199646, 0.9318999648094177, 0.9287999868392944, 0.9320999979972839, 0.9276999831199646, 0.9294999837875366, 0.9299999475479126, 0.9296999573707581, 0.9283999800682068]
fb_test_acc = [0.9716999530792236, 0.9745999574661255, 0.97489994764328, 0.9732999801635742, 0.9726999998092651, 0.9731999635696411, 0.9714999794960022, 0.9740999937057495, 0.973800003528595, 0.9733999967575073, 0.9736999869346619, 0.9726999998092651, 0.9728999733924866, 0.9723999500274658, 0.9751999974250793]
t_statistic, p_value = stats.ttest_ind(ff_test_acc, fb_test_acc)


In [None]:
print(t_statistic)
print(p_value)

-91.60628683450511
3.028381352208943e-36


The p-value of 3.028381352208943e-36 is extremely small. This means that there is a very small probability of observing a difference in means as large or larger than the one in the sample data if the null hypothesis (that the means of the two datasets (ff_test_acc and fb_test_acc) are equal) is true.
In other words, the probability of observing the difference between the means of the two datasets (ff_test_acc and fb_test_acc) due to random chance is extremely low.
Therefore, we can reject the null hypothesis and conclude that there is a significant difference in means between the two datasets, and that the difference is likely not due to chance.
It is a strong evidence that the ff_test_acc and fb_test_acc have different mean.