__Here is comparing the results of multilabel and multihead models after training them on 5 folds data. We compare their f1-scores on validation sets of every fold. Structure of models, hyperparameters were similar. In the first case one fold lasted for 15 epochs, in the second one - 10 epochs.__

In [1]:
import numpy as np
import pandas as pd

from scipy import stats

ORG

In [2]:
org_results = pd.DataFrame([[0.38590785907859076, 0.561263387184607], 
              [0.3990470518165575, 0.2720606826801517], 
              [0.3997897687456202, 0.5333068204414396], 
              [0.40506735086593976, 0.6096403978576893], 
               [0.39066655228617997, 0.5914396887159532]], columns=['multilabel', 'multihead'])
org_results

Unnamed: 0,multilabel,multihead
0,0.385908,0.561263
1,0.399047,0.272061
2,0.39979,0.533307
3,0.405067,0.60964
4,0.390667,0.59144


In [3]:
org_results.describe()

Unnamed: 0,multilabel,multihead
count,5.0,5.0
mean,0.396096,0.513542
std,0.007683,0.138098
min,0.385908,0.272061
25%,0.390667,0.533307
50%,0.399047,0.561263
75%,0.39979,0.59144
max,0.405067,0.60964


Hypotheses:

$$H_0: \mu_1 = \mu_2, \: H_1: \mu_1 < \mu_2 $$

Two-sample T-test for results comparing:

$$T(X_1, X_2) = \frac{\overline{X}_1-\overline{X}_2}{S/\sqrt{n}}$$

$$ S = \sqrt{\frac{1}{n-1}\sum\limits_{i=1}^n(D_i-\overline{D})^2} $$

$$ D_i = X_{1i} - X_{2i},\: \overline{D} = \frac{1}{n}\sum_iD_i $$

In [4]:
def t_stat(df):
    D = df['multilabel'] - df['multihead']
    D_mean = D.mean()
    S = np.sqrt(np.sum((D-D_mean)**2)/(df.shape[0]-1))
    T_stat = (df['multilabel'].mean() - df['multihead'].mean())/(S/np.sqrt(df.shape[0]))
    return T_stat

In [5]:
p_value = stats.distributions.t(5-1).cdf(t_stat(org_results))

In [6]:
p_value

0.06649185548449224

LOC

In [7]:
loc_results = pd.DataFrame([[0.4448608137044968, 0.9191919191919192], 
              [0.42109683379524643, 0.8053146336962539], 
              [0.43189483698002246, 0.8173060080574531], 
              [0.40022721739836076, 0.8514367230487059], 
               [0.4083086053412463, 0.7913279132791328]], columns=['multilabel', 'multihead'])
loc_results

Unnamed: 0,multilabel,multihead
0,0.444861,0.919192
1,0.421097,0.805315
2,0.431895,0.817306
3,0.400227,0.851437
4,0.408309,0.791328


In [8]:
loc_results.describe()

Unnamed: 0,multilabel,multihead
count,5.0,5.0
mean,0.421278,0.836915
std,0.01789,0.051092
min,0.400227,0.791328
25%,0.408309,0.805315
50%,0.421097,0.817306
75%,0.431895,0.851437
max,0.444861,0.919192


In [9]:
p_value = stats.distributions.t(5-1).cdf(t_stat(loc_results))

In [10]:
p_value

1.4586896771528625e-05

PER

In [11]:
per_results = pd.DataFrame([[0.3999264841021871, 0.3431126012852752], 
              [0.39782157676348556, 0.9160912324754134], 
              [0.3920551131788494, 0.3526273672979461], 
              [0.381796017878911, 0.32028289972215207], 
               [0.3820734152933996, 0.339974126778784]], columns=['multilabel', 'multihead'])
per_results

Unnamed: 0,multilabel,multihead
0,0.399926,0.343113
1,0.397822,0.916091
2,0.392055,0.352627
3,0.381796,0.320283
4,0.382073,0.339974


In [12]:
per_results.describe()

Unnamed: 0,multilabel,multihead
count,5.0,5.0
mean,0.390735,0.454418
std,0.008535,0.258352
min,0.381796,0.320283
25%,0.382073,0.339974
50%,0.392055,0.343113
75%,0.397822,0.352627
max,0.399926,0.916091


In this case let's make a two-tailed test:

$$H_0: \mu_1 = \mu_2, \: H_1: \mu_1 \neq \mu_2 $$

because results are quite worser.

In [33]:
p_value = 2*(1-stats.distributions.t(5-1).cdf(abs(t_stat(per_results))))

In [34]:
p_value

0.6053697535708902

 So, we definetely couldn't reject null hypotheses of means equality.

Let's compare the models using Holm's method for comparing two models by finding out the familywise error rate (FWER, групповая вероятность ошибки первого рода).

In [15]:
from statsmodels.stats.multitest import multipletests

In [35]:
result = multipletests([0.06649185548449224, 1.4586896771528625e-05, 0.6053697535708902],
                          method='holm')

In [36]:
print(f"Labels: [ORG LOC PER]")
print(f"rejected null hypotheses:{result[0]}")
print(f"corrected p-values:{result[1]}")

Labels: [ORG LOC PER]
rejected null hypotheses:[False  True False]
corrected p-values:[1.32983711e-01 4.37606903e-05 6.05369754e-01]


Formally, we couldn't reject the null hypotheses of mean equivalence in the ORG and PER cases. But in the ORG case we see that the results of multihead model are mainly better, in the PER case its results are comparable to multilable case.