__Here is comparing the results of multilabel and multihead models after training them on 5 folds data. We compare their f1-scores on validation sets of every fold. Structure of models, hyperparameters were found using grid-search and quite similar.__

In [1]:
import numpy as np
import pandas as pd

from scipy import stats

ORG

In [2]:
org_results = pd.DataFrame([[0.410358911966629, 0.594737713056429], 
              [0.41146540027137035, 0.651019622931897], 
              [0.4056111538790522, 0.44372713578652895], 
              [0.4260671968836227, 0.6835066864784548], 
               [0.4325370121130552, 0.6584513518484458]], columns=['multilabel', 'multihead'])
org_results

Unnamed: 0,multilabel,multihead
0,0.410359,0.594738
1,0.411465,0.65102
2,0.405611,0.443727
3,0.426067,0.683507
4,0.432537,0.658451


In [3]:
org_results.describe()

Unnamed: 0,multilabel,multihead
count,5.0,5.0
mean,0.417208,0.606289
std,0.011487,0.096494
min,0.405611,0.443727
25%,0.410359,0.594738
50%,0.411465,0.65102
75%,0.426067,0.658451
max,0.432537,0.683507


Hypotheses:

$$H_0: \mu_1 = \mu_2, \: H_1: \mu_1 < \mu_2 $$

Two-sample T-test for results comparing:

$$T(X_1, X_2) = \frac{\overline{X}_1-\overline{X}_2}{S/\sqrt{n}}$$

$$ S = \sqrt{\frac{1}{n-1}\sum\limits_{i=1}^n(D_i-\overline{D})^2} $$

$$ D_i = X_{1i} - X_{2i},\: \overline{D} = \frac{1}{n}\sum_iD_i $$

In [4]:
def t_stat(df):
    D = df['multilabel'] - df['multihead']
    D_mean = D.mean()
    S = np.sqrt(np.sum((D-D_mean)**2)/(df.shape[0]-1))
    T_stat = (df['multilabel'].mean() - df['multihead'].mean())/(S/np.sqrt(df.shape[0]))
    return T_stat

In [5]:
p_value = stats.distributions.t(5-1).cdf(t_stat(org_results))

In [6]:
p_value

0.0044111410254475195

LOC

In [7]:
loc_results = pd.DataFrame([[0.4362387290682696, 0.8915617365156926], 
              [0.43545956805625313, 0.8271012006861063], 
              [0.4487244466970772, 0.880085653104925], 
              [0.41058079355951693, 0.7750452079566005], 
               [0.4169403630077787, 0.8613957084863716]], columns=['multilabel', 'multihead'])
loc_results

Unnamed: 0,multilabel,multihead
0,0.436239,0.891562
1,0.43546,0.827101
2,0.448724,0.880086
3,0.410581,0.775045
4,0.41694,0.861396


In [8]:
loc_results.describe()

Unnamed: 0,multilabel,multihead
count,5.0,5.0
mean,0.429589,0.847038
std,0.015541,0.047067
min,0.410581,0.775045
25%,0.41694,0.827101
50%,0.43546,0.861396
75%,0.436239,0.880086
max,0.448724,0.891562


In [9]:
p_value = stats.distributions.t(5-1).cdf(t_stat(loc_results))

In [10]:
p_value

8.303535724950009e-06

PER

In [11]:
per_results = pd.DataFrame([[0.397561622051418, 0.8792354474370113], 
              [0.400448005513914, 0.3316230083715906], 
              [0.3974851554313657, 0.37963131958386565], 
              [0.4090497737556561, 0.3154701718907988], 
               [0.4072156050405662, 0.7682743837084672]], columns=['multilabel', 'multihead'])
per_results

Unnamed: 0,multilabel,multihead
0,0.397562,0.879235
1,0.400448,0.331623
2,0.397485,0.379631
3,0.40905,0.31547
4,0.407216,0.768274


In [12]:
per_results.describe()

Unnamed: 0,multilabel,multihead
count,5.0,5.0
mean,0.402352,0.534847
std,0.005449,0.26768
min,0.397485,0.31547
25%,0.397562,0.331623
50%,0.400448,0.379631
75%,0.407216,0.768274
max,0.40905,0.879235


In this case let's make a two-tailed test:

$$H_0: \mu_1 = \mu_2, \: H_1: \mu_1 \neq \mu_2 $$

because results are quite worser.

In [13]:
p_value = 2*(1-stats.distributions.t(5-1).cdf(abs(t_stat(per_results))))

In [14]:
p_value

0.3317938085675962

 So, we definetely couldn't reject null hypotheses of means equality.

Let's compare the models using Holm's method for comparing two models by finding out the familywise error rate (FWER, групповая вероятность ошибки первого рода).

In [15]:
from statsmodels.stats.multitest import multipletests

In [16]:
result = multipletests([0.0044111410254475195, 8.303535724950009e-06, 0.3317938085675962],
                          method='holm')

In [17]:
print(f"Labels: [ORG LOC PER]")
print(f"rejected null hypotheses:{result[0]}")
print(f"corrected p-values:{result[1]}")

Labels: [ORG LOC PER]
rejected null hypotheses:[ True  True False]
corrected p-values:[8.82228205e-03 2.49106072e-05 3.31793809e-01]


Formally, we couldn't reject the null hypotheses of mean equivalence in the ORG and PER cases. But in the ORG case we see that the results of multihead model are mainly better, in the PER case its results are comparable to multilable case.