# Significance VS Generalizability

Goal of this notebook is to show, with experiments on to (controlled) distributions of rankings, how significance and generalizability are distinct concepts. 
The intuition for this notebook can be summarized in two ways. 
The first, that significance is the property of one sample, while generalizability is the property of a distribution. 
The second, that significance is a property of a sample, while generalizabilit has to do with how well experimental results approximate the true distribution. 

Concretely, consider the following setup. 

We are comparing $n_a = 5$ alternatives on 20 experimental conditions. As detailed in the paper, Section 3.2, this is essentially equivalent to sampling from some distribution $\mathbb P$ with support $S_5$, the set of permutations of $5$ alternatives, i.e., rankings without ties. 
The resarch question we are trying to answer is always the following: ``Is there an alternative that is better than the others? ''
Current recommendations are Friedmann tests followed by Conover-Iman posthoc comparisons.  

More concretely, we will consider two distributions: 

\begin{align}
   \mathbb P_1: &(01234) \mapsto 0.55\\
   &(10234) \mapsto 0.45.
\end{align}

\begin{align}
   \mathbb P_2: &(01234) \mapsto 0.5\\
   &(43210) \mapsto 0.5.
\end{align}

Each distribution shows a different aspect of the problem.  

In [3]:
import numpy as np
import pandas as pd

from scikit_posthocs import posthoc_conover_friedman
from scipy.stats import friedmanchisquare
from tqdm.auto import tqdm

from genexpy import random 
from genexpy import kernels as ku
from genexpy.utils import rankings as ru

na = 5  # number of alternatives
n = 20  # sample size
nrep = 1000  # number of samples

pval = 0.05  # pvalue for the tests

## Significantly-best does not mean always-best

With this distribution, we show that, even when, within a sample, an alternative is significantly better than the others, it does not mean that the better alternative is always the same one. 

In [7]:
r0 = np.array([0, 1, 2, 3, 4])
r1 = np.array([1, 0, 2, 3, 4])
base_rm = np.stack(55*[r0] + 45*[r1], axis=1)
base_sample = ru.SampleAM.from_rank_vector_matrix(base_rm)
distr1 = random.PMFDistribution.from_sample(base_sample)

distr1.support, distr1.pmf

(SampleAM([b'\x01\x01\x01\x01\x01\x00\x01\x01\x01\x01\x00\x00\x01\x01\x01\x00\x00\x00\x01\x01\x00\x00\x00\x00\x01',
           b'\x01\x00\x01\x01\x01\x01\x01\x01\x01\x01\x00\x00\x01\x01\x01\x00\x00\x00\x01\x01\x00\x00\x00\x00\x01'],
          dtype='|S25'),
 array([0.55, 0.45]))

In [10]:
out = []
for _ in tqdm(list(range(nrep))):
    
    # --- Sample
    sample = distr1.sample(n=n)
    rank_matrix = sample.to_rank_vector_matrix()
    
    # get best alternatives (most times rank(alternative)==0)
    best = np.where((rank_matrix == 0).sum(axis=1) == (rank_matrix == 0).sum(axis=1).max())[0]
    
    # --- Tests
    # Friedman
    pval_friedman = friedmanchisquare(*rank_matrix)[1]
    sig_friedman = (pval_friedman <= pval)
    
    # if Friedman is significant, proceeed with Conover-Iman
    if sig_friedman:
        pval_conoveriman = posthoc_conover_friedman(rank_matrix.T)
        # C-I is significant if the best alternatives are significantly better than anything else
        sig_conoveriman = (posthoc_conover_friedman(rank_matrix.T).values[best] < pval).sum() == na - len(best)
    else:
        sig_conoveriman = False
    
    out.append(dict(
        na=na, 
        n=n,
        pval=pval, 
        distr=str(distr1),
        sig_friedman=sig_friedman,
        sig_conoveriman=sig_conoveriman,
        best=tuple(best)
    ))
out = pd.DataFrame(out)

  0%|          | 0/1000 [00:00<?, ?it/s]

We now proceed by filtering the significant results and checking their corresponding best alternatives.

In [14]:
sig_f = out.query("sig_friedman")
sig_ci = out.query("sig_conoveriman")

if len(sig_f) > 0:
    print(f"F-Significant: {len(sig_f)} / {len(out)} = {len(sig_f) / len(out)}")
    print(f"CI-Significant: {len(sig_ci)} / {len(out)} = {len(sig_ci) / len(out)}")
    print(f"Distribution of significantly best alternatives: \n{sig_ci.groupby('best')['best'].count()}")
else:
    print("No significant tests.")

F-Significant: 1000 / 1000 = 1.0
CI-Significant: 290 / 1000 = 0.29
Distribution of significantly best alternatives: 
best
(0,)    236
(1,)     54
Name: best, dtype: int64


The output is clear: roughly 1/6 of the statistically significant Conover-Iman comparisons disagree with the remaining 5/6 on which alternative is the best.

## Non-significant does not mean non-generalizable

This second experiment is to show how, sometimes, non-significant results can be generalizable. To use generalizability, we need to set two parameters. More details on their choice is in the case studies (`./demos/case_studies`) and on the paper.

In [17]:
delta = 0.05  # similarity threshold for the rankings
kernel = ku.rankings.JaccardKernel(k=1)

In [22]:
r0 = np.array([0, 1, 2, 3, 4])
r1 = np.array([4, 3, 2, 1, 0])
base_rm = np.stack(50*[r0] + 50*[r1], axis=1)
base_sample = ru.SampleAM.from_rank_vector_matrix(base_rm)
distr2 = random.PMFDistribution.from_sample(base_sample)

distr2.support, distr2.pmf

(SampleAM([b'\x01\x01\x01\x01\x01\x00\x01\x01\x01\x01\x00\x00\x01\x01\x01\x00\x00\x00\x01\x01\x00\x00\x00\x00\x01',
           b'\x01\x00\x00\x00\x00\x01\x01\x00\x00\x00\x01\x01\x01\x00\x00\x01\x01\x01\x01\x00\x01\x01\x01\x01\x01'],
          dtype='|S25'),
 array([0.5, 0.5]))

In [24]:
out = []
for _ in tqdm(list(range(nrep))):
    
    # --- Sample
    sample = distr2.sample(n=n)
    rank_matrix = sample.to_rank_vector_matrix()
    
    # get best alternatives (most times rank(alternative)==0)
    best = np.where((rank_matrix == 0).sum(axis=1) == (rank_matrix == 0).sum(axis=1).max())[0]
    
    # --- Tests
    # Friedman
    pval_friedman = friedmanchisquare(*rank_matrix)[1]
    sig_friedman = (pval_friedman <= pval)
    
    # if Friedman is significant, proceeed with Conover-Iman
    if sig_friedman:
        pval_conoveriman = posthoc_conover_friedman(rank_matrix.T)
        # C-I is significant if the best alternatives are significantly better than anything else
        sig_conoveriman = (posthoc_conover_friedman(rank_matrix.T).values[best] < pval).sum() == na - len(best)
    else:
        sig_conoveriman = False
    
    # --- Generalizability
    # compute generalizability for n // 2
    mmd = kernel.mmd_distribution(sample=sample, n=n//2, rep=200, method="embedding", replace=False, disjoint=True)
    generalizability = np.mean(mmd <= kernel.get_eps(delta, na=na))

    out.append(dict(
        na=na, 
        n=n,
        pval=pval, 
        distr=str(distr2),
        sig_friedman=sig_friedman,
        sig_conoveriman=sig_conoveriman,
        best=tuple(best),
        generalizability=generalizability
    ))
out = pd.DataFrame(out)

  0%|          | 0/1000 [00:00<?, ?it/s]

As before, we filter out the non-significant results, check the best alternatives found, and the generalizability of the samples.

In [26]:
sig_f = out.query("sig_friedman")
sig_ci = out.query("sig_conoveriman")

if len(sig_f) > 0:
    print(f"F-Significant: {len(sig_f)} / {len(out)} = {len(sig_f) / len(out)}")
    print(f"CI-Significant: {len(sig_ci)} / {len(out)} = {len(sig_ci) / len(out)}")
    print(f"Distribution of significantly best alternatives: \n{sig_ci.groupby('best')['best'].count()}")
else:
    print("No significant tests.")
    
print(f"Average generalizability: {out['generalizability'].mean():.3f} ({out['generalizability'].std():.3f})")

F-Significant: 128 / 1000 = 0.128
CI-Significant: 1 / 1000 = 0.001
Distribution of significantly best alternatives: 
best
(0,)    1
Name: best, dtype: int64
Average generalizability: 0.737 (0.103)


We observe that, despite the fact that almost no sample is Conover-Iman-significant, generalizability estimated from these samples is consistently medium-high. 