# IR Lab Tutorial: Statistical Analysis

This tutorial shows how to conduct a hypothesis test to compare two retrieval approaches.
The two runs compared in this example are loaded from the TIRA cache.

## Step 1: Ensure that libraries are imported

In [1]:
# This command loads and starts PyTerrier so that it also works in TIRA.

from tira.third_party_integrations import ensure_pyterrier_is_loaded

ensure_pyterrier_is_loaded()

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.
Java started and loaded: pyterrier.java, pyterrier.terrier.java [version=5.7 (build: craigm 2022-11-10 18:30), helper_version=0.0.7]
The following code will have the same effect:
pt.java.add_package('com.github.terrierteam', 'terrier-prf', '-SNAPSHOT')
pt.terrier.set_version('5.7')
pt.terrier.set_helper_version('0.0.7')
pt.java.mavenresolver.offline()
pt.java.init() # optional, forces java initialisation
  pt.init(


In [2]:
# PyTerrier must be imported after `ensure_pyterrier_is_loaded` is called.

from pyterrier import started, init

if not started():
    init()

  if not started():


## Step 2: Load the dataset

In [3]:
from pyterrier import get_dataset

dataset_train = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')
dataset_train

IRDSDataset('ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')

In [4]:
dataset_validation = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-rag-20250105-training')
dataset_validation

IRDSDataset('ir-lab-wise-2024/subsampled-ms-marco-rag-20250105-training')

In [5]:
dataset_test = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-ir-lab-20250105-test')
dataset_test

IRDSDataset('ir-lab-wise-2024/subsampled-ms-marco-ir-lab-20250105-test')

In [6]:
# For development, let's use the training set for the experiments.
dataset = dataset_train

## Step 3: Create the retrieval pipeline with TIRA

In this example, we will use two existing submitted runs and load the approaches via the TIRA API.

In [7]:
from tira.rest_api_client import Client

tira_client = Client()

The approach IDs below follow the structure: `<task>/<team>/<submission>`

In [21]:
approach_baseline = tira_client.pt.from_retriever_submission(
    approach='ir-lab-wise-2024/ir-wise-24-suchmaschinen/BM25 + ReRanking (monoT5 BL)',
    dataset='subsampled-ms-marco-deep-learning-20241201-training',
)
approach_baseline

<tira.pyterrier_util.TiraSourceTransformer at 0x12326b160>

In [None]:
# approach_baseline = tira_client.pt.from_retriever_submission(
#     approach='ir-lab-wise-2024/ir-wise-24-tutors/Retrieval Baseline',
#     dataset='subsampled-ms-marco-deep-learning-20241201-training',
# )
# approach_baseline

Download: 1.11MiB [00:00, 20.3MiB/s]

Download finished. Extract...
Extraction finished:  /Users/till/.tira/extracted_runs/ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training/ir-wise-24-tutors





<tira.pyterrier_util.TiraSourceTransformer at 0x105a0c9d0>

In [14]:
approach_new = tira_client.pt.from_retriever_submission(
    approach='ir-lab-wise-2024/ir-wise-24-suchmaschinen/BM25 + ReRanking (mono+duoT5)',
    dataset='subsampled-ms-marco-deep-learning-20241201-training',
)
approach_new

Download: 18.2kiB [00:00, 3.28MiB/s]

Download finished. Extract...
Extraction finished:  /Users/till/.tira/extracted_runs/ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training/ir-wise-24-suchmaschinen





<tira.pyterrier_util.TiraSourceTransformer at 0x123a29120>

In [None]:
# approach_new = tira_client.pt.from_retriever_submission(
#     approach='ir-lab-wise-2024/ir-wise-24-th25/BM25 + MonoT5 Rerank',
#     dataset='subsampled-ms-marco-deep-learning-20241201-training',
# )
# approach_new

Download: 1.37MiB [00:00, 15.4MiB/s]


Download finished. Extract...
Extraction finished:  /Users/till/.tira/extracted_runs/ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training/ir-wise-24-th25


<tira.pyterrier_util.TiraSourceTransformer at 0x1231b7b80>

## Step 4: Measure effectiveness

Now let us measure the nDCG@10 effectiveness of both systems on the Touché 2020 task 1 dataset.

In [22]:
from pyterrier.pipelines import Experiment

experiment = Experiment(
    retr_systems=[
        approach_baseline,
        approach_new,
    ],
    topics=dataset_train.get_topics("query"),
    qrels=dataset_train.get_qrels(),
    eval_metrics=["ndcg_cut_10"],
    names=[
        "monoT5",
        "monoT5+duoT5",
    ],
    perquery=True,
)
experiment.sample(n=10)

Unnamed: 0,name,qid,measure,value
92,monoT5,405717,ndcg_cut_10,0.595927
139,monoT5+duoT5,640502,ndcg_cut_10,1.0
175,monoT5+duoT5,1114819,ndcg_cut_10,0.877514
60,monoT5,168216,ndcg_cut_10,0.812071
189,monoT5+duoT5,405717,ndcg_cut_10,0.634123
121,monoT5+duoT5,121171,ndcg_cut_10,0.548893
89,monoT5,1114646,ndcg_cut_10,0.885673
29,monoT5,174463,ndcg_cut_10,0.711944
58,monoT5,489204,ndcg_cut_10,0.224446
4,monoT5,1064670,ndcg_cut_10,0.801862


This data frame shows the nDCG@10 values measured for each query and both systems. \
So we have pairs of measurements where the same metric (i.e., nDCG@10) is measured using the same input (e.g., query #1) but for two different systems.
Let's re-arrange the data frame so that the effectiveness values are in separate columns, not rows.

In [23]:
experiment_baseline = experiment[experiment["name"] == "monoT5"]\
    .drop(columns=["name"])
experiment_approach = experiment[experiment["name"] == "monoT5+duoT5"]\
    .drop(columns=["name"])

experiment_paired = experiment_baseline.merge(
    experiment_approach,
    on=["qid", "measure"],
    suffixes=("_baseline", "_approach"),
)
experiment_paired.head(n=10)

Unnamed: 0,qid,measure,value_baseline,value_approach
0,1030303,ndcg_cut_10,0.627356,0.73375
1,1037496,ndcg_cut_10,0.912539,0.90215
2,1037798,ndcg_cut_10,0.199613,0.220807
3,1043135,ndcg_cut_10,0.845994,0.73045
4,104861,ndcg_cut_10,1.0,1.0
5,1051399,ndcg_cut_10,0.821843,0.817671
6,1063750,ndcg_cut_10,0.871021,0.841178
7,1064670,ndcg_cut_10,0.801862,0.752252
8,1071750,ndcg_cut_10,0.726809,0.688409
9,1103812,ndcg_cut_10,0.744751,0.659619


## Step 5: Conduct hypothesis tests

On this _paired_ measurement data, we can now conduct _paired_ t-tests to test for statistical significance of given hypotheses.
Remember that the choice of your test depends (amongst other factors) on how the hypothesis is formulated.

Let us test some hypotheses to get a feeling of what this means:

#### Hypothesis 1: The new approach has a significantly different nDCG@10 on the chosen dataset than the baseline.
(Hint: For your own tests, you'd want to replace the approach and dataset names with the actual names above.)

Significance test: two-sided paired t-test \
Significance level: $\alpha = 0.05$ (i.e., the effect is only considered significant if $p < 0.05$)

In [24]:
from scipy.stats import ttest_rel

ttest_rel(
    experiment_paired["value_approach"],
    experiment_paired["value_baseline"],
    alternative='two-sided',
).pvalue

0.8316784914254621

The above value is called $p$, the probability of the corresponding null hypothesis (the probability that the effect would be observed by chance). \
If this is lower than our significance level $\alpha$, we can reject the null hypothesis and confirm the hypothesis 1.

Now it would be great to find out which is better. \
One way could be to formulate a hypothesis with a predefined "direction". In this example we assume our new approach to be better.

#### Hypothesis 2: The new approach has a significantly higher nDCG@10 on the chosen dataset than the baseline.

Significance test: one-sided paired t-test \
Significance level: $\alpha = 0.05$ (or $p < 0.05$)

In [27]:
from scipy.stats import ttest_rel

ttest_rel(
    experiment_paired["value_approach"],
    experiment_paired["value_baseline"],
    alternative='greater',
).pvalue

1.6693111780797145e-35

Again, if the probability $p$ of the null hypothesis is lower than our significance level $\alpha$, then we can reject the null hypothesis and confirm hypothesis 2.

Let us test the opposite direction: the new approach could be worse w.r.t. nDCG@10 than the baseline.

#### Hypothesis 2: The new approach has a significantly lower nDCG@10 on the chosen dataset than the baseline.

Significance test: one-sided paired t-test \
Significance level: $\alpha = 0.05$ (or $p < 0.05$)

In [26]:
from scipy.stats import ttest_rel

ttest_rel(
    experiment_paired["value_approach"],
    experiment_paired["value_baseline"],
    alternative='less',
).pvalue

0.5841607542872689

Again, if the probability $p$ of the null hypothesis is lower than our significance level $\alpha$, then we can reject the null hypothesis and confirm hypothesis 3.