# Experiments
[PyTerrier](https://pyterrier.readthedocs.io/en/stable/experiments.html) aims to make it easy to conduct an information retrieval experiment, namely, to run a transformer pipeline over a set of queries, and evaluating the outcome using standard information retrieval evaluation metrics based on known relevant documents (obtained from a set relevance assessments, also known as qrels).

The usage of [PyTerrier Artifacts](https://pyterrier.readthedocs.io/en/stable/artifacts/) allows us to reuse different artifacts (i.e., indexes, cached results, and more).


> Sean MacAvaney. 2025. Artifact Sharing for Information Retrieval Research. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25). Association for Computing Machinery, New York, NY, USA, 3974–3979. https://doi.org/10.1145/3726302.3730147



In [None]:
import pyterrier as pt
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from ir_measures import nDCG

In [None]:
ensure_pyterrier_is_loaded()
ds_id = "radboud-validation-20251114-training"
dataset = pt.datasets.get_dataset(f"irds:ir-lab-wise-2025/{ds_id}")
topics = dataset.get_topics("title")
qrels = dataset.get_qrels()

## Load PyTerrier Artifacts

You can [load artifacts](https://pyterrier.readthedocs.io/en/stable/artifacts/how-to.html#id8) via URL.
The TIRA-URL for a published run is structured as follows:
```
tira:<dataset-name>/<team-name>/<approach-name>
```
Find published runs on the [TIRA leaderboard](https://www.tira.io/task-overview/ir-lab-wise-2025/radboud-validation-20251114-training).

In the following, we load some sample runs.

In [32]:
ows_bm_25 = pt.Artifact.from_url("tira:radboud-validation-20251114-training/ows/pyterrier-BM25-on-title")
golden_retrieval = pt.Artifact.from_url("tira:radboud-validation-20251114-training/ks-golden-retrievals/pyterrier-on-default_text-with-DPH-Bo1-DPH")
ows_pl2 = pt.Artifact.from_url("tira:radboud-validation-20251114-training/ows/pyterrier-PL2-on-title")
orakel_monot5 = pt.Artifact.from_url("tira:radboud-validation-20251114-training/ks-orakel/bm25 + monoT5 reranker")
chatnoir_desc = pt.Artifact.from_url("tira:radboud-validation-20251114-training/ows/chatnoir-description-default-10")
chatnoir_title = pt.Artifact.from_url("tira:radboud-validation-20251114-training/ows/chatnoir-title-bm25-100")

As you know, you can explore artifacts (e.g., run `bm_25_ows`), e.g., their rankings given a set of topics.

In [33]:
ows_bm_25(topics)

Unnamed: 0,qid,query,docno,rank,score,name
0,3,split ergo keyboard,e2e37f514110c0b3b7548903495348fac3db79d29df27a...,0,19.304484,pyterrier
1,3,split ergo keyboard,52f5787a001e9835f9868491378b9302398b6227a39e29...,1,16.551886,pyterrier
2,3,split ergo keyboard,38f4a96022cda7b2e85db2015f2949b340bf4116d76ea7...,2,15.835189,pyterrier
3,3,split ergo keyboard,1ddb8e93d2259c719bd72ea589132b4b3368e72cb2dac6...,3,14.537711,pyterrier
4,3,split ergo keyboard,7c83ed04d2853ce11f6cc61c23ef5239fb81c71054c5de...,4,14.537711,pyterrier
...,...,...,...,...,...,...
27437,74,Homeassistant setup,b58981cf5266cc4cccbc8068d95565f480a15752b1d7b2...,852,3.888203,pyterrier
27438,74,Homeassistant setup,bd3e8c047e9a33ad7ae313a508607b812d173c33c0d645...,853,3.741120,pyterrier
27439,74,Homeassistant setup,c205553090dea06a2a759f8486b2186f64565a93d41607...,854,3.741120,pyterrier
27440,74,Homeassistant setup,66a956b91a78d44e38b7dcfda595eb72212b7da66d35d7...,855,3.049076,pyterrier


## PyTerrier Experiments

You can define experiments using [PyTerrier](https://pyterrier.readthedocs.io/en/stable/experiments.html#api). All [`trec_eval`](https://github.com/usnistgov/trec_eval) evaluation measure are available: [Evaluation Measure List](https://pyterrier.readthedocs.io/en/stable/experiments.html#available-evaluation-measures)

You can perform [significance testing](https://pyterrier.readthedocs.io/en/stable/experiments.html#significance-testing) by specifying the index of which transformer you consider to be our baseline, e.g. `baseline=0`.
Additional columns are returned for each measure, indicating the number of queries improved (i.e., `<measure> +`) compared to the baseline, the number of queries degraded (i.e., `<measure> -`), as well as the t-test p-value in the difference between each row and the baseline row. For the baseline, these values are `NaN` (not applicable).

The cell below shows a [Student's t-test](https://en.wikipedia.org/wiki/Student%27s_t-test) with [Bonferroni correction](https://en.wikipedia.org/wiki/Bonferroni_correction) (i.e., correction for multiple testing in the comparative evaluation of many IR systems).
Multiple testing correction adds two further columns for each measure, denoting if the null hypothesis can be rejected (i.e., `<measure> reject`), as well as the corrected p value (i.e., `<measure> p-value corrected`).

In [34]:
pt.Experiment(
    [chatnoir_title, chatnoir_desc, ows_bm_25, golden_retrieval, ows_pl2, orakel_monot5], # weighting techniques
    topics,
    qrels,
    ["ndcg_cut.10"],  # measure
    names=["Chatnoir (Title)", "Chatnoir (Descr.)", "BM25 (OWS)", "DPH-Bo1-DPH (Golden)", "PL2 (OWS)", "monoT5 (ORAKEL)"],
    baseline=0, # ID of baseline
    test="t", # test to use; here: Student's t-test
    correction="bonferroni" # correction for multiple testing
)

Unnamed: 0,name,ndcg_cut.10,ndcg_cut.10 +,ndcg_cut.10 -,ndcg_cut.10 p-value,ndcg_cut.10 reject,ndcg_cut.10 p-value corrected
0,Chatnoir (Title),0.24184,,,,False,
1,Chatnoir (Descr.),0.051006,4.0,16.0,0.003056,True,0.015281
2,BM25 (OWS),0.350269,15.0,10.0,0.090867,False,0.454337
3,DPH-Bo1-DPH (Golden),0.494674,22.0,5.0,0.000189,True,0.000943
4,PL2 (OWS),0.323168,15.0,10.0,0.214476,False,1.0
5,monoT5 (ORAKEL),0.513137,21.0,6.0,1.7e-05,True,8.3e-05


In [35]:
pt.Experiment(
    [chatnoir_title, chatnoir_desc, ows_bm_25, golden_retrieval, ows_pl2, orakel_monot5],
    topics,
    qrels,
    [nDCG(judged_only=True)@10],
    names=["Chatnoir (Title)", "Chatnoir (Descr.)", "BM25 (OWS)", "DPH-Bo1-DPH (Golden)", "PL2 (OWS)", "monoT5 (ORAKEL)"],
    baseline=0,
    test="t",
    correction="bonferroni"
)

Unnamed: 0,name,nDCG(judged_only=True)@10,nDCG(judged_only=True)@10 +,nDCG(judged_only=True)@10 -,nDCG(judged_only=True)@10 p-value,nDCG(judged_only=True)@10 reject,nDCG(judged_only=True)@10 p-value corrected
0,Chatnoir (Title),0.30224,,,,False,
1,Chatnoir (Descr.),0.055765,3.0,17.0,0.0004814316,True,0.002407
2,BM25 (OWS),0.467444,17.0,9.0,0.0245963,False,0.122981
3,DPH-Bo1-DPH (Golden),0.641055,23.0,4.0,3.981732e-06,True,2e-05
4,PL2 (OWS),0.452354,17.0,9.0,0.04239448,False,0.211972
5,monoT5 (ORAKEL),0.63717,24.0,3.0,4.668333e-07,True,2e-06


Finally, if necessary, you can request [per-query performances](https://pyterrier.readthedocs.io/en/stable/experiments.html#per-query-effectiveness) using the `perquer=True` kwarg:

In [38]:
pt.Experiment(
    [chatnoir_title, chatnoir_desc, ows_bm_25, golden_retrieval, ows_pl2, orakel_monot5],
    topics,
    qrels,
    eval_metrics=["map", "recip_rank", "ndcg_cut.10"],
    names=["Chatnoir (Title)", "Chatnoir (Descr.)", "BM25 (OWS)", "DPH-Bo1-DPH (Golden)", "PL2 (OWS)", "monoT5 (ORAKEL)"],
    perquery=True
)

Unnamed: 0,name,qid,measure,value
180,BM25 (OWS),13,map,0.242080
181,BM25 (OWS),13,recip_rank,1.000000
182,BM25 (OWS),13,ndcg_cut.10,0.435403
183,BM25 (OWS),15,map,0.334013
184,BM25 (OWS),15,recip_rank,1.000000
...,...,...,...,...
502,monoT5 (ORAKEL),74,recip_rank,1.000000
503,monoT5 (ORAKEL),74,ndcg_cut.10,0.719013
429,monoT5 (ORAKEL),8,map,0.578571
430,monoT5 (ORAKEL),8,recip_rank,1.000000
