# Information Retrieval Lab WiSe 2024/2025: Baseline Retrieval System

This Jupyter notebook serves as a baseline retrieval system that you can improve upon.
We use subsets of the MS MARCO datasets to retrieve passages of web documents.
We will show you how to create a software submission to TIRA from this notebook.

An overview of all corpora that we use in the current course is available at [https://tira.io/datasets?query=ir-lab-wise-2024](https://tira.io/datasets?query=ir-lab-wise-2024). The dataset IDs for loading the datasets are:

- `ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training`: A subsample of the TREC 2019/2020 Deep Learning tracks on the MS MARCO v1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/subsampled-ms-marco-rag-20241202-training` (_work in progress_): A subsample of the TREC 2024 Retrieval-Augmented Generation track on the MS MARCO v2.1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/ms-marco-rag-20241203-test` (work in progress): The test corpus that we have created together in the course, based on the MS MARCO v2.1 passage dataset. We will use this dataset as the test dataset, i.e., evaluation scores become available only after the submission deadline.

### Step 1: Import libraries

We will use [tira](https://tira.io/), an information retrieval shared task platform, and [ir_dataset](https://ir-datasets.com/) for loading the datasets. Subsequently, we will build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine framework.

First, we need to install the required libraries.

In [4]:
!pip3 install --upgrade tira ir-datasets python-terrier 
!pip3 install --upgrade pyterrier-caching pyterrier_t5
# !pip3 install --upgrade git+https://github.com/terrierteam/pyterrier_t5.git



Create an API client to interact with the TIRA platform (e.g., to load datasets and submit runs).

In [3]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client

import pyterrier as pt

if not pt.java.started():
    pt.java.init()

ensure_pyterrier_is_loaded()
tira = Client()

Java started and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]


### Step 2: Load the dataset

We load the dataset by its ir_datasets ID (as listed in the Readme). Just be sure to add the `irds:` prefix before the dataset ID to tell PyTerrier to load the data from ir_datasets.

In [4]:
from pyterrier import get_dataset

pt_dataset = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')
pt_dataset_new = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-rag-20250105-training')
pt_dataset_test = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-ir-lab-20250105-test')

### Step 3: Build an index

We will then create an index from the documents in the dataset we just loaded.

In [5]:
from pyterrier import IterDictIndexer

indexer = IterDictIndexer(
    # Store the index in the `index` directory.
    "../data/index",
    meta={'docno': 50, 'text': 4096},
    # If an index already exists there, then overwrite it.
    overwrite=True,
)
index = indexer.index(pt_dataset.get_corpus_iter())

ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:  37%|███▋      | 25413/68261 [00:04<00:04, 8641.09it/s] 



ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents: 100%|██████████| 68261/68261 [00:08<00:00, 8135.92it/s] 


20:07:24.876 [ForkJoinPool-1-worker-1] WARN org.terrier.structures.indexing.Indexer -- Indexed 1 empty documents


In [6]:
indexer_new = IterDictIndexer(
    # Store the index in the `index` directory.
    "../data/index_new",
    meta={'docno': 50, 'text': 4096},
    # If an index already exists there, then overwrite it.
    overwrite=True,
)
index_new = indexer_new.index(pt_dataset_new.get_corpus_iter())

ir-lab-wise-2024/subsampled-ms-marco-rag-20250105-training documents: 100%|██████████| 113227/113227 [00:54<00:00, 2087.20it/s]


In [7]:
indexer_test = IterDictIndexer(
    # Store the index in the `index` directory.
    "../data/index_test",
    meta={'docno': 50, 'text': 4096},
    # If an index already exists there, then overwrite it.
    overwrite=True,
)
index_test = indexer_test.index(pt_dataset_test.get_corpus_iter())

ir-lab-wise-2024/subsampled-ms-marco-ir-lab-20250105-test documents: 100%|██████████| 125112/125112 [01:09<00:00, 1796.74it/s]


### Step 4: Define the retrieval pipeline

We will define a simple retrieval pipeline using just BM25 as a baseline. For details, refer to the PyTerrier [documentation](https://pyterrier.readthedocs.io) or [tutorial](https://github.com/terrier-org/ecir2021tutorial).

In [8]:
from pyterrier import BatchRetrieve

bm25 = BatchRetrieve(index, wmodel="BM25")
bm25_new = BatchRetrieve(index_new, wmodel="BM25")
bm25_test = BatchRetrieve(index_test, wmodel="BM25")

  bm25 = BatchRetrieve(index, wmodel="BM25")
  bm25_new = BatchRetrieve(index_new, wmodel="BM25")
  bm25_test = BatchRetrieve(index_test, wmodel="BM25")


In [9]:
bm25

TerrierRetr(../data/index/data.properties,{'terrierql': 'on', 'parsecontrols': 'on', 'parseql': 'on', 'applypipeline': 'on', 'localmatching': 'on', 'filters': 'on', 'decorate': 'on', 'wmodel': 'BM25', 'decorate_batch': 'on'},{'querying.processes': 'terrierql:TerrierQLParser,parsecontrols:TerrierQLToControls,parseql:TerrierQLToMatchingQueryTerms,matchopql:MatchingOpQLParser,applypipeline:ApplyTermPipeline,context_wmodel:org.terrier.python.WmodelFromContextProcess,localmatching:LocalManager$ApplyLocalMatching,qe:QueryExpansion,labels:org.terrier.learning.LabelDecorator,filters:LocalManager$PostFilterProcess,decorate:SimpleDecorateProcess', 'querying.postfilters': 'decorate:SimpleDecorate,site:SiteFilter,scope:Scope', 'querying.default.controls': 'wmodel:DPH,parsecontrols:on,parseql:on,applypipeline:on,terrierql:on,localmatching:on,filters:on,decorate:on', 'querying.allowed.controls': 'scope,qe,qemodel,start,end,site,scope,applypipeline', 'termpipelines': 'Stopwords,PorterStemmer'})

### Step 5: Create the run
In the next steps, we would like to apply our retrieval system to some topics, to prepare a 'run' file, containing the retrieved documents.

First, let's have a short look at the first three topics:

In [11]:
# The `'text'` argument below selects the topics `text` field as the query.
pt_dataset.get_topics('text')

Unnamed: 0,qid,query
0,1030303,who is aziz hashim
1,1037496,who is rep scalise
2,1043135,who killed nicholas ii of russia
3,1051399,who sings monk theme song
4,1064670,why do hunters pattern their shotguns
...,...,...
92,405717,is cdg airport in main paris
93,182539,example of monotonic function
94,1113437,what is physical description of spruce
95,1129237,hydrogen is a liquid below what temperature


Now, retrieve results for all the topics (may take a while):

In [12]:
run = bm25(pt_dataset.get_topics('text'))

That's it for the retrieval. Here are the first 10 entries of the run:

In [13]:
run.head(10)

Unnamed: 0,qid,docid,docno,rank,score,query
0,1030303,53852,8726436,0,31.681671,who is aziz hashim
1,1030303,56041,8726433,1,25.966276,who is aziz hashim
2,1030303,62116,8726435,2,23.863442,who is aziz hashim
3,1030303,32183,8726429,3,23.391821,who is aziz hashim
4,1030303,35867,8726437,4,21.030669,who is aziz hashim
5,1030303,17637,8726430,5,19.9672,who is aziz hashim
6,1030303,42957,7156982,6,19.9672,who is aziz hashim
7,1030303,21803,8726434,7,19.474804,who is aziz hashim
8,1030303,59828,1305520,8,17.849161,who is aziz hashim
9,1030303,60002,3302257,9,17.832781,who is aziz hashim


# Step 7: Improve

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

Ideen:
- Lemmatisierung/Stemming andere austesten
- Textembeddings (Transformer): Bi-Encoder
- Reranking (Top10 in LLM reingeben): monoT5 (das hier machen)
- Hyperparameter-Tuning: BM25 Parameter (b, k1, k2) anpassen (Grid-Search)
- Query-Pipeline anpassen (Stoppwörter ergänzen/austauschen, usw.) + Query Expansion: Synonyme zur Query hinzufügen
- Document Expansion: doc2query
- verschiedene Retrieval Ansätze verknüpfen: Learning-to-Rank

In [14]:
import pyterrier as pt
from pyterrier_t5 import MonoT5ReRanker, DuoT5ReRanker
monoT5 = MonoT5ReRanker()
duoT5 = DuoT5ReRanker()

from pyterrier_caching import SparseScorerCache

  from .autonotebook import tqdm as notebook_tqdm
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [15]:
mono_pipeline = bm25 % 100 >> pt.text.get_text(pt_dataset, "text") >> monoT5
duo_pipeline = mono_pipeline % 10 >> duoT5
duo_only_pipeline = bm25 % 10 >> pt.text.get_text(pt_dataset, "text") >> duoT5 

In [17]:
# Experiment
from pyterrier import Experiment

In [18]:
# Anzahl der Topics
first_x_topics = 3

# Experiment(
#     [bm25, mono_pipeline, duo_pipeline, duo_only_pipeline],
#     topics = pt_dataset.get_topics('text').head(first_x_topics),
#     qrels = pt_dataset.get_qrels(),
#     eval_metrics =['ndcg_cut_10', 'mrt'],
#     names=["BM25", "monoT5", "monoT5+duoT5", "duoT5"],
#     round = 3
# )

In [26]:
pipeline_chatnoir = bm25
mono_cached = SparseScorerCache('monoT5_2.cache', monoT5, verbose=True) # Caching für MonoT5

pipeline_mono_t5 = (pipeline_chatnoir >> mono_cached) ^ pipeline_chatnoir
pipeline_duo_t5 = (pipeline_mono_t5 % 5 >> duoT5) ^ pipeline_mono_t5
topics = pt_dataset.get_topics().head(1)
qrels = pt_dataset.get_qrels()
# Run experiment
Experiment(
    retr_systems=[
        pipeline_chatnoir,
        pipeline_mono_t5,
        pipeline_duo_t5,
    ],
    names=[
        "ChatNoir",
        "ChatNoir+monoT5",
        "ChatNoir+monoT5+duoT5",
    ],
    topics=topics,
    qrels=qrels,
    eval_metrics=["ndcg_cut_10"],
    verbose=True,
)

There are multiple query fields available: ('text', 'title', 'query', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.


monoT5: 100%|██████████| 225/225 [00:35<00:00,  6.38batches/s]
pt.Experiment:  67%|██████▋   | 2/3 [00:35<00:17, 17.76s/system]

Sqlite3ScorerCache('monoT5_2.cache', <pyterrier_t5.MonoT5ReRanker object at 0x14704fa30>, group='query', key='docno'): 100 hit(s), 900 miss(es)
Sqlite3ScorerCache('monoT5_2.cache', <pyterrier_t5.MonoT5ReRanker object at 0x14704fa30>, group='query', key='docno'): 1000 hit(s), 0 miss(es)


duoT5: 100%|██████████| 1/1 [00:02<00:00,  2.94s/queries]
pt.Experiment: 100%|██████████| 3/3 [00:38<00:00, 12.91s/system]

Sqlite3ScorerCache('monoT5_2.cache', <pyterrier_t5.MonoT5ReRanker object at 0x14704fa30>, group='query', key='docno'): 1000 hit(s), 0 miss(es)





Unnamed: 0,name,ndcg_cut_10
0,ChatNoir,0.872746
1,ChatNoir+monoT5,0.0
2,ChatNoir+monoT5+duoT5,0.0


In [14]:
Experiment(
    [bm25, mono_pipeline, duo_pipeline],
    topics = pt_dataset.get_topics('text').tail(first_x_topics),
    qrels = pt_dataset.get_qrels(),
    eval_metrics =['ndcg_cut_10', 'mrt'],
    names=["BM25", "monoT5", "duoT5"],
    round = 3
)

monoT5: 100%|██████████| 75/75 [00:37<00:00,  2.02batches/s]
monoT5: 100%|██████████| 75/75 [00:35<00:00,  2.12batches/s]
duoT5: 100%|██████████| 3/3 [01:13<00:00, 24.42s/queries]


Unnamed: 0,name,ndcg_cut_10,mrt
0,BM25,0.437,81.713
1,monoT5,0.678,12462.558
2,duoT5,0.684,36505.042


# Caching

In [15]:
from pyterrier_caching import SparseScorerCache

first_x_topics = 100

def run_pipeline(dataset, append, base_model):
    # (bm25 % 100) damit nur die ersten 100 Dokumente genommen werden
    inp = (base_model % 100).transform(dataset.get_topics('text').head(first_x_topics)) # BM25 auf den ersten x Topics anwenden. Dabei cutoff von 100 Dokumenten
    inp_10 = (base_model % 10).transform(dataset.get_topics('text').head(first_x_topics)) # Nur die Top10 Dokumente, damit der Input zu DuoT5 gegeben werden kann
    mono2 = pt.text.get_text(dataset, "text") >> monoT5 # Pipeline für MonoT5
    mono_cached = SparseScorerCache('_old_monoT5' + append + '.cache', mono2, verbose=True) # Caching für MonoT5
    duo2 = (mono_cached % 100) >> pt.text.get_text(dataset, "text") >> duoT5 # Pipeline für DuoT5
    duo_cached = SparseScorerCache('_old_duoT5' + append + '.cache', duo2, verbose=True) # Caching für DuoT5

    mono_results = mono_cached.transform(inp) # MonoT5 wird auf die Ergebnisse von BM25 angewendet
    mono_results_10 = (mono_cached % 10).transform(inp) # Nur die Top10 Dokumente, damit der Input zu DuoT5 gegeben werden kann

    duo_results = duo_cached.transform(mono_results_10) # DuoT5 wird auf die Ergebnisse von MonoT5 angewendet
    duo_only_results = duo_cached.transform(inp_10) # DuoT5 wird auf die Ergebnisse von BM25 angewendet
    return mono_results, duo_results, duo_only_results

mono_results, duo_results, duo_only = run_pipeline(pt_dataset, '', bm25)
mono_results_new, duo_results_new, duo_only_new = run_pipeline(pt_dataset_new, '_new', bm25_new)
mono_results_test, duo_results_test, duo_only_test = run_pipeline(pt_dataset_test, '_test', bm25_test)

Sqlite3ScorerCache('monoT5.cache', (<pyterrier.datasets.IRDSTextLoader object at 0x14f27bc40> >> MonoT5(castorini/monot5-base-msmarco)), group='query', key='docno'): 9533 hit(s), 0 miss(es)
Sqlite3ScorerCache('monoT5.cache', (<pyterrier.datasets.IRDSTextLoader object at 0x14f27bc40> >> MonoT5(castorini/monot5-base-msmarco)), group='query', key='docno'): 9533 hit(s), 0 miss(es)
Sqlite3ScorerCache('duoT5.cache', (<pyterrier.datasets.IRDSTextLoader object at 0x13b29ebf0> >> DuoT5(castorini/duot5-base-msmarco)), group='query', key='docno'): 965 hit(s), 0 miss(es)
Sqlite3ScorerCache('duoT5.cache', (<pyterrier.datasets.IRDSTextLoader object at 0x13b29ebf0> >> DuoT5(castorini/duot5-base-msmarco)), group='query', key='docno'): 965 hit(s), 0 miss(es)
Sqlite3ScorerCache('monoT5_new.cache', (<pyterrier.datasets.IRDSTextLoader object at 0x12b220b20> >> MonoT5(castorini/monot5-base-msmarco)), group='query', key='docno'): 8600 hit(s), 0 miss(es)
Sqlite3ScorerCache('monoT5_new.cache', (<pyterrier.dat

In [18]:
exp = Experiment(
    [bm25, mono_results, duo_results, duo_only],
    topics = pt_dataset.get_topics('text').head(first_x_topics),
    qrels = pt_dataset.get_qrels(),
    eval_metrics =['ndcg_cut_10', 'map', 'recip_rank', 'P_1', 'P_5', 'P_10'],
    names=["BM25", "monoT5", "monoT5+duoT5", "duoT5"],
    round = 3,
    baseline=1
)
exp

Unnamed: 0,name,map,recip_rank,P_1,P_5,P_10,ndcg_cut_10,map +,map -,map p-value,...,P_1 p-value,P_5 +,P_5 -,P_5 p-value,P_10 +,P_10 -,P_10 p-value,ndcg_cut_10 +,ndcg_cut_10 -,ndcg_cut_10 p-value
0,BM25,0.413,0.787,0.701,0.623,0.574,0.489,48.0,49.0,0.8563336,...,8e-06,6.0,54.0,3.234086e-09,11.0,58.0,3.686597e-09,13.0,84.0,7.862038e-13
1,monoT5,0.416,0.962,0.938,0.819,0.735,0.682,,,,...,,,,,,,,,,
2,monoT5+duoT5,0.196,0.978,0.969,0.827,0.735,0.686,5.0,87.0,2.0284829999999997e-19,...,0.083248,11.0,6.0,0.3738256,0.0,0.0,,43.0,36.0,0.265198
3,duoT5,0.149,0.842,0.773,0.685,0.574,0.511,4.0,91.0,2.471478e-22,...,0.000477,8.0,39.0,1.167368e-05,11.0,58.0,3.686597e-09,13.0,79.0,6.471353e-11


# Optimierung von MonoT5
- Es werden Werte getestet, um zu schauen, mit welchem Dokument-Cutoff MonoT5 am besten funktioniert

In [None]:
mono_cached = SparseScorerCache('_old_monoT5_fix.cache', monoT5, verbose=True) # Caching für MonoT5

pipeline_mono_t5 = (bm25 >> mono_cached) ^ bm25
pipeline_duo_t5 = (pipeline_mono_t5 % 5 >> duoT5) ^ pipeline_mono_t5
topics = pt_dataset.get_topics().head(1)
qrels = pt_dataset.get_qrels()
# Run experiment
Experiment(
    retr_systems=[
        pipeline_chatnoir,
        pipeline_mono_t5,
        pipeline_duo_t5,
    ],
    names=[
        "ChatNoir",
        "ChatNoir+monoT5",
        "ChatNoir+monoT5+duoT5",
    ],
    topics=topics,
    qrels=qrels,
    eval_metrics=["ndcg_cut_10"],
    verbose=True,
)

# Runs durchführen

In [20]:
from tira.third_party_integrations import persist_and_normalize_run

In [21]:
run = duo_results
persist_and_normalize_run(
    run,
    # Give your approach a short but descriptive name tag.
    system_name='duoT5-suchMaschinen', 
    default_output='../data/runs',
    upload_to_tira=pt_dataset,
)

The run file is normalized outside the TIRA sandbox, I will store it at "../data/runs".
Done. run file is stored under "../data/runs/run.txt.gz".
Run uploaded to TIRA. Claim ownership via: https://www.tira.io/claim-submission/549cd878-fe0b-43a1-87b3-9f74cf643fed


In [22]:
run = duo_results_new
persist_and_normalize_run(
    run,
    # Give your approach a short but descriptive name tag.
    system_name='duoT5-suchMaschinen', 
    default_output='../data/runs',
    upload_to_tira=pt_dataset_new,
)

The run file is normalized outside the TIRA sandbox, I will store it at "../data/runs".
Done. run file is stored under "../data/runs/run.txt.gz".
Run uploaded to TIRA. Claim ownership via: https://www.tira.io/claim-submission/58e4e0f9-29fb-4560-b2c6-cd1af6c3ade3


In [23]:
run = duo_only_new
persist_and_normalize_run(
    run,
    # Give your approach a short but descriptive name tag.
    system_name='duoT5-suchMaschinen', 
    default_output='../data/runs',
    upload_to_tira=pt_dataset_new,
)

The run file is normalized outside the TIRA sandbox, I will store it at "../data/runs".
Done. run file is stored under "../data/runs/run.txt.gz".
Run uploaded to TIRA. Claim ownership via: https://www.tira.io/claim-submission/82f0b7d2-6172-4eea-bc0d-61077c08d2b1


In [24]:
run = duo_results_test
persist_and_normalize_run(
    run,
    # Give your approach a short but descriptive name tag.
    system_name='duoT5-suchMaschinen', 
    default_output='../data/runs',
    upload_to_tira=pt_dataset_test,
)

The run file is normalized outside the TIRA sandbox, I will store it at "../data/runs".
Done. run file is stored under "../data/runs/run.txt.gz".
Run uploaded to TIRA. Claim ownership via: https://www.tira.io/claim-submission/38158b6d-557f-4088-8195-3b0f314caffc


In [25]:
duo_results_test

Unnamed: 0,qid,docid,docno,score,query,rank,system
0,34,37054,msmarco_v2.1_doc_00_559772133#2_1023411002,19.687506,latex color box,0,duoT5-suchMaschinen
1,34,31673,msmarco_v2.1_doc_51_461434843#3_944006293,5.864996,latex color box,4,duoT5-suchMaschinen
2,34,27794,msmarco_v2.1_doc_51_461434843#2_944004995,15.076330,latex color box,2,duoT5-suchMaschinen
3,34,12541,msmarco_v2.1_doc_50_674652070#3_1371739662,19.379844,latex color box,1,duoT5-suchMaschinen
4,34,11566,msmarco_v2.1_doc_39_941719527#3_1917293465,-37.446684,latex color box,9,duoT5-suchMaschinen
...,...,...,...,...,...,...,...
455,7,34225,msmarco_v2.1_doc_21_267072906#12_604832338,3.838838,mental health impacts of social media,6,duoT5-suchMaschinen
456,7,36236,msmarco_v2.1_doc_21_267072906#17_604841187,11.857235,mental health impacts of social media,1,duoT5-suchMaschinen
457,7,11449,msmarco_v2.1_doc_21_267072906#8_604825387,2.638780,mental health impacts of social media,8,duoT5-suchMaschinen
458,7,43890,msmarco_v2.1_doc_21_267072906#11_604830589,3.210352,mental health impacts of social media,7,duoT5-suchMaschinen


# Vergleich monoT5 vs monoT5 + duoT5

In [32]:
def run_pipeline(dataset, append, base_model):
    # (bm25 % 100) damit nur die ersten 100 Dokumente genommen werden
    inp = (base_model % 100).transform(dataset.get_topics('text').head(first_x_topics)) # BM25 auf den ersten x Topics anwenden. Dabei cutoff von 100 Dokumenten
    inp_10 = (base_model % 3).transform(dataset.get_topics('text').head(first_x_topics)) # Nur die Top10 Dokumente, damit der Input zu DuoT5 gegeben werden kann
    mono2 = pt.text.get_text(dataset, "text") >> monoT5 # Pipeline für MonoT5
    mono_cached = SparseScorerCache('_old_monoT5' + append + '.cache', mono2, verbose=True) # Caching für MonoT5
    duo2 = pt.text.get_text(dataset, "text") >> duoT5 # Pipeline für DuoT5
    duo_cached = SparseScorerCache('_old_duoT5' + append + '.cache', duo2, verbose=True) # Caching für DuoT5

    mono_results = mono_cached.transform(inp) # MonoT5 wird auf die Ergebnisse von BM25 angewendet
    mono_results_10 = (mono_cached % 10).transform(inp) # Nur die Top10 Dokumente, damit der Input zu DuoT5 gegeben werden kann

    duo_results = duo_cached.transform(mono_results_10) # DuoT5 wird auf die Ergebnisse von MonoT5 angewendet
    duo_only_results = duo_cached.transform(inp_10) # DuoT5 wird auf die Ergebnisse von BM25 angewendet
    duo_results2 = pt.Transformer.from_df(mono_results_10) >> duo2
    return mono_results, duo_results, duo_only_results, duo_results2

mono_results, duo_results, duo_only, duo_results2 = run_pipeline(pt_dataset, '', bm25)
#mono_results_new, duo_results_new, duo_only_new = run_pipeline(pt_dataset_new, '_new', bm25_new)
#mono_results_test, duo_results_test, duo_only_test = run_pipeline(pt_dataset_test, '_test', bm25_test)

Sqlite3ScorerCache('monoT5.cache', (<pyterrier.datasets.IRDSTextLoader object at 0x14f41a5f0> >> MonoT5(castorini/monot5-base-msmarco)), group='query', key='docno'): 9533 hit(s), 0 miss(es)
Sqlite3ScorerCache('monoT5.cache', (<pyterrier.datasets.IRDSTextLoader object at 0x14f41a5f0> >> MonoT5(castorini/monot5-base-msmarco)), group='query', key='docno'): 9533 hit(s), 0 miss(es)
Sqlite3ScorerCache('duoT5.cache', (<pyterrier.datasets.IRDSTextLoader object at 0x322dea470> >> DuoT5(castorini/duot5-base-msmarco)), group='query', key='docno'): 965 hit(s), 0 miss(es)
Sqlite3ScorerCache('duoT5.cache', (<pyterrier.datasets.IRDSTextLoader object at 0x322dea470> >> DuoT5(castorini/duot5-base-msmarco)), group='query', key='docno'): 291 hit(s), 0 miss(es)


In [None]:
def run_test():
    mono2 = pt.text.get_text(pt_dataset, "text") >> monoT5 # Pipeline für MonoT5
    mono_cached = SparseScorerCache('_old_monoT5.cache', mono2, verbose=True) # Caching für MonoT5
    duo2 = pt.text.get_text(pt_dataset, "text") >> duoT5 # Pipeline für DuoT5
    duo_cached = SparseScorerCache('_old_duoT5.cache', duo2, verbose=True) # Caching für DuoT5
    experiment = []
    for mono_cutoff in range(25, 5000, 25):
        inp = (bm25 % mono_cutoff).transform(pt_dataset.get_topics('text').head(first_x_topics)) # BM25 auf den ersten x Topics anwenden. Dabei cutoff von 100 Dokumenten
        mono_results = mono_cached.transform(inp) # MonoT5 wird auf die Ergebnisse von BM25 angewendet

        for duo_cutoff in range(5, 6):
            mono_results_cutoff = (mono_cached % duo_cutoff).transform(inp)
            duo_results = duo_cached.transform(mono_results_cutoff)
            exp = Experiment(
                [mono_results, duo_results],
                topics = pt_dataset.get_topics('text'),
                qrels = pt_dataset.get_qrels(),
                eval_metrics =['ndcg_cut_5', 'ndcg_cut_10', "mrt"],
                names=["monoT5", "monoT5+duoT5"],
                round = 3,
                baseline=0
            )
            experiment.append({'mono_cutoff': mono_cutoff, 'duo_cutoff': duo_cutoff, 'mono_ndcg_5': exp['ndcg_cut_5'][0], 'duo_ndcg_5': exp['ndcg_cut_5'][1], 'mono_ndcg': exp['ndcg_cut_10'][0], 'duo_ndcg': exp['ndcg_cut_10'][1], 'p_value': exp['ndcg_cut_10 p-value'][1]})
            print('Mono Cutoff:', mono_cutoff, ', Duo Cutoff:', duo_cutoff, ', NDCG@10 MonoT5:', exp['ndcg_cut_10'][0], ', NDCG@10 DuoT5:', exp['ndcg_cut_10'][1], ', p-value:', exp['ndcg_cut_10 p-value'][1])
    return experiment

experiment = run_test()

Sqlite3ScorerCache('monoT5.cache', (<pyterrier.datasets.IRDSTextLoader object at 0x357637d90> >> MonoT5(castorini/monot5-base-msmarco)), group='query', key='docno'): 2405 hit(s), 0 miss(es)
Sqlite3ScorerCache('monoT5.cache', (<pyterrier.datasets.IRDSTextLoader object at 0x357637d90> >> MonoT5(castorini/monot5-base-msmarco)), group='query', key='docno'): 2405 hit(s), 0 miss(es)
Sqlite3ScorerCache('duoT5.cache', (<pyterrier.datasets.IRDSTextLoader object at 0x14b40c220> >> DuoT5(castorini/duot5-base-msmarco)), group='query', key='docno'): 485 hit(s), 0 miss(es)
Mono Cutoff: 25 , Duo Cutoff: 5 , NDCG@10 MonoT5: 0.622 , NDCG@10 DuoT5: 0.474 , p-value: 1.0907524722825199e-24
Sqlite3ScorerCache('monoT5.cache', (<pyterrier.datasets.IRDSTextLoader object at 0x357637d90> >> MonoT5(castorini/monot5-base-msmarco)), group='query', key='docno'): 2405 hit(s), 0 miss(es)
Sqlite3ScorerCache('duoT5.cache', (<pyterrier.datasets.IRDSTextLoader object at 0x14b40c220> >> DuoT5(castorini/duot5-base-msmarco)

In [None]:
"""
def run_experiment(config: Config) -> DataFrame:
    print(f"Config: {config}")
    print(f"Device: {device('cuda' if cuda_is_available() else 'cpu')}")

    retrieve_cache_dir = CACHE_DIR / "retrieve" / config.dataset
    retrieve_cache_dir.mkdir(parents=True, exist_ok=True)
    if not (retrieve_cache_dir / "pt_meta.json").exists():
        rmtree(retrieve_cache_dir)

    rerank_mono_t5_cache_dir = (
        CACHE_DIR / "rerank" / "mono-t5" / config.dataset / config.topics_variant
    )
    rerank_mono_t5_cache_dir.mkdir(parents=True, exist_ok=True)
    if not (rerank_mono_t5_cache_dir / "pt_meta.json").exists():
        rmtree(rerank_mono_t5_cache_dir)

    experiment_cache_dir = (
        CACHE_DIR / "experiment6" / config.dataset / config.topics_variant
    )
    experiment_cache_dir.mkdir(parents=True, exist_ok=True)

    # Create ChatNoir retriever.
    retriever = ChatNoirRetrieve(
        api_key=environ["CHATNOIR_API_KEY"],
        index=config.index,
        features=Feature.CONTENTS_PLAIN,
        num_results=100,
        verbose=True,
        retries=20,
        search_method="bm25",
    )

    # Cache retriever.
    retriever = RetrieverCache(
        str(retrieve_cache_dir),
        retriever,
        verbose=True,
    ) >> generic(_add_missing_cols)

    # Re-rankers.
    mono_t5 = Lazy(
        lambda: MonoT5ReRanker(
            model="castorini/monot5-base-msmarco",
            verbose=True,
            batch_size=128,
        )
    )
    mono_t5 = SparseScorerCache(
        str(rerank_mono_t5_cache_dir),
        mono_t5,
        verbose=True,
    )
    duo_t5 = Lazy(
        lambda: DuoT5ReRanker(
            model="castorini/duot5-base-msmarco",
            verbose=True,
            batch_size=128,
        )
    )

    # Data
    dataset = get_dataset(f"irds:{config.dataset}")
    topics = dataset.get_topics(variant=config.topics_variant)
    topics = topics[
        topics["query"] != ""
    ]  # Catch empty queries that cause troubles with ChatNoir.
    qrels = dataset.get_qrels()

    # Specify pipelines.
    pipeline_chatnoir = retriever
    pipeline_mono_t5 = (pipeline_chatnoir % 100 >> mono_t5) ^ pipeline_chatnoir
    pipeline_duo_t5 = (pipeline_mono_t5 % 5 >> duo_t5) ^ pipeline_mono_t5

    # Run experiment
    return Experiment(
        retr_systems=[
            pipeline_chatnoir,
            pipeline_mono_t5,
            pipeline_duo_t5,
        ],
        names=[
            "ChatNoir",
            "ChatNoir+monoT5",
            "ChatNoir+monoT5+duoT5",
        ],
        topics=topics,
        qrels=qrels,
        eval_metrics=[nDCG @ 5],
        verbose=True,
        save_dir=str(experiment_cache_dir),
        save_mode="reuse",
    )
"""