# Information Retrieval Lab WiSe 2024/2025: Baseline Retrieval System

This Jupyter notebook serves as a baseline retrieval system that you can improve upon.
We use subsets of the MS MARCO datasets to retrieve passages of web documents.
We will show you how to create a software submission to TIRA from this notebook.

An overview of all corpora that we use in the current course is available at [https://tira.io/datasets?query=ir-lab-wise-2024](https://tira.io/datasets?query=ir-lab-wise-2024). The dataset IDs for loading the datasets are:

- `ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training`: A subsample of the TREC 2019/2020 Deep Learning tracks on the MS MARCO v1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/subsampled-ms-marco-rag-20241202-training` (_work in progress_): A subsample of the TREC 2024 Retrieval-Augmented Generation track on the MS MARCO v2.1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/ms-marco-rag-20241203-test` (work in progress): The test corpus that we have created together in the course, based on the MS MARCO v2.1 passage dataset. We will use this dataset as the test dataset, i.e., evaluation scores become available only after the submission deadline.

### Step 1: Import libraries

We will use [tira](https://tira.io/), an information retrieval shared task platform, and [ir_dataset](https://ir-datasets.com/) for loading the datasets. Subsequently, we will build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine framework.

First, we need to install the required libraries.

In [1]:
!pip3 install 'tira>=0.0.139' ir-datasets 'python-terrier==0.10.0'

[0m

In [2]:
!wget https://files.webis.de/software/pyterrier-plugins/custom-terrier-token-processing-1.0-SNAPSHOT-jar-with-dependencies.jar -O /root/.pyterrier/custom-terrier-token-processing-0.0.1.jar

--2024-12-15 11:26:40--  https://files.webis.de/software/pyterrier-plugins/custom-terrier-token-processing-1.0-SNAPSHOT-jar-with-dependencies.jar
Resolving files.webis.de (files.webis.de)... 141.54.132.200
Connecting to files.webis.de (files.webis.de)|141.54.132.200|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 499865236 (477M) [application/java-archive]
Saving to: ‘/root/.pyterrier/custom-terrier-token-processing-0.0.1.jar’


2024-12-15 11:27:19 (12.4 MB/s) - ‘/root/.pyterrier/custom-terrier-token-processing-0.0.1.jar’ saved [499865236/499865236]



In [3]:
import pyterrier as pt
import pandas as pd
pd.set_option('display.max_colwidth', 0)

if not pt.started():
    pt.init(boot_packages=['mam10eks:custom-terrier-token-processing:0.0.1'])
    from jnius import autoclass

PyTerrier 0.10.0 has loaded Terrier 5.10 (built by craigm on 2024-08-22 17:33) and terrier-helper 0.0.8



Create an API client to interact with the TIRA platform (e.g., to load datasets and submit runs).

In [4]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client

ensure_pyterrier_is_loaded()
tira = Client()

### Step 2: Load the dataset

We load the dataset by its ir_datasets ID (as listed in the Readme). Just be sure to add the `irds:` prefix before the dataset ID to tell PyTerrier to load the data from ir_datasets.

In [5]:
from pyterrier import get_dataset

pt_dataset = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')

### Step 3: Build an index

We will then create an index from the documents in the dataset we just loaded.

Our Task is it to test different Stemmers/Lemmatizers and check which of them gives the best results.

In [6]:
from pyterrier import IterDictIndexer

In [7]:
indexer_porter = IterDictIndexer(
    # Store the index in the `index` directory.
    "../data/index",
    meta={'docno': 50, 'text': 4096},
    # If an index already exists there, then overwrite it.
    overwrite=True,
    stemmer='PorterStemmer'
)
index_porter = indexer_porter.index(pt_dataset.get_corpus_iter())


ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:  38%|███▊      | 25829/68261 [00:03<00:03, 12512.00it/s]



ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents: 100%|██████████| 68261/68261 [00:06<00:00, 10195.65it/s]


11:27:44.807 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer -- Indexed 1 empty documents


In [8]:
indexer_none = IterDictIndexer(
    "../data/index_none",
    meta={'docno': 50, 'text': 4096},
    overwrite=True,
    stemmer = None
)

index_none= indexer_none.index(pt_dataset.get_corpus_iter())

ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:  37%|███▋      | 25190/68261 [00:02<00:03, 12087.09it/s]



ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents: 100%|██████████| 68261/68261 [00:05<00:00, 11723.51it/s]


11:28:05.494 [ForkJoinPool-2-worker-3] WARN org.terrier.structures.indexing.Indexer -- Indexed 1 empty documents


In [9]:
indexer_Snowball = IterDictIndexer(
    "../data/index_Snowball",
    meta={'docno': 50, 'text': 4096},
    overwrite=True,
    stemmer = 'EnglishSnowballStemmer'
)

index_Snowball = indexer_Snowball.index(pt_dataset.get_corpus_iter())

ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:  38%|███▊      | 26169/68261 [00:02<00:03, 11700.43it/s]



ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents: 100%|██████████| 68261/68261 [00:05<00:00, 11768.28it/s]


11:28:13.707 [ForkJoinPool-3-worker-3] WARN org.terrier.structures.indexing.Indexer -- Indexed 1 empty documents


In [10]:
indexer_LemurKrovetz = IterDictIndexer(
    "../data/index_LemurKrovetz",
    meta={'docno': 50, 'text': 4096},
    overwrite=True,
    stemmer = 'LemurKrovetzStemmer'
)

index_LemurKrovetz = indexer_LemurKrovetz.index(pt_dataset.get_corpus_iter())

ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:  38%|███▊      | 26151/68261 [00:02<00:03, 13687.82it/s]



ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents: 100%|██████████| 68261/68261 [00:05<00:00, 12169.73it/s]


11:28:21.716 [ForkJoinPool-4-worker-3] WARN org.terrier.structures.indexing.Indexer -- Indexed 1 empty documents


In [11]:
indexer_standfordLemmatizer = IterDictIndexer(
    "../data/index_standfordLemmatizer",
    meta={'docno': 50, 'text': 4096},
    overwrite=True,
    stemmer = 'StanfordLemmatizer'
)

index_standfordLemmatizer = indexer_standfordLemmatizer.index(pt_dataset.get_corpus_iter())

ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:  38%|███▊      | 26169/68261 [00:25<00:40, 1049.31it/s]



ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents: 100%|██████████| 68261/68261 [01:06<00:00, 1028.00it/s]


11:29:30.664 [ForkJoinPool-5-worker-3] WARN org.terrier.structures.indexing.Indexer -- Indexed 1 empty documents


In [12]:
from pyterrier import BatchRetrieve

bm25_porter = BatchRetrieve(index_porter, wmodel="BM25")
bm25_none = BatchRetrieve(index_none, wmodel="BM25")
bm25_snowball = BatchRetrieve(indexer_Snowball, wmodel="BM25")
bm25_lemurkrovetz = BatchRetrieve(indexer_LemurKrovetz, wmodel="BM25")
bm25_standfordLemmatizer = BatchRetrieve(index_standfordLemmatizer, wmodel="BM25")

In [13]:
from pyterrier import Experiment 

Experiment(
    [bm25_porter, bm25_none, bm25_snowball, bm25_lemurkrovetz, bm25_standfordLemmatizer],
    topics = pt_dataset.get_topics("text"),
    qrels = pt_dataset.get_qrels(),
    eval_metrics=["ndcg_cut_10"],
    names=["BM25 with Porter Stemmer", "BM25 with no Stemmer", "BM25 with Snowball Stemmer", "BM25 with LemurKrovetz Stemmer", "BM25 with Standford Lemmatizer"]
)

Unnamed: 0,name,ndcg_cut_10
0,BM25 with Porter Stemmer,0.489469
1,BM25 with no Stemmer,0.46889
2,BM25 with Snowball Stemmer,0.489216
3,BM25 with LemurKrovetz Stemmer,0.490341
4,BM25 with Standford Lemmatizer,0.482512


## Interim Conclusion
This shows that the LemurKrovetzStemmer gives the best results. So now we start tuning the hyperparameters for it.

## Step 4: Tune the hyperparameters

For the next step we want to further optimise the retrievalpipeline by tuning its hyperparameters with a gridsearch. The gridsearch on the given data takes roughly two hours. Since we only have 9 combinations of the given parameters, we just tried everyone of them manually to evaluate the best combination faster. The code below can still be executed, but if you want to take the shortcut you can just skip to the manual testing step (4.5).



In [14]:
from tira.third_party_integrations import ir_datasets, ensure_pyterrier_is_loaded, persist_and_normalize_run
import pyterrier as pt

ensure_pyterrier_is_loaded()

training_dataset = 'ir-lab-jena-leipzig-wise-2023/training-20231104-training'
validation_dataset = 'ir-lab-jena-leipzig-wise-2023/validation-20231104-training'

In [15]:
dataset = ir_datasets.load(training_dataset)
queries = pt.io.read_topics(ir_datasets.topics_file(training_dataset), format='trecxml')

### Create the Index with the LemurKrovetzStemmer

In [25]:
def create_index(documents):
    indexer = pt.IterDictIndexer("/tmp/index", overwrite=True, meta={'docno': 50, 'text': 4096}, stemmer = 'LemurKrovetzStemmer')
    index_ref = indexer.index(({'docno': i.doc_id, 'text': i.text} for i in documents))
    return pt.IndexFactory.of(index_ref)

In [None]:
index = create_index(dataset.docs_iter())

### Define the Grid Search

In [44]:
from tira.third_party_integrations import ir_datasets, ensure_pyterrier_is_loaded, persist_and_normalize_run
def run_bm25_grid_search_run(index, output_dir, queries):
    """
        defaults: http://terrier.org/docs/current/javadoc/org/terrier/matching/models/BM25.html
        k_1 = 1.2d, k_3 = 8d, b = 0.75d
        We do not tune parameter k_3, as this parameter only impacts queries with reduntant terms.
    """
    for b in [0.7, 0.75, 0.8]:
        for k_1 in [1.1, 1.2, 1.3]:
            system = f'bm25-b={b}-k_1={k_1}'
            configuration = {"bm25.b" : b, "bm25.k_1": k_1}
            run_output_dir = output_dir + '/' + system
            !rm -Rf {run_output_dir}
            !mkdir -p {run_output_dir}
            print(f'Run {system}')
            BM25 = pt.BatchRetrieve(index, wmodel="BM25", controls=configuration, verbose=True)
            run = BM25(queries)
            persist_and_normalize_run(run, system, run_output_dir)

### Load Training Dataset

In [None]:
run_bm25_grid_search_run(index, 'grid-search/training', queries)

### Load Validation Dataset

In [46]:
dataset_val = ir_datasets.load(validation_dataset)
queries_val = pt.io.read_topics(ir_datasets.topics_file(validation_dataset), format='trecxml')

In [None]:
index = create_index(dataset.dos_iter())

In [None]:
run_bm25_grid_search_run(index, 'grid-search/validation', queries_val)

### Evalutation

In [None]:
!pip3 install tira trectools python-terrier

In [49]:
from trectools import TrecRun, TrecQrel, TrecEval
from tira.rest_api_client import Client
from glob import glob
import pandas as pd
tira = Client()

def load_qrels(dataset):
    return TrecQrel(tira.download_dataset('ir-lab-jena-leipzig-wise-2023', dataset, truth_dataset=True) + '/qrels.txt')

training_qrels = load_qrels('training-20231104-training')
validation_qrels = load_qrels('validation-20231104-training')

In [50]:
def evaluate_run(run_dir, qrels):
    run = TrecRun(run_dir + '/run.txt')
    trec_eval = TrecEval(run, qrels)

    return {
        'run': run.get_runid(),
        'nDCG@10': trec_eval.get_ndcg(depth=10),
        'nDCG@10 (unjudgedRemoved)': trec_eval.get_ndcg(depth=10, removeUnjudged=True),
        'MAP': trec_eval.get_map(depth=10),
        'MRR': trec_eval.get_reciprocal_rank()
    }

In [None]:
df = []
for r in glob('grid-search/training/bm25*'):
    df += [evaluate_run(r, training_qrels)]
df = pd.DataFrame(df)
df.sort_values('nDCG@10', ascending=False)

In [None]:
df = []
for r in glob('grid-search/validation/bm25*'):
    df += [evaluate_run(r, validation_qrels)]
df = pd.DataFrame(df)
df.sort_values('nDCG@10', ascending=False)

## Step 4.5: Manual testing
As explained above, we testet all possible combinations of the given hyperparameters to tune the pipeline manually and safe time.

In [14]:
configuration1 = {"bm25.b": 0.7, "bm25.k_1": 1.1}
configuration2 = {"bm25.b": 0.75, "bm25.k_1": 1.1}
configuration3 = {"bm25.b": 0.8, "bm25.k_1": 1.1}
configuration4 = {"bm25.b": 0.7, "bm25.k_1": 1.2}
configuration5 = {"bm25.b": 0.75, "bm25.k_1": 1.2}
configuration6 = {"bm25.b": 0.8, "bm25.k_1": 1.2}
configuration7 = {"bm25.b": 0.7, "bm25.k_1": 1.3}
configuration8 = {"bm25.b": 0.75, "bm25.k_1": 1.3}
configuration9 = {"bm25.b": 0.8, "bm25.k_1": 1.3}

In [15]:
# Define indexer_tunedParameters once
indexer_tunedParameters = IterDictIndexer(
    "../data/index_tunedParameters",
    meta={'docno': 50, 'text': 4096},
    overwrite=True,
    stemmer='LemurKrovetzStemmer',
)
index_tunedParameters = indexer_tunedParameters.index(pt_dataset.get_corpus_iter())

bm25_1 = BatchRetrieve(index_tunedParameters, wmodel="BM25", controls=configuration1)
bm25_2 = BatchRetrieve(index_tunedParameters, wmodel="BM25", controls=configuration2)
bm25_3 = BatchRetrieve(index_tunedParameters, wmodel="BM25", controls=configuration3)
bm25_4 = BatchRetrieve(index_tunedParameters, wmodel="BM25", controls=configuration4)
bm25_5 = BatchRetrieve(index_tunedParameters, wmodel="BM25", controls=configuration5)
bm25_6 = BatchRetrieve(index_tunedParameters, wmodel="BM25", controls=configuration6)
bm25_7 = BatchRetrieve(index_tunedParameters, wmodel="BM25", controls=configuration7)
bm25_8 = BatchRetrieve(index_tunedParameters, wmodel="BM25", controls=configuration8)
bm25_9 = BatchRetrieve(index_tunedParameters, wmodel="BM25", controls=configuration9)

ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:  37%|███▋      | 25008/68261 [00:02<00:03, 12434.24it/s]



ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents: 100%|██████████| 68261/68261 [00:05<00:00, 12111.26it/s]


11:33:37.815 [ForkJoinPool-6-worker-3] WARN org.terrier.structures.indexing.Indexer -- Indexed 1 empty documents


Now test the comibnations and evaluate the best ndcg_cut

In [16]:
from pyterrier import Experiment

Experiment(
    [bm25_1, bm25_2, bm25_3, bm25_4, bm25_5, bm25_6, bm25_7, bm25_8, bm25_9],
    topics = pt_dataset.get_topics("text"),
    qrels = pt_dataset.get_qrels(),
    eval_metrics=["ndcg_cut_10"],
    names=["BM25 with tuned parameters 1", "BM25 with tuned parameters 2", "BM25 with tuned parameters 3", "BM25 with tuned parameters 4", "BM25 with tuned parameters 5", "BM25 with tuned parameters 6", "BM25 with tuned parameters 7", "BM25 with tuned parameters 8", "BM25 with tuned parameters 9"]
)

Unnamed: 0,name,ndcg_cut_10
0,BM25 with tuned parameters 1,0.490694
1,BM25 with tuned parameters 2,0.489596
2,BM25 with tuned parameters 3,0.486928
3,BM25 with tuned parameters 4,0.491373
4,BM25 with tuned parameters 5,0.490341
5,BM25 with tuned parameters 6,0.485663
6,BM25 with tuned parameters 7,0.490225
7,BM25 with tuned parameters 8,0.486895
8,BM25 with tuned parameters 9,0.486829


We can see that combination 4 (b = 0.7, k_1 = 1.2) has the best results (improvement  ~0,001 compared to the variant that is not tuned at all)

## Step 5: Create the run
In the next steps, we would like to apply our retrieval system to some topics, to prepare a 'run' file, containing the retrieved documents.

Now, retrieve results for all the topics (may take a while):

In [17]:
bm25_final = BatchRetrieve(indexer_tunedParameters, wmodel="BM25", controls=configuration4)
run = bm25_final(pt_dataset.get_topics('text'))

That's it for the retrieval. Here are the first 10 entries of the run:

In [18]:
run.head(10)

Unnamed: 0,qid,docid,docno,rank,score,query
0,1030303,53852,8726436,0,31.713543,who is aziz hashim
1,1030303,56041,8726433,1,25.738831,who is aziz hashim
2,1030303,62116,8726435,2,23.799521,who is aziz hashim
3,1030303,32183,8726429,3,23.361162,who is aziz hashim
4,1030303,35867,8726437,4,20.882489,who is aziz hashim
5,1030303,17637,8726430,5,19.900218,who is aziz hashim
6,1030303,42957,7156982,6,19.900218,who is aziz hashim
7,1030303,21803,8726434,7,19.442939,who is aziz hashim
8,1030303,59828,1305520,8,17.934895,who is aziz hashim
9,1030303,60002,3302257,9,17.74898,who is aziz hashim


## Step 6: Persist and upload run to TIRA

The output of our retrieval system is a run file. This run file can later (and, e.g., in a different notebook or by a different person) be statistically evaluated. We will therefore first upload the run to TIRA.

In [None]:
from tira.third_party_integrations import persist_and_normalize_run

persist_and_normalize_run(
    run,
    # Give your approach a short but descriptive name tag.
    system_name='bm25-modifiedStemmerWithTuning', 
    default_output='../data/runs',
    upload_to_tira=pt_dataset,
)

In [None]:
from pyterrier import Experiment 

Experiment(
    [bm25_final],
    topics = pt_dataset.get_topics("text"),
    qrels = pt_dataset.get_qrels(),
    eval_metrics=["ndcg_cut_10"],
    names=["BM25 with Tuned Parameters"]
)

Click on the link in the cell output above to claim your submission on TIRA.