## Step 1: Install Dependencies

We will use [TIREx](https://www.tira.io/tirex/components), an integration of [ir-datasets](https://ir-datasets.com/), [TIRA](https://www.tira.io/), and [PyTerrier](https://github.com/terrier-org/pyterrier) for fast experimentation.



In [None]:
!pip3 install ir-datasets 'python-terrier==0.10.0' 'git+https://github.com/tira-io/tira.git@pyterrier-artifacts#egg=tira&subdirectory=python-client' 'git+https://github.com/webis-de/auto-ir-metadata@dev' 'git+https://github.com/mam10eks/autoqrels.git' 'git+https://github.com/OpenWebSearch/wows-code.git#egg=wows-eval&subdirectory=ecir25/wows-eval' --break-system-packages

## Step 2: Import Libraries

We create an API client to interact with the TIRA platform (e.g., to load datasets and submit runs).


In [59]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client

ensure_pyterrier_is_loaded()
import pyterrier as pt
tira = Client()


## Step 2: Load the dataset

We load a small teaching-oriented dataset by its ir_datasets ID from TIRA. The dataset is a subsample of MS MARCO.


In [4]:
pt_dataset = pt.get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')

## Step 3: Build an index

We will then create an index from the documents in the dataset we just loaded.


In [None]:
indexer = pt.IterDictIndexer(
    # Store the index in the `index` directory.
    "../data/index",
    meta={'docno': 50, 'text': 4096},
    # If an index already exists there, then overwrite it.
    overwrite=True,
)
index = indexer.index(pt_dataset.get_corpus_iter())

ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:  38%|███▊      | 25706/68261 [00:03<00:04, 8580.47it/s]



ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents: 100%|██████████| 68261/68261 [00:08<00:00, 7757.15it/s]


09:44:59.055 [ForkJoinPool-2-worker-1] WARN org.terrier.structures.indexing.Indexer - Indexed 1 empty documents


## Step 4: Look at Evaluation Data

We reduce the topics to two information needs to make the evaluation a bit faster:

- `who sings monk theme song`
- `what is the most popular food in switzerland`

In [28]:
topics = pt_dataset.get_topics('title')
topics = topics[topics['query'].isin(('who sings monk theme song', 'what is the most popular food in switzerland'))]

topics

Unnamed: 0,qid,query
3,1051399,who sings monk theme song
76,833860,what is the most popular food in switzerland


In [29]:
qrels = pt_dataset.get_qrels()
qrels = qrels[qrels['qid'].isin(('1051399', '833860'))]

qrels

Unnamed: 0,qid,docno,label,iteration
2790,833860,115142,1,0
2791,833860,1524401,0,0
2792,833860,1524402,0,0
2793,833860,1524403,0,0
2794,833860,1524406,0,0
...,...,...,...,...
10269,1051399,8815205,0,0
10270,1051399,8818330,0,0
10271,1051399,904304,1,0
10272,1051399,909463,0,0


## Step 5: Define the retrieval pipeline

We will define a simple retrieval pipeline using just BM25 as a baseline. For details, refer to the PyTerrier [documentation](https://pyterrier.readthedocs.io/) or [tutorial](https://github.com/terrier-org/ecir2021tutorial).

In [128]:
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

query_expansion = pt.rewrite.RM3(index)


## Step 6: Evaluate the Retrieval System

In [129]:
pt.Experiment(
    [bm25],
    names=['BM25'],
    topics=topics,
    qrels=qrels,
    eval_metrics=['P_10', 'recall_100']
)

Unnamed: 0,name,P_10,recall_100
0,BM25,0.35,0.254545


## Step 7: Improve the Retrieval System

Lets use large language models to try to improve the effectiveness of our retrieval system.

One of the fundamental tasks that large language models might help with is to assess [if a document is relevant to a given query](https://downloads.webis.de/publications/papers/faggioli_2023b.pdf).

![Relevance Assessments with an LLM](relevance-assessments.png "relevance-assessments.png")

In [137]:
from wows_eval import evaluate as wows_evaluate

from autoqrels.zeroshot import GradedMonoPrompt
from auto_ir_metadata import Environment
import pandas as pd
pd.set_option('display.max_colwidth', None)

WOWS_DATASET_ID = 'wows-eval/pointwise-smoke-test-20250128-training'


In [46]:
input_data = tira.pd.inputs(WOWS_DATASET_ID)
input_data.head(3)

Unnamed: 0,id,query,unknown
0,32d23068-7440-4891-9958-42325f98a604,who sings monk theme song,This is a reference to the minor controversy that brewed among Monk fans over the introduction of the new theme song It's A Jungle Out There written and performed by Randy Newman in the second season of Monk.
1,cde83146-ac3e-4bc5-a959-f2006ac7b8de,who sings monk theme song,"Walker, Texas Ranger. Chuck Norris thought “Eyes of a Ranger” would be the perfect theme song for his new show Walker, Texas Ranger. He wanted his friend Randy Travis should sing it, but CBS had a different idea: The network suggested Norris sing the theme himself."
2,cb7b20d0-def6-46c4-ae44-a78f00b47735,who sings monk theme song,"However, as Brave 's soundtrack reveals, the movie is also noteworthy for being one of the studio's most musical films, especially for one not featuring music by Pixar's go-to songwriter Randy Newman."


## How Do we solve this task?

<img src="prompt-engineering.png" width="300"/>


### Step 1: Describe the task in the prompt:

For instance:

```
Instruction: Indicate if the passage answers the question.
```

### Step 2: In Context Learning:

Provide a few examples:

```
###
Example 1:
Question: At what age kids start to read?
Passage: Most kids say 1–2 words by 15 months and 3 or more words by 18 months.
Answer: Not Relevant
```

```
###
Example 2:
Question: What are the 5 P's of drawing?
Passage: The 5 P's of drawning are (1) Patience, (2) Positive feedback, (3) Perseverance, (4) Practicing, and (5) Passion.
Answer: Perfectly Relevant
```


### Step 3: Teamwork :)

Please share your task descriptions and in-context-learning examples in the chat. We will combine and test them to evaluate your prompts.

<img src="teamwork.jpg" width="300"/>


In [47]:
BACKBONE_MODEL = "flan-t5-small"

PROMPT = """Instruction: Indicate if the passage answers the question.
###
Example 1:
Question: At about what age do adults normally begin to lose bone mass?
Passage: For most people, bone mass peaks during the third decade of life. By this age, men typically have accumulated more bone mass than women. After this point, the amount of bone in the skeleton typically begins to decline slowly as removal of old bone exceeds formation of new bone.
Answer: Perfectly relevant
###
Example 2:
Question: when and where did the battle of manassas take place
Passage: Summary of the Battle of Bull Run. The conflict took place close to Manassas Junction, Virginia. Around 35,000 Union soldiers marched from Washing D.C. towards Bull Run (a small river) where a 20,000 troop Confederate force was stationed.
Answer: Irrelevant
###
###
Example 3:
Question: what is lbm in body composition
Passage: They also measured the participants body fat and clean body mass of muscle mass, obtained by subtracting the body fat weight from the total body weight.
Answer: Relevant
###
Example 4:
Question: {{ query_text }}
Passage: {{ unk_doc_text }}
Answer:"""

In [48]:
flan_t5_small_assessor = GradedMonoPrompt(
    backbone=f'google/{BACKBONE_MODEL}',
    prompt=PROMPT,
    dataset=None
)

In [49]:
with Environment().measure() as tracked_experiment:
    predictions = flan_t5_small_assessor.predict(input_data)

predictions.head(3)

  0%|          | 0/2 [00:00<?, ?it/s]Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
100%|██████████| 2/2 [00:01<00:00,  1.06it/s]


Unnamed: 0,id,query,unknown,probability_relevant
0,32d23068-7440-4891-9958-42325f98a604,who sings monk theme song,This is a reference to the minor controversy that brewed among Monk fans over the introduction of the new theme song It's A Jungle Out There written and performed by Randy Newman in the second season of Monk.,0.263609
1,cde83146-ac3e-4bc5-a959-f2006ac7b8de,who sings monk theme song,"Walker, Texas Ranger. Chuck Norris thought “Eyes of a Ranger” would be the perfect theme song for his new show Walker, Texas Ranger. He wanted his friend Randy Travis should sing it, but CBS had a different idea: The network suggested Norris sing the theme himself.",0.240609
2,cb7b20d0-def6-46c4-ae44-a78f00b47735,who sings monk theme song,"However, as Brave 's soundtrack reveals, the movie is also noteworthy for being one of the studio's most musical films, especially for one not featuring music by Pixar's go-to songwriter Randy Newman.",0.280311


In [None]:
wows_evaluate(
    predictions,
    WOWS_DATASET_ID,
    upload=True,
    system_name=f'auto-qrels-pointwise-{BACKBONE_MODEL}',
    system_description="We use autoqrels [1] with a custom in-context learning prompt for pointwise relevance judgments.\n\n[1] - https://github.com/seanmacavaney/autoqrels",
    environment=tracked_experiment
)

Run uploaded to TIRA. Claim ownership via: https://www.tira.io/claim-submission/a03c0610-e1fa-48fa-bb2e-1737742fb94a


{'tau_ap': 0.07944444444444437,
 'kendall': 0.29523809523809524,
 'spearman': 0.32500000000000007,
 'pearson': 0.32499999999999996}

## Step 8: Build a Chain of Search Components


Retrieval pipelines easily become complex.
Here, we use the most simple operator `>>` to chain components.

In [95]:
def llm_predictions_to_ranking(predictions):
    ids = {i['id']: i.to_dict() for _, i in tira.pd.truths(WOWS_DATASET_ID).iterrows()}
    ret = []
    for _, i in predictions.iterrows():
        qid, docno = ids[i['id']]['query_id'], ids[i['id']]['unknown_doc_id']
        ret += [{'qid': qid, 'docno': docno, 'score': i['probability_relevant']}]

    return pt.Transformer.from_df(pd.DataFrame(ret))


In [130]:
improved_pipeline = bm25 >> \
    llm_predictions_to_ranking(predictions) >> \
    query_expansion >> \
    bm25

## Step 9: Understand what the Pipelines Does

In [131]:
bm25(topics)['query'].iloc[0]

'who sings monk theme song'

In [132]:
improved_pipeline(topics)['query'].iloc[0]

'applypipeline:off theme^0.197192997 4^0.014035087 real^0.077192985 show^0.014035087 song^0.197192997 sing^0.155087724 search^0.014035087 lyric^0.028070174 monk^0.120000005 who^0.155087724 time^0.028070174'

## Step 10: Evaluate and Compare

In [133]:
pt.Experiment(
    [bm25, improved_pipeline],
    names=['BM25', 'Improved'],
    topics=topics,
    qrels=qrels,
    eval_metrics=['P_10', 'recall_100']
)

Unnamed: 0,name,P_10,recall_100
0,BM25,0.35,0.254545
1,Improved,0.45,0.268182
