# WOWS-Eval Pairwise Retrieval Baseline

This is a retrieval baseline to WOWS-EVAL that uses a PyTerrier retrieval model to assign the probability that an unknown document is relevant by ranking all unknown documents for all known relevant documents. We then use the min-max normalized rank that an unknown document has for the a known relevant document as the probability that the unknown document is relevant.

## Step 1: Install Dependencies

In [None]:
!pip3 install 'wows-eval>=0.0.6' python-terrier==0.10.0

## Step 2: Load the Data

Pairwise models have a query, a known relevant document, and an document with an unknown relevance to a query as input and predict the probability that the unknown document is relevant to the query given the known relevant document into a field `probability_relevant`. For this naive baseline, we always predict a probability of 0.5.

In the following, we will process the pwise smoke test dataset. Please modify the variable `DATASET_ID` to submit for other datasets. See [tira.io/datasets?query=wows-eval](https://archive.tira.io/datasets?query=wows-eval) for an complete overview of dataset identifiers.


In [1]:
import pyterrier as pt
from tqdm import tqdm
if not pt.started():
    pt.init()

from tira.rest_api_client import Client
from wows_eval import evaluate as wows_evaluate
import pandas as pd
from jnius import autoclass
import numpy as np

# For measuring consumed resources (e.g., GPU, CPU, RAM, etc.)
from tirex_tracker import tracking, ExportFormat

pd.set_option('display.max_colwidth', None)

DATASET_ID = 'wows-eval/pairwise-smoke-test-20250210-training'
#DATASET_ID = 'wows-eval/pairwise-20250309-test'

tira = Client()
input_data = tira.pd.inputs(DATASET_ID)

PyTerrier 0.10.0 has loaded Terrier 5.11 (built by craig.macdonald on 2025-01-13 21:29) and terrier-helper 0.0.8



## Step 3: Implement the Approach

We wrap all computations into a [tirex_tracker.tracking](https://github.com/tira-io/tirex-tracker/) environment to measure the resources consumed for our computations and also a snapshot of our code in the [ir-metadata format](https://www.ir-metadata.org/).

In [2]:
class QueryByRelevantDocument:
    def __init__(self):
        self.results = {}
    def process(self, query, retrieval_system, rel, unk):
        if query in self.results:
            raise ValueError('This query was already processed: ' + query)
        
        ret = {}
        tokeniser = autoclass("org.terrier.indexing.tokenisation.Tokeniser").getTokeniser()
        for doc in rel.values():
            doc_text = " ".join(tokeniser.getTokens(doc))
            run = retrieval_system.search(doc_text)
            last_rank = -1
            scores = {}
            for _, i in run.iterrows():
                assert last_rank < i['rank']
                last_rank = i['rank']
                if i['docno'] in unk:
                    scores[unk[i['docno']]] = i['rank']

            max_score = max(scores.values())
            min_score = min(scores.values())
            ret[doc] = {k: ((v-min_score)/(max_score-min_score)) for k, v in scores.items()}
        
        self.results[query] = ret


class QueryByUnknownDocument:
    def __init__(self):
        self.results = {}
    def process(self, query, retrieval_system, rel, unk):
        if query in self.results:
            raise ValueError('This query was already processed: ' + query)
        
        ret = {}
        tokeniser = autoclass("org.terrier.indexing.tokenisation.Tokeniser").getTokeniser()
        for doc in unk.values():
            doc_text = " ".join(tokeniser.getTokens(doc))
            run = retrieval_system.search(doc_text)
            last_rank = -1
            dcg = 0
            for _, i in run.iterrows():
                assert last_rank < i['rank']
                last_rank = i['rank']
                if i['rank'] >= 20:
                    break
                if i['docno'] in rel:
                    # https://github.com/joaopalotti/trectools/blob/master/trectools/trec_eval.py#L499C28-L499C56
                    dcg += 1. / np.log2(i['rank']+1)
            
            ret[doc] = dcg
        max_score = max(ret.values())
        min_score = min(ret.values())
        ret = {k: ((v-min_score)/(max_score-min_score)) for k, v in ret.items()}
        ret = {r: ret for r in rel.values()}

        self.results[query] = ret


In [3]:
WMODEL = "BM25"
system_name = f'query-by-relevant-doc-{WMODEL}'
#system_name = 'query-by-unknown-doc-{WMODEL}'

!rm -Rf tmp
with tracking(export_file_path='tmp/.metadata.yml', export_format=ExportFormat.IR_METADATA) as tracked:
    queries = set(input_data['query'].unique())

    def known_relevant_documents(query):
        docs = set(input_data[input_data['query'] == query]['relevant'].unique())
        return {f'{i[0]}-rel': i[1] for i in zip(range(len(docs)), docs)}

    def unknown_documents(query):
        docs = set(input_data[input_data['query'] == query]['unknown'].unique())
        return {f'{i[0]}-unkn': i[1] for i in zip(range(len(docs)), docs)}

    if system_name.startswith('query-by-relevant-doc'):
        processor = QueryByRelevantDocument()
    elif system_name.startswith('query-by-unknown-doc'):
        processor = QueryByUnknownDocument()
    else:
        raise ValueError('foo')

    for query in tqdm(queries):
        rel = known_relevant_documents(query)
        unk = unknown_documents(query)

        docs = [{'docno': k, 'text': v} for k, v in rel.items()]+[{'docno': k, 'text': v} for k, v in unk.items()]
        indexer = pt.IterDictIndexer("/tmp/index", overwrite=True, meta={'docno': 100, 'text': 20480})
        index_ref = indexer.index(docs)
        bm25 = pt.BatchRetrieve(index_ref, wmodel=WMODEL)
        processor.process(query, bm25, rel, unk)

    predictions = []
    for _, i in input_data.iterrows():
        res = processor.results[i['query']]
        res = res[i['relevant']]
        predictions.append({
            'id': i['id'],
            'probability_relevant': res.get(i['unknown'], -1)
        })
    predictions = pd.DataFrame(predictions)


PCM Info: setrlimit for file limit 1000000 failed with error Operation not permitted

=====  Processor information  =====
Linux arch_perfmon flag  : yes
Hybrid processor         : yes
IBRS and IBPB supported  : yes
STIBP supported          : yes
Spec arch caps supported : yes
Max CPUID level          : 32
CPU model number         : 154
ERROR: Can not open /sys/module/msr/parameters/allow_writes file.
PCM Error: can't open MSR handle for core 0 (No such file or directory)
Try no-MSR mode by setting env variable PCM_NO_MSR=1
Can not access CPUs Model Specific Registers (MSRs).
execute 'modprobe msr' as root user, then execute pcm as root user.
100%|██████████| 2/2 [00:00<00:00,  2.38it/s]


## Step 5: Evaluate and Submit Your Run

We use the `wows_evaluate` method imported above to evaluate our predictions and to upload them, to TIRA.

The `wows_evaluate` method has optional parameters that you can pass to describe your system and to include the resource measurements used during your computations in the ir-metadata format into your submission. You can remove those attributes or modify them for your submission accordingly. Call `help(wows_evaluate)` to see a full description.

In [4]:
wows_evaluate(
    predictions,
    DATASET_ID,
    tracking_results=tracked,
    upload=True,
    system_name=system_name,
    system_description=f'We use the PyTerrier retrieval model {WMODEL} to assign the probability that an unknown document is relevant by ranking all unknown documents for all known relevant documents. We then use the min-max normalized rank that an unknown document has for the a known relevant document as the probability that the unknown document is relevant.'
)

Run uploaded to TIRA. Claim ownership via: https://www.tira.io/claim-submission/59f8a4d8-5fc5-4349-88eb-694db23eb457


Unnamed: 0,system,tau_ap,kendall,spearman,pearson
0,query-by-relevant-doc-BM25,0.526667,0.485714,0.621429,0.621429
