# Information Retrieval project
**Authors:** L.Arduini, D.N.Ghaneh, L.Menchini, C.Petruzzella

**Dataset:** The chosen dataset is MSMARCO Passage dataset [link here](https://ir-datasets.com/msmarco-passage.html)

**Evaluation:** For evaluation the trec-2020-dl dataset has been used

## Instructions to Run

### Prerequisites
1. Python 3.10 or above.
2. Access to a runtime environment with GPU support (e.g., NVIDIA V28 on Google Colab) for optimal performance.

### Running the project
- Switch the runtime to GPU (e.g., NVIDIA V28) for enhanced performance

# 0. Setup environment and dependencies
This section ensures that all necessary packages are installed and loaded.

**Note:** The project uses `ir_datasets`, `nltk`, and `ir_measures`, along with several utilities for processing.

In [33]:
!pip install ir_datasets
!pip install nltk
!pip install ir_measures
!pip install PyStemmer
!pip install pandas
!pip install python-terrier
!pip install --upgrade gdown



In [34]:
import ir_datasets
import ir_measures
from ir_measures import *
import random
import re
import string
import nltk
import time
from collections import Counter, defaultdict
from tqdm.auto import tqdm
import gzip
import pickle
import os
import heapq
import math
import pyterrier as pt
from google.colab import drive
import os
import shutil

# 1. Loading the dataset

This notebook will load the MS MARCO Passage dataset, a standard dataset for Information Retrieval tasks.
It contains passages from various sources and is used to train and evaluate retrieval models.

For testing purposes, also the Vaswani dataset will be used in a development environment.

In [35]:
# ------- Production Environment -------
dataset = ir_datasets.load("msmarco-passage")
# ---------------------------------------

# ------- Development Environment -------
# dataset = ir_datasets.load("vaswani")
# ---------------------------------------

In [36]:
# From Google Drive import lexicon, inverted indexes and document indexes
drive.mount('/content/drive')

repository = "1riRemOldrDhvbpnphe1Co8jadQuC4OOs"
repository_name = "ir-project-files"
!gdown --folder $repository

# Copia i file dalla cartella scaricata a /content/
for item in os.listdir(repository_name):
  s = os.path.join(repository_name, item)
  d = os.path.join('/content/', item)
  if os.path.isfile(s):  # Copia solo se è un file
    shutil.copy2(s, d)

# Rimuovi la cartella scaricata (opzionale)
shutil.rmtree(repository_name)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Retrieving folder contents
Processing file 1-6YUREBaAr5H2TNd4v66feF7-gXtRSTB document_index.pickle.gz
Processing file 1-6QY48jBNOrtppPDAV80EemnrN-f06yV inverted_file.pickle.gz
Processing file 1-5JPLr2Ug3CTAeoYePwOui8CljX5LpV0 lexicon.pickle.gz
Processing file 1-L0CzWJPrzmllB-3P5dE1qkV9cgyJL7G stats.pickle.gz
Processing file 1-KA42mZpr_KT5E4Jl3isSThNnu2ueUg5 trec_eval_bm25_run_file.txt
Processing file 1-Aj1Q513p315ymq_bgDkidtCbFcevPPo trec_eval_qrels_file.txt
Processing file 1-HW4GID_ULrLq_3CDZu24RoLskqi1pj0 trec_eval_tfidf_run_file.txt
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From (original): https://drive.google.com/uc?id=1-6YUREBaAr5H2TNd4v66feF7-gXtRSTB
From (redirected): https://drive.google.com/uc?id=1-6YUREBaAr5H2TNd4v66feF7-gXtRSTB&confirm=t&uuid=229b4732-3aa4-4923-83db-46c3

# 2. Preprocessing text data
This section defines functions for text preprocessing. Preprocessing steps include:
- Lowercasing
- Replacing symbols and punctuations
- Removing stopwords
- Stemming tokens

The goal is to normalize text data for effective retrieval

In [6]:
from functools import lru_cache
import Stemmer
nltk.download("stopwords", quiet=True)

# ------- Pre Initialization -------
# 1. Compile regex patterns once globally
# 2. Preload stopwords set
# 3. Initialize stemmer

ACRONYM_REGEX = re.compile(r"(?<!\w)\.(?!\d)")
PUNCTUATION_TRANS = str.maketrans("", "", string.punctuation)
STOPWORDS = set(nltk.corpus.stopwords.words('english'))
STEMMER = Stemmer.Stemmer('english')
# ----------------------------------

def preprocess(s):
    """
    Preprocess a string for indexing or querying.

    Args:
        s: The input string.

    Returns:
        A list of preprocessed tokens.
    """

    s = s.lower()
    s = s.replace("&", " and ")
    # normalize quotes and dashes
    s = s.translate(str.maketrans("‘’´“”–-", "'''\"\"--"))
    # remove unnecessary dots in acronyms (but not decimals)
    s = ACRONYM_REGEX.sub("", s)
    # remove punctuation
    s = s.translate(PUNCTUATION_TRANS)
    # strip and remove extra spaces
    s = " ".join(s.split())

    tokens = s.split()
    tokens = [t for t in tokens if t not in STOPWORDS]
    tokens = STEMMER.stemWords(tokens)
    return tokens

In [7]:
def profile(f):
    """
    A decorator that prints the runtime of the decorated function.

    Args:
        f: The function to profile.

    Returns:
        The profiled function.
    """
    def f_timer(*args, **kwargs):
        """
        The profiled function.

        Args:
            *args: The arguments to the function.
            **kwargs: The keyword arguments to the function.

        Returns:
            The result of the function.
        """
        start = time.time()
        result = f(*args, **kwargs)
        end = time.time()
        ms = (end - start) * 1000
        print(f"{f.__name__} ({ms:.3f} ms)")
        return result
    return f_timer

# 3. Building the inverted index
We create an inverted index to store terms with their respective document IDs and term frequencies.
The `build_index` function processes the dataset and constructs a structure that enables efficient term-based searching across documents

In [8]:
@profile
def build_index(dataset):
    """
    Build an inverted index from a dataset.

    Args:
        dataset: The dataset to index.

    Returns:
        A tuple of:
        - The lexicon, a dictionary mapping terms to term IDs and document frequencies.
        - The inverted index, a dictionary mapping term IDs to lists of document IDs and frequencies.
        - The document index, a list of document IDs and document lengths.
        - The index statistics, a dictionary of statistics.
    """
    lexicon = {}
    doc_index = []
    inv_d, inv_f = {}, {}
    termid = 0

    num_docs = 0
    total_dl = 0
    total_toks = 0
    for docid, doc in tqdm(enumerate(dataset.docs_iter()), desc='Indexing', total=dataset.docs_count()):
        tokens = preprocess(doc.text)
        token_tf = Counter(tokens)
        for token, tf in token_tf.items():
            if token not in lexicon:
                lexicon[token] = [termid, 0, 0]
                inv_d[termid], inv_f[termid] =  [], []
                termid += 1
            token_id = lexicon[token][0]
            inv_d[token_id].append(docid)
            inv_f[token_id].append(tf)
            lexicon[token][1] += 1
            lexicon[token][2] += tf
        doclen = len(tokens)
        doc_index.append((str(doc.doc_id), doclen))
        total_dl += doclen
        num_docs += 1


    stats = {
        'num_docs': 1 + docid,
        'num_terms': len(lexicon),
        'num_tokens': total_dl,
    }
    return lexicon, {'docids': inv_d, 'freqs': inv_f}, doc_index, stats

In [9]:
lex, inv, doc, stats = None, None, None, None

files = ['lexicon.pickle.gz', 'inverted_file.pickle.gz', 'document_index.pickle.gz', 'stats.pickle.gz']
if all(os.path.exists(file) for file in files):
    print("All files already exist.")

    for file, var_name in zip(files, ['lex', 'inv', 'doc', 'stats']):
        try:
            if os.path.getsize(file) > 0:  # Verifica se il file non è vuoto
                with gzip.open(file, 'rb') as f:
                    globals()[var_name] = pickle.load(f)
            else:
                print(f"Warning: {file} is empty.")
        except EOFError:
            print(f"Error: {file} is corrupted or incomplete. Rebuilding the index.")
            lex, inv, doc, stats = build_index(dataset)
            break
else:
    # Se i file non esistono o sono corrotti, ricostruisci l'indice
    lex, inv, doc, stats = build_index(dataset)

    # Salva nuovamente i dati nei file compressi solo se necessario
    for data, file in zip([lex, inv, doc, stats], files):
      with gzip.open(file, 'wb') as f:
        print(f"Saving {file}...")
        pickle.dump(data, f)


Indexing:   0%|          | 0/8841823 [00:00<?, ?it/s]

[INFO] Please confirm you agree to the MSMARCO data usage agreement found at <http://www.msmarco.org/dataset.aspx>
[INFO] If you have a local copy of https://msmarco.z22.web.core.windows.net/msmarcoranking/collectionandqueries.tar.gz, you can symlink it here to avoid downloading it again: /root/.ir_datasets/downloads/31644046b18952c1386cd4564ba2ae69
[INFO] [starting] https://msmarco.z22.web.core.windows.net/msmarcoranking/collectionandqueries.tar.gz

https://msmarco.z22.web.core.windows.net/msmarcoranking/collectionandqueries.tar.gz: 0.0%| 0.00/1.06G [00:00<?, ?B/s][A
https://msmarco.z22.web.core.windows.net/msmarcoranking/collectionandqueries.tar.gz: 0.0%| 344k/1.06G [00:00<05:13, 3.37MB/s][A
https://msmarco.z22.web.core.windows.net/msmarcoranking/collectionandqueries.tar.gz: 0.2%| 2.16M/1.06G [00:00<01:41, 10.4MB/s][A
https://msmarco.z22.web.core.windows.net/msmarcoranking/collectionandqueries.tar.gz: 0.7%| 7.63M/1.06G [00:00<00:43, 24.3MB/s][A
https://msmarco.z22.web.core.window

build_index (1396744.589 ms)
Saving lexicon.pickle.gz...
Saving inverted_file.pickle.gz...
Saving document_index.pickle.gz...
Saving stats.pickle.gz...


In [11]:
class InvertedIndex:
    """
    A simple inverted index class.

    Attributes:
        lexicon: The lexicon.
        inv: The inverted index.
        doc: The document index.
        stats: The index statistics.

    Methods:
        num_docs: Get the number of documents in the index.
        get_posting: Get a posting list iterator for a term.
        get_termids: Get the term IDs for a list of tokens.
        get_postings: Get the posting list iterators for a list of term IDs.

    Inner class:
        PostingListIterator: An iterator over a posting list.
    """

    class PostingListIterator:
        """
        An iterator over a posting list.

        Attributes:
            docids: The list of document IDs.
            freqs: The list of term frequencies.
            pos: The current position in the posting list.
            doc: The document index.

        Methods:
            docid: Get the current document ID.
            score: Get the current document score.
            next: Move to the next document.
            is_end_list: Check if the iterator is at the end of the list.
            len: Get the length of the posting list.
        """
        def __init__(self, docids, freqs, doc):
            self.docids = docids
            self.freqs = freqs
            self.pos = 0
            self.doc = doc

        def docid(self):
            if self.is_end_list():
                return math.inf
            return self.docids[self.pos]

        def score(self):
            if self.is_end_list():
                return math.inf
            return self.freqs[self.pos]/self.doc[self.docid()][1]

        def next(self, target = None):
            if not target:
                if not self.is_end_list():
                    self.pos += 1
            else:
                if target > self.docid():
                    try:
                        self.pos = self.docids.index(target, self.pos)
                    except ValueError:
                        self.pos = len(self.docids)

        def is_end_list(self):
            return self.pos == len(self.docids)


        def len(self):
            return len(self.docids)


    def __init__(self, lex, inv, doc, stats):
        self.lexicon = lex
        self.inv = inv
        self.doc = doc
        self.stat = stats

    def num_docs(self):
        return self.stats['num_docs']

    def get_posting(self, termid):
        return InvertedIndex.PostingListIterator(self.inv['docids'][termid], self.inv['freqs'][termid], self.doc)

    def get_termids(self, tokens):
        return [self.lexicon[token][0] for token in tokens if token in self.lexicon]

    def get_postings(self, termids):
        return [self.get_posting(termid) for termid in termids]

inv_index = InvertedIndex(lex, inv, doc, stats)

# 4. Query processing
This section implements the Query Processing task, aiming to rank documents by relevance to a given query using the BM25 and TF-IDF scoring functions with two different approaches:
- **DAAT (Document-at-a-Time)**: Processes documents sequentially, computing scores for all terms in a document before moving to the next document.
- **TAAT (Term-at-a-Time)**: Processes terms sequentially, scoring all documents for a given term before moving to the next term.

In [12]:
# ------- Production Environment --------
trec_dl_2020 = ir_datasets.load("msmarco-passage/trec-dl-2020")
# ---------------------------------------

# ------- Development Environment -------
# trec_dl_2020 = ir_datasets.load("vaswani")
# ---------------------------------------

In [13]:
class TopQueue:
    """
    A simple top-k queue class.

    Attributes:
        queue: The priority queue.
        k: The maximum number of items in the queue.
        threshold: The minimum score threshold.

    Methods:
        size: Get the number of items in the queue.
        would_enter: Check if a score would enter the queue.
        clear: Clear the queue.
        insert: Insert a document into the queue.
    """
    def __init__(self, k=10, threshold=0.0):
        self.queue = []
        self.k = k
        self.threshold = threshold

    def size(self):
        return len(self.queue)

    def would_enter(self, score):
        return score > self.threshold

    def clear(self, new_threshold=None):
        self.queue = []
        if new_threshold:
            self.threshold = new_threshold

    def __repr__(self):
        return f'<{self.size()} items, th={self.threshold} {self.queue}'

    def insert(self, docid, score):
        if score > self.threshold:
            if self.size() >= self.k:
                heapq.heapreplace(self.queue, (score, docid))
            else:
                heapq.heappush(self.queue, (score, docid))
            if self.size() >= self.k:
                self.threshold = max(self.threshold, self.queue[0][0])
            return True
        return False

## 4.1. BM25

In [14]:
# Average document length
avg_dl = inv_index.stat['num_tokens'] / inv_index.stat['num_docs']
# Number of documents
N = inv_index.stat['num_docs']

def bm25(tf, df, dl, k1=1.5, b=0.75):
    """
    Compute the BM25 score.

    Args:
        tf: The term frequency.
        df: The document frequency.
        dl: The document length.
        k1: The k1 parameter.
        b: The b parameter.

    Returns:
        The BM25 score.
    """
    idf = math.log(1 + (N - df + 0.5) / (df + 0.5))
    term_frequency_component = (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * (dl / avg_dl)))
    return idf * term_frequency_component

### 4.1.1 DAAT with BM25

In [15]:
# Calculate document lengths
doc_lengths = defaultdict(int)
for docid, doc_len in inv_index.doc:
    doc_lengths[docid] = doc_len

def min_docid(postings):
    """
    Get the minimum document ID from a list of posting list iterators.

    Args:
        postings: The list of posting list iterators.

    Returns:
        The minimum document ID.
    """
    min_docid = math.inf
    for p in postings:
        if not p.is_end_list():
            min_docid = min(p.docid(), min_docid)
    return min_docid

def daat_bm25(postings, k=10):
    """
    Perform a document-at-a-time (DAAT) scoring using BM25.

    Args:
        postings: The list of posting list iterators.
        k: The maximum number of documents to retrieve.

    Returns:
        A list of (docid, score) tuples.
    """
    top = TopQueue(k)
    current_docid = min_docid(postings)

    while current_docid != math.inf:
        score = 0
        next_docid = math.inf

        for posting in postings:
            if posting.docid() == current_docid:
                tf = posting.freqs[posting.pos]
                df = posting.len()
                dl = doc_lengths[current_docid]

                score += bm25(tf, df, dl)

                posting.next()
            if not posting.is_end_list():
                next_docid = min(next_docid, posting.docid())

        top.insert(current_docid, score)
        current_docid = next_docid

    return sorted(top.queue, reverse=True)

### 4.1.2 TAAT with BM25

In [16]:
def taat_bm25(postings, k=10):
    """
    Perform a term-at-a-time (TAAT) scoring using BM25.

    Args:
        postings: The list of posting list iterators.
        k: The maximum number of documents to retrieve.

    Returns:
        A list of (docid, score) tuples.
    """
    A = defaultdict(float)

    for posting in postings:
        current_docid = posting.docid()

        df = posting.len()

        while current_docid != math.inf:
            tf = posting.freqs[posting.pos]
            dl = doc_lengths[current_docid]

            score = bm25(tf, df, dl)
            A[current_docid] += score

            posting.next()
            current_docid = posting.docid()

    top = TopQueue(k)
    for docid, score in A.items():
        top.insert(docid, score)

    return sorted(top.queue, reverse=True)

## 4.2 TF-IDF

In [17]:
def tfidf_score(tf, df, N):
    """
    Compute the TF-IDF score.

    Args:
        tf: The term frequency.
        df: The document frequency.
        N: The total number of documents.

    Returns:
        The TF-IDF score.
    """
    tf_component = 1 + math.log(tf) if tf > 0 else 0
    idf_component = math.log(N / df) if df > 0 else 0
    return tf_component * idf_component

### 4.2.1 DAAT with TF-IDF

In [18]:
def daat_tfidf(postings, k=10):
    """
    Perform a document-at-a-time (DAAT) scoring using TF-IDF.

    Args:
        postings: The list of posting list iterators.
        k: The maximum number of documents to retrieve.

    Returns:
        A list of (docid, score) tuples.
    """
    top = TopQueue(k)
    current_docid = min_docid(postings)

    while current_docid != math.inf:
        score = 0
        next_docid = math.inf

        for posting in postings:
            if posting.docid() == current_docid:
                tf = posting.freqs[posting.pos]
                df = posting.len()

                score += tfidf_score(tf, df, N)

                posting.next()
            if not posting.is_end_list():
                next_docid = min(next_docid, posting.docid())

        top.insert(current_docid, score)
        current_docid = next_docid

    return sorted(top.queue, reverse=True)

### 4.2.2 TAAT with TF-IDF

In [19]:
def taat_tfidf(postings, k=10):
    """
    Perform a term-at-a-time (TAAT) scoring using TF-IDF.

    Args:
        postings: The list of posting list iterators.
        k: The maximum number of documents to retrieve.

    Returns:
        A list of (docid, score) tuples.
    """
    A = defaultdict(float)

    for posting in postings:
        current_docid = posting.docid()

        df = posting.len()

        while current_docid != math.inf:
            tf = posting.freqs[posting.pos]

            score = tfidf_score(tf, df, N)
            A[current_docid] += score

            posting.next()
            current_docid = posting.docid()

    top = TopQueue(k)
    for docid, score in A.items():
        top.insert(docid, score)

    return sorted(top.queue, reverse=True)

## 4.3 Results

In [20]:
@profile
def query_processing(queries_iter, fn):
    """
    Process a list of queries using a scoring function.

    Args:
        queries_iter: The list of queries.
        fn: The scoring function.

    Returns:
        A list of query results.
    """

    res = []
    for q in queries_iter:
        query = preprocess(q.text)
        termids = inv_index.get_termids(query)
        postings = inv_index.get_postings(termids)
        res.append({'query_id': q.query_id, 'scores': fn(postings)})
    return res

In [21]:
print(query_processing(trec_dl_2020.queries_iter(), daat_bm25))

[INFO] [starting] https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco-test2020-queries.tsv.gz
[INFO] [finished] https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco-test2020-queries.tsv.gz: [00:00] [4.13kB] [37.4MB/s]


query_processing (324993.125 ms)
[{'query_id': '1030303', 'scores': [(52.67510724707057, 8726436), (45.396526627551914, 8726435), (45.396526627551914, 8726433), (45.396526627551914, 8726429), (41.90065349198796, 8726437), (41.90065349198796, 8726434), (41.90065349198796, 8726430), (41.90065349198796, 7156982), (28.65245040521046, 1305521), (28.65245040521046, 1305520)]}, {'query_id': '1037496', 'scores': [(37.653476406482234, 7766587), (37.653476406482234, 7766585), (37.653476406482234, 5927420), (37.653476406482234, 4760912), (34.31780748978049, 4905511), (34.31780748978049, 4760914), (34.31780748978049, 4174377), (34.31780748978049, 3725937), (26.558278232025287, 4174376), (25.820548281135697, 4174378)]}, {'query_id': '1043135', 'scores': [(51.32286465880493, 8696961), (49.760449633443734, 650642), (48.18462982973179, 3378240), (47.924771719896725, 8696958), (47.07947263594525, 4994428), (46.7362248086448, 650641), (45.601083136591555, 4355523), (45.601083136591555, 2514458), (44.402

In [22]:
bm25_results = query_processing(trec_dl_2020.queries_iter(), taat_bm25)
print(bm25_results)

query_processing (154979.523 ms)
[{'query_id': '1030303', 'scores': [(52.67510724707057, 8726436), (45.396526627551914, 8726435), (45.396526627551914, 8726433), (45.396526627551914, 8726429), (41.90065349198796, 8726437), (41.90065349198796, 8726434), (41.90065349198796, 8726430), (41.90065349198796, 7156982), (28.65245040521046, 1305521), (28.65245040521046, 1305520)]}, {'query_id': '1037496', 'scores': [(37.653476406482234, 7766587), (37.653476406482234, 7766585), (37.653476406482234, 5927420), (37.653476406482234, 4760912), (34.31780748978049, 4905511), (34.31780748978049, 4760914), (34.31780748978049, 4174377), (34.31780748978049, 3725937), (26.558278232025287, 4174376), (25.820548281135697, 3725932)]}, {'query_id': '1043135', 'scores': [(51.32286465880493, 8696961), (49.760449633443734, 650642), (48.18462982973179, 3378240), (47.924771719896725, 8696958), (47.07947263594525, 4994428), (46.7362248086448, 650641), (45.601083136591555, 4355523), (45.601083136591555, 2514458), (44.402

In [23]:
print(query_processing(trec_dl_2020.queries_iter(), daat_tfidf))

query_processing (292270.526 ms)
[{'query_id': '1030303', 'scores': [(55.02647037627441, 8726436), (34.0269168065793, 1305521), (34.0269168065793, 1305520), (31.804719474231995, 6222298), (31.804719474231995, 1305528), (31.50769510154042, 8726435), (31.50769510154042, 8726433), (31.50769510154042, 8726429), (25.94150062983425, 1451846), (25.578602506010316, 336399)]}, {'query_id': '1037496', 'scores': [(27.74199820891621, 4174376), (26.939404349198032, 7766587), (26.939404349198032, 7766585), (26.939404349198032, 5927420), (26.939404349198032, 4760912), (24.397534227972482, 7766583), (24.397534227972482, 4174378), (24.397534227972482, 3725936), (24.397534227972482, 3725932), (24.397534227972482, 2725825)]}, {'query_id': '1043135', 'scores': [(55.13688402675095, 650641), (43.129282141145325, 2664995), (38.896025775386626, 8696961), (38.838289355178404, 8694697), (38.5456765976694, 4721484), (37.984447996610655, 8694701), (37.131394764280344, 2664986), (36.998760486723356, 2443317), (36.

In [24]:
tfidf_results = query_processing(trec_dl_2020.queries_iter(), taat_tfidf)
print(tfidf_results)

query_processing (123126.564 ms)
[{'query_id': '1030303', 'scores': [(55.02647037627441, 8726436), (34.0269168065793, 1305521), (34.0269168065793, 1305520), (31.804719474231995, 6222298), (31.804719474231995, 1305528), (31.50769510154042, 8726435), (31.50769510154042, 8726433), (31.50769510154042, 8726429), (25.94150062983425, 1451846), (25.578602506010316, 336399)]}, {'query_id': '1037496', 'scores': [(27.74199820891621, 4174376), (26.939404349198032, 7766587), (26.939404349198032, 7766585), (26.939404349198032, 5927420), (26.939404349198032, 4760912), (24.397534227972482, 7766583), (24.397534227972482, 4174378), (24.397534227972482, 3725936), (24.397534227972482, 3725932), (24.397534227972482, 2725825)]}, {'query_id': '1043135', 'scores': [(55.13688402675095, 650641), (43.129282141145325, 2664995), (38.896025775386626, 8696961), (38.838289355178404, 8694697), (38.5456765976694, 4721484), (37.984447996610655, 8694701), (37.131394764280344, 2664986), (36.998760486723356, 2443317), (36.

# 5. Evaluation with TREC-style measures
To evaluate retrieval performance, we use the TREC evaluation method with `ir_measures`.

This section generates a run file and QRELs for the TREC evaluation tool.

In [25]:
for query in list(trec_dl_2020.queries_iter())[:3]:
    print(query)

GenericQuery(query_id='1030303', text='who is aziz hashim')
GenericQuery(query_id='1037496', text='who is rep scalise?')
GenericQuery(query_id='1043135', text='who killed nicholas ii of russia')


In [26]:
for ass in list(trec_dl_2020.qrels_iter())[:3]:
  print(ass)

[INFO] [starting] https://trec.nist.gov/data/deep/2020qrels-pass.txt
[INFO] [finished] https://trec.nist.gov/data/deep/2020qrels-pass.txt: [00:00] [219kB] [2.94MB/s]


TrecQrel(query_id='23849', doc_id='1020327', relevance=2, iteration='0')
TrecQrel(query_id='23849', doc_id='1034183', relevance=3, iteration='0')
TrecQrel(query_id='23849', doc_id='1120730', relevance=0, iteration='0')


## 5.1 Run File generation

In [27]:
def generate_run(results):
    trec_run_list = []
    for doc_scores in results:
        rank = 1
        query_id = doc_scores['query_id']
        scores = doc_scores['scores']

        for score, doc_id in scores:
            line = f"{query_id} Q0 {doc_id} {rank} {score} GOODFELLAS"
            trec_run_list.append(line)
            rank += 1

    return trec_run_list

trec_bm25_run_list = generate_run(bm25_results)
trec_tfidf_run_list = generate_run(tfidf_results)

with open("trec_eval_bm25_run_file.txt", "w") as f:
    for line in trec_bm25_run_list:
        f.write(line + "\n")

with open("trec_eval_tfidf_run_file.txt", "w") as f:
    for line in trec_tfidf_run_list:
        f.write(line + "\n")

## 5.2 Qrels File generation

In [28]:
# Create format for Trec_Eval
qrels_file = []
for qrel in trec_dl_2020.qrels_iter():
    line = f"{qrel.query_id} 0 {qrel.doc_id} {qrel.relevance}"
    qrels_file.append(line)

with open("trec_eval_qrels_file.txt", "w") as f:
    for line in qrels_file:
        f.write(line + "\n")

## 5.3 Results

In [29]:
measures = [P@5, P(rel=2)@5, nDCG@10, AP, AP(rel=2), Bpref, Bpref(rel=2), Judged@10]

qrels = ir_measures.read_trec_qrels('trec_eval_qrels_file.txt')
bm25_run = ir_measures.read_trec_run(('trec_eval_bm25_run_file.txt'))
bm25_results = ir_measures.calc_aggregate(measures, qrels, bm25_run)

qrels = ir_measures.read_trec_qrels('trec_eval_qrels_file.txt')
tfidf_run = ir_measures.read_trec_run(('trec_eval_tfidf_run_file.txt'))
tfidf_results = ir_measures.calc_aggregate(measures, qrels, tfidf_run)

In [30]:
import pandas as pd

# Create DataFrame for comparison
df = pd.DataFrame({
    "BM25": bm25_results,
    "TF-IDF": tfidf_results
})

print(df)

                  BM25    TF-IDF
P(rel=2)@5    0.411111  0.314815
Bpref         0.155477  0.110060
P@5           0.614815  0.481481
AP            0.139214  0.097421
Bpref(rel=2)  0.195314  0.148619
AP(rel=2)     0.173051  0.134207
Judged@10     0.929630  0.874074
nDCG@10       0.473446  0.375236


BM25 proves to be a more effective model for information retrieval on the MSMARCO passage dataset than TF-IDF. This suggests that BM25, which accounts for both term frequency and document length normalization, is better suited for ranking documents meaningfully, whereas TF-IDF relies solely on raw term frequency and inverse document frequency, as reflected in the superior performance of BM25 across almost all metrics."

## 5.4 Comparison with PyTerrier

The same set of queries has been evaluated using PyTerrier, which serves as benchmark, and a comparison between the latter and results obtained  through the implementation seen so far has been performed.

In [31]:
from pyterrier.measures import P, nDCG, AP, Judged

dataset = pt.get_dataset('msmarco_passage')

pt.Experiment(
    [pt.terrier.Retriever.from_dataset('msmarco_passage', 'terrier_stemmed', wmodel='TF_IDF'),
     pt.terrier.Retriever.from_dataset('msmarco_passage', 'terrier_stemmed', wmodel='BM25')],
     dataset.get_topics('test-2020'),
     dataset.get_qrels('test-2020'),
     eval_metrics=[P@5, P(rel=2)@5, nDCG@10, AP, AP(rel=2), Judged@10],
)

terrier-assemblies 5.10 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.8 jar not found, downloading to /root/.pyterrier...
Done


Java started (triggered by Retriever.from_dataset) and loaded: pyterrier.java, pyterrier.terrier.java [version=5.10 (build: craigm 2024-08-22 17:33), helper_version=0.0.8]


Downloading msmarco_passage index to /root/.pyterrier/corpora/msmarco_passage/index/terrier_stemmed


data.direct.bf:   0%|          | 0.00/486M [00:00<?, ?iB/s]

data.document.fsarrayfile:   0%|          | 0.00/177M [00:00<?, ?iB/s]

data.inverted.bf:   0%|          | 0.00/377M [00:00<?, ?iB/s]

data.lexicon.fsomapfile:   0%|          | 0.00/100M [00:00<?, ?iB/s]

data.lexicon.fsomaphash:   0%|          | 0.00/0.99k [00:00<?, ?iB/s]

data.lexicon.fsomapid:   0%|          | 0.00/4.47M [00:00<?, ?iB/s]

data.meta.idx:   0%|          | 0.00/67.5M [00:00<?, ?iB/s]

data.meta.zdata:   0%|          | 0.00/193M [00:00<?, ?iB/s]

data.properties:   0%|          | 0.00/4.29k [00:00<?, ?iB/s]

md5sums:   0%|          | 0.00/480 [00:00<?, ?iB/s]

Downloading msmarco_passage topics to /root/.pyterrier/corpora/msmarco_passage/msmarco-test2020-queries.tsv.gz


msmarco-test2020-queries.tsv.gz:   0%|          | 0.00/4.03k [00:00<?, ?iB/s]

Downloading msmarco_passage qrels to /root/.pyterrier/corpora/msmarco_passage/2020qrels-docs.txt


2020qrels-docs.txt:   0%|          | 0.00/213k [00:00<?, ?iB/s]

Unnamed: 0,name,P@5,P(rel=2)@5,nDCG@10,AP,AP(rel=2),Judged@10
0,TerrierRetr(TF_IDF),0.625926,0.392593,0.492575,0.358072,0.292548,0.972222
1,TerrierRetr(BM25),0.625926,0.392593,0.493627,0.358724,0.292988,0.972222


Despite PyTerrier’s higher metrics overall, the proposed custom implementation achieved competitive results in certain scenarios:

- **nDCG@10**: The normalized Discounted Cumulative Gain for BM25 in the custom implementation reached 0.473, approaching PyTerrier's 0.493.
- **BM25 Precision (P@5)**: The custom implementation achieved 0.614, only marginally lower than PyTerrier’s 0.625, demonstrating competitive ranking in the top 5 results.
- **Judged@10**: The custom BM25 implementation yielded a score of 0.929, which is relatively close to PyTerrier’s 0.972.

These results highlight the potential of the custom implementation, especially considering it was developed from scratch without the extensive optimization and tuning present in PyTerrier, which for instance outperformed in  **Average Precision** achieving 0.358 (BM25) compared to 0.139 for the custom implementation, emphasizing how fine-tuning and robust indexing can improve retrieval accuracy.