# Information Retrieval project

**Authors:** L.Arduini, D.N.Ghaneh, L.Menchini, C.Petruzzella

## Abstract
This project focuses on developing and evaluating a custom Information Retrieval (IR) pipeline. The MSMARCO Passage dataset undergoes preprocessing steps such as tokenization, stemming, and stopword removal using NLTK.

An inverted index is implemented to store terms, enabling BM25 and TF-IDF ranking through both DAAT and TAAT approaches. Retrieval performance is evaluated using the ir_measures library for TREC-style assessments, leveraging TREC 2020 queries and QRELs. Additionally, PyTerrier is used for benchmarking the retrieval effectiveness.

The goal is to showcase a complete IR workflow, from text normalization to performance comparison with established tools.

**Dataset:** The [MSMARCO Passage dataset](https://ir-datasets.com/msmarco-passage.html), consisting of 8,841,823 real-world web passages from diverse sources, offers substantial variety, making it ideal for training and evaluating retrieval models under realistic conditions.

**Evaluation:** For evaluation the trec-2020-dl dataset has been used.

## Instructions to Run

### Prerequisites
1. Python 3.10 or above.
2. Access to a runtime environment with GPU support (e.g., NVIDIA V28 on Google Colab) for optimal performance.

### Running the project
- Switch the runtime to GPU (e.g., NVIDIA V28) for enhanced performance.

# 0. Setup environment and dependencies
This section ensures that all necessary packages are installed and loaded.

**Note:** The project uses `ir_datasets`, `nltk`, and `ir_measures`, along with several utilities for processing.

In [1]:
!pip install ir_datasets
!pip install nltk
!pip install ir_measures
!pip install PyStemmer
!pip install pandas
!pip install python-terrier
!pip install --upgrade gdown

Collecting ir_datasets
  Downloading ir_datasets-0.5.9-py3-none-any.whl.metadata (12 kB)
Collecting inscriptis>=2.2.0 (from ir_datasets)
  Downloading inscriptis-2.5.0-py3-none-any.whl.metadata (25 kB)
Collecting lxml>=4.5.2 (from ir_datasets)
  Downloading lxml-5.3.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.8 kB)
Collecting trec-car-tools>=2.5.4 (from ir_datasets)
  Downloading trec_car_tools-2.6-py3-none-any.whl.metadata (640 bytes)
Collecting lz4>=3.1.10 (from ir_datasets)
  Downloading lz4-4.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.7 kB)
Collecting warc3-wet>=0.2.3 (from ir_datasets)
  Downloading warc3_wet-0.2.5-py3-none-any.whl.metadata (2.2 kB)
Collecting warc3-wet-clueweb09>=0.2.5 (from ir_datasets)
  Downloading warc3-wet-clueweb09-0.2.5.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting zlib-state>=0.1.3 (from ir_datasets)
  Downloading zlib_state-0.1.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_6

In [2]:
import ir_datasets
import ir_measures
from ir_measures import *
import random
import re
import string
import nltk
import time
from collections import Counter, defaultdict
from tqdm.auto import tqdm
import gzip
import pickle
import os
import heapq
import math
import pyterrier as pt
from google.colab import drive
import os
import shutil

# 1. Loading the dataset

This notebook will load the MS MARCO Passage dataset, a standard dataset for Information Retrieval tasks.
It contains passages from various sources and is used to train and evaluate retrieval models.

In [1]:
dataset = ir_datasets.load("msmarco-passage")

NameError: name 'ir_datasets' is not defined

Let’s now import some files from a folder saved in Google Drive. Specifically, we will import the following files: `lexicon`, `inverted indexes`, and `document indexes`.

Why are we importing them instead of computing them ourselves by running the code? We’ll see the reasons later, but in short, it’s to save the reader from a long wait.

In [2]:
# Mount Google Drive to access required files
drive.mount('/content/drive')

# URL of the Google Drive repository containing the project files
repository = "1riRemOldrDhvbpnphe1Co8jadQuC4OOs"
repository_name = "ir-project-files"

# Download the specified folder from the repository
!gdown --folder $repository

# Copy the downloaded files from the repository folder to /content/
# This ensures the files are easily accessible during execution
for item in os.listdir(repository_name):
  s = os.path.join(repository_name, item)
  d = os.path.join('/content/', item)
  if os.path.isfile(s):             # Check if the item is a file before copying
    shutil.copy2(s, d)

# Remove the downloaded repository folder to free up space
shutil.rmtree(repository_name)

NameError: name 'drive' is not defined

# 2. Preprocessing text data
This section defines functions for text preprocessing. Preprocessing steps include:
- Lowercasing
- Replacing symbols and punctuations
- Removing stopwords
- Stemming tokens

The goal is to normalize text data for effective retrieval.

In [5]:
from functools import lru_cache
import Stemmer
nltk.download("stopwords", quiet=True)

# ------- Pre Initialization -------
# Compile reusable resources and pre-load common datasets for efficiency
# 1. Regular expression for removing unnecessary dots in acronyms
# 2. Translation table for stripping punctuation
# 3. Set of English stopwords for filtering irrelevant tokens
# 4. Initialize a stemming tool for word normalization

ACRONYM_REGEX = re.compile(r"(?<!\w)\.(?!\d)")                  # Matches dots not part of decimal numbers
PUNCTUATION_TRANS = str.maketrans("", "", string.punctuation)   # Removes punctuation
STOPWORDS = set(nltk.corpus.stopwords.words('english'))         # Load English stopwords
STEMMER = Stemmer.Stemmer('english')                            # Initialize an English stemmer
# ----------------------------------

def preprocess(s):
    """
    Preprocesses an input string for text analysis tasks such as indexing or querying.

    Args:
        s (str): The input string to preprocess.

    Returns:
        list[str]: A list of processed tokens.
    """

    s = s.lower()
    s = s.replace("&", " and ")
    s = s.translate(str.maketrans("‘’´“”–-", "'''\"\"--"))      # Standardize quotes and dashes for uniformity
    s = ACRONYM_REGEX.sub("", s)                                # Remove unnecessary dots in acronyms (but not decimals)
    s = s.translate(PUNCTUATION_TRANS)                          # Remove all punctuation
    s = " ".join(s.split())                                     # Remove extra spaces and strip leading/trailing spaces

    tokens = s.split()
    tokens = [t for t in tokens if t not in STOPWORDS]          # Filter out stopwords
    tokens = STEMMER.stemWords(tokens)                          # Apply stemming to normalize word forms
    return tokens

In [6]:
def profile(f):
    """
    A decorator to measure and print the runtime of a decorated function.

    Args:
        f (callable): The function to be profiled.

    Returns:
        callable: A wrapped version of the original function that prints its runtime after execution.
    """

    def f_timer(*args, **kwargs):
        """
        The wrapped function that measures execution time.

        Args:
            *args: Positional arguments to pass to the original function.
            **kwargs: Keyword arguments to pass to the original function.

        Returns:
            The result of the original function.
        """
        
        start = time.time()

        result = f(*args, **kwargs)     # Execute the original function
        
        end = time.time()
        ms = (end - start) * 1000       # Calculate runtime in milliseconds
        print(f"{f.__name__} ({ms:.3f} ms)")
        return result                   # Return the result of the function

    return f_timer                      # Return the wrapped function

# 3. Building the inverted index
We create an inverted index to store terms with their respective document IDs and term frequencies.
The `build_index` function processes the dataset and constructs a structure that enables efficient term-based searching across documents

In [7]:
@profile
def build_index(dataset):
    """
    Constructs an inverted index from a dataset.

    The function processes documents to build the following components:
    1. Lexicon: Maps terms to term IDs and tracks document frequency (DF) and term frequency (TF).
    2. Inverted Index: Maps term IDs to lists of document IDs and term frequencies.
    3. Document Index: A list of document IDs and their corresponding document lengths.
    4. Index Statistics: A dictionary summarizing the index statistics.

    Args:
        dataset: The dataset to index.

    Returns:
        tuple: A tuple containing:
            - lexicon (dict): Maps terms to [term ID, document frequency, term frequency].
            - inverted_index (dict): Contains:
                - 'docids' (dict): Maps term IDs to lists of document IDs.
                - 'freqs' (dict): Maps term IDs to lists of term frequencies in the documents.
            - document_index (list): A list of tuples (document ID, document length).
            - stats (dict): Contains:
                - 'num_docs': Total number of documents indexed.
                - 'num_terms': Total number of unique terms.
                - 'num_tokens': Total number of tokens across all documents.
    """

    lexicon = {}                # Maps terms to [term ID, document frequency, term frequency]
    doc_index = []              # Stores document IDs and their lengths
    inv_d, inv_f = {}, {}       # Inverted index components: doc IDs and term frequencies
    termid = 0                  # Counter for assigning unique term IDs

    num_docs = 0                # Number of documents processed
    total_dl = 0                # Total length of the documents (in tokens)

    # Iterate over documents in the dataset
    for docid, doc in tqdm(enumerate(dataset.docs_iter()), desc='Indexing', total=dataset.docs_count()):
        tokens = preprocess(doc.text)               # Preprocess document text into tokens
        token_tf = Counter(tokens)                  # Count term frequencies in the document

        # Populate the lexicon and inverted index
        for token, tf in token_tf.items():          # Assign a new term ID if the token is not in the lexicon
            if token not in lexicon:
                lexicon[token] = [termid, 0, 0]
                inv_d[termid], inv_f[termid] =  [], []
                termid += 1
            
            token_id = lexicon[token][0]            # Get the term ID
            inv_d[token_id].append(docid)
            inv_f[token_id].append(tf)
            lexicon[token][1] += 1                  # Increment document frequency for the term
            lexicon[token][2] += tf                 # Increment total term frequency
        
        # Update document index and statistics
        doclen = len(tokens)
        doc_index.append((str(doc.doc_id), doclen)) # Add document ID and length to the index
        total_dl += doclen
        num_docs += 1

    # Build index statistics
    stats = {
        'num_docs': 1 + docid,                      # Total number of documents indexed
        'num_terms': len(lexicon),                  # Total number of unique terms
        'num_tokens': total_dl,                     # Total number of tokens across all documents
    }

    return lexicon, {'docids': inv_d, 'freqs': inv_f}, doc_index, stats

Now that we have reviewed all the necessary code for preprocessing and building the inverted index, it is evident how computationally demanding this process is, particularly given that the selected dataset contains nearly 9 million documents.

Operations such as tokenization, stopword removal, case normalization and stemming or lemmatization, while relatively straightforward, must be executed for every single document. Moreover, building the index requires processing each document to compute the frequency of every term. This involves updating the lexicon with the term frequency (TF) and document frequency (DF) as well as managing the inverted index by appending document IDs and term frequencies for each term in every document.

Given these considerations, we aim to spare the reader of this notebook from waiting approximately 20 minutes for the index to be built. Instead, the files we downloaded earlier from Google Drive (and will now use) are the ones generated at the end of the preprocessing phase and the construction of the inverted index, along with other necessary data structures for IR.

In [8]:
lex, inv, doc, stats = None, None, None, None               # Initialize variables for the index components

files = ['lexicon.pickle.gz', 'inverted_file.pickle.gz', 'document_index.pickle.gz', 'stats.pickle.gz']
if all(os.path.exists(file) for file in files):             # Check if all required files exist
    print("All files already exist.")

    # Iterate over the list of files and their associated variable names
    for file, var_name in zip(files, ['lex', 'inv', 'doc', 'stats']):
        try:
            if os.path.getsize(file) > 0:                   # Ensure the file is not empty
                with gzip.open(file, 'rb') as f:
                    globals()[var_name] = pickle.load(f)    # Load the file into the corresponding variable
            else:
                print(f"Warning: {file} is empty.")
        except EOFError:
            # If the file is corrupted or incomplete, rebuild the index
            print(f"Error: {file} is corrupted or incomplete. Rebuilding the index.")
            lex, inv, doc, stats = build_index(dataset)
            break
else:
    # If any of the files do not exist, rebuild the index
    lex, inv, doc, stats = build_index(dataset)

    # Save the rebuilt index components back into the respective files
    for data, file in zip([lex, inv, doc, stats], files):
      with gzip.open(file, 'wb') as f:
        print(f"Saving {file}...")
        pickle.dump(data, f)                                # Serialize and save the data


All files already exist.


In [9]:
class InvertedIndex:
    """
    A simple inverted index class. Stores term-document mappings for fast retrieval.

    Attributes:
        lexicon (dict): Maps a token to [termID, docFreq, totalTermFreq].
        inv (dict): Contains 'docids' and 'freqs' lists, indexed by termID.
        doc (list): Each element is (doc_id, doc_length).
        stat (dict): Index statistics (e.g., num_docs, num_terms, num_tokens).

    Methods:
        num_docs() -> int
            Returns the total number of indexed documents.
        get_posting(termid: int) -> PostingListIterator
            Returns a posting list iterator for the given termID.
        get_termids(tokens: list[str]) -> list[int]
            Converts tokens to termIDs if found in the lexicon.
        get_postings(termids: list[int]) -> list[PostingListIterator]
            Returns posting list iterators for each termID.
    """

    class PostingListIterator:
        """
        (Inner class) Iterates over the posting list for a single termID.

        Attributes:
            docids (list[int]): Document IDs containing this term.
            freqs (list[int]): Term frequencies in the corresponding docID.
            pos (int): Current index in the posting list.
            doc (list): Reference to the main document index.

        Methods:
            docid() -> int or math.inf
                Returns the current docID or math.inf if finished.
            score() -> float or math.inf
                Returns freq / doc_length or math.inf if finished.
            next(target: int = None) -> None
                Moves forward or jumps to target docID if specified.
            is_end_list() -> bool
                Checks if the iterator has reached the end.
            len() -> int
                Returns the total number of docIDs for this term.
        """

        def __init__(self, docids, freqs, doc):
            """
            Initialize the iterator with document IDs, frequencies, and a reference to the document index.
            """
            self.docids = docids            # List of document IDs where the term appears
            self.freqs = freqs              # List of term frequencies corresponding to each document ID
            self.pos = 0                    # Start position in the posting list
            self.doc = doc                  # Reference to the main document index

        def docid(self):
            """
            Returns the current document ID or math.inf if the end of the list is reached.
            """
            if self.is_end_list():
                return math.inf
            return self.docids[self.pos]

        def score(self):
            """
            Computes the term frequency normalized by the document length for the current position.
            Returns math.inf if the end of the list is reached.
            """
            if self.is_end_list():
                return math.inf
            return self.freqs[self.pos]/self.doc[self.docid()][1]

        def next(self, target = None):
            """
            Advances to the next position in the posting list or jumps to the target document ID.
            """
            if not target:                              # If no target is specified, move to the next position        
                if not self.is_end_list():
                    self.pos += 1
            else:
                if target > self.docid():               # If a target is specified, jump to its position if it exists
                    try:
                        self.pos = self.docids.index(target, self.pos)
                    except ValueError:
                        self.pos = len(self.docids)     # Move to the end if the target is not found

        def is_end_list(self):
            """
            Checks if the iterator has reached the end of the posting list.
            """
            return self.pos == len(self.docids)


        def len(self):
            """
            Returns the total number of document IDs in the posting list.
            """
            return len(self.docids)


    def __init__(self, lex, inv, doc, stats):
        """
        Initialize the inverted index with its components: lexicon, inverted file, document index, and stats.
        """
        self.lexicon = lex          # Lexicon mapping tokens to [termID, docFreq, totalTermFreq]
        self.inv = inv              # Inverted index with 'docids' and 'freqs'
        self.doc = doc              # List of documents with IDs and lengths
        self.stat = stats           # Index statistics (e.g., number of documents, terms, tokens)

    def num_docs(self):
        """
        Returns the total number of indexed documents.
        """
        return self.stats['num_docs']

    def get_posting(self, termid):
        """
        Returns a PostingListIterator for the given term ID.
        """
        return InvertedIndex.PostingListIterator(self.inv['docids'][termid], self.inv['freqs'][termid], self.doc)

    def get_termids(self, tokens):
        """
        Converts a list of tokens to their corresponding term IDs using the lexicon.
        """
        return [self.lexicon[token][0] for token in tokens if token in self.lexicon]

    def get_postings(self, termids):
        """
        Returns a list of PostingListIterators for the given term IDs.
        """
        return [self.get_posting(termid) for termid in termids]

inv_index = InvertedIndex(lex, inv, doc, stats)

# 4. Query processing
This section implements the Query Processing task, aiming to rank documents by relevance to a given query using the BM25 and TF-IDF scoring functions with two different approaches:
- **DAAT (Document-at-a-Time)**: Processes documents sequentially, computing scores for all terms in a document before moving to the next document.
- **TAAT (Term-at-a-Time)**: Processes terms sequentially, scoring all documents for a given term before moving to the next term.

In [3]:
trec_dl_2020 = ir_datasets.load("msmarco-passage/trec-dl-2020")

NameError: name 'ir_datasets' is not defined

In [11]:
class TopQueue:
    """
    A simple top-k priority queue to maintain the top-scoring items.

    This class uses a min-heap to efficiently store and retrieve the top-k
    items based on their scores. Items are tuples of (score, docid).

    Attributes:
        queue (list[tuple[float, int]]): The priority queue storing (score, docid) pairs.
        k (int): The maximum number of items to maintain in the queue.
        threshold (float): The minimum score required for an item to enter the queue.

    Methods:
        size() -> int:
            Returns the current number of items in the queue.
        would_enter(score: float) -> bool:
            Checks if a given score exceeds the threshold and could enter the queue.
        clear(new_threshold: float = None) -> None:
            Clears the queue and optionally sets a new threshold.
        insert(docid: int, score: float) -> bool:
            Attempts to insert an item into the queue. Updates the threshold if needed.
        __repr__() -> str:
            Returns a string representation of the queue.
    """

    def __init__(self, k=10, threshold=0.0):
        """
        Initializes the TopQueue with a maximum size and an optional threshold.
        """
        self.queue = []                     # Initialize an empty priority queue (min-heap)
        self.k = k                          # Maximum number of items to store
        self.threshold = threshold          # Initial score threshold

    def size(self):
        """
        Returns the current number of items in the queue.
        """
        return len(self.queue)

    def would_enter(self, score):
        """
        Checks if a given score exceeds the current threshold and could enter the queue.
        """
        return score > self.threshold

    def clear(self, new_threshold=None):
        """
        Clears all items from the queue and optionally sets a new threshold.
        """
        self.queue = []                     # Empty the queue
        if new_threshold is not None:
            self.threshold = new_threshold  # Update the threshold if provided

    def __repr__(self):
        """
        Returns a string representation of the queue.
        """
        return f'<{self.size()} items, th={self.threshold} {self.queue}>'

    def insert(self, docid, score):
        """
        Attempts to insert an item into the queue. Maintains the top-k items by score.
        """
        if score > self.threshold:
            if self.size() >= self.k:                                   # If the queue is full
                heapq.heapreplace(self.queue, (score, docid))           # Replace the smallest item
            else:
                heapq.heappush(self.queue, (score, docid))              # Add the item to the queue

            if self.size() >= self.k:                                   # Update the threshold if the queue is full
                self.threshold = max(self.threshold, self.queue[0][0])  # The lowest score becomes the threshold
            return True
        return False

## 4.1. BM25

In [44]:
LOG_E_OF_2 = math.log(2)        # Natural logarithm of 2 for base conversion.
LOG_2_OF_E = 1 / LOG_E_OF_2     # Conversion factor for log base-e to base-2.

# Compute average document length and total number of documents from the index.
avg_dl = inv_index.stat['num_tokens'] / inv_index.stat['num_docs']
N = inv_index.stat['num_docs']

def bm25(tf, df, dl, k1=1.2, b=0.75, k3=8, keyFrequency=1):
    """
    Compute the BM25 relevance score for a term in a document.

    Args:
        tf (int):   Term frequency, the count of the term in the document.
        df (int):   Document frequency, the number of documents containing the term.
        dl (float): Document length, the number of tokens in the document.
        k1 (float): Parameter controlling term frequency saturation.
        b (float):  Parameter controlling document length normalization.
        k3 (float): Parameter for query term weighting.
        keyFrequency (int): Frequency of the term in the query.

    Returns:
        float: The BM25 score for the term in the document with respect to the query.
    """
    idf = math.log(1 + (N - df + 0.5) / (df + 0.5)) * LOG_2_OF_E                # IDF weighting
    K = k1 * ((1 - b) + b * (dl / avg_dl))                                      # Document length adjustment
    term_frequency_component = ((k1 + 1) * tf) / (K + tf)                       # TF component
    query_frequency_component = ((k3 + 1) * keyFrequency) / (k3 + keyFrequency) # Query weight
    return idf * term_frequency_component * query_frequency_component

### 4.1.1 DAAT with BM25

In [13]:
# Precompute document lengths
doc_lengths = defaultdict(int)
for docid, doc_len in inv_index.doc:
    doc_lengths[docid] = doc_len

def min_docid(postings):
    """
    Find the smallest document ID among active posting list iterators.

    Args:
        postings (list[PostingListIterator]): Posting list iterators.

    Returns:
        int: The smallest document ID or math.inf if all lists are exhausted.
    """
    
    min_docid = math.inf
    for p in postings:
        if not p.is_end_list():     # Skip completed lists
            min_docid = min(p.docid(), min_docid)
    return min_docid

def daat_bm25(postings, k=10):
    """
    Perform Document-At-A-Time (DAAT) retrieval with BM25 scoring.

    Args:
        postings (list[PostingListIterator]): Posting lists for terms.
        k (int): Number of top results to retrieve.

    Returns:
        list[tuple[int, float]]: Top-k (docid, score) pairs sorted by score.
    """

    top = TopQueue(k)                               # Initialize top-k priority queue
    current_docid = min_docid(postings)             # Start with the smallest document ID

    while current_docid != math.inf:                # Process documents until all posting lists are exhausted
        score = 0
        next_docid = math.inf

        for posting in postings:
            if posting.docid() == current_docid:    # Check if the term is in the current doc
                tf = posting.freqs[posting.pos]
                df = posting.len()
                dl = doc_lengths[current_docid]

                score += bm25(tf, df, dl)

                posting.next()                      # Move to the next term occurrence
            
            if not posting.is_end_list():           # Update the smallest doc ID for next iteration
                next_docid = min(next_docid, posting.docid())

        top.insert(current_docid, score)            # Add the current doc to the top-k queue
        current_docid = next_docid                  # Move to the next document

    return sorted(top.queue, reverse=True)

### 4.1.2 TAAT with BM25

In [14]:
def taat_bm25(postings, k=10):
    """
    Perform Term-At-A-Time (TAAT) retrieval with BM25 scoring.

    Args:
        postings (list[PostingListIterator]): A list of posting list iterators, one for each query term.
        k (int): The maximum number of top documents to retrieve. Default is 10.

    Returns:
        list[tuple[int, float]]: A sorted list of (docid, score) tuples, ordered by score in descending order.
    """
    A = defaultdict(float)                      # Accumulator for document scores

    # Process one term's posting list at a time
    for posting in postings:
        current_docid = posting.docid()
        df = posting.len()                      # Document frequency for the current term

        while current_docid != math.inf:
            tf = posting.freqs[posting.pos]     # Term frequency in the current document
            dl = doc_lengths[current_docid]     # Length of the current document

            score = bm25(tf, df, dl)            # Compute BM25 score for the term-document pair
            A[current_docid] += score

            posting.next()
            current_docid = posting.docid()

    top = TopQueue(k)

    for docid, score in A.items():              # Insert all documents and their scores into the top-k queue
        top.insert(docid, score)

    return sorted(top.queue, reverse=True)

## 4.2 TF-IDF

In [15]:
def tfidf_score(tf, df, dl, keyFrequency=1, k1 = 1.2, b = 0.75):
    """
    Compute the TF-IDF score using a normalized term frequency formulation.

    Args:
        tf (int): Term frequency in the document.
        df (int): Document frequency, the number of documents containing the term.
        dl (float): Document length, the total number of tokens in the document.
        keyFrequency (int): Query term frequency.
        k1 (float): Term frequency saturation parameter.
        b (float): Length normalization parameter.

    Returns:
        float: The TF-IDF score for the term in the document with respect to the query.
    """
    # Compute normalized term frequency
    tf_robertson = k1 * tf / (tf + (k1 * ((1 - b) + ((b * dl) / avg_dl))))
    # Compute inverse document frequency (IDF) with base-2 logarithm
    idf = math.log((N / df) + 1) * LOG_2_OF_E
    
    return tf_robertson * idf * keyFrequency

### 4.2.1 DAAT with TF-IDF

In [16]:
def daat_tfidf(postings, k=10):
    """
    Perform Document-At-A-Time (DAAT) retrieval using TF-IDF scoring.

    Args:
        postings (list[PostingListIterator]): A list of posting list iterators, one for each query term.
        k (int): The maximum number of top documents to retrieve. Default is 10.

    Returns:
        list[tuple[int, float]]: A sorted list of (docid, score) tuples, ordered by score in descending order.
    """

    top = TopQueue(k)                               # Initialize a priority queue for the top-k results
    current_docid = min_docid(postings)             # Start with the smallest document ID across postings

    while current_docid != math.inf:                # Loop until all documents are processed
        score = 0
        next_docid = math.inf

        for posting in postings:
            if posting.docid() == current_docid:    # Check if the term appears in the current document
                tf = posting.freqs[posting.pos]
                df = posting.len()
                dl = doc_lengths[current_docid]

                score += tfidf_score(tf, df, dl)    # Accumulate the TF-IDF score for this document

                posting.next()

            if not posting.is_end_list():           # Update the next smallest document ID
                next_docid = min(next_docid, posting.docid())

        top.insert(current_docid, score)            # Insert the document and its score into the top-k queue
        current_docid = next_docid                  # Move to the next document to be scored

    return sorted(top.queue, reverse=True)

### 4.2.2 TAAT with TF-IDF

In [17]:
def taat_tfidf(postings, k=10):
    """
    Perform Term-At-A-Time (TAAT) retrieval using TF-IDF scoring.

    Args:
        postings (list[PostingListIterator]): A list of posting list iterators, one for each query term.
        k (int): The maximum number of top documents to retrieve.

    Returns:
        list[tuple[int, float]]: A sorted list of (docid, score) tuples, ordered by score in descending order.
    """
    A = defaultdict(float)                      # Accumulator for document scores

    # Process one term's posting list at a time
    for posting in postings:
        current_docid = posting.docid()

        df = posting.len()

        while current_docid != math.inf:
            tf = posting.freqs[posting.pos]
            dl = doc_lengths[current_docid]

            score = tfidf_score(tf, df, dl)      # Compute TF-IDF score
            A[current_docid] += score

            posting.next()
            current_docid = posting.docid()

    top = TopQueue(k)

    for docid, score in A.items():              # Insert all documents and their scores into the top-k queue
        top.insert(docid, score)

    return sorted(top.queue, reverse=True)

## 4.3 Results

In [18]:
@profile
def query_processing(queries_iter, fn):
    """
    Process a list of queries using a specified scoring function.

    Args:
        queries_iter (iterable): An iterable of query objects.
        fn (callable): A scoring function that takes a list of posting list iterators
            and returns a list of (docid, score) tuples.

    Returns:
        list[dict]: A list of results, each containing:
            - `query_id` (int): The ID of the processed query.
            - `scores` (list[tuple[int, float]]): The list of (docid, score) tuples for the query.
    """

    res = []                                # Store the results for each query

    for q in queries_iter:
        query = preprocess(q.text)                  # Preprocess the query text
        termids = inv_index.get_termids(query)      # Map query tokens to term IDs
        postings = inv_index.get_postings(termids)  # Retrieve posting lists for the term IDs
        
        # Compute scores using the provided scoring function and store the result
        res.append({'query_id': q.query_id, 'scores': fn(postings)})

    return res

In [19]:
print(query_processing(trec_dl_2020.queries_iter(), daat_bm25))

[INFO] Please confirm you agree to the MSMARCO data usage agreement found at <http://www.msmarco.org/dataset.aspx>
[INFO] [starting] https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco-test2020-queries.tsv.gz
[INFO] [finished] https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco-test2020-queries.tsv.gz: [00:00] [4.13kB] [17.2MB/s]


query_processing (331738.479 ms)
[{'query_id': '1030303', 'scores': [(52.67510724707057, 8726436), (45.396526627551914, 8726435), (45.396526627551914, 8726433), (45.396526627551914, 8726429), (41.90065349198796, 8726437), (41.90065349198796, 8726434), (41.90065349198796, 8726430), (41.90065349198796, 7156982), (28.65245040521046, 1305521), (28.65245040521046, 1305520)]}, {'query_id': '1037496', 'scores': [(37.653476406482234, 7766587), (37.653476406482234, 7766585), (37.653476406482234, 5927420), (37.653476406482234, 4760912), (34.31780748978049, 4905511), (34.31780748978049, 4760914), (34.31780748978049, 4174377), (34.31780748978049, 3725937), (26.558278232025287, 4174376), (25.820548281135697, 4174378)]}, {'query_id': '1043135', 'scores': [(51.32286465880493, 8696961), (49.760449633443734, 650642), (48.18462982973179, 3378240), (47.924771719896725, 8696958), (47.07947263594525, 4994428), (46.7362248086448, 650641), (45.601083136591555, 4355523), (45.601083136591555, 2514458), (44.402

In [45]:
bm25_results = query_processing(trec_dl_2020.queries_iter(), taat_bm25)
print(bm25_results)

query_processing (160219.688 ms)
[{'query_id': '1030303', 'scores': [(68.04124339861062, 8726436), (60.1427866714432, 8726435), (60.1427866714432, 8726433), (60.1427866714432, 8726429), (56.26487434885108, 8726437), (56.26487434885108, 8726434), (56.26487434885108, 8726430), (56.26487434885108, 7156982), (36.8093899826997, 1305521), (36.8093899826997, 1305520)]}, {'query_id': '1037496', 'scores': [(49.78270676861767, 7766587), (49.78270676861767, 7766585), (49.78270676861767, 5927420), (49.78270676861767, 4760912), (46.08250624801771, 4905511), (46.08250624801771, 4760914), (46.08250624801771, 4174377), (46.08250624801771, 3725937), (34.3057350592058, 4174376), (33.526059262405674, 3725932)]}, {'query_id': '1043135', 'scores': [(67.58094550401972, 8696961), (65.84778521905021, 650642), (64.07233051343995, 3378240), (63.811499082452244, 8696958), (62.87382325532231, 4994428), (61.23387105030339, 4355523), (61.23387105030339, 2514458), (60.077594538221234, 650641), (57.61450176736487, 26

In [21]:
print(query_processing(trec_dl_2020.queries_iter(), daat_tfidf))

query_processing (316966.691 ms)
[{'query_id': '1030303', 'scores': [(37.13602502160565, 8726436), (32.82578122640052, 8726435), (32.82578122640052, 8726433), (32.82578122640052, 8726429), (30.70863607555851, 8726437), (30.70863607555851, 8726434), (30.70863607555851, 8726430), (30.70863607555851, 7156982), (20.09607619370668, 1305521), (20.09607619370668, 1305520)]}, {'query_id': '1037496', 'scores': [(27.164761525836614, 7766587), (27.164761525836614, 7766585), (27.164761525836614, 5927420), (27.164761525836614, 4760912), (25.1453728825241, 4905511), (25.1453728825241, 4760914), (25.1453728825241, 4174377), (25.1453728825241, 3725937), (18.72239393334702, 4174376), (18.296884980316406, 4174378)]}, {'query_id': '1043135', 'scores': [(36.87987445266299, 8696961), (35.933761717917605, 650642), (34.965363265314046, 3378240), (34.82276430970873, 8696958), (34.312225997811325, 4994428), (33.41670021144713, 4355523), (33.41670021144713, 2514458), (32.78034666132486, 650641), (31.43642321437

In [22]:
tfidf_results = query_processing(trec_dl_2020.queries_iter(), taat_tfidf)
print(tfidf_results)

query_processing (142372.439 ms)
[{'query_id': '1030303', 'scores': [(37.13602502160565, 8726436), (32.82578122640052, 8726435), (32.82578122640052, 8726433), (32.82578122640052, 8726429), (30.70863607555851, 8726437), (30.70863607555851, 8726434), (30.70863607555851, 8726430), (30.70863607555851, 7156982), (20.09607619370668, 1305521), (20.09607619370668, 1305520)]}, {'query_id': '1037496', 'scores': [(27.164761525836614, 7766587), (27.164761525836614, 7766585), (27.164761525836614, 5927420), (27.164761525836614, 4760912), (25.1453728825241, 4905511), (25.1453728825241, 4760914), (25.1453728825241, 4174377), (25.1453728825241, 3725937), (18.72239393334702, 4174376), (18.296884980316406, 3725932)]}, {'query_id': '1043135', 'scores': [(36.87987445266299, 8696961), (35.933761717917605, 650642), (34.965363265314046, 3378240), (34.82276430970873, 8696958), (34.312225997811325, 4994428), (33.41670021144713, 4355523), (33.41670021144713, 2514458), (32.78034666132486, 650641), (31.43642321437

# 5. Evaluation with TREC-style measures
To evaluate retrieval performance, we use the TREC evaluation method with `ir_measures`.

This section generates a run file and QRELs for the TREC evaluation tool.

In [23]:
for query in list(trec_dl_2020.queries_iter())[:3]:
    print(query)

GenericQuery(query_id='1030303', text='who is aziz hashim')
GenericQuery(query_id='1037496', text='who is rep scalise?')
GenericQuery(query_id='1043135', text='who killed nicholas ii of russia')


In [24]:
for qrel in list(trec_dl_2020.qrels_iter())[:3]:
  print(qrel)

[INFO] [starting] https://trec.nist.gov/data/deep/2020qrels-pass.txt
[INFO] [finished] https://trec.nist.gov/data/deep/2020qrels-pass.txt: [00:00] [219kB] [4.90MB/s]
                                                                              

TrecQrel(query_id='23849', doc_id='1020327', relevance=2, iteration='0')
TrecQrel(query_id='23849', doc_id='1034183', relevance=3, iteration='0')
TrecQrel(query_id='23849', doc_id='1120730', relevance=0, iteration='0')




## 5.1 Run File generation

In [47]:
def generate_run(results):
    """
    Generate a TREC-formatted run list from query results.

    Args:
        results (list[dict]): A list of query results, where each result contains:
            - `query_id` (int): The ID of the query.
            - `scores` (list[tuple[float, int]]): A list of (score, doc_id) tuples.

    Returns:
        list[str]: A list of strings formatted in TREC run format.
    """

    trec_run_list = []                      # List to store TREC-formatted lines

    for doc_scores in results:              # Iterate over each query result
        rank = 1
        query_id = doc_scores['query_id']
        scores = doc_scores['scores']

        for score, doc_id in scores:
            # Format the result as a TREC-compliant line
            line = f"{query_id} Q0 {doc_id} {rank} {score} GOODFELLAS"
            trec_run_list.append(line)
            rank += 1

    return trec_run_list

# Generate TREC-formatted run lists for BM25 and TF-IDF results
trec_bm25_run_list = generate_run(bm25_results)
trec_tfidf_run_list = generate_run(tfidf_results)

# Write the BM25 run list to a TREC-eval compatible file
with open("trec_eval_bm25_run_file.txt", "w") as f:
    for line in trec_bm25_run_list:
        f.write(line + "\n")

# Write the TF-IDF run list to a separate TREC-eval compatible file
with open("trec_eval_tfidf_run_file.txt", "w") as f:
    for line in trec_tfidf_run_list:
        f.write(line + "\n")

## 5.2 Qrels File generation

In [48]:
qrels_file = []     # List to store lines formatted for TREC-Eval qrels

# Iterate over qrels data provided by the TREC DL 2020 dataset
for qrel in trec_dl_2020.qrels_iter():
    # Format the qrel information as per TREC-Eval requirements
    # Format: <query_id> 0 <doc_id> <relevance>
    line = f"{qrel.query_id} 0 {qrel.doc_id} {qrel.relevance}"
    qrels_file.append(line)

# Write the qrels list to a file in TREC-Eval compatible format
with open("trec_eval_qrels_file.txt", "w") as f:
    for line in qrels_file:
        f.write(line + "\n")

## 5.3 Results

In [49]:
# Define evaluation measures for the retrieval models
measures = [
    P@5,              # Precision at rank 5
    P(rel=2)@5,       # Precision at rank 5, considering relevance level >= 2
    nDCG@10,          # Normalized Discounted Cumulative Gain at rank 10
    AP,               # Average Precision
    AP(rel=2),        # Average Precision, considering relevance level >= 2
    Bpref,            # Binary preference
    Bpref(rel=2),     # Binary preference, considering relevance level >= 2
    Judged@10         # Fraction of top 10 documents that were judged
]

# Load qrels (ground truth relevance judgments)
qrels = ir_measures.read_trec_qrels('trec_eval_qrels_file.txt')

# Evaluate BM25 results using the defined measures
bm25_run = ir_measures.read_trec_run('trec_eval_bm25_run_file.txt')
bm25_results = ir_measures.calc_aggregate(measures, qrels, bm25_run)

# Evaluate TF-IDF results using the same qrels and measures
tfidf_run = ir_measures.read_trec_run('trec_eval_tfidf_run_file.txt')
tfidf_results = ir_measures.calc_aggregate(measures, qrels, tfidf_run)

In [50]:
import pandas as pd

# Create a DataFrame to compare BM25 and TF-IDF evaluation metrics
# Each column represents the results for a retrieval model
df = pd.DataFrame({
    "BM25": bm25_results,
    "TF-IDF": tfidf_results
})

print(df)

                  BM25    TF-IDF
AP(rel=2)     0.176679  0.176508
Bpref(rel=2)  0.197055  0.196541
P(rel=2)@5    0.403704  0.403704
AP            0.140604  0.141106
nDCG@10       0.478191  0.478917
Bpref         0.156412  0.156832
Judged@10     0.925926  0.925926
P@5           0.607407  0.607407


BM25 proves to be a more effective model for information retrieval on the MSMARCO passage dataset than TF-IDF. This suggests that BM25, which accounts for both term frequency and document length normalization, is better suited for ranking documents meaningfully, whereas TF-IDF relies solely on raw term frequency and inverse document frequency, as reflected in the superior performance of BM25 across almost all metrics.

## 5.4 Comparison with PyTerrier

The same set of queries has been evaluated using PyTerrier, which serves as benchmark, and a comparison between the latter and results obtained  through the implementation seen so far has been performed.

In [29]:
from pyterrier.measures import P, nDCG, AP, Judged

# Load the MSMARCO Passage Retrieval dataset
dataset = pt.get_dataset('msmarco_passage')

# Run an experiment comparing TF-IDF and BM25 retrieval models
pt.Experiment(
    [
        # TF-IDF retriever from the Terrier index
        pt.terrier.Retriever.from_dataset('msmarco_passage', 'terrier_stemmed', wmodel='TF_IDF'),
        # BM25 retriever from the Terrier index
        pt.terrier.Retriever.from_dataset('msmarco_passage', 'terrier_stemmed', wmodel='BM25'),
    ],
    dataset.get_topics('test-2020'),                                    # Test topics for the experiment
    dataset.get_qrels('test-2020'),                                     # Ground truth relevance judgments
    eval_metrics=[P@5, P(rel=2)@5, nDCG@10, AP, AP(rel=2), Judged@10],  # Evaluation metrics
)

terrier-assemblies 5.10 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.8 jar not found, downloading to /root/.pyterrier...
Done


Java started (triggered by Retriever.from_dataset) and loaded: pyterrier.java, pyterrier.terrier.java [version=5.10 (build: craigm 2024-08-22 17:33), helper_version=0.0.8]


Downloading msmarco_passage index to /root/.pyterrier/corpora/msmarco_passage/index/terrier_stemmed


data.direct.bf:   0%|          | 0.00/486M [00:00<?, ?iB/s]

data.document.fsarrayfile:   0%|          | 0.00/177M [00:00<?, ?iB/s]

data.inverted.bf:   0%|          | 0.00/377M [00:00<?, ?iB/s]

data.lexicon.fsomapfile:   0%|          | 0.00/100M [00:00<?, ?iB/s]

data.lexicon.fsomaphash:   0%|          | 0.00/0.99k [00:00<?, ?iB/s]

data.lexicon.fsomapid:   0%|          | 0.00/4.47M [00:00<?, ?iB/s]

data.meta.idx:   0%|          | 0.00/67.5M [00:00<?, ?iB/s]

data.meta.zdata:   0%|          | 0.00/193M [00:00<?, ?iB/s]

data.properties:   0%|          | 0.00/4.29k [00:00<?, ?iB/s]

md5sums:   0%|          | 0.00/480 [00:00<?, ?iB/s]

Downloading msmarco_passage topics to /root/.pyterrier/corpora/msmarco_passage/msmarco-test2020-queries.tsv.gz


msmarco-test2020-queries.tsv.gz:   0%|          | 0.00/4.03k [00:00<?, ?iB/s]

Downloading msmarco_passage qrels to /root/.pyterrier/corpora/msmarco_passage/2020qrels-docs.txt


2020qrels-docs.txt:   0%|          | 0.00/213k [00:00<?, ?iB/s]

Unnamed: 0,name,P@5,P(rel=2)@5,nDCG@10,AP,AP(rel=2),Judged@10
0,TerrierRetr(TF_IDF),0.625926,0.392593,0.492575,0.358072,0.292548,0.972222
1,TerrierRetr(BM25),0.625926,0.392593,0.493627,0.358724,0.292988,0.972222


Despite PyTerrier’s higher metrics overall, the proposed custom implementation achieved competitive results in certain scenarios:

- **nDCG@10**: The normalized Discounted Cumulative Gain for BM25 in the custom implementation reached 0.473, approaching PyTerrier's 0.493.
- **BM25 Precision (P@5)**: The custom implementation achieved 0.614, only marginally lower than PyTerrier’s 0.625, demonstrating competitive ranking in the top 5 results.
- **Judged@10**: The custom BM25 implementation yielded a score of 0.929, which is relatively close to PyTerrier’s 0.972.

These results highlight the potential of the custom implementation, especially considering it was developed from scratch without the extensive optimization and tuning present in PyTerrier, which for instance outperformed in  **Average Precision** achieving 0.358 (BM25) compared to 0.139 for the custom implementation, emphasizing how fine-tuning and robust indexing can improve retrieval accuracy.