# üîé Practical Walkthrough: Information Retrieval with pv211-utils

Welcome to this walkthrough notebook for the [**pv211-utils**](https://github.com/MIR-MU/pv211-utils) ‚Äî an open-source Python library that provides an object-oriented interface for building and evaluating information retrieval search engines for various text collections. It is designed to help you implement and experiment with concepts from PV211: Introduction to Information Retrieval course taught at the Faculty of Informatics, Masaryk University, Brno, Czech Republic.

Code implementations here are mostly linked to concepts from the book [Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/) by Manning, Raghavan, & Sch√ºtze (2008). You can download the full PDF version [HERE](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf).

Structure of this notebook:
- Starting off (cloning and setting up the repo)
- Projects from `notebooks` package (Cranfield, ARQMath, CQADupStack, TREC)
- Loading data for projects with the `datasets` module
- Preprocessing text data with tools from `preprocessing` package
- Information Retrieval Systems from `systems` package


Dear user, when building this notebook, I intended to make something I wish I had when I started this course. It contains descriptions and demonstrations of all functionality included in the PV-211 repository, so you can see how it works just by running the cells. I hope you will find it at least somewhat helpful :)




# Starting off





## How to access the repository?

You can find the official repository here:  

üîó**GitHub:** [https://github.com/MIR-MU/pv211-utils](https://github.com/MIR-MU/pv211-utils)


Clone it to your local machine with:

```bash
git clone https://github.com/username/reponame.git
```


**(!)** Or... you can find other options for cloning the repo (SSH/KRB5) by following the link above.

## Look around the repo: here are some useful files


- ###  `README.md`

    This is the main documentation file for the repo. It contains a short descriptio, and...

    **(!)** Here you can find direct links to every project both in Google Colab and Jupyter Hub
    - First Term Project: Cranfield Collection (23.24% MAP score)

    - Second Term Project: Beir CQADupStack Collection (21.96% MAP score)

    - Alternative Second Term Project: ARQMath Collection (6.62% MAP score)

    - Pre-2023 Second Term Project: TREC Collection (43.06% MAP score)


- ### `requirements.txt`

    Here you can find all the packages used in the repository.  
    To install all dependencies, you can run:

    ```bash
    pip install -r requirements.txt
    ```

    For more details, check the ```README.md``` in the repository root.


- ### `setup.py`

    This script
    - defines how the `pv211-utils` package is built, installed, and distributed;
    - specifies the package metadata (name, version, author);
    - includes optional extras for notebooks or Google Drive downloads

    **(!)** This is an alternative way to **install all dependencies**!
    
    - Just run `pip install .` in the repository root. Then `setup.py` will install everything needed for the library.



 - ### `spreadsheet`

    These are the systems thar are responsible for the leaderboars for the projects. These leaderbords are linked to the spreadsheets that you can access from IS MUNI.
    Here you can find:
    - **README.md**:  includes archived leaderbords from the previous years
    - **dynamic-sorting.gs** is a Google Apps Script which keeps the leaderboard spreadsheets updated and sorted.




In [None]:
# setting up pv211-utils from github in google colab (skip if running locally/jupyterhub)
!git clone -q https://github.com/MIR-MU/pv211-utils.git && \
  cd pv211-utils && \
  pip -q install -e .

# Building Blocks of `pv211-utils`: Documents, Queries, Systems, Judgements

## Documents & Queries

There are several objects that are used throughout the repository to work with data. Let us not only look at their structure, but also answer the question: why do we need them at all?

To answer this question, let us demonstrate an extremely simple Information Retrieval system.
Our IR system should help us with a problem:: we want to find some kind of text or article by entering a natural language query.

In order for this to make sense, we need:
 1) collection of text chinks or articles (now generalize it wo "documents")
 2) our query - what do we want to get?

 And every other IR system that we will build needs it. That's why there are reproducible classes for it in this Repo:
- `DocumentBase` entity contains some kind of document, which is basically a text chunk (article, sentence, answer)
- `QueryBase` entity contains a user query, which will be fed to our IR systems.

Many dataset-specific classes (e.g. `CranfieldDocument`, `ArqmathAnswer`, `BeirQuery`, etc.) inherit from these base classes to unify the way we interact with different collections.

Let us define a small in-memory collection of documents. Each `DocumentBase` instance has to have id and its content as attributes.

In [None]:
from collections import OrderedDict
from pv211_utils.entities import DocumentBase, QueryBase

# first we'll make an ordered dictionary of DocumentBase instances
docs = OrderedDict({
    "doc1": DocumentBase("doc1", "Cats are beautiful, playful and curious."),
    "doc2": DocumentBase("doc2", "Dogs are loyal and protect their owners."),
    "doc3": DocumentBase("doc3", "Birds live in trees and fly in the sky."),
})

doc = docs["doc2"]
print(f"Document with id \"{doc.document_id}\": \n{doc.body}")

Document with id "doc2": 
Dogs are loyal and protect their owners.


Now let's define the query withe help of `QueryBase`: what do we want to get?

In [None]:
# defining a query
query = QueryBase("q1", "animals that can fly")

print(f"Query with id \"{query.query_id}\": \"{query.body}\"")

Query with id "q1": "animals that can fly"


## IRSystem & DocPreprocessing

Nice! And now the most important part: what do we need to make it work: to return documents that are relevant to the given query?

Right, an IR system itself.

All retrieval systems inherit from an abstract base class `IRSystem`, but they all use different methods under the hood to achieve the retrieval purposes.  
They all live in the `systems` package and are very simple to use, as they all follow the same interface and share the `.search` method to retrieve results relevant to query.

**BUT** before passing our text data to some of the systems, we must apply preprocessing in order to normalize, clean, and tokenize the input ‚Äî both for documents and queries.

And for this purpose, we can use classes from the `preprocessing` package, such as `SimpleDocProcessing` or a customizable `DocProcessing` class.

In [None]:
from pv211_utils.preprocessing import SimpleDocPreprocessing
from pv211_utils.systems.bow import BoWSystem  # or tfidf, bm25, etc.

# initializing a simple preprocessor
preprocessor = SimpleDocPreprocessing()

# initializing a retrieval system (BoW here, but you can try TF-IDF, BM25, Retriever...)
system = BoWSystem(documents=docs, preprocessing=preprocessor)

# performing the search
results = list(system.search(query))

# printing results
for i, doc in enumerate(results):
    print(f"doc #{i+1}: {doc.document_id}: {doc.body}")


Building the dictionary: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 2971.87it/s]
Building the index: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<?, ?it/s]

doc #1: doc3: Birds live in trees and fly in the sky.
doc #2: doc1: Cats are beautiful, playful and curious.
doc #3: doc2: Dogs are loyal and protect their owners.





## Judgements

Just now we retrieved documents for a query and printed them in ranked order.

However, to evaluate an IR system, we must answer a crucial question: Which documents are actually relevant to the query?

The answer might simply be Yes/No, as well as a rade from 1 to 5 or another value representing confidence.

In any case, to store this information about each document, we use **judgements**, incorporated in `JudgementBase` class:

In [None]:
# creating a tiny set of relevance judgements manually
judgements = {
    (query, docs["doc1"]),
    (query, docs["doc3"]),
}

# performing retrieval
results = list(system.search(query))

print("ranking with relevance labels:\n")

for rank, doc in enumerate(results, start=1):
    is_relevant = any(
        j[1].document_id == doc.document_id for j in judgements
    )
    label = "relevant" if is_relevant else "not relevant"

    print(f"rank {rank}: {doc.document_id} ‚Üí {label}")


ranking with relevance labels:

rank 1: doc3 ‚Üí relevant
rank 2: doc1 ‚Üí relevant
rank 3: doc2 ‚Üí not relevant


# Projects from `notebooks` package (Cranfield, ARQMath, CQADupStack, TREC)

**In the `notebooks` folder you can find `.ipynb` files with the projects which are part of the PV211 course.**

**Notebooks with the projects contain simple, baseline solutions. Here are the projects and the current scores:**


- **First Term Project: Cranfield Collection** (23.24% MAP score) -> open in [Google Colab](https://colab.research.google.com/github/MIR-MU/pv211-utils/blob/spring2025/notebooks/cranfield.ipynb) or [JupyterHub](https://iirhub.cloud.e-infra.cz/)

- **Second Term Project: Beir CQADupStack Collection** (21.96% MAP score) -> open in [Google Colab](https://colab.research.google.com/github/MIR-MU/pv211-utils/blob/main/notebooks/beir_cqadupstack.ipynb) or [JupyterHub](https://iirhub.cloud.e-infra.cz/)

- **Alternative Second Term Project: ARQMath Collection** (6.62% MAP score) -> open in [Google Colab](https://colab.research.google.com/github/MIR-MU/pv211-utils/blob/main/notebooks/arqmath.ipynb) or [JupyterHub](https://iirhub.cloud.e-infra.cz/)

- **Pre-2023 Second Term Project: TREC Collection** (43.06% MAP score)-> open in [Google Colab](https://colab.research.google.com/github/MIR-MU/pv211-utils/blob/main/notebooks/trec.ipynb) or [JupyterHub](https://iirhub.cloud.e-infra.cz/)


‚ö†Ô∏è When working on the projects, don't forget to [keep the code clean](#recommended-resources-to-work-on-the-projects) and upload the `.ipynb` file to the Home Vault in IS MU.

P.S. More detailed instructions are provided in the notebooks (links above)

## Recommended resources to work on the projects



---
- **Documentation and coding style frameworks**

    - **[PEP 8](https://www.python.org/dev/peps/pep-0008/)** ‚Äì the official Python style guide for consistent code formatting, naming, indentation, line lengths, etc.

    - **[PEP 257](https://www.python.org/dev/peps/pep-0257/)** ‚Äì conventions for writing clear, concise, and structured Python docstrings.

    - **[NumPy Docstring Standard](https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard)** ‚Äì a widely adopted style for writing detailed, sectioned docstrings, especially in scientific and data projects.

    - **[Google Python Style Guide](https://google.github.io/styleguide/pyguide.html)** ‚Äì an alternative coding and docstring style guide, simpler than NumPy style but still structured.

    - **[reStructuredText (reST)](https://docutils.sourceforge.io/rst.html)** ‚Äì the markup syntax used in many Python docstrings to format text with headings, lists, links, etc.



---

- **Notebook environments you may use:**

    - **[Google Colab](https://colab.research.google.com/)** ‚Äì free, cloud-hosted Jupyter notebooks with GPU/TPU support.

    - **[DeepNote](https://deepnote.com/)** ‚Äì collaborative cloud notebooks with modern UI.

    - **[JupyterHub](https://iirhub.cloud.e-infra.cz/)** ‚Äì preconfigured with course resources, allowing you to run pv211-utils notebooks in a set up environment

    - **[Kaggle Notebooks](https://www.kaggle.com/code)** ‚Äì cloud notebooks with integrated datasets


---

- **(!)** If you prefer working on your code locally, here are some useful links:

    - **[JupyterLab](https://jupyter.org/)** ‚Äì interface for running Jupyter notebooks locally, with tabs, terminals, and some really nice extensions.

    - **[VS Code Notebooks](https://code.visualstudio.com/docs/datascience/jupyter-notebooks)** ‚Äì run, edit, and debug notebooks directly in VS Code, with integrated linting and formatting.

- ...and some frameworks to keep your code clean and work locally:

    - [flake8](https://flake8.pycqa.org/en/latest/) ‚Äì checks your code for style issues and simple bugs.

    - [pydocstyle](https://www.pydocstyle.org/en/stable/) ‚Äì checks your docstrings follow good conventions.

    - [black](https://black.readthedocs.io/en/stable/) ‚Äì auto-formats your code so it always looks neat.

---

## Cranfield Collection


Your task is to implement an unsupervised ranked retrieval system, which will produce a list of documents from the Cranfield collection in a descending order of relevance to a queryYou are going to work with the **Cranfield Collection**.  It might be the most famous and fundamental test collection in the field of Information Retrieval. It was developed in the United Kingdom at the Cranfield Institute of Technology in the late 1950s and early 1960s. In fact, it was the first collection designed to systematically evaluate information retrieval systems.

It contains:
- **1398 abstracts** of journal articles about **aerodynamics and engineering**.
- **225 queries** crafted to test the IR systems
- **Relevance judgements**  that specify, for every query-document pair, whether the document is relevant to the query.

    (!) Cranfield judgments are binary: either a document is relevant (included in the judgements) or not relevant (absent from the set of judgements).



**-> The Cranfield Collection** was described in the book by Manning et al. in [Section 8.2](https://nlp.stanford.edu/IR-book/pdf/08eval.pdf). In this section you can also read about the evaluation process of IR systems.

**->** You can work on this project using any tool mentioned above. Here are direct links to [Google Colab](https://colab.research.google.com/github/MIR-MU/pv211-utils/blob/spring2025/notebooks/cranfield.ipynb) and [JupyterHub](https://iirhub.cloud.e-infra.cz/)

**->** if you have any trouble loading the collection, visit the section describing how to use `datasets` module.

**Let's take a look at the data from the Cranfield just to see how it all works... For a more deep demo check out the notebook directly :)**

In [None]:
from pv211_utils.datasets import CranfieldDataset

# Initializing the dataset and setting the test and validation split
cranfield = CranfieldDataset(test_split_size=0.2)
cranfield.set_validation_split_size(0.1)
print(f"Test split size: {cranfield.test_split_size}")
print(f"Validation split size: {cranfield.validation_split_size}")

Test split size: 0.2
Validation split size: 0.1


In [None]:
# Checking out the number of the documens and queries
# Loading documents, queries, and judgements
documents = cranfield.load_documents()
queries = cranfield.load_train_queries()
judgements = list(cranfield.load_train_judgements())

print(f"Total documents: {len(documents)}")
print(f"Total test queries: {len(queries)}")
print(f"Total train judgments: {len(judgements)}")

Total documents: 1400
Total test queries: 162
Total train judgments: 1340


In [None]:
# accessing the documents and queries
# pick a judgment (enter an index like 0,1,2, 5 etc.)
index = 21
query, document = judgements[index]

# printing the given judgement, query and the document
print(f"Let's see what we have here...\n")
print(f"Picked judgment index: {index}\n")
print(f"Query ID: {query.query_id}")
print(f"Query text:\n{query.body.strip()}\n")
print(f"Document ID: {document.document_id}")
print(f"Document body:\n...{document.body.strip()[:500]}...")

# this document should be relevant for the given query - check it yourself :)

Let's see what we have here...

Picked judgment index: 21

Query ID: 106
Query text:
experimental techniques in shell vibration .

Document ID: 764
Document body:
...breathing vibrations of a circular shell with an internal liquid .   resonant breathing frequencies and mode shapes are determined experimentally for a thin-walled, circular cylindrical shell containing a nonviscous incompressible liquid .  the resonant frequencies determined for the full shell are in good agreement with those predicted by reissner's shallow-shell vibration theory with the inclusion of an apparent-mass term for the liquid .  the effect of the internal liquid on the shell mode sh...


Now that you saw the data on your own eyes, it's time to go to the notebook and build a nice Information Retrieval system on top of it.

## CQADupStack Collection

Your task is to implement an unsupervised ranked retrieval system, which will produce a list of documents from the **CQADupStack Collection** in a descending order of relevance to a query.

You are going to work with the **CQADupStack Collection**, a benchmark dataset for community question-answering research. It is part of the larger [BEIR benchmark collection](https://github.com/beir-cellar/beir), and was introduced by Hoogeveen et al. (2015) as "[a] Benchmark Data Set for Community Question-Answering Research."

CQADupStack contains data from **12 different [Stack Exchange](https://stackexchange.com/) subforums**, each consisting of real-world user questions and corresponding answers.
It contains:
- **Community QA pairs** from topics like Android, English, Gaming, GIS, Mathematica, Physics, Programmers, Stats, TeX, Unix, Webmasters, and WordPress.
- **Queries** are real user questions from Stack Exchange forums.
- **Documents** are candidate duplicate questions and answers within the same subforum.
- **Relevance judgements** specify which documents (questions/answers) are considered duplicates of a given query.

    (!) CQADupStack judgments are **binary**: either a document is considered a duplicate (relevant) or it is not.

Your tasks, reviewed by your colleagues and course instructors, are the following:
- Implement a ranked retrieval system that produces a list of documents from CQADupStack in descending order of relevance to each query, as described in [Manning et al., Chapter 6](https://nlp.stanford.edu/IR-book/pdf/06vect.pdf).
- Document your code in accordance with [PEP 257](https://peps.python.org/pep-0257/), ideally using [the NumPy style guide](https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard). Stick to a consistent coding style following [PEP 8](https://peps.python.org/pep-0008/).
- Reach at least **25% mean average precision at 10** (MAP@10) on the CQADupStack collection, as described in [Manning et al., Section 8.4](https://nlp.stanford.edu/IR-book/pdf/08eval.pdf).
- Upload your `.ipynb` notebook to the homework vault in IS MU. Optionally, include a brief description of your retrieval system and a link to an external service such as Google Colaboratory, DeepNote, or JupyterHub.

**-> The CQADupStack dataset is described in detail in [Hoogeveen et al., 2015](https://dl.acm.org/doi/10.1145/2838931.2838934), and is included in the BEIR benchmark described on the [BEIR GitHub page](https://github.com/beir-cellar/beir).**

**->** You can work on this project using any tool mentioned above. Here are direct links to [Google Colab](https://colab.research.google.com/github/MIR-MU/pv211-utils/blob/main/notebooks/beir_cqadupstack.ipynb) and [JupyterHub](https://iirhub.cloud.e-infra.cz/)

**->** if you have any trouble loading the collection, visit the section describing how to use `datasets` module.

**Let's take a look at the data from CQADupStack just to see how it all works‚Ä¶ For a deeper demo, check out the notebook directly :)**


In [None]:
from pv211_utils.datasets import BeirDataset

# initializing the BEIR dataset; replace with any available BEIR dataset name
dataset_name = "quora"  # e.g., msmarco, hotpotqa, nq, fever, etc.
beir = BeirDataset(dataset_name=dataset_name)

print(f"Loaded BEIR dataset: {beir.dataset_name}")

Loaded BEIR dataset: quora


In [None]:
# loading documents, queries, and judgments
# NOTICE: some datasets may not have train sets, so better call for test sets
documents = beir.load_documents()
queries = beir.load_test_queries()
judgements = list(beir.load_test_judgements())

print(f"Total documents: {len(documents)}")
print(f"Total train queries: {len(queries)}")
print(f"Total train judgments: {len(judgements)}")

  0%|          | 0/522931 [00:00<?, ?it/s]

  0%|          | 0/522931 [00:00<?, ?it/s]

  0%|          | 0/522931 [00:00<?, ?it/s]

Total documents: 522931
Total train queries: 10000
Total train judgments: 15675


In [None]:
# accessing the documents and queries
# pick a judgment (enter an index like 0,1,2,5 etc.)
index = 211
query, document = judgements[index]

# cleaning the document body for better readability
import re
cleaned_doc_body = re.sub(r'\s+', ' ', document.body).strip()

print(f"Let's see what we have here...\n")
print(f"Picked judgment index: {index}\n")
print(f"Query ID: {query.query_id}")
print(f"Query text:\n{query.body.strip()}\n")
print(f"Document ID: {document.document_id}")
print(f"Document body:\n...{cleaned_doc_body}...")


# Reminder: This document is judged relevant to the query ‚Äî check if it makes sense!

Let's see what we have here...

Picked judgment index: 211

Query ID: 126999
Query text:
What is the outcome of a $15 minimum wage in the U.S.?

Document ID: 342874
Document body:
...What does minimum wage increasing effect to economy and is it a good thing to do?...


## ARQMath Collection

Your task is to implement a supervised ranked retrieval system, which will produce a list of documents from the **ARQMath Collection** in a descending order of relevance to a query.

You are going to work with the **ARQMath Collection**, a benchmark dataset focused on mathematical question answering. ARQMath uses real threads from the Math Stack Exchange site and aims to advance math-aware search systems capable of understanding both **text** and **mathematical notation** in questions and answers.

According to [Zanibbi et al., 2020](https://ceur-ws.org/Vol-2696/paper_271.pdf), ~20% of math-related queries submitted to general-purpose search engines are well-formed questions, highlighting the real need for effective math question answering systems.

ARQMath contains:
- **Questions** posted on Math Stack Exchange.
- **Answers** submitted by the community, forming candidate documents.
- **Queries**, which are real math questions used to evaluate answer retrieval systems.
- **Relevance judgements**, specifying which answers are relevant to each query.

    (!) ARQMath judgments use **graded relevance**, with relevance levels 0 (not relevant), 1 (partially relevant), or 2 (highly relevant). You should use training and validation judgments for building your system, and test judgments only for evaluation.

 ![Answer Retrieval Task](https://www.cs.rit.edu/~dprl/ARQMath/assets/images/screen-shot-2019-09-09-at-11.11.57-pm-2656x1229.png)

Your tasks, reviewed by your colleagues and course instructors, are the following:
- Implement a supervised ranked retrieval system that produces a ranked list of answers for each query, as described in [Manning et al., Chapter 15](https://nlp.stanford.edu/IR-book/pdf/15learn.pdf).
- Document your code in accordance with [PEP 257](https://peps.python.org/pep-0257/), ideally using [the NumPy style guide](https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard). Stick to a consistent coding style following [PEP 8](https://peps.python.org/pep-0008/).
- Reach at least **10% mean average precision at 10** (MAP@10) on the ARQMath collection, as discussed in [Manning et al., Section 8.4](https://nlp.stanford.edu/IR-book/pdf/08eval.pdf).
- You are encouraged to use advanced IR techniques such as tokenization ([Section 2.2](https://nlp.stanford.edu/IR-book/pdf/02bool.pdf)), document representation ([Section 6.4](https://nlp.stanford.edu/IR-book/pdf/06vect.pdf)), tolerant retrieval ([Chapter 3](https://nlp.stanford.edu/IR-book/pdf/03edit.pdf)), relevance feedback, query expansion ([Chapter 9](https://nlp.stanford.edu/IR-book/pdf/09expand.pdf)), and learning to rank ([Chapter 15](https://nlp.stanford.edu/IR-book/pdf/15learn.pdf)).

- Upload your `.ipynb` notebook to the homework vault in IS MU. Optionally, include a brief description of your system and a link to an external service like Google Colaboratory, DeepNote, or JupyterHub.

**-> The ARQMath Collection is described in detail in [Zanibbi et al., 2022](https://www.cs.rit.edu/~dprl/ARQMath/index.html) and focuses on developing math-aware retrieval systems for community Q&A sites.**

**->** You can work on this project using any of the recommended tools above. Here are direct links to [Google Colab](https://colab.research.google.com/github/MIR-MU/pv211-utils/blob/spring2025/notebooks/arqmath.ipynb) and [JupyterHub](https://iirhub.cloud.e-infra.cz/).

**->** if you have any trouble loading the collection, visit the section describing how to use `datasets` module.

---

### ARQMath dataset configuration

- **Years available for splitting**: 2020, 2021, 2022. Choose one year as the **test set**; the other two form the training set.
- **Validation split**: obtained by further splitting the training set.
- **Text formats** available for representing math content:
  - `text`: Plain text without math.
  - `text+latex`: Text + LaTeX math (with `$...$`).
  - `text+prefix`: Text + Tangent-L‚Äôs prefix math representation.
  - `text+tangentl`: Text + math in Tangent-L‚Äôs mathtuples.
  - `xhtml+latex`: XHTML + LaTeX.
  - `xhtml+pmml`: XHTML + Presentation MathML.
  - `xhtml+cmml`: XHTML + Content MathML.

**Now let's explore some code on this...**

In [None]:
from pv211_utils.datasets import ArqmathDataset
import re

# initializing ARQMath dataset for 2020 as the test year...
# choose a text format that includes math notation (all of them are listed above)
arqmath = ArqmathDataset(year=2020, text_format="text+latex", validation_split_size=0.1)

print(f"Test year: {arqmath.year}")
print(f"Text format: {arqmath.text_format}")
print(f"Validation split size: {arqmath.validatoin_split_size}")

Test year: 2020
Text format: text+latex
Validation split size: 0.1


In [None]:
# loading documents (answers), queries, and judgements
answers = arqmath.load_answers()
queries = arqmath.load_test_queries()
judgements = list(arqmath.load_test_judgements())

print(f"Total answers (documents): {len(answers)}")
print(f"Total train queries: {len(queries)}")
print(f"Total train judgements: {len(judgements)}")

Computing MD5: C:\Users\Admin\.cache\pv211-utils\75975b2833889cc12007251c3b5de951
MD5 matches: C:\Users\Admin\.cache\pv211-utils\75975b2833889cc12007251c3b5de951
Computing MD5: C:\Users\Admin\.cache\pv211-utils\75975b2833889cc12007251c3b5de951
MD5 matches: C:\Users\Admin\.cache\pv211-utils\75975b2833889cc12007251c3b5de951
Total answers (documents): 1445495
Total train queries: 77
Total train judgements: 1804


In [None]:

# picking a judgement to inspect: you can change the index to whatever you think is a nice number
index = 211
query, answer = judgements[index]

# some regex magic here to make outputs more readable
cleaned_answer_body = re.sub(r'\s+', ' ', answer.body).strip()

print(f"Let's see what we have here...\n")
print(f"Picked judgement index: {index}\n")
print(f"Query ID: {query.query_id}")
print(f"Query text:\n{query.body.strip()}\n")
print(f"Answer ID: {answer.document_id}")
print(f"Answer body:\n...{cleaned_answer_body[:500]}...")

# reminder: The retrieved answer was judged relevant to this query, so it should make sense.
# even though it can be a LaTeX coe (not very readable), you can still copy it to e.g Overleaf and see how it is related with the query.

Let's see what we have here...

Picked judgement index: 211

Query ID: 5
Query text:
A family has two children. Given that one of the children is a boy, what is the probability that both children are boys?   I was doing this question using conditional probability formula.   Suppose, (1) is the event, that the first child is a boy, and (2) is the event that the second child is a boy.  Then the probability of the second child to be boy given that first child is a boys by formula, $P((2)|(1))=\frac{P((2) \cap (1))}{P((1))}=\frac{P((2))P((1))}{P((1))} = P((2))$ ...since second child to be boy doesn't depend on first child and vice versa. Please provide the detailed solution and correct me if I am wrong.

Answer ID: 2107539
Answer body:
...In a real life situation, it depends on how you know that the family has 3 girls. Here are two different scenarios: A. You run into the mother with 3 of her children with her that are all girls, and she tells you that she has a 4th child. Now the chance o

## TREC Collection

Your task is to implement a supervised ranked retrieval system, which will produce a list of documents from the **TREC Collection** in a descending order of relevance to a query.

You are going to work with the **TREC Collection**, one of the most influential benchmarks in Information Retrieval, developed by the U.S. National Institute of Standards and Technology (NIST) as part of the Text REtrieval Conference (TREC) series starting in 1992(yes, that's where the name from))

According to [Manning et al., Section 8.2](https://nlp.stanford.edu/IR-book/pdf/08eval.pdf), TREC datasets such as TRECs 6‚Äì8 include **150 information needs (queries)** over about **528,000 newswire and Foreign Broadcast Information Service articles**, creating one of the largest and most realistic testbeds for large-scale information retrieval systems.

The TREC Collection contains:
- Hundreds of thousands of **documents** from real newswire and broadcast sources.
- **Queries** reflecting realistic information needs (information requests) posed to the system.
- **Relevance judgements**, connecting queries with relevant documents, though not exhaustive due to the scale of the collection.

    (!) Because the collection is so large, TREC judgments are **incomplete**: not all documents have been manually judged for every query. This means that mean average precision metrics assume unjudged documents are not relevant.
---
Your tasks, reviewed by your colleagues and course instructors, are the following:
- Implement a supervised ranked retrieval system that produces a ranked list of documents for each query, as described in [Manning et al., Chapter 15](https://nlp.stanford.edu/IR-book/pdf/15learn.pdf).
- Document your code in accordance with [PEP 257](https://peps.python.org/pep-0257/), ideally using [the NumPy style guide](https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard). Stick to a consistent coding style following [PEP 8](https://peps.python.org/pep-0008/).
- Reach at least **13.5% mean average precision at 10** (MAP@10) on the TREC collection, as discussed in [Manning et al., Section 8.4](https://nlp.stanford.edu/IR-book/pdf/08eval.pdf).
- You are encouraged to apply techniques such as tokenization ([Section 2.2](https://nlp.stanford.edu/IR-book/pdf/02bool.pdf)), document representation ([Section 6.4](https://nlp.stanford.edu/IR-book/pdf/06vect.pdf)), tolerant retrieval ([Chapter 3](https://nlp.stanford.edu/IR-book/pdf/03edit.pdf)), relevance feedback, query expansion ([Chapter 9](https://nlp.stanford.edu/IR-book/pdf/09expand.pdf)), and learning to rank ([Chapter 15](https://nlp.stanford.edu/IR-book/pdf/15learn.pdf)).

- Upload your `.ipynb` notebook to the homework vault in IS MU. Optionally, include a brief description of your system and a link to an external service like [Google Colaboratory](https://colab.research.google.com/), [DeepNote](https://deepnote.com/), or [JupyterHub](https://iirhub.cloud.e-infra.cz/).
---

**-> The TREC Collection is described in detail in [Manning et al., Section 8.2](https://nlp.stanford.edu/IR-book/pdf/08eval.pdf) and serves as a key resource for evaluating large-scale IR systems on real-world news content.**

**->** You can work on this project using any recommended tools above. Here are direct links to [Google Colab](https://colab.research.google.com/github/MIR-MU/pv211-utils/blob/spring2025/notebooks/trec.ipynb) and [JupyterHub](https://iirhub.cloud.e-infra.cz/).

**->** if you have any trouble loading the collection, visit the section describing how to use `datasets` module.

---

You can observe some data from the TREC Collection below.


In [None]:
from pv211_utils.datasets import TrecDataset
import re

# creating the TREC dataset with validation split
trec = TrecDataset(validation_split_size=0.1)

print(f"Validation split size: {trec.validation_split_size}")

# loading documents
documents = trec.load_documents()
print(f"Loaded {len(documents)} documents.")

# loading test queries
queries = trec.load_test_queries()
print(f"Loaded {len(queries)} test queries.")

# loading test judgements
judgements = list(trec.load_test_judgements())
print(f"Loaded {len(judgements)} test judgements.")


index = 0
query, document = judgements[index]

# cleaning the document text for display
cleaned_doc_text = re.sub(r'\s+', ' ', document.body).strip()

print(f"\nLet's see what we have here...\n")
print(f"Picked judgement index: {index}\n")
print(f"Query ID: {query.query_id}")
print(f"Query text:\n{query.body.strip()}\n")
print(f"Document ID: {document.document_id}")
print(f"Document snippet:\n...{cleaned_doc_text[:500]}...")


Validation split size: 0.1
Computing MD5: C:\Users\Admin\.cache\pv211-utils\a75c5f3f7f75085c94f111594ba7d227
MD5 matches: C:\Users\Admin\.cache\pv211-utils\a75c5f3f7f75085c94f111594ba7d227
Loaded 527890 documents.
Loaded 50 test queries.
Computing MD5: C:\Users\Admin\.cache\pv211-utils\a75c5f3f7f75085c94f111594ba7d227
MD5 matches: C:\Users\Admin\.cache\pv211-utils\a75c5f3f7f75085c94f111594ba7d227
Loaded 4726 test judgements.

Let's see what we have here...

Picked judgement index: 0

Query ID: 442
Query text:
Find accounts of selfless heroic acts by individuals or 
small groups for the benefit of others or a cause.

Document ID: FBIS3-27579
Document snippet:
...Language: Chinese Article Type:BFN [Commentator's article: "Learn From Xu Honggang"] [Text] Xu Honggang, the name of a common soldier in the People's Liberation Army, has now spread among more and more people. A squad leader in a communications company in the Jinan Military Region, Xu Honggang bravely stepped forward and fought 

# Loading data for projects with the `datasets` module


The datasets.py module provides a unified, simple interface to work with datasets:

- ARQMath
- Cranfield
- TREC
- BEIR

You can find more detailed demonstrations of working with each dataset in the first section.


The `datasets` module encapsulates each dataset in a class, considering dataset-specific logic to retrieve queries, documents, and relevance judgments while handling details like file downloads, data integrity, and split management (train/validation/test).

## Loading a dataset: example with `CranfieldDataset`

Here is an example of loading the Cranfield dataset and setting the train and test split. You can also find more detailed demo of working with Cranfield in [this section](#cranfield-collection).

In [None]:
from pv211_utils.datasets import CranfieldDataset

# creating a CranfieldDataset instance with a 20% test split and 10% validation split
cranfield = CranfieldDataset(test_split_size=0.2, validation_split_size=0.1)
print(f"Test split size: {cranfield.test_split_size}")
print(f"Validation split size: {cranfield.validation_split_size}")

Test split size: 0.2
Validation split size: 0.1


In [None]:
# loading training queries and judgments + documents
train_queries = cranfield.load_train_queries()
train_judgements = cranfield.load_train_judgements()
documents = cranfield.load_documents()

print(f"We loaded {len(train_queries)} training queries with {len(train_judgements)} relevance judgments.")
print(f"We loaded {len(documents)} documents from the Cranfield collection.")

We loaded 162 training queries with 1340 relevance judgments.
We loaded 1400 documents from the Cranfield collection.


## Loading data with `ArqmathDataset` and setting parameters

ARQMath Collection procides:
- 3 years (2020-2021-2022) for train-test split (the year you choose becomes the test data)
- 7 text formats for representing math content.

You can find available text formats and other datasets characteristics in the section, dedicated to ArqMath Project in the beginning.

In [None]:
from pv211_utils.datasets import ArqmathDataset

# let's set year 2021 and "text+latex" format
arqmath = ArqmathDataset(year=2021, text_format="text+latex", validation_split_size=0.15)
print(f"ARQMath year: {arqmath.year}, text format: {arqmath.text_format}")

# load test queries
test_queries = arqmath.load_test_queries()
print(f"Loaded {len(test_queries)} test queries for ARQMath.")


ARQMath year: 2021, text format: text+latex
Loaded 100 test queries for ARQMath.


# Preprocessing documents with the `text_preprocessing` and `math_preprocessing`

Before using text data in information retrieval systems, we need to preprocess it to make it more clean. Preprocessing helps by converting text to a consistent format, removing noise, and normalizing words, which improves both efficiency and retrieval quality.


*‚ÄúIt is standard to do some form of normalization on the tokens before indexing. The most standard is to case-fold all tokens down to lowercase... Often, people also remove punctuation, digits, and perform stemming or lemmatization.‚Äù*

*(Introduction to Information Retrieval, [Section 2.2](https://nlp.stanford.edu/IR-book/pdf/02voc.pdf))*


The `text_preprocessing` module provides structured, reusable objects for this task. You can easily import classes like `DocPreprocessing` and configure options like lowercasing, accent removal, stop word filtering, stemming, and lemmatization ‚Äî all in one place!

The `math_preprocessing` module can help with math expressions (for example, in the ARQMath Collection project).



## Simple preprocessing

Before moving to harder stuff, let's explore some really basic examples of how we can preprocess documents. We will use just one sentence as an example.

In [None]:
# feel free to type in any sentence that comes to your mind here
sentence = "What a n√≠ce id√©a to start my proj√©ct 2 d√°ys befor√© the deadlin√©... I hope it is not going to end with a defenestration"

In [None]:
from pv211_utils.preprocessing.text_preprocessing import NoneDocPreprocessing  # does nothing except splitting the sentence into words

preprocessor = NoneDocPreprocessing()
tokens = preprocessor(sentence)
print("Tokens with NoneDocPreprocessing:")
print(tokens)

Tokens with NoneDocPreprocessing:
['What', 'a', 'n√≠ce', 'id√©a', 'to', 'start', 'my', 'proj√©ct', '2', 'd√°ys', 'befor√©', 'the', 'deadlin√©...', 'I', 'hope', 'it', 'is', 'not', 'going', 'to', 'end', 'with', 'a', 'defenestration']


In [None]:
from pv211_utils.preprocessing.text_preprocessing import LowerDocPreprocessing # splits the sentence and lowers each of them

preprocessor = LowerDocPreprocessing()
tokens = preprocessor(sentence)
print("Tokens with LowerDocPreprocessing:")
print(tokens)

Tokens with LowerDocPreprocessing:
['what', 'a', 'n√≠ce', 'id√©a', 'to', 'start', 'my', 'proj√©ct', '2', 'd√°ys', 'befor√©', 'the', 'deadlin√©...', 'i', 'hope', 'it', 'is', 'not', 'going', 'to', 'end', 'with', 'a', 'defenestration']


In [None]:
from pv211_utils.preprocessing.text_preprocessing import SimpleDocPreprocessing # can use deacc (remove diacritics from the text) and filter words by length


# with deaccent enabled and max_len=12 we get a nice list of words without accents and without the words that are too long or too short
preprocessor = SimpleDocPreprocessing(deacc=True, min_len=2, max_len=12)
tokens = preprocessor(sentence)
print("Tokens with SimpleDocPreprocessing:")
print(tokens)

Tokens with SimpleDocPreprocessing:
['what', 'nice', 'idea', 'to', 'start', 'my', 'project', 'days', 'before', 'the', 'deadline', 'hope', 'it', 'is', 'not', 'going', 'to', 'end', 'with']


## Improved preprocessing with stop words, stemming and lemmatization

Now let us explore it a bit further and see how we can customize the DocPreprocessing magic with...
1. Stop words
2. Stemming
3. Lemmatization

You can perform these and other preprocessing techniques by importing `DocPreprocessing` object from the `text_preprocessing` module.

**->** The process of text normalization, including all mentioned above, is further discussed in Manning et al. [Section 2.2](https://nlp.stanford.edu/IR-book/pdf/02voc.pdf). This section is really useful to read if you want to make a good preprocessing pipeline for your specific task.

**->** Also, there are interesting introductory articles about this, for example Text Preprocessing with NLTK on [GeeksForGeeks](https://www.geeksforgeeks.org/nlp/removing-stop-words-nltk-python/) and a a more general article on text normalization on [Medium](https://medium.com/@prvaddkkepurakkal/text-preprocessing-in-nlp-19ebe6c9732c).



### Stop Words

Stop words are common words which are of little value in helping select documents matching a user need. For example, in an English text corpus, words like `‚Äòthe‚Äô, ‚Äòis‚Äô, ‚Äòat‚Äô, ‚Äòwhich‚Äô, and ‚Äòon‚Äô` are typical stop words. Removing them can reduce the size of the index and improve efficiency. [Manning et al., Introduction to Information Retrieval, [Section 2.2](https://nlp.stanford.edu/IR-book/pdf/02voc.pdf)].

In [None]:
from pv211_utils.preprocessing.text_preprocessing import DocPreprocessing

# let us start with stop words
# you can add the stop words you wish and see how they disappear
my_stopwords = ["and", "or", "not", "with"] # "how", "why", "to", "no" ...

preprocessor = DocPreprocessing(stopwords=my_stopwords)
tokens = preprocessor(sentence)
print("Tokens with DocPreprocessing and the list of stopwords:")
print(tokens)


Tokens with DocPreprocessing and the list of stopwords:
['what', 'nice', 'idea', 'start', 'project', 'days', 'before', 'the', 'deadline...', 'hope', 'going', 'end', 'defenestration']


### Stemming

Stemming is a text normalization technique that cuts words down to their root (not always the root in the linguistic sense) by removing common prefixes or suffixes. This helps group together words with the same core meaning, like running and runner ‚Üí run, even if the resulting stems aren‚Äôt valid words. Stemming is fast and simple, making it a popular choice in information retrieval for improving search recall [Manning et al., Introduction to Information Retrieval, [Section 2.2](https://nlp.stanford.edu/IR-book/pdf/02voc.pdf)].

In [None]:
from pv211_utils.preprocessing.text_preprocessing import DocPreprocessing

# this is our own stemmer, but currently it is really stupid...
# feel free to try building your own stemmer, but be cautious: it's a challenge!

def dummy_stemmer(word):
    """
    If the word is shorter than 5 letters, then we don't touch it.
    Otherwise checks if the word ends with "s", if it does - cuts "s" off.
    Also checks for the "-ing" ending and cuts it off.
    Otherwise just cuts off last 2 letters

    """
    if len(word) < 5:
        return word

    elif word.endswith("s"):
        return word[:-1]

    elif word.endswith("ing"):
        return word[:-3]

    else:
        return word[:-2]

preprocessor = DocPreprocessing(stem=dummy_stemmer, stopwords=my_stopwords)
tokens = preprocessor(sentence)

print("Tokens with DocPreprocessing and the list of stopwords:")
print(tokens)

Tokens with DocPreprocessing and the list of stopwords:
['what', 'nice', 'idea', 'sta', 'proje', 'days', 'befo', 'the', 'deadline.', 'hope', 'go', 'end', 'defenestrati']


### Lemmatization
Lemmatization reduces words to their dictionary base form (lemma), so that variations like running and ran map to run. Unlike stemming, which simply chops endings, lemmatization uses linguistic rules and vocabulary knowledge, resulting in more meaningful normalization. This helps information retrieval systems better match related words and improves both recall and precision [Manning et al., Introduction to Information Retrieval, Section 2.2].

Now let's try lematization imported from the NLTK library.

In [None]:
# first let's install nltk if not already installed

!pip install nltk




[notice] A new release of pip is available: 23.2.1 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
from nltk.stem import WordNetLemmatizer
from pv211_utils.preprocessing.text_preprocessing import DocPreprocessing

# here we create the lemmatizer
wn_lemmatizer = WordNetLemmatizer()

# wrapping it into a function matching DocPreprocessing's expected signature
def nltk_lemmatizer(word):
    return wn_lemmatizer.lemmatize(word)

# create the preprocessor
preprocessor = DocPreprocessing(lemm=nltk_lemmatizer, stopwords=my_stopwords)
tokens = preprocessor(sentence)
print("Tokens with DocPreprocessing + NLTK WordNet lemmatizer:")
print(tokens)

# now look, we literally have lemmas on the output!


Tokens with DocPreprocessing + NLTK WordNet lemmatizer:
['what', 'nice', 'idea', 'start', 'project', 'day', 'before', 'the', 'deadline...', 'hope', 'going', 'end', 'defenestration']


## Math Preprocessing

There are also some tools for preprocessing math expressions, especially useful in the [ARQMath](#arqmath-collection) task.

In [None]:
# Import your math utilities
from pv211_utils.preprocessing.math_preprocessing import (
    exp_to_latex, exp_to_pmathml, exp_to_cmathml
)

# define a simple Python-style math expression (type in whatever you want)
expression = "x**211 + 211*x + 211"

# convert to LaTeX
latex_output = exp_to_latex(expression)
print("\n LaTeX representation:")
print(latex_output)

# convert to presentation MathML
pmathml_output = exp_to_pmathml(expression)
print("\n Presentation MathML:")
print(pmathml_output)

# convert to content MathML
cmathml_output = exp_to_cmathml(expression)
print("\n Content MathML:")
print(cmathml_output)

#


 LaTeX representation:
x^{211} + 211 x + 211

 Presentation MathML:
<mrow><msup><mi>x</mi><mn>211</mn></msup><mo>+</mo><mrow><mn>211</mn><mo>&InvisibleTimes;</mo><mi>x</mi></mrow><mo>+</mo><mn>211</mn></mrow>

 Content MathML:
<apply><plus/><apply><power/><ci>x</ci><cn>211</cn></apply><apply><times/><cn>211</cn><ci>x</ci></apply><cn>211</cn></apply>


# Information Retrieval Systems from `systems` package

The `systems` package implements a complete suite of information retrieval (IR) components, from classical bag-of-words methods to modern neural reranking:

Some of the tools use `gensim` library by Radim ≈òeh≈Ø≈ôek -> [Find out about gensim here](https://radimrehurek.com/gensim/)

In [4]:
# to work with this section, imports are needed:

from pv211_utils.preprocessing.text_preprocessing import SimpleDocPreprocessing
from pv211_utils.entities import DocumentBase, QueryBase
from collections import OrderedDict

## Bag of Words



The Bag-of-Words (BoW) retrieval system is a foundational ranked retrieval method in information retrieval, representing both documents and queries as unordered collections of word frequencies. By reducing text to word counts, BoW captures the lexical content while discarding grammar and word order, allowing efficient comparison of textual similarity in high-dimensional vector spaces.

For example, when a user submits a query like ‚Äúthe child makes the dog happy‚Äù, BoW tokenizes it into words and constructs a vector reflecting the frequency of each word (see below). Each document in the collection is likewise vectorized. The system then calculates a similarity measure (such as cosine similarity) between the query vector and every document vector, returning a ranked list of documents ‚Äî those most lexically similar to the query appear first.

*"For ranked retrieval, the key idea of the vector space model is to represent queries and documents as vectors in a common term space and to rank documents according to their proximity to the query vector."*

*‚Äî Manning et al., Introduction to Information Retrieval, [Section 6.3](https://nlp.stanford.edu/IR-book/pdf/06vect.pdf).*

While BoW lacks semantic understanding or word order awareness, it is a strong baseline for building more advanced retrieval systems.



![BoW](https://aiml.com/wp-content/uploads/2023/02/disadvantage-bow-1024x650.png)

In [None]:
# first let us import everything needed
from pv211_utils.systems.bow import BoWSystem

Let us demonstrate how this system works with a toy set of sentences and a made up query. The `entities` module helps us do it with `DocumentBase` and `QueryBase`.

In [None]:
# defining a toy set of documents
documents = OrderedDict({
    "doc1": DocumentBase("doc1", "Cats are wonderful animals."),
    "doc2": DocumentBase("doc2", "Dogs are loyal pets."),
    "doc3": DocumentBase("doc3", "Birds can fly in the sky."),
    "doc4": DocumentBase("doc4", "Kittens are amazing creatures."),
    "doc5": DocumentBase("doc5", "I will have F on the Introduction to Information Retrieval")
})

# defining a simple preprocessing strategy
preprocessor = SimpleDocPreprocessing()

# initializing the bag-of-words system with documents and preprocessing as arguments
bow_system = BoWSystem(documents, preprocessor)

Building the dictionary: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:00<00:00, 4992.03it/s]
Building the index: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:00<?, ?it/s]


After running the BoW system on a query, we will see a list of documents from the most to the least relevant one. This is the output of ranked retrieval.

In [None]:
# creating a query (try to play around with it)
query = QueryBase("q1", "What are cats?")

# performing ranked retrieval for the query
print("retrieval results for the query:")
for doc in bow_system.search(query):
    print(f"- Document ID: {doc.document_id} | Content: {doc.body}")


# NOTE: see how the sentence with cats and the sentence with kittens often don't appear together: the BoW system doesn't really understand that those are similar things

retrieval results for the query:
- Document ID: doc1 | Content: Cats are wonderful animals.
- Document ID: doc2 | Content: Dogs are loyal pets.
- Document ID: doc4 | Content: Kittens are amazing creatures.
- Document ID: doc3 | Content: Birds can fly in the sky.
- Document ID: doc5 | Content: I will have F on the Introduction to Information Retrieval


## TF-IDF


The `TfidfSystem` implements a ranked retrieval model based on the Term Frequency‚ÄìInverse Document Frequency (TF-IDF) weighting scheme. This system represents both documents and queries as vectors in a high-dimensional space where each dimension corresponds to a term from the corpus vocabulary. By computing the cosine similarity between the query vector and document vectors, it ranks documents in order of their estimated relevance to the query.

TF-IDF weighting helps emphasize words that are frequent in a specific document but rare across the entire collection ‚Äî which intuitively captures the idea that such words are often **more informative**. For instance, in a collection of animal texts, words like ‚Äúwhiskers‚Äù or ‚Äúpurring‚Äù may help distinguish a document about cats, whereas common words like ‚Äúanimal‚Äù or ‚Äúthe‚Äù are in almost every document (as in the example in the code below) and these words carry little discriminative power.

Manning et al. (2008) in [Chapter 6](https://nlp.stanford.edu/IR-book/pdf/06vect.pdf) highlight that ‚ÄúOne of the most effective and widely used weighting schemes is tf-idf weighting.‚Äù


![TF IDF](https://www.seoquantum.com/sites/default/files/tf-idf-2-1-1024x375.png)

**(!)note:** these TF-IDF cells may crash on Windows - the code uses multiprocessing with `fork`, and Windows doesn‚Äôt support that start method. If you‚Äôre on Windows, run it in `Google Colab` or on `Linux/macOS`.

In [5]:
from pv211_utils.systems.tfidf import TfidfSystem

# preparing documents
docs = OrderedDict({
    "doc1": DocumentBase("doc1", "Cats are playful and curious animals."),
    "doc2": DocumentBase("doc2", "Dogs are loyal and protective animals."),
    "doc3": DocumentBase("doc3", "Birds are also animals but they can fly."),
})

# creating a simple preprocessor (lowercasing + tokenization)
preprocessor = SimpleDocPreprocessing()

# creating the TF-IDF system
tfidf = TfidfSystem(docs, preprocessor)

Building the dictionary: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 6449.47it/s]
Building the TF-IDF model: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 4366.03it/s]
Building the TF-IDF index: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 833.91it/s]


In [6]:
# preparing query
query = QueryBase("q1", "Protective animals")

# searching with the query
results = list(tfidf.search(query))

# printing top results
for rank, doc in enumerate(results[:3], 1):
    print(f"Rank {rank}: Document ID {doc.document_id} -> {doc.body}")

Rank 1: Document ID doc2 -> Dogs are loyal and protective animals.
Rank 2: Document ID doc1 -> Cats are playful and curious animals.
Rank 3: Document ID doc3 -> Birds are also animals but they can fly.


## BM25

BM25 (and its extension BM25+) is one of the most effective ranking functions in modern information retrieval.

*"‚ÄúBM25 is one of the best known and most widely used retrieval functions in modern IR systems.‚Äù"*
*‚Äî Manning et al., Introduction to Information Retrieval, [Section 11.4](https://nlp.stanford.edu/IR-book/pdf/11prob.pdf).*

It builds on TF-IDF by capturing both term frequency saturation and document length normalization, which helps fairly compare shorter and longer documents. In simple terms, BM25 recognizes that longer documents have more words and might match queries more often by chance ‚Äî so it adjusts scores for document length to prevent unfairly favoring long texts. This solves a problem that is quite a drawback for TF-IDF.

The scoring formula rewards terms appearing more frequently in documents but applies diminishing returns (term frequency saturation), and adjusts relevance based on how the document length deviates from the average document length in the corpus (via the parameter b). BM25+ also introduces a small delta parameter d to handle cases where raw BM25 can give negative scores.

This makes BM25 (and BM25+) better than basic TF-IDF especially in real-world settings where documents vary greatly in length (e.g., product reviews, news articles, forum posts).

**->** BM25+ is an extension of BM25 that adds a delta parameter to address BM25‚Äôs bias against very short documents, ensuring these documents can still achieve meaningful scores even if they contain important query terms.


Here is a nice explanation of the scary BM25's formula.

**Lucene is a search engine library*
![bm25](https://kmwllc.com/wp-content/uploads/2021/05/bm25demystified-1536x662.png)

In [None]:
from pv211_utils.systems.bm25 import BM25PlusSystem

# creating documents with very different lengths but similar content
docs = OrderedDict({
    "doc1": DocumentBase("doc1", "Cats are cute."),
    "doc2": DocumentBase("doc2", "Cats are wonderful pets that are loved by many people all over the world. They are known for their independence, playfulness, and affectionate behavior. Cats can be great companions in apartments and houses."),
    "doc3": DocumentBase("doc3", "Dogs are loyal and protective animals."),
})

# creating a query about cats
query = QueryBase("query1", "cats are affectionate")

# creating a simple preprocessor
preprocessor = SimpleDocPreprocessing()

# creating the BM25+ system
bm25_system = BM25PlusSystem(docs, preprocessor)

# searching with the query
results = list(bm25_system.search(query))

# printing ranked results
print("Ranked documents by BM25+:")
for rank, doc in enumerate(results, start=1):
    print(f"{rank}. Document ID: {doc.document_id} - Document snippet: {doc.body[:60]}...")


Ranked documents by BM25+:
1. Document ID: doc1 - Document snippet: Cats are cute....
2. Document ID: doc2 - Document snippet: Cats are wonderful pets that are loved by many people all ov...
3. Document ID: doc3 - Document snippet: Dogs are loyal and protective animals....


## Dense Retrieval Systems with `retriever` module


Systems that we looked at before (such BoW, TF-IDF, or BM25) rely purely on token frequency statistics when trying to capture word representations. In a retrieval system they will find the needed sentence only if the needed keyword (or its normalized version) is present there. These systems are called sparse retrievers. Now we will look at dense retrievers: they use neural embeddngs pretrained on large amounts of data. They capture the semantic meaning of the words. For example, such retreivers can see similarity between words "lesson" and "lecture", even though they are quite different. A great and famous example of such models are BERT models. You can find out more about them by following these links:

**Here are some useful links to read about BERT models:**

**-> [Demonstration of BERT with code](https://huggingface.co/docs/transformers/en/model_doc/bert) on HuggingFace**

**-> [BERT: A Comprehensive Introduction](https://medium.com/@lixue421/bert-a-comprehensive-introduction-7d620efcd32f)** on Medium

**-> [Article about BERT](https://en.wikipedia.org/wiki/BERT_(language_model))** on Wikipedia

**-> The original paper: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)** by J. Devlin et al.(2019)


After loading the `RetrieverSystem` from the `retriever.py` module, feel free to experiment by adding pretrained models. Below is a code snippet with a demonstration of how it works.

In [None]:
from sentence_transformers import SentenceTransformer
from pv211_utils.systems.retriever import RetrieverSystem

# loading a small Sentence-BERT model (feel free to choose your own can be found on HuggingFace)
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# preparing documents
docs = OrderedDict({
    "doc1": DocumentBase("doc1", "The capital of France is Paris."),
    "doc2": DocumentBase("doc2", "Berlin is the capital of Germany."),
    "doc3": DocumentBase("doc3", "Madrid is the capital of Spain."),
})

retriever = RetrieverSystem(model, docs)

# defining a query that doesn't contain same words but is related to France
query = QueryBase("q1", "Where is the Eiffel Tower located?")

# performing the search + prints
results = list(retriever.search(query))
print(f"Top retrieved document ID: {results[0].document_id}")
print(f"Document text: {results[0].body}")

# if you didn't change anything, you should get the doc1 as the best match. Bien jou√©!

Top retrieved document ID: doc1
Document text: The capital of France is Paris.


The `RetrieverSystem` class also allows to use the `query expansion` technique. This useful technique is also discussed in [Manning et al. Chapter 9.](https://nlp.stanford.edu/IR-book/pdf/09expand.pdf).   

Query expansion is used to automatically enrich a user‚Äôs query with extra, relevant terms. The idea behind it is simple: the original query might be too short and not represent the user's **information need**. That's why we want to enrich it by adding words or phrases that are likely to bring better resultss. But where can we get those words from? We can extract them from top documents that were already retrieved. The `search` function of the `RetrieverSystem` class chooses the best fit among the documents, then uses TF-DF to extract top-k sentences from this document. The class parameter `top_k_sentences` defines how many sentences will be added to the query. The number of iteration of this process is defined by `no_query_expansion`

Below you can see how the results change if you use **query expansion**. See how documents 2 and 3 swapped after applying the query extension. This is because the document about Sam mentions donuts in several sentences.

In [None]:
from collections import OrderedDict
from sentence_transformers import SentenceTransformer
from pv211_utils.entities import DocumentBase, QueryBase
from pv211_utils.systems.retriever import RetrieverSystem

# documents with multi-sentence content, but only some sentences are directly relevant
docs = OrderedDict({
    "1": DocumentBase("1",
        "My friend is Sam. "
        "He really enjoys donuts and says they are healthy. "
        "Donuts mean everything to him. "
    ),
    "2": DocumentBase("2",
        "I enjoy donuts very much. "
        "Much better than any salad or fresh juice. "
        "Donuts mean everything to me. "
    ),
    "3": DocumentBase("3",
        "Cats are perfect animals. "
        "They are also good friends. "
        "But you have to learn how to treat them."
    ),
    "4": DocumentBase("4",
        "Gambling doesn't have loosers. "
        "It only has quitters. "
        "Don't give up. "
    ),
})

# loading the dense retriever
retriever_model = SentenceTransformer("all-MiniLM-L6-v2")

# defining the user query
query = QueryBase("q1", "Sam is my friend. ")

# running retriever without query expansion
print("Retrieval WITHOUT query expansion:")
retriever_no_exp = RetrieverSystem(
    retriever=retriever_model,
    answers=docs,
    no_query_expansion=0
)
results_no_exp = list(retriever_no_exp.search(query))
for doc in results_no_exp:
    print(f"‚Üí Doc {doc.document_id}: {doc.body[:80]}...")

# running retriever with query expansion
print("\nRetrieval WITH query expansion:")
retriever_with_exp = RetrieverSystem(
    retriever=retriever_model,
    answers=docs,
    no_query_expansion=1,  # enabling query expansion
    top_k_sentences=3
)
results_with_exp = list(retriever_with_exp.search(query))
for doc in results_with_exp:
    print(f"‚Üí Doc {doc.document_id}: {doc.body[:80]}...")

# see how Doc2 and Doc3 swapped places. Try to understand why.

Retrieval WITHOUT query expansion:
‚Üí Doc 1: My friend is Sam. He really enjoys donuts and says they are healthy. Donuts mean...
‚Üí Doc 3: Cats are perfect animals. They are also good friends. But you have to learn how ...
‚Üí Doc 2: I enjoy donuts very much. Much better than any salad or fresh juice. Donuts mean...
‚Üí Doc 4: Gambling doesn't have loosers. It only has quitters. Don't give up. ...

Retrieval WITH query expansion:
‚Üí Doc 1: My friend is Sam. He really enjoys donuts and says they are healthy. Donuts mean...
‚Üí Doc 2: I enjoy donuts very much. Much better than any salad or fresh juice. Donuts mean...
‚Üí Doc 3: Cats are perfect animals. They are also good friends. But you have to learn how ...
‚Üí Doc 4: Gambling doesn't have loosers. It only has quitters. Don't give up. ...


## Hybrid Retreval Systems with `ranker` and `reranker `


Now, we move beyond individual models to retrieval and ranking systems, which bring everything together to answer queries by efficiently searching a large collection of documents.

Modern IR pipelines often use a two-stage architecture:

##### 1Ô∏è. Retriever stage (Dense retrieval)
‚Äì Quickly narrows down thousands or millions of candidate documents to the most promising ones using vector similarities.
##### 2Ô∏è. Reranker stage (Cross-encoder reranking)
‚Äì Precisely scores the top candidates by jointly considering the full text of each document and the query.


Both systems that we have (`RankerSystem` and `RerankerSystem`) use this architecture, but in slightly different ways.

##### `RerankerSystem`
1. encodes all documents with a SentenceTransformer and stores their embeddings in memory.

2. computes cosine similarities directly between the query embedding and all stored document embeddings.

3. selects the top candidates based on similarity scores and reranks them using a CrossEncoder*
*CrossEncoder jointly scores query-document pairs for deeper semantic understanding.

4. returns documents in the reranked order.

##### `RankerSystem`

Like RerankerSystem, it encodes all documents using a SentenceTransformer.

1. instead of storing embeddings in memory, it adds them to a vector database (e.g., FAISS or your own implementation)

2. at query time, it performs fast approximate nearest neighbor (ANN) search in the vector DB to retrieve top-k candidates.

3. reranks the top documents with a CrossEncoder for precise scoring.


**-> In general, `RerankerSystem` is more simple, mostly because it doesn't use an external Vector DB. In case with `RankerSystem` it is scalable and better for lagr-scale data (millions of documents)**

Here are some quick demonstrations of the `RerankerSystem` with code...



In [None]:
from sentence_transformers import SentenceTransformer, CrossEncoder

# let's create a nice minimal DocumentBase subclass
class SimpleDoc(DocumentBase):
    def __init__(self, doc_id: str, text: str):
        self.doc_id = doc_id
        self.text = text

    def __str__(self):
        return self.text

answers = OrderedDict({
    "1": SimpleDoc("1", "The Eiffel Tower is in Paris."),
    "2": SimpleDoc("2", "The Great Wall can be seen from space."),
    "3": SimpleDoc("3", "Mount Everest is the tallest mountain."),
    "4": SimpleDoc("4", "I really like baguette."),
    "5": SimpleDoc("5", "I have never been to Lyon or Bordeaux")
})

# here load retriever and reranker models: feel free to experiment with your own
retriever = SentenceTransformer("all-MiniLM-L6-v2")
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# making the query document
query = SimpleDoc("q1", "Do you like France?")


In [None]:
from pv211_utils.systems.reranker import RerankerSystem

reranker_sys = RerankerSystem(
    retriever=retriever,
    reranker=reranker,
    answers=answers,
    no_reranks=2
)

results = list(reranker_sys.search(query))
print("\n RerankerSystem Results:")
for doc in results:
    print("-", doc)



 RerankerSystem Results:
- I have never been to Lyon or Bordeaux
- I really like baguette.
- The Eiffel Tower is in Paris.
- The Great Wall can be seen from space.
- Mount Everest is the tallest mountain.
