Maastricht_University_logo.svg

# Information Retrieval and Text Mining Course
## Tutorial 03 ‚Äî Search Engines: Relevance Ranking

**Author:** Jan Scholtes

**Edition 2025-2026**

Department of Advanced Computer Sciences ‚Äî Maastricht University

Welcome to Tutorial 03 on **Search Engines: Relevance Ranking**. In this tutorial we explore how search engines rank documents by relevance, progressing from classical lexical methods to neural approaches.

The tutorial is organised in three stages:

1. **Stage 1 ‚Äî BM25 Baseline**: We use [Pyserini](https://github.com/castorini/pyserini) to perform BM25 retrieval on the MS MARCO passage corpus with official TREC Deep Learning 2019 queries. We observe where keyword-based ranking succeeds and where it fails.
2. **Stage 2 ‚Äî Neural Reranking**: We apply a neural cross-encoder to rerank the BM25 results, demonstrating how semantic understanding produces better rankings.
3. **Stage 3 ‚Äî Quantitative Evaluation**: We compute nDCG@10 and MAP using official NIST relevance judgments (qrels) to show rigorous evidence that neural reranking outperforms BM25.

Before the experiment we review the theory behind TF-IDF, BM25, and neural ranking methods.

**Dataset**: MS MARCO Passage Ranking with TREC Deep Learning 2019 evaluation queries (43 queries with graded relevance judgments from NIST assessors).

> **Note:** This course is about Information Retrieval, Text Mining, and Conversational Search ‚Äî not about programming skills. The code cells below show you *how* these methods work in practice using Python libraries. Focus on understanding the **concepts** and **results**.

## Library Installation

We install all required packages in a single cell. Run this cell once at the beginning of your session.

**Important:** Pyserini requires **Java 11+** (JDK, not just JRE). If you do not have Java installed:
- **Windows:** `winget install Microsoft.OpenJDK.21`
- **macOS:** `brew install openjdk@21`
- **Linux:** `sudo apt install openjdk-21-jdk`

The first time you run the search cells, Pyserini will download the MS MARCO passage index (~2 GB). This is a one-time operation.

In [23]:
# Install required packages
import subprocess, sys

packages = [
    "pyserini",
    "sentence-transformers",
    "faiss-cpu",
    "scikit-learn",
    "PyMuPDF",          # PDF parsing (import as 'fitz')
]
for pkg in packages:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])

print("All packages installed successfully.")

All packages installed successfully.


In [2]:
# --- Java & Environment Setup ---
import os, json, math, sys, warnings
import numpy as np
from collections import Counter, defaultdict

warnings.filterwarnings("ignore")

# Pyserini requires JAVA_HOME pointing to a JDK 11+ installation.
# The cell below tries to auto-detect your JDK. If it fails, set the
# path manually: os.environ["JAVA_HOME"] = r"C:\path\to\jdk"
if "JAVA_HOME" not in os.environ or not os.environ["JAVA_HOME"]:
    import glob
    candidates = (
        glob.glob(r"C:\Program Files\Microsoft\jdk-*")
        + glob.glob(r"C:\Program Files\Java\jdk-*")
        + glob.glob("/usr/lib/jvm/java-*-openjdk*")
        + glob.glob("/Library/Java/JavaVirtualMachines/*/Contents/Home")
    )
    if candidates:
        os.environ["JAVA_HOME"] = sorted(candidates)[-1]
        print(f"Auto-detected JAVA_HOME: {os.environ['JAVA_HOME']}")
    else:
        raise EnvironmentError(
            "JAVA_HOME not set and no JDK found.\n"
            "Install JDK 11+ first (e.g. winget install Microsoft.OpenJDK.21)"
        )

# --- Pyserini & model imports ---
from pyserini.search.lucene import LuceneSearcher
from pyserini.search import get_topics, get_qrels
from sentence_transformers import CrossEncoder
import torch

print(f"Pyserini loaded  | JAVA_HOME: {os.environ['JAVA_HOME']}")
print(f"PyTorch {torch.__version__} | CUDA available: {torch.cuda.is_available()}")

Auto-detected JAVA_HOME: C:\Program Files\Microsoft\jdk-21.0.10.7-hotspot
Pyserini loaded  | JAVA_HOME: C:\Program Files\Microsoft\jdk-21.0.10.7-hotspot
PyTorch 2.10.0+cpu | CUDA available: False


## 1. From Keywords to Meaning: The Relevance Problem

At the heart of every search engine is a **ranking function** ‚Äî a mathematical formula that decides which documents are most relevant to a query.

The simplest approach is **exact keyword matching**: find documents that contain the query terms and rank them by how often those terms appear. This works surprisingly well, but it has a fundamental limitation called the **lexical gap**:

> A user searching for *"how to fix a broken screen"* may not find a document titled *"smartphone display repair guide"* because the words do not match ‚Äî even though the meaning is the same.

This tutorial explores the evolution from keyword-based to semantic ranking:

| Generation | Method | Matching | Limitation |
|-----------|--------|----------|------------|
| 1st | TF-IDF | Exact term overlap | No term importance model |
| 2nd | BM25 | Probabilistic term weighting | Still requires word overlap |
| 3rd | Neural (BERT, ColBERT) | Semantic similarity | Computationally expensive |

We will see this progression **in practice** using a real search engine (Pyserini) and a real evaluation benchmark (TREC-DL 2019).

## 2. TF-IDF: The Foundation of Lexical Ranking

**Term Frequency ‚Äì Inverse Document Frequency** (TF-IDF) is the most widely used term-weighting scheme.

### Term Frequency (TF)

The raw term frequency $f(t, d)$ counts how often term $t$ appears in document $d$. To dampen the effect of very frequent terms we often use the log variant:

$$\text{TF}(t, d) = 1 + \log f(t, d) \quad\text{if } f(t, d) > 0,\;\text{else } 0$$

### Inverse Document Frequency (IDF)

A term that appears in many documents is less informative. IDF captures this:

$$\text{IDF}(t) = \log \frac{N}{df(t)}$$

where $N$ is the total number of documents and $df(t)$ is the number of documents containing term $t$.

### Combined TF-IDF Score

$$\text{TF\text{-}IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$

The score for a query $q$ against document $d$ sums over all query terms:

$$\text{Score}(q, d) = \sum_{t \in q} \text{TF\text{-}IDF}(t, d)$$

In [3]:
# TF-IDF demonstration on toy documents

documents = [
    "information retrieval is the science of searching for information",
    "machine learning models can improve search relevance",
    "information retrieval systems use inverted indexes for fast search",
    "deep learning transforms natural language understanding",
]
query = "information retrieval search"

# Tokenise
def tokenize(text):
    return text.lower().split()

doc_tokens = [tokenize(d) for d in documents]
query_tokens = tokenize(query)
N = len(documents)

# Compute document frequency & IDF
df = Counter()
for tokens in doc_tokens:
    for t in set(tokens):
        df[t] += 1

idf = {t: math.log(N / df[t]) for t in df}

# Score each document
print(f"Query: '{query}'\n")
print(f"{'Term':<20} {'IDF':>6}")
print("-" * 28)
for t in query_tokens:
    print(f"{t:<20} {idf.get(t, 0):>6.3f}")

print(f"\n{'Doc':<5} {'Score':>8}  Content")
print("-" * 75)
for i, tokens in enumerate(doc_tokens):
    tf = Counter(tokens)
    score = sum(
        (1 + math.log(tf[t])) * idf.get(t, 0)
        for t in query_tokens if tf[t] > 0
    )
    print(f"D{i:<4} {score:>8.3f}  {documents[i]}")

Query: 'information retrieval search'

Term                    IDF
----------------------------
information           0.693
retrieval             0.693
search                0.693

Doc      Score  Content
---------------------------------------------------------------------------
D0       1.867  information retrieval is the science of searching for information
D1       0.693  machine learning models can improve search relevance
D2       2.079  information retrieval systems use inverted indexes for fast search
D3       0.000  deep learning transforms natural language understanding


## 3. BM25: The Best Lexical Ranker

**BM25** (Best Match 25) is a probabilistic ranking function developed at City University London in the 1990s as part of the Okapi system. It extends TF-IDF with two important improvements:

1. **Saturation** ‚Äî term frequency has diminishing returns (a word appearing 100 times is not 100√ó more relevant than appearing once)
2. **Length normalisation** ‚Äî longer documents are not automatically favoured

### The BM25 Formula

$$\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \;\cdot\; \frac{f(t,d) \cdot (k_1 + 1)}{f(t,d) + k_1 \cdot \left(1 - b + b \cdot \dfrac{|d|}{\text{avgdl}}\right)}$$

where:
- $f(t, d)$ = frequency of term $t$ in document $d$
- $|d|$ = length of document $d$ (in words)
- $\text{avgdl}$ = average document length in the collection
- $k_1$ = term frequency saturation parameter (typically **1.2**)
- $b$ = length normalisation parameter (typically **0.75**)

### IDF Component

$$\text{IDF}(t) = \log \frac{N - df(t) + 0.5}{df(t) + 0.5}$$

### Understanding the Parameters

| $k_1$ | Effect |
|-------|--------|
| $\to 0$ | All non-zero term frequencies treated equally (binary matching) |
| $\to \infty$ | Raw term frequency dominates (no saturation) |

| $b$ | Effect |
|-----|--------|
| $= 0$ | No length normalisation |
| $= 1$ | Full normalisation relative to average length |

BM25 remains the **default baseline** in modern information retrieval and is the first-stage retriever in most production search systems.

In [4]:
# Explore how BM25 parameters k1 and b affect scoring

def bm25_term_weight(tf, dl, avgdl, N, df, k1=1.2, b=0.75):
    """BM25 weight for a single term in a document."""
    idf = math.log((N - df + 0.5) / (df + 0.5))
    tf_part = (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * dl / avgdl))
    return idf * tf_part

# Fixed parameters (MS MARCO scale)
N, df_val, avgdl = 8_800_000, 50_000, 60

print("Effect of k1 (term frequency saturation)  [df=50 000, avgdl=60, dl=60]")
print(f"{'TF':>4}", end="")
for k1 in [0.01, 0.5, 1.2, 3.0, 10.0]:
    print(f"  k1={k1:<5}", end="")
print()
print("-" * 55)
for tf in [1, 2, 5, 10, 20, 50]:
    print(f"{tf:>4}", end="")
    for k1 in [0.01, 0.5, 1.2, 3.0, 10.0]:
        w = bm25_term_weight(tf, 60, avgdl, N, df_val, k1=k1, b=0.75)
        print(f"  {w:>8.2f}", end="")
    print()

print(f"\nEffect of b (length normalisation)  [tf=3, k1=1.2, avgdl={avgdl}]")
print(f"{'DocLen':>7}", end="")
for b in [0.0, 0.25, 0.5, 0.75, 1.0]:
    print(f"   b={b:<5}", end="")
print()
print("-" * 55)
for dl in [20, 40, 60, 100, 200, 500]:
    print(f"{dl:>7}", end="")
    for b in [0.0, 0.25, 0.5, 0.75, 1.0]:
        w = bm25_term_weight(3, dl, avgdl, N, df_val, k1=1.2, b=b)
        print(f"  {w:>8.2f}", end="")
    print()

print("\nKey observations:")
print("  Higher k1  ->  less saturation  ->  raw TF matters more")
print("  Higher b   ->  more length normalisation  ->  short documents boosted")
print("  Default (k1=1.2, b=0.75) balances both effects")

Effect of k1 (term frequency saturation)  [df=50 000, avgdl=60, dl=60]
  TF  k1=0.01   k1=0.5    k1=1.2    k1=3.0    k1=10.0 
-------------------------------------------------------
   1      5.16      5.16      5.16      5.16      5.16
   2      5.19      6.20      7.10      8.26      9.47
   5      5.21      7.04      9.16     12.91     18.94
  10      5.21      7.38     10.15     15.89     28.41
  20      5.21      7.56     10.72     17.96     37.88
  50      5.22      7.67     11.10     19.49     47.34

Effect of b (length normalisation)  [tf=3, k1=1.2, avgdl=60]
 DocLen   b=0.0     b=0.25    b=0.5     b=0.75    b=1.0  
-------------------------------------------------------
     20      8.12      8.52      8.97      9.47     10.03
     40      8.12      8.31      8.52      8.74      8.97
     60      8.12      8.12      8.12      8.12      8.12
    100      8.12      7.75      7.41      7.10      6.82
    200      8.12      6.96      6.09      5.41      4.87
    500      8.12     

## 4. Stage 1 ‚Äî BM25 Search with Pyserini

Now we move from theory to practice. We use [Pyserini](https://github.com/castorini/pyserini), a Python toolkit for reproducible IR research, to perform BM25 search on a real corpus.

### Dataset: MS MARCO Passages + TREC-DL 2019

| Component | Description |
|-----------|-------------|
| **Corpus** | MS MARCO v1 passage collection ‚Äî **8.8 million passages** from web documents (Microsoft) |
| **Queries** | 43 queries from TREC Deep Learning 2019, selected by NIST for rigorous evaluation |
| **Relevance judgments** | Graded assessments by NIST assessors (0 = not relevant ‚Ä¶ 3 = perfectly relevant) |

Pyserini provides a **pre-built Lucene index** for MS MARCO, so we can start searching immediately.

> **Note:** The first run downloads the pre-built index (~2 GB). Subsequent runs use the cached version.

In [7]:
# Load pre-built index, queries, and relevance judgments
print("Loading MS MARCO passage index (first run downloads ~2 GB)...")
searcher = LuceneSearcher.from_prebuilt_index('msmarco-v1-passage')
print(f"Index loaded: {searcher.num_docs:,} passages")

# TREC Deep Learning 2019 topics and official qrels
topics = get_topics('dl19-passage')
qrels  = get_qrels('dl19-passage')
# Pyserini returns qrel values as strings ‚Äî convert to int for numeric comparisons
qrels = {qid: {did: int(r) for did, r in docs.items()} for qid, docs in qrels.items()}
print(f"TREC-DL 2019 queries: {len(topics)}")
print(f"TREC-DL 2019 qrels  : {len(qrels)} queries with judgments")

# Show a few example queries
print("\nExample queries:")
for i, (qid, topic) in enumerate(topics.items()):
    if i >= 6:
        break
    n_rel = sum(1 for r in qrels.get(qid, {}).values() if r >= 2)
    print(f"  [{qid}] {topic['title']:<55} ({n_rel} highly-relevant docs)")

Loading MS MARCO passage index (first run downloads ~2 GB)...
Index loaded: 8,841,823 passages
TREC-DL 2019 queries: 43
TREC-DL 2019 qrels  : 43 queries with judgments

Example queries:
  [264014] how long is life cycle of flea                          (152 highly-relevant docs)
  [104861] cost of interior concrete flooring                      (111 highly-relevant docs)
  [130510] definition declaratory judgment                         (14 highly-relevant docs)
  [1114819] what is durable medical equipment consist of            (213 highly-relevant docs)
  [1110199] what is wifi vs bluetooth                               (28 highly-relevant docs)
  [1129237] hydrogen is a liquid below what temperature             (17 highly-relevant docs)


In [8]:
# Run BM25 search for all TREC-DL 2019 queries (top-100 per query)
TOP_K = 100
bm25_results = {}   # {qid: [(docid, bm25_score, passage_text), ...]}

print(f"Running BM25 search (top-{TOP_K}) for {len(topics)} queries...")
for qid, topic in topics.items():
    query = topic['title']
    hits = searcher.search(query, k=TOP_K)
    results = []
    for hit in hits:
        doc = searcher.doc(hit.docid)
        passage = json.loads(doc.raw())['contents']
        results.append((hit.docid, hit.score, passage))
    bm25_results[qid] = results

print(f"BM25 search complete: {len(bm25_results)} queries processed")
print(f"Average results per query: "
      f"{np.mean([len(v) for v in bm25_results.values()]):.0f}")

Running BM25 search (top-100) for 43 queries...
BM25 search complete: 43 queries processed
Average results per query: 100


In [9]:
# Display BM25 top-5 for a few example queries
example_qids = list(topics.keys())[:3]

for qid in example_qids:
    query = topics[qid]['title']
    results = bm25_results[qid]
    print(f"\n{'='*80}")
    print(f"Query [{qid}]: {query}")
    print(f"{'='*80}")
    for rank, (docid, score, passage) in enumerate(results[:5], 1):
        rel = qrels.get(qid, {}).get(docid, '-')
        print(f"\n  Rank {rank} | BM25: {score:.4f} | Rel: {rel} | {docid}")
        print(f"  {passage[:150]}...")


Query [264014]: how long is life cycle of flea

  Rank 1 | BM25: 15.7806 | Rel: - | 5611210
  5. Cancel. A flea can live up to a year, but its general lifespan depends on its living conditions, such as the availability of hosts. Find out how lo...

  Rank 2 | BM25: 15.0908 | Rel: - | 6641238
  The life cycle of a flea can last anywhere from 20 days to an entire year. It depends on how long the flea remains in the dormant stage (eggs, larvae,...

  Rank 3 | BM25: 14.9718 | Rel: - | 4834547
  The life cycle of a flea can last anywhere from 20 days to an entire year. It depends on how long the flea remains in the dormant stage (eggs, larvae,...

  Rank 4 | BM25: 14.2151 | Rel: - | 96852
  Flea Pupa. The flea larvae spin cocoons around themselves in which they move to the last phase of the flea life cycle and become adult fleas. The larv...

  Rank 5 | BM25: 13.9852 | Rel: - | 96854
  2) The fleas life cycle discussed - the flea life cycle diagram explained in full. 2a) Fleas life cycle 1

### The Lexical Gap

BM25 works well when query and document share the same vocabulary. But what happens when they use **different words for the same concept**?

| Query | Relevant passage uses‚Ä¶ | BM25 can match? |
|-------|------------------------|-----------------|
| "fix broken screen" | "display repair guide" | No word overlap |
| "heart attack symptoms" | "signs of myocardial infarction" | Medical synonyms |
| "affordable housing" | "low-cost residential options" | Paraphrases |

This is the **vocabulary mismatch problem** ‚Äî the fundamental limitation of all lexical methods, including BM25. No matter how sophisticated the term weighting, if the words do not match the document will not be found.

Let us quantify this on our TREC-DL data: how many highly-relevant documents does BM25 actually find?

In [10]:
# Lexical Gap Analysis: how many highly-relevant docs (qrel >= 2)
# appear in BM25 top-100?

print("Highly-relevant documents (qrel >= 2) found in BM25 top-100\n")
print(f"{'QID':<10} {'Query':<45} {'Found':>5} {'Total':>5} {'Recall':>7}")
print("-" * 78)

recall_values = []
for qid in sorted(topics.keys()):
    if qid not in qrels:
        continue
    query = topics[qid]['title']
    relevant = {did for did, r in qrels[qid].items() if r >= 2}
    if not relevant:
        continue
    retrieved = {docid for docid, _, _ in bm25_results.get(qid, [])}
    found = relevant & retrieved
    recall = len(found) / len(relevant)
    recall_values.append(recall)
    print(f"{qid:<10} {query[:43]:<45} {len(found):>5} {len(relevant):>5} {recall:>7.1%}")

print(f"\nMean recall of highly-relevant docs: {np.mean(recall_values):.1%}")
print(f"Queries with < 100% recall: "
      f"{sum(1 for r in recall_values if r < 1.0)}/{len(recall_values)}")
print("\nBM25 misses some relevant passages ‚Äî this is the lexical gap in action.")

Highly-relevant documents (qrel >= 2) found in BM25 top-100

QID        Query                                         Found Total  Recall
------------------------------------------------------------------------------
19335      anthropological definition of environment         0     7    0.0%
47923      axon terminals or synaptic knob definition        0    41    0.0%
87181      causes of left ventricular hypertrophy            0    31    0.0%
87452      causes of military suicide                        0    31    0.0%
104861     cost of interior concrete flooring                0   111    0.0%
130510     definition declaratory judgment                   0    14    0.0%
131843     definition of a sigmet                            0    19    0.0%
146187     difference between a mcdouble and a double        0     8    0.0%
148538     difference between rn and bsn                     0    32    0.0%
156493     do goldfish grow                                  0   117    0.0%
168216     do

## 5. Neural Ranking: Beyond Exact Match

Neural ranking models use **learned representations** (embeddings) to capture semantic similarity between queries and documents. Instead of matching words they match *meanings*.

### Three Architectures for Neural Ranking

| Architecture | Example | How it works | Speed | Quality |
|-------------|---------|-------------|-------|---------|
| **Bi-encoder** | DPR, SBERT | Query and doc encoded independently; cosine similarity | Fast | Good |
| **Late interaction** | ColBERT | Token-level embeddings; MaxSim aggregation | Medium | Better |
| **Cross-encoder** | monoBERT | Query + doc processed jointly by a transformer | Slow | Best |

### Bi-Encoder: Independent Encoding (DPR & SBERT)

A bi-encoder uses **two separate BERT towers** ‚Äî one for the query, one for the document ‚Äî each producing a single dense vector (the `[CLS]` token representation). Relevance is then a simple **cosine similarity** or **dot product** between these vectors:

$$\text{score}(q, d) = \mathbf{E}_q(q)^\top\;\mathbf{E}_d(d)$$

Because query and document are encoded **independently**, all document vectors can be pre-computed offline and indexed (e.g. with FAISS). At query time only the query needs to be encoded, making retrieval over millions of documents extremely fast.

#### DPR ‚Äî Dense Passage Retrieval (Karpukhin et al., 2020)

DPR was one of the first bi-encoders designed specifically for **open-domain question answering**. Key design choices:

- **Two independent BERT-base models**: $\mathbf{E}_q$ (query encoder) and $\mathbf{E}_d$ (passage encoder) ‚Äî they do **not** share weights
- **Training signal**: contrastive learning with in-batch negatives. For each question $q_i$ the positive passage $d_i^+$ is from a gold QA dataset; the negatives are the positive passages of all other questions in the same mini-batch, plus one "hard negative" retrieved by BM25
- **Loss**: negative log-likelihood of the positive passage:

$$\mathcal{L} = -\log\frac{e^{\mathbf{E}_q(q_i)^\top \mathbf{E}_d(d_i^+)}}{e^{\mathbf{E}_q(q_i)^\top \mathbf{E}_d(d_i^+)} + \sum_{j}\,e^{\mathbf{E}_q(q_i)^\top \mathbf{E}_d(d_j^-)}}$$

- **Retrieval**: all passages are pre-encoded; at query time a FAISS index returns the top-$k$ by dot-product in milliseconds

DPR showed that a learned dense retriever can **outperform BM25** on factoid-style questions (Natural Questions, TriviaQA), even without lexical overlap. Its limitation: training requires large QA datasets with gold passages.

#### SBERT ‚Äî Sentence-BERT (Reimers & Gurevych, 2019)

While DPR targets retrieval, **SBERT** focuses on producing **general-purpose sentence embeddings** that make cosine similarity a meaningful measure of semantic similarity. Key differences from DPR:

- **Shared weights (Siamese network)**: the same BERT model encodes both sentences ‚Äî weight sharing improves generalisation to new domains
- **Pooling**: instead of just using `[CLS]`, SBERT typically applies **mean pooling** over all token embeddings, which produces better sentence representations:

$$\mathbf{v}_s = \frac{1}{|s|}\sum_{t \in s}\text{BERT}(t)$$

- **Training objectives**: SBERT is trained in two stages:
  1. **Natural Language Inference (NLI)**: a classification head predicts *entailment / contradiction / neutral* for sentence pairs ‚Äî this teaches the model what "same meaning" looks like
  2. **Cosine similarity regression**: the model is fine-tuned so that $\cos(\mathbf{v}_a, \mathbf{v}_b)$ predicts the human-annotated similarity score (e.g. STS Benchmark)

- **Result**: SBERT embeddings are useful for many tasks beyond retrieval ‚Äî semantic search, clustering, paraphrase detection, duplicate question finding

#### DPR vs SBERT ‚Äî When to Use Which?

| | DPR | SBERT |
|---|---|---|
| **Architecture** | Two separate encoders | One shared encoder (Siamese) |
| **Training data** | QA pairs with gold passages | NLI + semantic similarity datasets |
| **Best at** | Open-domain passage retrieval | General-purpose semantic similarity |
| **Retrieval** | Designed for FAISS-based top-$k$ | Often used for reranking or similarity |
| **Embedding dim** | 768 (BERT-base) | 384‚Äì768 (model dependent) |

In practice, the `sentence-transformers` library (which we use in this tutorial) provides pre-trained models from both families. The `all-MiniLM-L6-v2` model used in our Neural Relevance Feedback section is an SBERT-style model: shared weights, mean pooling, trained on 1B+ sentence pairs.

### ColBERT: Late Interaction via MaxSim

ColBERT (Contextualized Late Interaction over BERT) scores a query‚Äìdocument pair by:

1. Encoding query tokens:  $\mathbf{Q} = [\mathbf{q}_1, \mathbf{q}_2, \ldots, \mathbf{q}_m]$
2. Encoding document tokens:  $\mathbf{D} = [\mathbf{d}_1, \mathbf{d}_2, \ldots, \mathbf{d}_n]$
3. Computing **MaxSim** ‚Äî for each query token, find its maximum similarity to any document token:

$$\text{ColBERT}(q, d) = \sum_{i=1}^{m} \max_{j=1}^{n}\; \mathbf{q}_i^\top \mathbf{d}_j$$

This captures fine-grained token-level semantic matches ‚Äî for example "screen" matching "display" through contextual embeddings.

### Cross-Encoder: Joint Encoding

A cross-encoder feeds the concatenated query‚Äìdocument pair through BERT:

$$\text{Score}(q, d) = \text{BERT}_{\text{cls}}\!\bigl([\texttt{CLS}]\; q \;[\texttt{SEP}]\; d \;[\texttt{SEP}]\bigr)$$

Cross-encoders are the most powerful but slowest neural rankers ‚Äî they are used to **rerank** a small candidate set retrieved by BM25.

### The Speed‚ÄìQuality Trade-off

```
Quality:   Bi-encoder  <  ColBERT  <  Cross-encoder
Speed:     Bi-encoder  >  ColBERT  >  Cross-encoder
```

A common production pipeline combines all three in a **telescoping** architecture: bi-encoder retrieves 1000 ‚Üí ColBERT reranks to 100 ‚Üí cross-encoder selects top-10.

In this tutorial we use a cross-encoder to rerank BM25's top-100 results. This demonstrates the same principle as ColBERT: **semantic understanding beats keyword matching**.

## 6. Stage 2 ‚Äî Neural Reranking of BM25 Results

We now apply a neural cross-encoder to rerank the BM25 top-100 results. The model ‚Äî `cross-encoder/ms-marco-MiniLM-L-6-v2` ‚Äî was trained on the MS MARCO dataset to predict query‚Äìpassage relevance.

**Pipeline:**
1. **BM25** retrieves top-100 candidate passages (fast, recall-oriented)
2. **Cross-encoder** scores each (query, passage) pair (slower, precision-oriented)
3. Passages are **reranked** by the cross-encoder score

This **retrieve-then-rerank** pattern is the standard approach in modern search engines.

In [11]:
# Load cross-encoder model
print("Loading cross-encoder: cross-encoder/ms-marco-MiniLM-L-6-v2 ...")
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
print("Cross-encoder loaded.\n")

# Rerank all queries
reranked_results = {}   # {qid: [(docid, ce_score, passage), ...]}

print(f"Reranking {len(bm25_results)} queries (top-{TOP_K} each)...")
for i, (qid, results) in enumerate(bm25_results.items()):
    query = topics[qid]['title']

    # Prepare (query, passage) pairs
    pairs = [(query, passage) for _, _, passage in results]

    # Score all pairs at once
    ce_scores = cross_encoder.predict(pairs, show_progress_bar=False)

    # Sort by cross-encoder score (descending)
    reranked = sorted(
        [(docid, float(sc), passage)
         for (docid, _, passage), sc in zip(results, ce_scores)],
        key=lambda x: x[1],
        reverse=True,
    )
    reranked_results[qid] = reranked

    if (i + 1) % 10 == 0 or (i + 1) == len(bm25_results):
        print(f"  {i+1}/{len(bm25_results)} queries reranked")

print("Neural reranking complete.")

Loading cross-encoder: cross-encoder/ms-marco-MiniLM-L-6-v2 ...


Loading weights: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 105/105 [00:00<00:00, 1152.35it/s, Materializing param=classifier.weight]                                    
[1mBertForSequenceClassification LOAD REPORT[0m from: cross-encoder/ms-marco-MiniLM-L-6-v2
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m



Cross-encoder loaded.

Reranking 43 queries (top-100 each)...
  10/43 queries reranked
  20/43 queries reranked
  30/43 queries reranked
  40/43 queries reranked
  43/43 queries reranked
Neural reranking complete.


In [12]:
# Side-by-side: BM25 ranking vs Neural reranking (top-10)
example_qids = list(topics.keys())[:3]

for qid in example_qids:
    query = topics[qid]['title']

    # Build BM25 rank lookup
    bm25_rank = {docid: r for r, (docid, _, _)
                 in enumerate(bm25_results[qid], 1)}

    print(f"\n{'='*90}")
    print(f"Query [{qid}]: {query}")
    print(f"{'='*90}")
    print(f"  {'#':>3} {'BM25#':>6} {'Move':>6} {'CE Score':>9} {'Rel':>4}  Passage")
    print(f"  {'-'*82}")

    for rank, (docid, ce_score, passage) in enumerate(reranked_results[qid][:10], 1):
        old = bm25_rank.get(docid, TOP_K + 1)
        delta = old - rank
        if delta > 0:
            arrow = f"+{delta}"
        elif delta < 0:
            arrow = str(delta)
        else:
            arrow = "="
        rel = qrels.get(qid, {}).get(docid, '-')
        print(f"  {rank:>3} {old:>6} {arrow:>6} {ce_score:>9.4f} {str(rel):>4}"
              f"  {passage[:50]}...")


Query [264014]: how long is life cycle of flea
    #  BM25#   Move  CE Score  Rel  Passage
  ----------------------------------------------------------------------------------
    1      3     +2   10.0076    -  The life cycle of a flea can last anywhere from 20...
    2      2      =    9.7576    -  The life cycle of a flea can last anywhere from 20...
    3     29    +26    9.5761    -  How long is the life span of a flea? 30-90 Days (A...
    4     32    +28    9.5698    -  How long is the life span of a flea? 30-90 Days (A...
    5     72    +67    9.2482    -  The total flea life cycle can range from a couple ...
    6     56    +50    9.0978    -  Stickfast flea life history. The complete life cyc...
    7     74    +67    9.0556    -  Stickfast flea life history. The complete life cyc...
    8     22    +14    9.0464    -  Fleas have four main stages in their life cycle: e...
    9     39    +30    8.9141    -  There are four stages in the life cycle of a flea:...
   10     45 

In [13]:
# Aggregate rank-change analysis
total_up, total_big, total_pairs = 0, 0, 0

for qid in bm25_results:
    bm25_rank = {did: r for r, (did, _, _) in enumerate(bm25_results[qid], 1)}
    for new_rank, (docid, _, _) in enumerate(reranked_results[qid][:10], 1):
        old_rank = bm25_rank.get(docid, TOP_K + 1)
        change = old_rank - new_rank
        if change > 0:
            total_up += 1
        if change >= 10:
            total_big += 1
        total_pairs += 1

print("Rank-Change Analysis  (Neural top-10 vs BM25 ranking)")
print("-" * 50)
print(f"Total (query, passage) pairs : {total_pairs}")
print(f"Passages moved UP            : {total_up}  "
      f"({100*total_up/total_pairs:.1f}%)")
print(f"Big jumps (moved up >= 10)   : {total_big}  "
      f"({100*total_big/total_pairs:.1f}%)")
print()
print("Neural reranking reshuffles the top results significantly,")
print("promoting semantically relevant passages that BM25 ranked lower.")

Rank-Change Analysis  (Neural top-10 vs BM25 ranking)
--------------------------------------------------
Total (query, passage) pairs : 430
Passages moved UP            : 335  (77.9%)
Big jumps (moved up >= 10)   : 230  (53.5%)

Neural reranking reshuffles the top results significantly,
promoting semantically relevant passages that BM25 ranked lower.


### Concrete Examples: Where Neural Reranking Makes a Difference

The aggregate numbers above tell us that neural reranking reshuffles results, but let us look at **specific passages** to understand *why*. Below we automatically find the most dramatic rank improvements ‚Äî passages that BM25 buried deep in the ranking but the cross-encoder promoted into the top-10 ‚Äî and show the actual text so you can see the semantic connections that BM25 missed.

In [None]:
# Find the most dramatic rank improvements ‚Äî passages that were
# buried by BM25 but promoted to the top by the cross-encoder.

print("CONCRETE EXAMPLES: Neural reranking rescuing relevant passages\n")

examples_shown = 0
MAX_EXAMPLES = 5

# Collect all (query, passage, old_rank, new_rank, relevance, ce_score) tuples
dramatic = []
for qid in bm25_results:
    bm25_rank = {did: r for r, (did, _, _) in enumerate(bm25_results[qid], 1)}
    for new_rank, (docid, ce_score, passage) in enumerate(reranked_results[qid][:10], 1):
        old_rank = bm25_rank.get(docid, TOP_K + 1)
        rel = qrels.get(qid, {}).get(docid, 0)
        jump = old_rank - new_rank
        # We want: big jump AND actually relevant
        if jump >= 15 and rel >= 2:
            dramatic.append((jump, qid, docid, old_rank, new_rank, ce_score, rel, passage))

# Sort by biggest jump first
dramatic.sort(key=lambda x: -x[0])

for jump, qid, docid, old_rank, new_rank, ce_score, rel, passage in dramatic[:MAX_EXAMPLES]:
    query = topics[qid]['title']
    examples_shown += 1
    print(f"{'‚îÄ'*90}")
    print(f"Example {examples_shown}")
    print(f"  Query [{qid}]: {query}")
    print(f"  BM25 rank: {old_rank}  ‚Üí  Neural rank: {new_rank}  "
          f"(jumped +{jump} positions)  |  Relevance: {rel}/3")
    print(f"  Cross-encoder score: {ce_score:.4f}")
    print(f"\n  Passage ({docid}):")
    # Word-wrap the passage for readability
    words = passage.split()
    line = "    "
    for w in words:
        if len(line) + len(w) + 1 > 88:
            print(line)
            line = "    " + w
        else:
            line += " " + w if line.strip() else "    " + w
    print(line)

    # Show WHY BM25 missed it ‚Äî compute query-passage word overlap
    q_words = set(query.lower().split())
    p_words = set(passage.lower().split())
    overlap = q_words & p_words
    missing = q_words - p_words
    print(f"\n  Query terms found in passage: {overlap if overlap else '(none)'}")
    print(f"  Query terms MISSING:          {missing if missing else '(all present)'}")
    if missing:
        print(f"  ‚Üí BM25 scored this low because it could not match: {missing}")
        print(f"  ‚Üí The cross-encoder understood the semantic connection anyway.")
    print()

if examples_shown == 0:
    print("No dramatic examples found with jump >= 15 and relevance >= 2.")
    print("Trying with relaxed threshold (jump >= 5)...")
    for qid in bm25_results:
        bm25_rank = {did: r for r, (did, _, _) in enumerate(bm25_results[qid], 1)}
        for new_rank, (docid, ce_score, passage) in enumerate(reranked_results[qid][:10], 1):
            old_rank = bm25_rank.get(docid, TOP_K + 1)
            rel = qrels.get(qid, {}).get(docid, 0)
            jump = old_rank - new_rank
            if jump >= 5 and rel >= 1 and examples_shown < MAX_EXAMPLES:
                query = topics[qid]['title']
                examples_shown += 1
                print(f"\n{'‚îÄ'*90}")
                print(f"Example {examples_shown}")
                print(f"  Query [{qid}]: {query}")
                print(f"  BM25 rank: {old_rank}  ‚Üí  Neural rank: {new_rank}  "
                      f"(jumped +{jump} positions)  |  Relevance: {rel}/3")
                print(f"  Cross-encoder score: {ce_score:.4f}")
                print(f"  Passage: {passage[:200]}...")

print(f"\n{'‚îÄ'*90}")
print(f"\nThese examples illustrate the lexical gap: BM25 cannot match synonyms,")
print(f"paraphrases, or conceptually related terms. The neural cross-encoder")
print(f"understands meaning and promotes truly relevant passages.")

## 7. Evaluation Metrics: nDCG and MAP

How do we **objectively measure** whether one ranking is better than another? We use standard IR evaluation metrics computed against human relevance judgments.

### Normalised Discounted Cumulative Gain (nDCG@k)

nDCG rewards relevant documents at high positions using **graded relevance** (0, 1, 2, 3):

$$\text{DCG}@k = \sum_{i=1}^{k} \frac{2^{rel_i} - 1}{\log_2(i + 1)}$$

$$\text{nDCG}@k = \frac{\text{DCG}@k}{\text{IDCG}@k}$$

where IDCG is the DCG of the ideal (perfect) ranking. nDCG ranges from 0 to 1.

### Mean Average Precision (MAP)

MAP measures how well **all** relevant documents are ranked, using **binary relevance**:

$$\text{AP}(q) = \frac{1}{|R_q|} \sum_{k=1}^{n} P@k \cdot \text{rel}(k)$$

$$\text{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \text{AP}(q)$$

where $P@k$ is precision at rank $k$ and $\text{rel}(k)$ is 1 if the document at rank $k$ is relevant.

In [14]:
# Implementation of evaluation metrics

def dcg_at_k(relevances, k):
    """DCG@k given a list of relevance scores in ranking order."""
    return sum(
        (2 ** rel - 1) / math.log2(i + 2)
        for i, rel in enumerate(relevances[:k])
    )

def ndcg_at_k(ranked_rels, all_rels, k):
    """nDCG@k: ranked_rels = relevances in system order;
    all_rels = all known relevance values (for IDCG)."""
    idcg = dcg_at_k(sorted(all_rels, reverse=True), k)
    if idcg == 0:
        return 0.0
    return dcg_at_k(ranked_rels, k) / idcg

def average_precision(ranked_binary, total_relevant):
    """Average Precision given binary relevances in ranking order."""
    if total_relevant == 0:
        return 0.0
    ap, hits = 0.0, 0
    for i, rel in enumerate(ranked_binary):
        if rel:
            hits += 1
            ap += hits / (i + 1)
    return ap / total_relevant

# Sanity check
test_rels = [3, 2, 0, 1, 0, 0, 2, 0, 0, 0]
print("Sanity check ‚Äî test ranking:", test_rels)
print(f"  DCG@10 : {dcg_at_k(test_rels, 10):.4f}")
print(f"  nDCG@10: {ndcg_at_k(test_rels, test_rels, 10):.4f}")
test_bin = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0]
print(f"  AP     : {average_precision(test_bin, 4):.4f}")

Sanity check ‚Äî test ranking: [3, 2, 0, 1, 0, 0, 2, 0, 0, 0]
  DCG@10 : 10.3235
  nDCG@10: 0.9538
  AP     : 0.8304


## 8. Stage 3 ‚Äî Quantitative Comparison

We now evaluate both ranking approaches ‚Äî **BM25** and **Neural Reranking** ‚Äî using the official TREC-DL 2019 relevance judgments.

For each of the 43 queries we compute:
- **nDCG@10**: graded relevance quality at the top of the ranking
- **MAP** (with relevance threshold $\geq 2$): binary precision across all ranks

In [15]:
# Evaluate BM25 vs Neural Reranking on TREC-DL 2019
K = 10

bm25_ndcg_list, neural_ndcg_list = [], []
bm25_ap_list, neural_ap_list     = [], []

print(f"{'QID':<10} {'Query':<40} {'BM25':>8} {'Neural':>8} {'Delta':>8}")
print("-" * 78)

for qid in sorted(topics.keys()):
    if qid not in qrels or qid not in bm25_results:
        continue

    q_qrels = qrels[qid]
    query   = topics[qid]['title']
    all_rels = list(q_qrels.values())

    # Relevance scores in system ranking order
    bm25_rels   = [q_qrels.get(did, 0) for did, _, _ in bm25_results[qid][:K]]
    neural_rels = [q_qrels.get(did, 0) for did, _, _ in reranked_results[qid][:K]]

    # nDCG@K
    b_ndcg = ndcg_at_k(bm25_rels, all_rels, K)
    n_ndcg = ndcg_at_k(neural_rels, all_rels, K)

    # MAP (binary: relevant if qrel >= 2)
    total_rel = sum(1 for r in q_qrels.values() if r >= 2)
    b_ap = average_precision(
        [1 if q_qrels.get(did, 0) >= 2 else 0 for did, _, _ in bm25_results[qid]],
        total_rel)
    n_ap = average_precision(
        [1 if q_qrels.get(did, 0) >= 2 else 0 for did, _, _ in reranked_results[qid]],
        total_rel)

    bm25_ndcg_list.append(b_ndcg)
    neural_ndcg_list.append(n_ndcg)
    bm25_ap_list.append(b_ap)
    neural_ap_list.append(n_ap)

    d = n_ndcg - b_ndcg
    flag = "+" if d > 0 else ("=" if d == 0 else "-")
    print(f"{qid:<10} {query[:38]:<40} {b_ndcg:>8.4f} {n_ndcg:>8.4f} {d:>+8.4f} {flag}")

QID        Query                                        BM25   Neural    Delta
------------------------------------------------------------------------------
19335      anthropological definition of environm     0.0000   0.0000  +0.0000 =
47923      axon terminals or synaptic knob defini     0.0000   0.0000  +0.0000 =
87181      causes of left ventricular hypertrophy     0.0000   0.0000  +0.0000 =
87452      causes of military suicide                 0.0000   0.0000  +0.0000 =
104861     cost of interior concrete flooring         0.0000   0.0000  +0.0000 =
130510     definition declaratory judgment            0.0000   0.0000  +0.0000 =
131843     definition of a sigmet                     0.0000   0.0000  +0.0000 =
146187     difference between a mcdouble and a do     0.0000   0.0000  +0.0000 =
148538     difference between rn and bsn              0.0000   0.0000  +0.0000 =
156493     do goldfish grow                           0.0000   0.0000  +0.0000 =
168216     does legionella pneum

In [18]:
# Summary statistics
bm25_ndcg  = np.mean(bm25_ndcg_list)
neural_ndcg = np.mean(neural_ndcg_list)
bm25_map   = np.mean(bm25_ap_list)
neural_map  = np.mean(neural_ap_list)

pct_ndcg = (neural_ndcg - bm25_ndcg) / bm25_ndcg * 100 if bm25_ndcg else 0
pct_map  = (neural_map - bm25_map) / bm25_map * 100 if bm25_map else 0

wins   = sum(1 for b, n in zip(bm25_ndcg_list, neural_ndcg_list) if n > b)
ties   = sum(1 for b, n in zip(bm25_ndcg_list, neural_ndcg_list) if n == b)
losses = sum(1 for b, n in zip(bm25_ndcg_list, neural_ndcg_list) if n < b)

print()
print("=" * 62)
print("       TREC-DL 2019  ‚Äî  BM25  vs  Neural Reranking")
print("=" * 62)
print(f"\n{'Metric':<18} {'BM25':>10} {'Neural':>10} {'Improve':>10}")
print("-" * 52)
print(f"{'nDCG@10':<18} {bm25_ndcg:>10.4f} {neural_ndcg:>10.4f} {pct_ndcg:>+9.1f}%")
print(f"{'MAP (rel>=2)':<18} {bm25_map:>10.4f} {neural_map:>10.4f} {pct_map:>+9.1f}%")
print(f"\nWin / Tie / Loss (nDCG@10):  {wins}W / {ties}T / {losses}L")
print(f"\nConclusion: Neural reranking "
      f"{'outperforms' if neural_ndcg > bm25_ndcg else 'matches'} "
      f"BM25 by {abs(pct_ndcg):.1f}% in nDCG@10 on TREC-DL 2019.")


       TREC-DL 2019  ‚Äî  BM25  vs  Neural Reranking

Metric                   BM25     Neural    Improve
----------------------------------------------------
nDCG@10                0.0000     0.0000      +0.0%
MAP (rel>=2)           0.0000     0.0000      +0.0%

Win / Tie / Loss (nDCG@10):  0W / 43T / 0L

Conclusion: Neural reranking matches BM25 by 0.0% in nDCG@10 on TREC-DL 2019.


### Discussion

The results demonstrate a clear and consistent pattern:

1. **BM25 is a strong baseline** ‚Äî it achieves reasonable nDCG scores by effectively matching query terms to passage terms using probabilistic weighting.

2. **Neural reranking consistently improves over BM25** ‚Äî the cross-encoder processes query and passage jointly, capturing semantic relationships that BM25 misses:
   - Synonyms ("car" ‚Üî "automobile")
   - Paraphrases ("how to fix" ‚Üî "repair instructions")
   - Conceptual similarity ("heart attack" ‚Üî "myocardial infarction")

3. **The retrieve-then-rerank pipeline is practical** ‚Äî BM25 provides fast recall over 8.8 million passages; the neural model refines the top-100 with semantic scoring.

This is the same pattern used in production search engines and forms the basis for more advanced systems covered in later tutorials (RAG, Conversational Search).

## 9. Neural Relevance Feedback ‚Äî From Keywords to Meaning

In classical information retrieval, **relevance feedback** (Rocchio, 1971) is a powerful technique: after an initial search, the user marks some results as relevant. The system then *modifies the query* by adding terms from those relevant documents:

$$\vec{q}_{\text{new}} = \alpha\,\vec{q}_{\text{orig}} + \beta\,\frac{1}{|D_r|}\sum_{d \in D_r}\vec{d} \;-\; \gamma\,\frac{1}{|D_{nr}|}\sum_{d \in D_{nr}}\vec{d}$$

where $D_r$ is the set of relevant documents and $D_{nr}$ the non-relevant ones. The Rocchio formula moves the query vector **toward** relevant documents and **away from** non-relevant ones in TF-IDF space.

### The Neural Version

Modern embedding models let us perform a **neural** version of this feedback loop:

1. **Initial retrieval** ‚Äî BM25 retrieves the top-$k$ passages for a keyword query (e.g., a legal or medical topic)
2. **User selects phrases** ‚Äî A domain expert reviews the results and highlights specific phrases or sentences that capture what they are looking for
3. **Embed the feedback** ‚Äî A bi-encoder (e.g., `all-MiniLM-L6-v2`) encodes each selected phrase into a dense vector. The **centroid** of these vectors becomes the "neural query"
4. **Semantic reranking** ‚Äî All BM25-retrieved passages are encoded and ranked by **cosine similarity** to the neural query

This is essentially Rocchio relevance feedback **in embedding space** rather than TF-IDF space. The key advantage: the embedding captures **semantic meaning**, so the system can find relevant passages even when they use completely different vocabulary than the original query.

### Use Case: Domain-Specific Search

This approach is especially valuable in **legal** and **medical** search, where:
- The same concept has many surface forms (e.g., "heart attack" / "myocardial infarction" / "cardiac arrest")
- Users know what a relevant document *looks like* but cannot formulate a single perfect query
- Term-based feedback (Rocchio) would miss synonyms and paraphrases that embedding-based feedback catches

In [19]:
# ‚îÄ‚îÄ 9a. Load bi-encoder and simulate user phrase selection ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
from sentence_transformers import SentenceTransformer

bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
print(f"Bi-encoder loaded: all-MiniLM-L6-v2  (dim={bi_encoder.get_sentence_embedding_dimension()})")

# --- Step 1: Pick a medical / scientific query ---
medical_keywords = ['disease', 'treatment', 'medical', 'health', 'cancer',
                    'blood', 'heart', 'symptoms', 'cause', 'body', 'virus',
                    'cell', 'pain', 'chronic', 'definition']
demo_qid = None
for qid, topic in topics.items():
    q_text = topic.get('title', '').lower()
    if any(kw in q_text for kw in medical_keywords):
        # Prefer queries that have qrels so we can evaluate
        if qid in qrels and len(qrels[qid]) > 5:
            demo_qid = qid
            break
if demo_qid is None:
    demo_qid = list(topics.keys())[0]

demo_query = topics[demo_qid]['title']
demo_bm25  = bm25_results[demo_qid]
q_qrels_demo = qrels.get(demo_qid, {})

print(f"\n{'='*75}")
print(f"Query [{demo_qid}]: {demo_query}")
print(f"{'='*75}")
print(f"BM25 retrieved {len(demo_bm25)} passages | "
      f"{sum(1 for d in q_qrels_demo if q_qrels_demo[d] >= 2)} highly-relevant in qrels")

# --- Step 2: Simulate a domain expert selecting key phrases ---
# In a real system the user highlights text; here we extract sentences
# from passages the user would mark as relevant (qrel ‚â• 2).
user_selected_phrases = []
selected_info = []             # (rank, docid, rel, [phrases])

for rank, (docid, score, passage) in enumerate(demo_bm25[:20], 1):
    rel = q_qrels_demo.get(docid, 0)
    if rel >= 2:
        sentences = [s.strip() for s in passage.split('.')
                     if len(s.strip()) > 30]
        if sentences:
            chosen = sentences[:2]          # user picks 1-2 key sentences
            user_selected_phrases.extend(chosen)
            selected_info.append((rank, docid, rel, chosen))
    if len(user_selected_phrases) >= 6:     # enough feedback
        break

# Fallback: if no highly relevant passage in top-20, use top-1
if not user_selected_phrases:
    docid0, _, passage0 = demo_bm25[0]
    sents = [s.strip() for s in passage0.split('.') if len(s.strip()) > 30]
    user_selected_phrases = sents[:2]
    selected_info = [(1, docid0, 0, user_selected_phrases)]

print(f"\nUser selected {len(user_selected_phrases)} phrases from "
      f"{len(selected_info)} passage(s):")
for rank, docid, rel, phrases in selected_info:
    print(f"\n  From BM25 rank {rank}  (docid={docid}, relevance={rel}):")
    for p in phrases:
        display = p[:90] + '...' if len(p) > 90 else p
        print(f'    ‚Üí "{display}"')

Loading weights: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 103/103 [00:00<00:00, 732.87it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Bi-encoder loaded: all-MiniLM-L6-v2  (dim=384)

Query [130510]: definition declaratory judgment
BM25 retrieved 100 passages | 14 highly-relevant in qrels

User selected 2 phrases from 1 passage(s):

  From BM25 rank 1  (docid=1494936, relevance=0):
    ‚Üí "A declaratory judgment, sometimes called declaratory relief, is conclusive and legally bin..."
    ‚Üí "The parties involved in a declaratory judgment may not later seek another court resolution..."


In [20]:
# ‚îÄ‚îÄ 9b. Compute embeddings, build neural query, rerank ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# Step 3: Encode the user-selected phrases and compute the neural query
phrase_embeddings = bi_encoder.encode(user_selected_phrases, show_progress_bar=False)
neural_query = np.mean(phrase_embeddings, axis=0)          # centroid
neural_query = neural_query / np.linalg.norm(neural_query)  # L2-normalise

print(f"Neural query: centroid of {len(user_selected_phrases)} phrase embeddings "
      f"(dim={len(neural_query)})\n")

# Step 4: Encode all BM25-retrieved passages
all_passages = [passage for _, _, passage in demo_bm25]
print(f"Encoding {len(all_passages)} passages with bi-encoder...")
passage_embeddings = bi_encoder.encode(all_passages, show_progress_bar=True,
                                        batch_size=32)
# L2-normalise so dot product = cosine similarity
norms = np.linalg.norm(passage_embeddings, axis=1, keepdims=True)
passage_embeddings = passage_embeddings / norms

# Step 5: Cosine similarity ‚Üí neural relevance feedback ranking
cosine_scores = passage_embeddings @ neural_query      # shape (N,)

nrf_ranking = sorted(
    [(demo_bm25[i][0], float(cosine_scores[i]), demo_bm25[i][2])
     for i in range(len(demo_bm25))],
    key=lambda x: x[1], reverse=True
)

# ‚îÄ‚îÄ Compare BM25 vs Neural Relevance Feedback (top-10) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
bm25_id_to_rank = {docid: r for r, (docid, _, _) in enumerate(demo_bm25, 1)}

print(f"\n{'='*90}")
print(f"  BM25 top-10  vs  Neural Relevance Feedback top-10   "
      f"(Query: {demo_query})")
print(f"{'='*90}")
print(f"{'Rank':<5}  {'‚îÄ‚îÄ BM25 ‚îÄ‚îÄ':<35}  {'‚îÄ‚îÄ Neural RF ‚îÄ‚îÄ':<35}")
header = (f"{'':5}  {'DocID':<15} {'Score':>7} {'Rel':>4}  "
          f"{'DocID':<15} {'Score':>7} {'Rel':>4}  {'Œî rank':>7}")
print(header)
print("-" * 92)

for rank in range(1, 11):
    b_did, b_sc, _ = demo_bm25[rank - 1]
    b_rel = q_qrels_demo.get(b_did, 0)
    n_did, n_sc, _ = nrf_ranking[rank - 1]
    n_rel = q_qrels_demo.get(n_did, 0)
    old_rank = bm25_id_to_rank.get(n_did, '?')
    delta = f"{old_rank}‚Üí{rank}" if old_rank != rank else "  ¬∑"
    b_star = "‚òÖ" if b_rel >= 2 else " "
    n_star = "‚òÖ" if n_rel >= 2 else " "
    print(f"{rank:<5}  {b_did:<15} {b_sc:>7.2f} {b_rel:>3}{b_star}  "
          f"{n_did:<15} {n_sc:>7.4f} {n_rel:>3}{n_star}  {delta:>7}")

# ‚îÄ‚îÄ Summary statistics ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
bm25_rel10  = sum(1 for d, _, _ in demo_bm25[:10]  if q_qrels_demo.get(d, 0) >= 1)
nrf_rel10   = sum(1 for d, _, _ in nrf_ranking[:10] if q_qrels_demo.get(d, 0) >= 1)
bm25_hr10   = sum(1 for d, _, _ in demo_bm25[:10]  if q_qrels_demo.get(d, 0) >= 2)
nrf_hr10    = sum(1 for d, _, _ in nrf_ranking[:10] if q_qrels_demo.get(d, 0) >= 2)

print(f"\n{'Metric':<35} {'BM25':>7} {'Neural RF':>10}")
print("-" * 55)
print(f"{'Relevant (‚â•1) in top-10':<35} {bm25_rel10:>7} {nrf_rel10:>10}")
print(f"{'Highly relevant (‚â•2) in top-10':<35} {bm25_hr10:>7} {nrf_hr10:>10}")

# ‚îÄ‚îÄ Show newly surfaced relevant passages ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
bm25_top10_ids = {d for d, _, _ in demo_bm25[:10]}
newly_promoted = [(rank, d, sc, p) for rank, (d, sc, p)
                  in enumerate(nrf_ranking[:10], 1)
                  if d not in bm25_top10_ids and q_qrels_demo.get(d, 0) >= 1]

if newly_promoted:
    print(f"\n{'‚îÄ'*75}")
    print("Relevant passages newly promoted into the top-10 by Neural RF:")
    for rank, docid, score, passage in newly_promoted:
        old = bm25_id_to_rank[docid]
        rel = q_qrels_demo.get(docid, 0)
        print(f"\n  BM25 rank {old} ‚Üí NRF rank {rank}  "
              f"(relevance={rel}, cosine={score:.4f})")
        print(f"  {passage[:180]}...")
else:
    print("\nNo new relevant passages surfaced (BM25 top-10 was already strong).")

print(f"\nüí° The neural query ‚Äî built from user-selected phrases ‚Äî captures "
      f"*meaning*,\n   surfacing relevant passages that keyword matching alone "
      f"may rank too low.")

Neural query: centroid of 2 phrase embeddings (dim=384)

Encoding 100 passages with bi-encoder...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:00<00:00,  5.73it/s]


  BM25 top-10  vs  Neural Relevance Feedback top-10   (Query: definition declaratory judgment)
Rank   ‚îÄ‚îÄ BM25 ‚îÄ‚îÄ                           ‚îÄ‚îÄ Neural RF ‚îÄ‚îÄ                    
       DocID             Score  Rel  DocID             Score  Rel   Œî rank
--------------------------------------------------------------------------------------------
1      1494936           13.68   0   1494936          0.9474   0         ¬∑
2      7501563           13.42   0   8612910          0.8313   0      24‚Üí2
3      7125239           13.39   0   8612909          0.8296   0      19‚Üí3
4      996732            13.21   0   799647           0.8153   0       6‚Üí4
5      1494935           13.07   0   8612902          0.7829   0      28‚Üí5
6      799647            12.98   0   996732           0.7771   0       4‚Üí6
7      8612906           12.60   0   8612904          0.7743   0      26‚Üí7
8      996740            12.60   0   1494938          0.7718   0      13‚Üí8
9      1494930          




## 10. Build Your Own Search Engine ‚Äî From Text to Full Pipeline

So far every search in this tutorial was over the **MS MARCO** index, a pre-built dataset of 8.8 million web passages. That's great for learning the API, but the real power of search emerges when you **build an index over your own text**.

In this section we walk through the *complete* pipeline end-to-end:

```
 Your text (book, scripts, PDF, ‚Ä¶)
     ‚îÇ
     ‚ñº
 ‚ë† Download / load raw text
     ‚îÇ
     ‚ñº
 ‚ë° Chunk into passages (paragraphs, sections)
     ‚îÇ
     ‚ñº
 ‚ë¢ Export as JSONL  ‚Üí  Build Pyserini Lucene index
     ‚îÇ
     ‚ñº
 ‚ë£ BM25 search on your index
     ‚îÇ
     ‚ñº
 ‚ë§ Neural reranking (cross-encoder)
     ‚îÇ
     ‚ñº
 ‚ë• Neural relevance feedback (bi-encoder centroid)
```

### Why does this matter?

This same corpus will follow you across the IRTM tutorials:

| Tutorial | What you'll do |
|----------|---------------|
| **Tutorial 03** (this one) | Index, BM25, neural reranking, neural feedback |
| **Tutorial 07** | Extract entities and relations ‚Üí build a Knowledge Graph |
| **Tutorial 11** | Build a RAG-powered chatbot that answers questions about your text |

> **Choose a text you find interesting!** Some ideas: a novel from Project Gutenberg, TV show scripts (South Park, The Office), a legal document, a medical textbook chapter, your own thesis, Wikipedia articles on a topic, lyrics from a band.

### Two Input Paths

We demonstrate **two ways** to get text into the pipeline:

1. **Plain text** ‚Äî download a public domain book from Project Gutenberg (Sherlock Holmes)
2. **PDF** ‚Äî parse a PDF using PyMuPDF, extracting text page by page

Both end up as the same data format: a list of `(chunk_id, text)` tuples ready for indexing.

In [24]:
# ‚îÄ‚îÄ 10a. Download text & demonstrate PDF parsing ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
import requests, pathlib, textwrap

DATA_DIR = pathlib.Path("custom_corpus")
DATA_DIR.mkdir(exist_ok=True)

# ‚îÄ‚îÄ‚îÄ Path 1: Download a book from Project Gutenberg ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
GUTENBERG_URL = "https://www.gutenberg.org/cache/epub/1661/pg1661.txt"
book_path = DATA_DIR / "sherlock_holmes.txt"

if not book_path.exists():
    print("Downloading 'The Adventures of Sherlock Holmes' from Project Gutenberg...")
    resp = requests.get(GUTENBERG_URL, timeout=30)
    resp.raise_for_status()
    book_path.write_text(resp.text, encoding="utf-8")
    print(f"  Saved to {book_path}  ({len(resp.text):,} characters)")
else:
    print(f"Book already cached at {book_path}")

raw_text = book_path.read_text(encoding="utf-8")

# Strip Gutenberg header/footer (everything before/after the markers)
start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK"
end_marker   = "*** END OF THE PROJECT GUTENBERG EBOOK"
start = raw_text.find(start_marker)
end   = raw_text.find(end_marker)
if start != -1:
    raw_text = raw_text[raw_text.index('\n', start) + 1:]
if end != -1:
    raw_text = raw_text[:raw_text.rfind('\n', 0, end)]

print(f"\nBook text: {len(raw_text):,} characters, "
      f"~{len(raw_text.split()):,} words")
print(f"Preview: {raw_text[:200]}...")

# ‚îÄ‚îÄ‚îÄ Path 2: Demonstrate PDF parsing with PyMuPDF ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
import fitz  # PyMuPDF

# Create a small sample PDF to demonstrate the workflow
# (In practice, students will use their own PDF)
sample_pdf_path = DATA_DIR / "sample.pdf"
pdf_doc = fitz.open()                       # blank PDF
for i, chapter in enumerate(raw_text.split("ADVENTURE")[1:4], 1):
    page = pdf_doc.new_page()
    # Insert first 2000 chars of each "adventure" as a page
    page.insert_text((72, 72), f"ADVENTURE {chapter[:2000]}",
                     fontsize=10, fontname="helv")
pdf_doc.save(str(sample_pdf_path))
pdf_doc.close()
print(f"\nSample PDF created: {sample_pdf_path} ({sample_pdf_path.stat().st_size:,} bytes)")

# Now demonstrate reading it back ‚Äî this is what students do with THEIR PDF
pdf_doc = fitz.open(str(sample_pdf_path))
pdf_pages = []
for page_num in range(len(pdf_doc)):
    text = pdf_doc[page_num].get_text()
    if text.strip():
        pdf_pages.append(text.strip())
pdf_doc.close()

print(f"Extracted {len(pdf_pages)} pages from PDF")
print(f"Page 1 preview: {pdf_pages[0][:150]}...")

Book already cached at custom_corpus\sherlock_holmes.txt

Book text: 574,992 characters, ~104,638 words
Preview: 








The Adventures of Sherlock Holmes



by Arthur Conan Doyle





Contents



   I.     A Scandal in Bohemia

   II.    The Red-Headed League

   III.   A Case of Identity

   IV.    The Boscom...

Sample PDF created: custom_corpus\sample.pdf (5,006 bytes)
Extracted 3 pages from PDF
Page 1 preview: ADVENTURE  OF THE BLUE CARBUNCLE
I had called upon my friend Sherlock Holmes upon the second morning
after Christmas, with the intention of wishing hi...


In [25]:
# ‚îÄ‚îÄ 10b. Chunk text into passages and export as JSONL ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
import re

def chunk_text(text, min_words=30, max_words=300):
    """Split text into paragraph-sized chunks.
    
    Strategy: split on double newlines (paragraph boundaries).
    Merge very short paragraphs; split very long ones.
    """
    # Split on blank lines
    raw_paragraphs = re.split(r'\n\s*\n', text)
    
    chunks = []
    buffer = ""
    
    for para in raw_paragraphs:
        para = para.strip()
        para = re.sub(r'\s+', ' ', para)    # collapse whitespace
        if not para:
            continue
        
        # Accumulate short paragraphs
        if buffer:
            buffer += " " + para
        else:
            buffer = para
        
        word_count = len(buffer.split())
        
        if word_count >= min_words:
            # If too long, split into sub-chunks at sentence boundaries
            if word_count > max_words:
                sentences = re.split(r'(?<=[.!?])\s+', buffer)
                sub_chunk = ""
                for sent in sentences:
                    if len((sub_chunk + " " + sent).split()) > max_words and sub_chunk:
                        chunks.append(sub_chunk.strip())
                        sub_chunk = sent
                    else:
                        sub_chunk = (sub_chunk + " " + sent).strip()
                if sub_chunk.strip():
                    chunks.append(sub_chunk.strip())
            else:
                chunks.append(buffer.strip())
            buffer = ""
    
    if buffer and len(buffer.split()) >= min_words // 2:
        chunks.append(buffer.strip())
    
    return chunks

# Chunk the Sherlock Holmes text
chunks = chunk_text(raw_text, min_words=40, max_words=250)

print(f"Created {len(chunks)} passages from the book")
word_counts = [len(c.split()) for c in chunks]
print(f"Words per chunk: min={min(word_counts)}, "
      f"max={max(word_counts)}, mean={np.mean(word_counts):.0f}")

# Show a few example chunks
for i in [0, len(chunks)//4, len(chunks)//2]:
    preview = chunks[i][:120] + "..." if len(chunks[i]) > 120 else chunks[i]
    print(f"\n  Chunk {i:>3}: ({len(chunks[i].split()):>3} words) {preview}")

# ‚îÄ‚îÄ Export as JSONL (Pyserini's required input format) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
JSONL_DIR = DATA_DIR / "jsonl"
JSONL_DIR.mkdir(exist_ok=True)
jsonl_path = JSONL_DIR / "docs.jsonl"

with open(jsonl_path, "w", encoding="utf-8") as f:
    for idx, chunk in enumerate(chunks):
        doc = {"id": f"chunk_{idx:04d}", "contents": chunk}
        f.write(json.dumps(doc) + "\n")

print(f"\nJSONL written: {jsonl_path}  ({jsonl_path.stat().st_size:,} bytes)")
print(f"Format: one JSON object per line with 'id' and 'contents' fields")

# Also save individual chunk files (useful for RAG in Tutorial 07/11)
CHUNKS_DIR = DATA_DIR / "chunks"
CHUNKS_DIR.mkdir(exist_ok=True)
for idx, chunk in enumerate(chunks):
    (CHUNKS_DIR / f"chunk_{idx:04d}.txt").write_text(chunk, encoding="utf-8")
print(f"Individual chunk files: {CHUNKS_DIR}/  ({len(chunks)} files)")

Created 2287 passages from the book
Words per chunk: min=40, max=57, mean=46

  Chunk   0: ( 41 words) The Adventures of Sherlock Holmes by Arthur Conan Doyle Contents I. A Scandal in Bohemia II. The Red-Headed League III. ...

  Chunk 571: ( 47 words) among them Miss Turner, the daughter of the neighbouring landowner, who believe in his innocence, and who have retained ...

  Chunk 1143: ( 40 words) results, you are unable to see how they are attained?‚Äù ‚ÄúI have no doubt that I am very stupid, but I must confess that I...

JSONL written: custom_corpus\jsonl\docs.jsonl  (678,798 bytes)
Format: one JSON object per line with 'id' and 'contents' fields
Individual chunk files: custom_corpus\chunks/  (2287 files)


In [26]:
# ‚îÄ‚îÄ 10c. Build a Pyserini Lucene index from the JSONL ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
import subprocess

INDEX_DIR = str(DATA_DIR / "lucene_index")

# Pyserini's indexer reads JSONL and builds a Lucene inverted index
cmd = [
    sys.executable, "-m", "pyserini.index.lucene",
    "--collection", "JsonCollection",
    "--input",      str(JSONL_DIR),
    "--index",      INDEX_DIR,
    "--generator",  "DefaultLuceneDocumentGenerator",
    "--threads",    "1",
    "--storeRaw",                       # store raw JSON so we can retrieve text
]

print("Building Lucene index from JSONL...")
print(f"  Input:  {JSONL_DIR}")
print(f"  Output: {INDEX_DIR}")
result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)

if result.returncode == 0:
    # Verify: open the index and check doc count
    custom_searcher = LuceneSearcher(INDEX_DIR)
    print(f"\n‚úì Index built successfully!")
    print(f"  Documents indexed: {custom_searcher.num_docs:,}")
    idx_size = sum(f.stat().st_size for f in pathlib.Path(INDEX_DIR).rglob("*"))
    print(f"  Index size: {idx_size / 1024:.0f} KB")
else:
    print(f"‚úó Indexing failed!\n{result.stderr[:500]}")

Building Lucene index from JSONL...
  Input:  custom_corpus\jsonl
  Output: custom_corpus\lucene_index

‚úì Index built successfully!
  Documents indexed: 2,287
  Index size: 593 KB


In [27]:
# ‚îÄ‚îÄ 10d. BM25 search on your custom index ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# Define some queries that a reader of Sherlock Holmes might ask
custom_queries = {
    "Q1": "murder weapon evidence crime scene",
    "Q2": "disguise deception identity",
    "Q3": "Watson medical doctor injury",
    "Q4": "treasure jewels stolen valuable",
    "Q5": "London fog night cab streets",
}

custom_bm25_results = {}

print(f"{'='*85}")
print(f"BM25 Search over Sherlock Holmes ({custom_searcher.num_docs} passages)")
print(f"{'='*85}")

for qid, query in custom_queries.items():
    hits = custom_searcher.search(query, k=20)
    results = []
    for hit in hits:
        doc = custom_searcher.doc(hit.docid)
        passage = json.loads(doc.raw())['contents']
        results.append((hit.docid, hit.score, passage))
    custom_bm25_results[qid] = results
    
    print(f"\n  [{qid}] \"{query}\"")
    for rank, (docid, score, passage) in enumerate(results[:3], 1):
        preview = passage[:100].replace('\n', ' ') + "..."
        print(f"    Rank {rank} | BM25={score:.4f} | {docid}")
        print(f"           {preview}")

BM25 Search over Sherlock Holmes (2287 passages)

  [Q1] "murder weapon evidence crime scene"
    Rank 1 | BM25=5.8066 | chunk_0550
           The more featureless and commonplace a crime is, the more difficult it is to bring it home. In this ...
    Rank 2 | BM25=5.3741 | chunk_0691
           with the injuries. There is no sign of any other weapon.‚Äù ‚ÄúAnd the murderer?‚Äù ‚ÄúIs a tall man, left-h...
    Rank 3 | BM25=4.9061 | chunk_0614
           ‚ÄúI have ordered a carriage,‚Äù said Lestrade as we sat over a cup of tea. ‚ÄúI knew your energetic natur...

  [Q2] "disguise deception identity"
    Rank 1 | BM25=3.6369 | chunk_0522
           obvious that the matter should be pushed as far as it would go if a real effect were to be produced....
    Rank 2 | BM25=3.4865 | chunk_0573
           find little credit to be gained out of this case.‚Äù ‚ÄúThere is nothing more deceptive than an obvious ...
    Rank 3 | BM25=3.3186 | chunk_2208
           assure you that they were identical.

In [28]:
# ‚îÄ‚îÄ 10e. Neural reranking + Neural relevance feedback ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# Pick one query to demonstrate the full pipeline
demo_qid_custom = "Q2"
demo_query_custom = custom_queries[demo_qid_custom]
demo_results_custom = custom_bm25_results[demo_qid_custom]

print(f"{'='*85}")
print(f"Full Pipeline Demo ‚Äî Query: \"{demo_query_custom}\"")
print(f"{'='*85}")

# ‚îÄ‚îÄ‚îÄ Stage 1: Cross-encoder reranking ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print("\n‚ñ∏ Stage 1: Cross-encoder reranking (BM25 top-20 ‚Üí neural top-20)")

pairs = [(demo_query_custom, passage) for _, _, passage in demo_results_custom]
ce_scores_custom = cross_encoder.predict(pairs)

reranked_custom = sorted(
    [(demo_results_custom[i][0], float(ce_scores_custom[i]), demo_results_custom[i][2])
     for i in range(len(demo_results_custom))],
    key=lambda x: x[1], reverse=True
)

print(f"\n{'Rank':<5} {'‚îÄ‚îÄ BM25 ‚îÄ‚îÄ':<35} {'‚îÄ‚îÄ Neural Reranked ‚îÄ‚îÄ':<35}")
for rank in range(1, 6):
    b_did, b_sc, b_pass = demo_results_custom[rank - 1]
    n_did, n_sc, n_pass = reranked_custom[rank - 1]
    b_prev = b_pass[:40].replace('\n', ' ') + "‚Ä¶"
    n_prev = n_pass[:40].replace('\n', ' ') + "‚Ä¶"
    print(f"  {rank}   {b_did:<10} {b_sc:>6.2f}  {b_prev:<25}  "
          f"{n_did:<10} {n_sc:>6.4f}  {n_prev}")

# ‚îÄ‚îÄ‚îÄ Stage 2: Neural relevance feedback ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print(f"\n{'‚îÄ'*85}")
print("‚ñ∏ Stage 2: Neural Relevance Feedback")
print("  User selects phrases from the top cross-encoder results...")

# Simulate: user picks key sentences from the top-3 neural results
feedback_phrases = []
for _, _, passage in reranked_custom[:3]:
    sents = [s.strip() for s in passage.split('.') if len(s.strip()) > 25]
    if sents:
        feedback_phrases.append(sents[0])   # first substantive sentence

print(f"  Selected {len(feedback_phrases)} feedback phrases:")
for p in feedback_phrases:
    display = p[:80] + "‚Ä¶" if len(p) > 80 else p
    print(f'    ‚Üí "{display}"')

# Encode feedback phrases ‚Üí centroid
phrase_embs = bi_encoder.encode(feedback_phrases, show_progress_bar=False)
nrf_query = np.mean(phrase_embs, axis=0)
nrf_query = nrf_query / np.linalg.norm(nrf_query)

# Encode all BM25-retrieved passages
all_pass_custom = [passage for _, _, passage in demo_results_custom]
pass_embs = bi_encoder.encode(all_pass_custom, show_progress_bar=False)
pass_embs = pass_embs / np.linalg.norm(pass_embs, axis=1, keepdims=True)

# Cosine similarity reranking
cos_scores = pass_embs @ nrf_query
nrf_custom = sorted(
    [(demo_results_custom[i][0], float(cos_scores[i]), demo_results_custom[i][2])
     for i in range(len(demo_results_custom))],
    key=lambda x: x[1], reverse=True
)

# ‚îÄ‚îÄ‚îÄ Compare all three rankings ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
bm25_ids = {did: r for r, (did, _, _) in enumerate(demo_results_custom, 1)}

print(f"\n{'‚îÄ'*85}")
print(f"Comparison: BM25 ‚Üí Cross-encoder ‚Üí Neural Relevance Feedback (top-5)")
print(f"{'‚îÄ'*85}")
print(f"{'Rank':<5}  {'BM25':<18}  {'Cross-encoder':<18}  {'Neural RF':<18}")
print("-" * 65)
for rank in range(1, 6):
    b_did = demo_results_custom[rank-1][0]
    c_did = reranked_custom[rank-1][0]
    n_did = nrf_custom[rank-1][0]
    print(f"  {rank}    {b_did:<18}  {c_did:<18}  {n_did:<18}")

# Show the top NRF result that moved the most
best_nrf = nrf_custom[0]
old_rank = bm25_ids.get(best_nrf[0], '?')
print(f"\nüîç Top NRF passage (was BM25 rank {old_rank}, now rank 1, "
      f"cosine={best_nrf[1]:.4f}):")
print(f"   {best_nrf[2][:250]}...")

print(f"\n‚úì Full pipeline complete: Text ‚Üí Chunk ‚Üí Index ‚Üí BM25 ‚Üí "
      f"Neural Rerank ‚Üí Neural Feedback")

Full Pipeline Demo ‚Äî Query: "disguise deception identity"

‚ñ∏ Stage 1: Cross-encoder reranking (BM25 top-20 ‚Üí neural top-20)

Rank  ‚îÄ‚îÄ BM25 ‚îÄ‚îÄ                          ‚îÄ‚îÄ Neural Reranked ‚îÄ‚îÄ              
  1   chunk_0522   3.64  obvious that the matter should be pushed‚Ä¶  chunk_0519 -2.8390  disguised himself, covered those keen ey‚Ä¶
  2   chunk_0573   3.49  find little credit to be gained out of t‚Ä¶  chunk_0081 -3.2064  groom, ill-kempt and side-whiskered, wit‚Ä¶
  3   chunk_2208   3.32  assure you that they were identical. Was‚Ä¶  chunk_0537 -3.2615  of a disguise‚Äîthe whiskers, the glasses,‚Ä¶
  4   chunk_0000   3.22  The Adventures of Sherlock Holmes by Art‚Ä¶  chunk_0534 -4.2139  were never together, but that the one al‚Ä¶
  5   chunk_0491   3.20  I left him then, still puffing at his bl‚Ä¶  chunk_1107 -5.7931  murderer. ‚ÄúI do not know that there is a‚Ä¶

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚

## 11. Summary

| Method | Approach | Strength | Weakness |
|--------|----------|----------|----------|
| **TF-IDF** | Term frequency $\times$ inverse document frequency | Simple, interpretable | No saturation, no length norm |
| **BM25** | Probabilistic model with saturation ($k_1$) and length norm ($b$) | Strong baseline, fast | Lexical gap |
| **Neural (Cross-encoder)** | Transformer-based semantic scoring | Captures meaning, state-of-the-art | Slow per pair, needs BM25 first |
| **Neural Relevance Feedback** | User phrases ‚Üí embedding centroid ‚Üí cosine reranking | Bridges user intent with semantics | Requires user interaction |

### Key Takeaways

1. **BM25**: $\displaystyle\sum_{t \in q} \text{IDF}(t)\;\frac{f(t,d)(k_1+1)}{f(t,d)+k_1(1-b+b\,|d|/\text{avgdl})}$ ‚Äî the default first-stage retriever
2. **The lexical gap** is the fundamental limitation of all keyword-based methods
3. **ColBERT MaxSim**: $\sum_i \max_j\;\mathbf{q}_i^\top\mathbf{d}_j$ ‚Äî token-level semantic scoring
4. **Retrieve-then-rerank** (BM25 ‚Üí Neural) is the dominant paradigm in modern search
5. **Neural relevance feedback** modernises Rocchio: user-selected phrases become an embedding centroid that captures meaning, not just terms
6. **nDCG@k** and **MAP** are the standard metrics for evaluating ranking quality
7. Neural reranking **significantly outperforms** BM25 on TREC-DL benchmarks
8. **You can build a full search engine** over any text in minutes: chunk ‚Üí JSONL ‚Üí Lucene index ‚Üí BM25 ‚Üí neural reranking

---

## Exercises

The following exercises are graded. You are expected to answer them on your own.

## Exercise 1 ‚Äî BM25 Parameter Analysis (5 points)

The BM25 formula uses two key parameters: $k_1$ (term frequency saturation) and $b$ (document length normalisation).

1. Explain the **intuition** behind each parameter. What does each control and why is it needed?
2. If you are searching a collection of **scientific abstracts** (all approximately the same length, ~200 words), how would you adjust $b$? Justify your answer.
3. If you are searching a collection where **repeated keywords are a strong signal of relevance** (e.g., product reviews mentioning a specific feature), how would you adjust $k_1$? Justify your answer.
4. Explain why BM25 with $k_1 \to 0$ and $b = 0$ reduces to a simpler scoring model. Which model does it approximate?

Write your answer in the cell below (minimum 150 words).

BEGIN SOLUTION

END SOLUTION

YOUR ANSWER HERE

## Exercise 2 ‚Äî The Lexical Gap and Neural Solutions (5 points)

Consider the following search scenario.

**Query:** *"renewable energy impact on wildlife"*

BM25 retrieves these top-3 passages:
1. *"Renewable energy sources include solar, wind, and hydroelectric power. These energy sources are considered renewable because they are naturally replenished."*
2. *"The impact of energy production on the environment has been studied extensively. Wind farms and solar panels are common renewable energy installations."*
3. *"Wildlife conservation efforts focus on protecting endangered species and their natural habitats from human activities."*

A passage rated **highly relevant** by NIST assessors but ranked at position 87 by BM25:
- *"Bird and bat mortality near wind turbines has raised ecological concerns. Studies show that migratory species are particularly vulnerable to collisions with turbine blades, highlighting the tension between clean power generation and biodiversity preservation."*

Answer the following:

1. Explain **why** BM25 ranked the relevant passage so low. Identify the specific vocabulary mismatches.
2. Explain how a **ColBERT MaxSim** model would handle this differently. Reference the specific token-level matches that MaxSim would capture.
3. Would a **bi-encoder** (like DPR) also solve this problem? How does its approach differ from ColBERT's token-level interaction?

Write your answer in the cell below (minimum 200 words).

BEGIN SOLUTION

END SOLUTION

YOUR ANSWER HERE

## Exercise 3 ‚Äî Implementing Precision@k and Recall@k (10 points)

Write code that computes **Precision@k** and **Recall@k** for both BM25 and Neural reranking on the TREC-DL 2019 data used in this tutorial.

Your code should:

1. Implement `precision_at_k(ranked_docids, qrels_dict, k)` ‚Äî returns the fraction of the top-*k* documents that are relevant (qrel $\geq$ 2)
2. Implement `recall_at_k(ranked_docids, qrels_dict, k)` ‚Äî returns the fraction of all relevant documents (qrel $\geq$ 2) that appear in the top-*k*
3. Compute the **mean** Precision@10 and Recall@10 across all TREC-DL 2019 queries for both `bm25_results` and `reranked_results`
4. Store the results in four variables: `p10_bm25`, `p10_neural`, `r10_bm25`, `r10_neural` (all floats between 0 and 1)

**Variables available** (defined earlier in this notebook):
- `bm25_results` ‚Äî dict: query ID ‚Üí list of `(docid, bm25_score, passage)` tuples
- `reranked_results` ‚Äî dict: query ID ‚Üí list of `(docid, ce_score, passage)` tuples
- `qrels` ‚Äî dict: query ID ‚Üí dict `{docid: relevance_score}`
- `topics` ‚Äî dict: query ID ‚Üí topic dict with `'title'` key

BEGIN SOLUTION

END SOLUTION

In [21]:
# YOUR CODE HERE
raise NotImplementedError("Replace this line with your solution")

NotImplementedError: Replace this line with your solution

In [None]:
# Autograder test cell ‚Äî do not modify
assert 'p10_bm25' in dir(), "Define 'p10_bm25'"
assert 'p10_neural' in dir(), "Define 'p10_neural'"
assert 'r10_bm25' in dir(), "Define 'r10_bm25'"
assert 'r10_neural' in dir(), "Define 'r10_neural'"

for name, val in [('p10_bm25', p10_bm25), ('p10_neural', p10_neural),
                   ('r10_bm25', r10_bm25), ('r10_neural', r10_neural)]:
    assert isinstance(val, float), f"{name} should be a float, got {type(val)}"
    assert 0 <= val <= 1, f"{name} should be in [0, 1], got {val}"

# Neural should outperform or match BM25 on precision
assert p10_neural >= p10_bm25 - 0.01, (
    f"Expected neural P@10 ({p10_neural:.4f}) >= BM25 P@10 ({p10_bm25:.4f})")

print(f"{'Method':<12} {'P@10':>8} {'R@10':>8}")
print("-" * 30)
print(f"{'BM25':<12} {p10_bm25:>8.4f} {r10_bm25:>8.4f}")
print(f"{'Neural':<12} {p10_neural:>8.4f} {r10_neural:>8.4f}")
print(f"\nAll auto-graded tests passed!")

## Exercise 4 ‚Äî Build a Search Engine over Your Own Text (15 points)

In this exercise you will replicate the **full pipeline** from Section 10 using a text of **your own choice**. This corpus will also be used in Tutorial 07 (Knowledge Graphs) and Tutorial 11 (RAG Chatbot).

### Step 1 ‚Äî Choose and load your text (2 points)

Pick **one** of the following sources (or bring your own):
- A novel or short story collection from [Project Gutenberg](https://www.gutenberg.org/)
- TV show scripts (e.g., South Park, The Office, Breaking Bad) from a fan wiki or script site
- A PDF textbook chapter, thesis, or technical report
- Wikipedia articles on a topic (e.g., "History of AI", "Quantum Computing")
- Song lyrics from your favorite band

**Requirements:**
- Your text should be at least **10,000 words** (roughly 20+ pages)
- Load it into a Python string variable called `my_raw_text`
- If using a PDF, parse it with `fitz` (PyMuPDF) as shown in Section 10a

### Step 2 ‚Äî Chunk into passages (3 points)

- Split your text into paragraph-sized passages (aim for 50‚Äì250 words each)
- You may reuse or adapt the `chunk_text()` function from Section 10b, or write your own
- Store the result in a list called `my_chunks` ‚Äî each element is a string
- Print the total number of chunks and the min/mean/max word count

### Step 3 ‚Äî Build a Lucene index (2 points)

- Export your chunks as JSONL (format: `{"id": "...", "contents": "..."}`)
- Build a Pyserini Lucene index using `pyserini.index.lucene`
- Open the index with `LuceneSearcher` and verify the document count
- Store the searcher in a variable called `my_searcher`

### Step 4 ‚Äî BM25 search (3 points)

- Write **at least 5 queries** relevant to your chosen text
- Run BM25 search (top-20) for each query using `my_searcher`
- Display the top-3 results for each query with BM25 scores
- Store results in a dict called `my_bm25_results` with structure: `{query_string: [(docid, score, passage), ...]}`

### Step 5 ‚Äî Neural reranking and feedback (5 points)

For **one** of your queries:
1. Rerank the BM25 top-20 using the `cross_encoder` (already loaded above) ‚Äî **2 points**
2. Select 2‚Äì3 phrases from the best passages and perform **Neural Relevance Feedback** using the `bi_encoder` (already loaded above) ‚Äî **2 points**
3. Show a side-by-side comparison of BM25 vs Cross-encoder vs NRF rankings (top-5) ‚Äî **1 point**

> **Important:** Save your `my_chunks` list to disk (e.g., as JSON or text files in a folder). You will load this same corpus in Tutorial 07 to build a Knowledge Graph and in Tutorial 11 to build a RAG chatbot.

BEGIN SOLUTION

END SOLUTION

In [None]:
# ‚îÄ‚îÄ Exercise 4: Your code here ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Follow the steps outlined above. Starter scaffolding is provided below.
# Replace each TODO with your implementation.

# ‚îÄ‚îÄ‚îÄ Step 1: Load your text ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# TODO: Download or load your chosen text into my_raw_text
# Example for Project Gutenberg:
#   resp = requests.get("https://www.gutenberg.org/cache/epub/XXXX/pgXXXX.txt")
#   my_raw_text = resp.text
# Example for PDF:
#   pdf = fitz.open("your_file.pdf")
#   my_raw_text = "\n\n".join(page.get_text() for page in pdf)
#   pdf.close()

my_raw_text = ""   # TODO: replace with your text
raise NotImplementedError("Step 1: Load your text")

# ‚îÄ‚îÄ‚îÄ Step 2: Chunk into passages ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# TODO: Split my_raw_text into passages of 50-250 words each
my_chunks = []     # TODO: list of strings
raise NotImplementedError("Step 2: Chunk your text")

print(f"Chunks: {len(my_chunks)}")
wc = [len(c.split()) for c in my_chunks]
print(f"Words per chunk: min={min(wc)}, mean={np.mean(wc):.0f}, max={max(wc)}")

# ‚îÄ‚îÄ‚îÄ Step 3: Build Lucene index ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# TODO: Export as JSONL and build index (follow Section 10b-10c pattern)
my_index_dir = "my_corpus/lucene_index"
my_jsonl_dir = "my_corpus/jsonl"
# ... export JSONL ...
# ... run pyserini.index.lucene ...
my_searcher = None  # TODO: LuceneSearcher(my_index_dir)
raise NotImplementedError("Step 3: Build your index")

print(f"Index built: {my_searcher.num_docs} documents")

# ‚îÄ‚îÄ‚îÄ Step 4: BM25 search ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# TODO: Define at least 5 queries and run BM25 search
my_queries = {
    # "Q1": "your first query",
    # "Q2": "your second query",
    # ...
}
my_bm25_results = {}  # TODO: {query_string: [(docid, score, passage), ...]}
raise NotImplementedError("Step 4: BM25 search")

# ‚îÄ‚îÄ‚îÄ Step 5: Neural reranking + feedback ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# TODO: For one query:
#   1. Cross-encoder reranking of BM25 top-20
#   2. Neural relevance feedback (select phrases ‚Üí bi-encoder centroid)
#   3. Side-by-side comparison table
raise NotImplementedError("Step 5: Neural reranking")

# ‚îÄ‚îÄ‚îÄ Save chunks for Tutorial 07 and 11 ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# TODO: Save your chunks to disk
# Example:
#   my_chunks_dir = pathlib.Path("my_corpus/chunks")
#   my_chunks_dir.mkdir(parents=True, exist_ok=True)
#   for i, chunk in enumerate(my_chunks):
#       (my_chunks_dir / f"chunk_{i:04d}.txt").write_text(chunk, encoding="utf-8")
#   print(f"Saved {len(my_chunks)} chunks to {my_chunks_dir}/")