# NB07: Cross-encoder Reranking

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RJuro/unistra-nlp2026/blob/main/notebooks/NB07_reranking.ipynb)

**Duration:** 50 minutes

## Learning Goals

By the end of this notebook you will be able to:

1. **Understand** the bi-encoder vs cross-encoder tradeoff -- why we need both.
2. **Implement** a two-stage reranking pipeline (bi-encoder retrieval + cross-encoder reranking).
3. **Measure** the precision improvement that reranking provides over bi-encoder retrieval alone.
4. **Apply** the pipeline to social science retrieval tasks (policy documents, academic papers, case law).

In [None]:
!pip install faiss-cpu sentence-transformers datasets pandas numpy -q

import faiss
import numpy as np
import pandas as pd
import time
from sentence_transformers import SentenceTransformer, CrossEncoder
from datasets import load_dataset

print(f"FAISS version: {faiss.__version__}")
print("All imports successful.")

## 1. The Problem: Bi-encoders Are Fast but Imprecise

In NB06 we built a semantic search system using a **bi-encoder**. Bi-encoders encode the query and each document **independently** into fixed-size vectors, then compare them with cosine similarity. This is extremely fast -- we can search millions of documents in milliseconds using FAISS.

But there is a cost: because the query and document are encoded separately, the model **cannot attend across them**. It misses fine-grained interactions between query terms and document terms. For example, a bi-encoder might struggle to distinguish:

- *"Does smoking cause cancer?"* vs *"Does cancer cause smoking?"*
- *"Python eats mouse"* vs *"Mouse clicks in Python"*

**Cross-encoders** solve this by processing the query and document **together** as a single input. The model can attend to both simultaneously, capturing rich interactions. The result is much more accurate relevance scores -- but at the cost of speed, since we cannot pre-compute document embeddings.

| Property | Bi-encoder | Cross-encoder |
|---|---|---|
| Input | Query and document encoded separately | Query and document encoded together |
| Speed | Very fast (vector similarity) | Slow (full forward pass per pair) |
| Accuracy | Good | Excellent |
| Scalability | Millions of documents | Hundreds of documents |
| Use case | First-stage retrieval | Reranking a small candidate set |

### The Two-Stage Pipeline

The standard approach in modern information retrieval is to combine both models in a **two-stage pipeline**:

![Two-Stage Retrieval Pipeline](https://raw.githubusercontent.com/RJuro/unistra-nlp2026/main/notebooks/figures/reranking_pipeline.png)

**Stage 1 (Bi-encoder):** Quickly narrows the full corpus down to ~100 plausible candidates using vector similarity. This is the same approach we used in NB06.

**Stage 2 (Cross-encoder):** Takes each of the ~100 candidates and scores them jointly with the query. Reorders them by the more accurate cross-encoder score. Returns the top 5.

This gives us the **best of both worlds**: the speed of bi-encoders with the accuracy of cross-encoders.

In [None]:
# ── Load SciFact (same approach as NB06) ──────────────────────────────────

def load_scifact(max_docs: int | None = None):
    corpus = load_dataset("mteb/scifact", "corpus", split="corpus").to_pandas()
    queries = load_dataset("mteb/scifact", "queries", split="queries").to_pandas()
    qrels_train = load_dataset("mteb/scifact", "default", split="train").to_pandas()
    qrels_test = load_dataset("mteb/scifact", "default", split="test").to_pandas()

    corpus["_id"] = corpus["_id"].astype(str)
    queries["_id"] = queries["_id"].astype(str)
    qrels_train["query-id"] = qrels_train["query-id"].astype(str)
    qrels_train["corpus-id"] = qrels_train["corpus-id"].astype(str)
    qrels_test["query-id"] = qrels_test["query-id"].astype(str)
    qrels_test["corpus-id"] = qrels_test["corpus-id"].astype(str)

    corpus["title"] = corpus["title"].fillna("").astype(str)
    corpus["text"] = corpus["text"].fillna("").astype(str)
    corpus["full_text"] = (corpus["title"].str.strip() + ". " + corpus["text"].str.strip()).str.strip(" .")

    corpus_df = corpus.rename(columns={"_id": "doc_id"})[["doc_id", "title", "text", "full_text"]]
    queries_df = queries.rename(columns={"_id": "query_id", "text": "query"})[["query_id", "query"]]

    if max_docs is not None:
        corpus_df = corpus_df.head(max_docs).reset_index(drop=True)

    return corpus_df.reset_index(drop=True), queries_df, qrels_train, qrels_test


corpus_df, queries_df, qrels_train, qrels_test = load_scifact(max_docs=None)

# Combine train + test qrels for evaluation
qrels = pd.concat([qrels_train, qrels_test], ignore_index=True)

# Build a lookup: query_id -> set of relevant doc_ids
relevant_docs = qrels.groupby("query-id")["corpus-id"].apply(set).to_dict()

print(f"Corpus: {len(corpus_df)} docs | Queries: {len(queries_df)} | Qrels: {len(qrels)} judgments")
print(f"Queries with relevance labels: {len(relevant_docs)}")

# ── Bi-encoder: intfloat/e5-small (same as NB06) ─────────────────────────
MODEL_NAME = "intfloat/e5-small"
bi_encoder = SentenceTransformer(MODEL_NAME)


def format_passage(text: str) -> str:
    return f"passage: {text.strip()}"


def format_query(text: str) -> str:
    return f"query: {text.strip()}"


corpus_inputs = [format_passage(t) for t in corpus_df["full_text"].tolist()]

print(f"\nEncoding corpus with {MODEL_NAME}...")
start = time.time()
corpus_embeddings = bi_encoder.encode(
    corpus_inputs,
    show_progress_bar=True,
    batch_size=64,
    normalize_embeddings=True,
).astype("float32")

index = faiss.IndexFlatIP(corpus_embeddings.shape[1])
index.add(corpus_embeddings)
print(f"Index built in {time.time()-start:.1f}s: {index.ntotal} vectors, {corpus_embeddings.shape[1]} dims")

## 2. Loading the Cross-encoder

We use a cross-encoder trained on MS MARCO, a large-scale passage ranking dataset. The model takes a `(query, document)` pair and outputs a single relevance score.

In [None]:
# Cross-encoder: processes (query, document) pairs jointly
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
print("Cross-encoder loaded!")

## 3. The Reranking Pipeline

Now we combine both stages into a single function. The bi-encoder retrieves a broad set of candidates (using E5 `query:` prefix), and the cross-encoder rescores and reorders them (using raw query-document pairs — cross-encoders have their own input format).

In [None]:
def retrieve_and_rerank(query: str, top_k_retrieve: int = 20, top_k_final: int = 5):
    """Two-stage retrieval: bi-encoder retrieve -> cross-encoder rerank."""

    # Stage 1: Bi-encoder retrieval (fast) — uses E5 query prefix
    q_emb = bi_encoder.encode(
        [format_query(query)], normalize_embeddings=True
    ).astype("float32")
    bi_scores, bi_indices = index.search(q_emb, top_k_retrieve)

    # Stage 2: Cross-encoder reranking (accurate)
    # Cross-encoders take raw (query, document) pairs — no prefix needed
    pairs = [(query, corpus_df.iloc[idx]["full_text"]) for idx in bi_indices[0]]
    cross_scores = cross_encoder.predict(pairs)

    # Sort by cross-encoder score (descending)
    reranked = sorted(
        zip(bi_indices[0], bi_scores[0], cross_scores),
        key=lambda x: x[2],
        reverse=True,
    )

    results = []
    for rank, (idx, bi_score, ce_score) in enumerate(reranked[:top_k_final], 1):
        results.append({
            "rank": rank,
            "doc_id": corpus_df.iloc[idx]["doc_id"],
            "bi_score": float(bi_score),
            "ce_score": float(ce_score),
            "title": corpus_df.iloc[idx]["title"],
            "text": corpus_df.iloc[idx]["full_text"][:200] + "...",
        })
    return pd.DataFrame(results)

In [ ]:
def bi_encoder_only(query: str, top_k: int = 5):
    """Bi-encoder retrieval only (no reranking)."""
    q_emb = bi_encoder.encode(
        [format_query(query)], normalize_embeddings=True
    ).astype("float32")
    scores, indices = index.search(q_emb, top_k)
    results = []
    for rank, (score, idx) in enumerate(zip(scores[0], indices[0]), 1):
        results.append({
            "rank": rank,
            "doc_id": corpus_df.iloc[idx]["doc_id"],
            "score": float(score),
            "title": corpus_df.iloc[idx]["title"],
            "text": corpus_df.iloc[idx]["full_text"][:200] + "...",
        })
    return pd.DataFrame(results)

# Test query
query = "What are the risk factors for developing lung cancer?"

print("=" * 70)
print(f"QUERY: {query}")
print("=" * 70)

print("\n--- Bi-encoder only (top 5) ---")
bi_results = bi_encoder_only(query)
for _, r in bi_results.iterrows():
    print(f"  [{r['rank']}] ({r['score']:.3f}) {r['title']}")

print("\n--- Bi-encoder + Cross-encoder reranking (top 5) ---")
reranked_results = retrieve_and_rerank(query)
for _, r in reranked_results.iterrows():
    print(f"  [{r['rank']}] (bi:{r['bi_score']:.3f} -> ce:{r['ce_score']:.3f}) {r['title']}")

In [None]:
<cell_type>markdown</cell_type>## 4. Systematic Evaluation

### 4a. Keyword-proxy evaluation

We start with the same keyword-proxy approach from NB06: a result is "relevant" if it contains at least one expected term. This is imperfect but gives a quick intuition.

In [None]:
eval_queries = [
    {"query": "How do vaccines protect against viral infections?",
     "relevant_terms": ["vaccine", "immun", "viral", "antibod", "infection"]},
    {"query": "What causes antibiotic resistance in bacteria?",
     "relevant_terms": ["antibiotic", "resist", "bacteria", "antimicrobial"]},
    {"query": "How does smoking affect lung health?",
     "relevant_terms": ["smok", "lung", "tobacco", "cancer", "respiratory"]},
    {"query": "What role does genetics play in obesity?",
     "relevant_terms": ["gene", "obes", "BMI", "weight", "metabol"]},
    {"query": "How does exercise impact mental health?",
     "relevant_terms": ["exercise", "mental", "depress", "anxiety", "physical"]},
]


def calc_precision(results_df, relevant_terms, text_col="text"):
    """Calculate precision based on keyword matching."""
    relevant = sum(
        any(term in row[text_col].lower() for term in relevant_terms)
        for _, row in results_df.iterrows()
    )
    return relevant / len(results_df)


print(f"{'Query':<50} {'Bi-enc P@5':>12} {'Reranked P@5':>14}")
print("-" * 78)

bi_precisions, reranked_precisions = [], []

for eq in eval_queries:
    bi_res = bi_encoder_only(eq["query"])
    reranked_res = retrieve_and_rerank(eq["query"])

    bi_p = calc_precision(bi_res, eq["relevant_terms"])
    re_p = calc_precision(reranked_res, eq["relevant_terms"])
    bi_precisions.append(bi_p)
    reranked_precisions.append(re_p)

    print(f"  {eq['query'][:48]:<50} {bi_p:>10.0%}   {re_p:>12.0%}")

print(f"\n{'Average':<50} {np.mean(bi_precisions):>10.0%}   {np.mean(reranked_precisions):>12.0%}")

### 4b. Ground-truth evaluation with qrels

SciFact comes with **human-annotated relevance judgments** (qrels). For each query, experts have labeled which corpus documents are truly relevant. This lets us compute proper Precision@k without guessing keywords.

This is what professional IR evaluation looks like — and it is exactly why benchmark datasets like BEIR, MTEB, and TREC are so valuable.

**Important context:** SciFact has very sparse relevance — most queries have only 1 relevant document in the entire corpus. This means:

- **Precision@5 is capped at 20%** for most queries (1 hit in 5 slots)
- The bi-encoder is already quite good at finding that one needle in the haystack
- Reranking cannot add new documents — it can only **reorder** what was already retrieved

So we expect **modest P@k improvements**. The real value of reranking shows up in two ways:
1. **Recall@k at lower k**: pushing the relevant doc from position 15 to position 3 (measured by NDCG)
2. **Qualitative ordering**: as we saw in the lung cancer example, irrelevant-but-similar documents get pushed down

In [None]:
def precision_at_k(retrieved_doc_ids: list[str], relevant_set: set[str], k: int) -> float:
    """Precision@k using ground-truth relevance judgments."""
    hits = sum(1 for doc_id in retrieved_doc_ids[:k] if doc_id in relevant_set)
    return hits / k


def recall_at_k(retrieved_doc_ids: list[str], relevant_set: set[str], k: int) -> float:
    """Recall@k: fraction of relevant docs found in top k."""
    if not relevant_set:
        return 0.0
    hits = sum(1 for doc_id in retrieved_doc_ids[:k] if doc_id in relevant_set)
    return hits / len(relevant_set)


def ndcg_at_k(retrieved_doc_ids: list[str], relevant_set: set[str], k: int) -> float:
    """NDCG@k — the standard metric for ranking quality.

    Unlike Precision@k which only asks 'is the relevant doc in the top k?',
    NDCG rewards placing it at rank 1 much more than rank 5.
    This is exactly what reranking is designed to improve.
    """
    dcg = 0.0
    for i, doc_id in enumerate(retrieved_doc_ids[:k]):
        if doc_id in relevant_set:
            dcg += 1.0 / np.log2(i + 2)  # rank is 1-indexed, so i+2
    # Ideal DCG: all relevant docs at the top
    ideal_hits = min(len(relevant_set), k)
    idcg = sum(1.0 / np.log2(i + 2) for i in range(ideal_hits))
    return dcg / idcg if idcg > 0 else 0.0


# Evaluate on all queries that have relevance judgments
bi_p5, bi_p10, bi_r20, bi_ndcg10 = [], [], [], []
re_p5, re_p10, re_r20, re_ndcg10 = [], [], [], []

for _, row in queries_df.iterrows():
    qid = row["query_id"]
    if qid not in relevant_docs:
        continue

    query_text = row["query"]
    gold = relevant_docs[qid]

    # Bi-encoder top-20
    bi_res = bi_encoder_only(query_text, top_k=20)
    bi_ids = bi_res["doc_id"].tolist()

    # Reranked top-20 (retrieve 50, rerank to 20)
    re_res = retrieve_and_rerank(query_text, top_k_retrieve=50, top_k_final=20)
    re_ids = re_res["doc_id"].tolist()

    bi_p5.append(precision_at_k(bi_ids, gold, 5))
    bi_p10.append(precision_at_k(bi_ids, gold, 10))
    bi_r20.append(recall_at_k(bi_ids, gold, 20))
    bi_ndcg10.append(ndcg_at_k(bi_ids, gold, 10))

    re_p5.append(precision_at_k(re_ids, gold, 5))
    re_p10.append(precision_at_k(re_ids, gold, 10))
    re_r20.append(recall_at_k(re_ids, gold, 20))
    re_ndcg10.append(ndcg_at_k(re_ids, gold, 10))

print(f"Ground-truth evaluation on {len(bi_p5)} queries with relevance labels\n")
print(f"{'Metric':<20} {'Bi-encoder':>12} {'+ Reranking':>12} {'Delta':>10}")
print("-" * 56)
for name, bi_vals, re_vals in [
    ("Precision@5", bi_p5, re_p5),
    ("Precision@10", bi_p10, re_p10),
    ("NDCG@10", bi_ndcg10, re_ndcg10),
    ("Recall@20", bi_r20, re_r20),
]:
    bm, rm = np.mean(bi_vals), np.mean(re_vals)
    print(f"  {name:<18} {bm:>10.1%}   {rm:>10.1%}   {rm - bm:>+9.1%}")

### Interpreting the results

**Why are the P@k deltas so modest?** This is expected and educational:

1. **Sparsity**: 91% of SciFact queries have exactly 1 relevant document. With 1 relevant doc, P@5 is either 0% or 20% — there's no room for nuance.

2. **The bi-encoder is already good**: E5-small finds the relevant doc in the top 20 for ~85% of queries. For those queries, reranking can help; for the other 15%, the relevant doc was never retrieved, and reranking cannot conjure it.

3. **NDCG@10 tells the real story**: Unlike precision (which is binary — "is it in the top 5?"), NDCG rewards pushing a relevant doc from rank 8 to rank 1. This is exactly what cross-encoder reranking does, and is why NDCG is the standard metric in retrieval benchmarks like MTEB and TREC.

**In production systems with richer relevance** (e.g., Google, where many results could be relevant), reranking gains are typically much larger (5–15% NDCG improvement). SciFact is a conservative test case — if reranking helps here, it definitely helps in practice.

## 5. Speed vs Accuracy Tradeoff

Cross-encoder reranking adds latency. Let's measure exactly how much, so we can make informed decisions about when the tradeoff is worthwhile.

In [None]:
query = "effects of air pollution on respiratory health"

# Time bi-encoder only
start = time.time()
for _ in range(10):
    bi_encoder_only(query)
bi_time = (time.time() - start) / 10

# Time reranking pipeline
start = time.time()
for _ in range(10):
    retrieve_and_rerank(query)
rerank_time = (time.time() - start) / 10

print(f"Bi-encoder only:  {bi_time*1000:.1f} ms/query")
print(f"With reranking:   {rerank_time*1000:.1f} ms/query")
print(f"Reranking overhead: {(rerank_time-bi_time)*1000:.1f} ms ({rerank_time/bi_time:.1f}x slower)")
print(f"\nFor social science research, this tradeoff is usually worth it!")

## 6. Exercise: Tune the Pipeline

Experiment with the pipeline to understand how its components affect performance:

1. **Vary `top_k_retrieve`**: Try values of 10, 20, 50, and 100. How does the number of bi-encoder candidates affect final precision and latency?
2. **Try a different cross-encoder**: Replace `cross-encoder/ms-marco-MiniLM-L-6-v2` with another model (e.g., `cross-encoder/ms-marco-TinyBERT-L-2-v2` for speed, or `cross-encoder/ms-marco-MiniLM-L-12-v2` for accuracy).
3. **Add your own queries**: Think of a social science research question and test it.

In [None]:
# YOUR CODE HERE

# Experiment 1: Try different top_k_retrieve values
# for k in [10, 20, 50, 100]:
#     results = retrieve_and_rerank("your query here", top_k_retrieve=k)
#     print(f"top_k_retrieve={k}: ...")

# Experiment 2: Try a different cross-encoder model
# cross_encoder_v2 = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-2-v2')
# ...

# Experiment 3: Add your own social science query
# my_query = "..."
# retrieve_and_rerank(my_query)

## Bonus: Deploy as a Gradio App

Let's turn our two-stage retrieval pipeline into an interactive search interface. Users can type natural-language queries and see reranked results instantly.

In [None]:
try:
    !pip install gradio -q
    import gradio as gr

    def search_and_rerank(query, top_k=5):
        """Search and rerank, returning formatted markdown results."""
        if not query.strip():
            return "Please enter a search query."
        results = retrieve_and_rerank(query, top_k_retrieve=20, top_k_final=int(top_k))
        output = f"## Results for: *{query}*\n\n"
        for _, row in results.iterrows():
            output += f"**[{row['rank']}]** (score: {row['ce_score']:.3f}) **{row['title']}**\n\n"
            output += f"> {row['text'][:150]}...\n\n---\n\n"
        return output

    demo = gr.Interface(
        fn=search_and_rerank,
        inputs=[
            gr.Textbox(lines=2, placeholder="Enter a search query..."),
            gr.Slider(minimum=1, maximum=10, value=5, step=1, label="Number of results"),
        ],
        outputs=gr.Markdown(label="Search Results"),
        title="Semantic Search with Reranking",
        description="Two-stage retrieval: E5 bi-encoder (fast) + cross-encoder (accurate). Search SciFact scientific abstracts.",
        examples=[["How do vaccines protect against infections?"], ["genetic factors in cancer risk"]],
    )
    demo.launch(share=True)

except ImportError:
    print("Gradio not available. Install with: pip install gradio")

## 7. Summary & Takeaways

**Two-stage retrieval is the standard approach in production search systems.** Nearly every modern search engine -- from Google to academic paper search -- uses some form of this pattern.

Key points:

- **Bi-encoders** encode queries and documents independently. They are fast and scalable (millions of documents) but miss fine-grained query-document interactions.
- **Cross-encoders** process the query and document together. They are much more accurate but too slow to apply to an entire corpus.
- **The two-stage pipeline** combines both: bi-encoder for fast candidate retrieval, cross-encoder for precise reranking. This gives us speed *and* accuracy.
- **The pipeline is modular**: you can swap out the bi-encoder, the cross-encoder, the vector index, or the candidate pool size independently. This makes it easy to experiment and improve.

**Social science applications:**

- **Finding relevant policy documents** in large government archives
- **Academic paper retrieval** for literature reviews
- **Case law search** in legal research
- **Survey response matching** for qualitative analysis
- **Media analysis** -- finding relevant news articles on specific social issues

In the next notebook, we will look at how to fine-tune these models on domain-specific data to further improve retrieval quality.