# NB07: Cross-encoder Reranking

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RJuro/unistra-nlp2026/blob/main/notebooks/NB07_reranking.ipynb)

**Duration:** 50 minutes

## Learning Goals

By the end of this notebook you will be able to:

1. **Understand** the bi-encoder vs cross-encoder tradeoff -- why we need both.
2. **Implement** a two-stage reranking pipeline (bi-encoder retrieval + cross-encoder reranking).
3. **Measure** the precision improvement that reranking provides over bi-encoder retrieval alone.
4. **Apply** the pipeline to social science retrieval tasks (policy documents, academic papers, case law).

In [None]:
!pip install faiss-cpu sentence-transformers datasets pandas numpy -q

In [None]:
import faiss
import numpy as np
import pandas as pd
import time
from sentence_transformers import SentenceTransformer, CrossEncoder
from datasets import load_dataset

## 1. The Problem: Bi-encoders Are Fast but Imprecise

In NB06 we built a semantic search system using a **bi-encoder**. Bi-encoders encode the query and each document **independently** into fixed-size vectors, then compare them with cosine similarity. This is extremely fast -- we can search millions of documents in milliseconds using FAISS.

But there is a cost: because the query and document are encoded separately, the model **cannot attend across them**. It misses fine-grained interactions between query terms and document terms. For example, a bi-encoder might struggle to distinguish:

- *"Does smoking cause cancer?"* vs *"Does cancer cause smoking?"*
- *"Python eats mouse"* vs *"Mouse clicks in Python"*

**Cross-encoders** solve this by processing the query and document **together** as a single input. The model can attend to both simultaneously, capturing rich interactions. The result is much more accurate relevance scores -- but at the cost of speed, since we cannot pre-compute document embeddings.

| Property | Bi-encoder | Cross-encoder |
|---|---|---|
| Input | Query and document encoded separately | Query and document encoded together |
| Speed | Very fast (vector similarity) | Slow (full forward pass per pair) |
| Accuracy | Good | Excellent |
| Scalability | Millions of documents | Hundreds of documents |
| Use case | First-stage retrieval | Reranking a small candidate set |

### The Two-Stage Pipeline

The standard approach in modern information retrieval is to combine both models in a **two-stage pipeline**:

![Two-Stage Retrieval Pipeline](https://raw.githubusercontent.com/RJuro/unistra-nlp2026/main/notebooks/figures/reranking_pipeline.png)

**Stage 1 (Bi-encoder):** Quickly narrows the full corpus down to ~100 plausible candidates using vector similarity. This is the same approach we used in NB06.

**Stage 2 (Cross-encoder):** Takes each of the ~100 candidates and scores them jointly with the query. Reorders them by the more accurate cross-encoder score. Returns the top 5.

This gives us the **best of both worlds**: the speed of bi-encoders with the accuracy of cross-encoders.

In [None]:
# Load corpus and build FAISS index (same setup as NB06)
from datasets import load_dataset

dataset = load_dataset("mteb/scifact", split="corpus")
corpus_df = dataset.to_pandas().head(300)
corpus_df.columns = ['doc_id', 'title', 'text']
corpus_df['full_text'] = corpus_df['title'] + ". " + corpus_df['text']

# Stage 1: Bi-encoder
bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
corpus_embeddings = bi_encoder.encode(
    corpus_df['full_text'].tolist(), show_progress_bar=True,
    normalize_embeddings=True, batch_size=64
).astype('float32')

index = faiss.IndexFlatIP(corpus_embeddings.shape[1])
index.add(corpus_embeddings)
print(f"Index built: {index.ntotal} documents")

## 2. Loading the Cross-encoder

We use a cross-encoder trained on MS MARCO, a large-scale passage ranking dataset. The model takes a `(query, document)` pair and outputs a single relevance score.

In [None]:
# Cross-encoder: processes (query, document) pairs jointly
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
print("Cross-encoder loaded!")

## 3. The Reranking Pipeline

Now we combine both stages into a single function. The bi-encoder retrieves a broad set of candidates, and the cross-encoder rescores and reorders them.

In [None]:
def retrieve_and_rerank(query: str, top_k_retrieve: int = 20, top_k_final: int = 5):
    """Two-stage retrieval: bi-encoder retrieve -> cross-encoder rerank."""
    
    # Stage 1: Bi-encoder retrieval (fast)
    q_emb = bi_encoder.encode([query], normalize_embeddings=True).astype('float32')
    bi_scores, bi_indices = index.search(q_emb, top_k_retrieve)
    
    # Stage 2: Cross-encoder reranking (accurate)
    # Create (query, document) pairs for cross-encoder scoring
    pairs = [(query, corpus_df.iloc[idx]['full_text']) for idx in bi_indices[0]]
    cross_scores = cross_encoder.predict(pairs)
    
    # Sort by cross-encoder score
    reranked = sorted(
        zip(bi_indices[0], bi_scores[0], cross_scores),
        key=lambda x: x[2],  # Sort by cross-encoder score
        reverse=True
    )
    
    # Format results
    results = []
    for rank, (idx, bi_score, ce_score) in enumerate(reranked[:top_k_final], 1):
        results.append({
            'rank': rank,
            'bi_score': float(bi_score),
            'ce_score': float(ce_score),
            'title': corpus_df.iloc[idx]['title'],
            'text': corpus_df.iloc[idx]['full_text'][:200] + '...'
        })
    return pd.DataFrame(results)

In [None]:
def bi_encoder_only(query: str, top_k: int = 5):
    """Bi-encoder retrieval only (no reranking)."""
    q_emb = bi_encoder.encode([query], normalize_embeddings=True).astype('float32')
    scores, indices = index.search(q_emb, top_k)
    results = []
    for rank, (score, idx) in enumerate(zip(scores[0], indices[0]), 1):
        results.append({
            'rank': rank,
            'score': float(score),
            'title': corpus_df.iloc[idx]['title'],
            'text': corpus_df.iloc[idx]['full_text'][:200] + '...'
        })
    return pd.DataFrame(results)

# Test query
query = "What are the risk factors for developing lung cancer?"

print("=" * 70)
print(f"QUERY: {query}")
print("=" * 70)

print("\n--- Bi-encoder only (top 5) ---")
bi_results = bi_encoder_only(query)
for _, r in bi_results.iterrows():
    print(f"  [{r['rank']}] ({r['score']:.3f}) {r['title']}")

print("\n--- Bi-encoder + Cross-encoder reranking (top 5) ---")
reranked_results = retrieve_and_rerank(query)
for _, r in reranked_results.iterrows():
    print(f"  [{r['rank']}] (bi:{r['bi_score']:.3f} -> ce:{r['ce_score']:.3f}) {r['title']}")

## 4. Systematic Evaluation

Let's compare precision across multiple queries. We use a simple keyword-based proxy for relevance: a result is "relevant" if it contains at least one of the expected terms.

In [None]:
eval_queries = [
    {"query": "How do vaccines protect against viral infections?",
     "relevant_terms": ["vaccine", "immun", "viral", "antibod", "infection"]},
    {"query": "What causes antibiotic resistance in bacteria?",
     "relevant_terms": ["antibiotic", "resist", "bacteria", "antimicrobial"]},
    {"query": "How does smoking affect lung health?",
     "relevant_terms": ["smok", "lung", "tobacco", "cancer", "respiratory"]},
    {"query": "What role does genetics play in obesity?",
     "relevant_terms": ["gene", "obes", "BMI", "weight", "metabol"]},
    {"query": "How does exercise impact mental health?",
     "relevant_terms": ["exercise", "mental", "depress", "anxiety", "physical"]},
]

def calc_precision(results_df, relevant_terms, text_col='text'):
    """Calculate precision based on keyword matching."""
    relevant = sum(
        any(term in row[text_col].lower() for term in relevant_terms)
        for _, row in results_df.iterrows()
    )
    return relevant / len(results_df)

print(f"{'Query':<50} {'Bi-enc P@5':>12} {'Reranked P@5':>14}")
print("-" * 78)

bi_precisions = []
reranked_precisions = []

for eq in eval_queries:
    bi_res = bi_encoder_only(eq['query'])
    reranked_res = retrieve_and_rerank(eq['query'])
    
    bi_p = calc_precision(bi_res, eq['relevant_terms'])
    re_p = calc_precision(reranked_res, eq['relevant_terms'])
    bi_precisions.append(bi_p)
    reranked_precisions.append(re_p)
    
    print(f"  {eq['query'][:48]:<50} {bi_p:>10.0%}   {re_p:>12.0%}")

print(f"\n{'Average':<50} {np.mean(bi_precisions):>10.0%}   {np.mean(reranked_precisions):>12.0%}")

## 5. Speed vs Accuracy Tradeoff

Cross-encoder reranking adds latency. Let's measure exactly how much, so we can make informed decisions about when the tradeoff is worthwhile.

In [None]:
query = "effects of air pollution on respiratory health"

# Time bi-encoder only
start = time.time()
for _ in range(10):
    bi_encoder_only(query)
bi_time = (time.time() - start) / 10

# Time reranking pipeline
start = time.time()
for _ in range(10):
    retrieve_and_rerank(query)
rerank_time = (time.time() - start) / 10

print(f"Bi-encoder only:  {bi_time*1000:.1f} ms/query")
print(f"With reranking:   {rerank_time*1000:.1f} ms/query")
print(f"Reranking overhead: {(rerank_time-bi_time)*1000:.1f} ms ({rerank_time/bi_time:.1f}x slower)")
print(f"\nFor social science research, this tradeoff is usually worth it!")

## 6. Exercise: Tune the Pipeline

Experiment with the pipeline to understand how its components affect performance:

1. **Vary `top_k_retrieve`**: Try values of 10, 20, 50, and 100. How does the number of bi-encoder candidates affect final precision and latency?
2. **Try a different cross-encoder**: Replace `cross-encoder/ms-marco-MiniLM-L-6-v2` with another model (e.g., `cross-encoder/ms-marco-TinyBERT-L-2-v2` for speed, or `cross-encoder/ms-marco-MiniLM-L-12-v2` for accuracy).
3. **Add your own queries**: Think of a social science research question and test it.

In [None]:
# YOUR CODE HERE

# Experiment 1: Try different top_k_retrieve values
# for k in [10, 20, 50, 100]:
#     results = retrieve_and_rerank("your query here", top_k_retrieve=k)
#     print(f"top_k_retrieve={k}: ...")

# Experiment 2: Try a different cross-encoder model
# cross_encoder_v2 = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-2-v2')
# ...

# Experiment 3: Add your own social science query
# my_query = "..."
# retrieve_and_rerank(my_query)

## Bonus: Deploy as a Gradio App

Let's turn our two-stage retrieval pipeline into an interactive search interface. Users can type natural-language queries and see reranked results instantly.

In [None]:
try:
    !pip install gradio -q
    import gradio as gr

    def search_and_rerank(query, top_k=5):
        """Search and rerank, returning formatted markdown results."""
        if not query.strip():
            return "Please enter a search query."
        results = retrieve_and_rerank(query, top_k_retrieve=20, top_k_final=int(top_k))
        output = f"## Results for: *{query}*\n\n"
        for _, row in results.iterrows():
            output += f"**[{row['rank']}]** (score: {row['ce_score']:.3f}) **{row['title']}**\n\n"
            output += f"> {row['text'][:150]}...\n\n---\n\n"
        return output

    demo = gr.Interface(
        fn=search_and_rerank,
        inputs=[
            gr.Textbox(lines=2, placeholder="Enter a search query..."),
            gr.Slider(minimum=1, maximum=10, value=5, step=1, label="Number of results"),
        ],
        outputs=gr.Markdown(label="Search Results"),
        title="Semantic Search with Reranking",
        description="Two-stage retrieval: bi-encoder (fast) â†’ cross-encoder (accurate). Search 300 scientific abstracts.",
        examples=[["How do vaccines protect against infections?"], ["genetic factors in cancer risk"]],
    )
    demo.launch(share=True)

except ImportError:
    print("Gradio not available. Install with: pip install gradio")

## 7. Summary & Takeaways

**Two-stage retrieval is the standard approach in production search systems.** Nearly every modern search engine -- from Google to academic paper search -- uses some form of this pattern.

Key points:

- **Bi-encoders** encode queries and documents independently. They are fast and scalable (millions of documents) but miss fine-grained query-document interactions.
- **Cross-encoders** process the query and document together. They are much more accurate but too slow to apply to an entire corpus.
- **The two-stage pipeline** combines both: bi-encoder for fast candidate retrieval, cross-encoder for precise reranking. This gives us speed *and* accuracy.
- **The pipeline is modular**: you can swap out the bi-encoder, the cross-encoder, the vector index, or the candidate pool size independently. This makes it easy to experiment and improve.

**Social science applications:**

- **Finding relevant policy documents** in large government archives
- **Academic paper retrieval** for literature reviews
- **Case law search** in legal research
- **Survey response matching** for qualitative analysis
- **Media analysis** -- finding relevant news articles on specific social issues

In the next notebook, we will look at how to fine-tune these models on domain-specific data to further improve retrieval quality.