
# Bi‑Encoder Retrieval vs Cross‑Encoder Re‑Ranking — Why Re‑Rankers Are Often More Accurate

<p align="center">
<img src="https://raw.githubusercontent.com/CNUClasses/CPSC471/master/content/lectures/week12/cross_and_biencoder.png" alt="standard" >
</p>
<!-- Bi-Encoders produce for a given sentence a sentence embedding. 
**Training**: A bi-encoder (dual-encoder) uses two weight-tied transformers to map queries and passages into the same embedding space. It’s trained on (query, positive, negative) pairs with a contrastive loss so positives score higher than negatives, often using in-batch and hard negatives(queryies that are close in vector space but not correct).
**Inference** Sentence embeddings are computed by pulling the embedding from u. 

Cross-Encoder, two sentences are passed simultaneously to the Transformer network. It produces an output value between 0 and 1 indicating the similarity of the input sentence pair: -->

1) **Bi‑encoder** → fast candidate generation (dense vector search).  
2) **Cross‑encoder (re‑ranker)** → accurate reordering of those candidates.

> TL;DR: The **bi‑encoder** gives you *speed and scale*. The **re‑ranker** gives you *fine‑grained accuracy*.



## Why is a re‑ranker more accurate than a bi‑encoder? 

| Model | Input/Scoring | Strengths | Weaknesses |
|---|---|---|---|
| **Bi‑encoder** | Encodes **query** and **doc** *independently* into vectors; score = cosine/dot | Very **fast**, **scalable** (precompute doc vectors; index with FAISS/Chroma/Pinecone) | **No token‑level interaction** between query & doc; can miss subtle meaning (negation, entities, context) |
| **Cross‑encoder** (Re‑ranker) | Reads **query + doc together** (e.g., `[CLS] query [SEP] doc`); predicts **relevance** | **Deep token‑level attention**; **context‑sensitive** scoring; higher **accuracy** | **Slower** (must score each pair), no doc precomputation |

**Example:**

**Query:** “Documents not about CNNs”

**Doc:** “This paper discusses convolutional networks”

- Bi-encoder: high similarity (misses “not”)

- Re-ranker: low relevance (understands negation)


**Reason the re‑ranker wins:** It attends across tokens of query **and** document jointly, so it can model negation, long‑distance dependencies, and nuanced phrasing that a single fixed vector (bi‑encoder output) cannot fully capture.



## Typical RAG Retrieval Flow (two‑stage)

```
Query  ──► Bi‑encoder vector ──► ANN index (top‑k docs)
                               └─► k candidates
Query + each candidate doc ──► Cross‑encoder (re‑ranker) ──► final ordered list
Top‑m chunks ──► Prompt context for LLM
```



## Setup (optional)
If running on a fresh environment, uncomment and execute the cell below to install dependencies.


In [None]:

# %pip install -U sentence-transformers --quiet



## Tiny Teaching Corpus and Queries
We keep a small synthetic corpus and a few queries with "gold" doc ids.


In [1]:

corpus = [
    ("d1", "Transformers use attention to model relationships between tokens in a sequence."),
    ("d2", "Convolutional neural networks are widely used in computer vision for image classification."),
    ("d3", "Retrieval-Augmented Generation (RAG) adds external knowledge via retrieval to improve LLM answers."),
    ("d4", "FAISS supports IVF and HNSW indexes for fast approximate nearest neighbor search at scale."),
    ("d5", "MiniLM and DistilBERT produce compact sentence embeddings suitable for dense retrieval."),
    ("d6", "BM25 is a keyword-based sparse retrieval method that relies on exact and partial token matches."),
    ("d7", "Cross-encoders score a query and a document together, enabling fine-grained token-level interactions."),
    ("d8", "Bi-encoders encode queries and documents independently, enabling fast vector search at scale.")
]

queries = [
    ("How can RAG improve answers from language models?", "d3"),
    ("Which ANN index is graph-based and very fast?", "d4"),
    ("Which small models are good for sentence embeddings?", "d5"),
    ("What retrieval method uses keyword matches?", "d6"),
    ("Why are cross-encoders often more accurate than bi-encoders?", "d7"),
]
len(corpus), len(queries)

(8, 5)


## Stage 1 — Bi‑Encoder Retrieval (fast candidate generation)


In [2]:

from sentence_transformers import SentenceTransformer
import numpy as np

bi_encoder_name = "sentence-transformers/all-MiniLM-L6-v2"
bi_encoder = SentenceTransformer(bi_encoder_name)

doc_ids   = [d for d,_ in corpus]
doc_texts = [t for _,t in corpus]
doc_embs  = bi_encoder.encode(doc_texts, normalize_embeddings=True, show_progress_bar=False)

def bi_encoder_search(query: str, k: int = 5):
    qv = bi_encoder.encode([query], normalize_embeddings=True, show_progress_bar=False)[0]
    sims = doc_embs @ qv  # cosine similarity (normalized vectors)
    idx = np.argsort(-sims)[:k]
    return idx, sims[idx]



### Quick smoke test (bi‑encoder)


In [3]:

test_query = "What is a cross-encoder and why is it useful?"
cand_idx, cand_scores = bi_encoder_search(test_query, k=5)
[(doc_ids[i], float(cand_scores[j]), doc_texts[i]) for j, i in enumerate(cand_idx)]

[('d7',
  0.5529755353927612,
  'Cross-encoders score a query and a document together, enabling fine-grained token-level interactions.'),
 ('d8',
  0.44767284393310547,
  'Bi-encoders encode queries and documents independently, enabling fast vector search at scale.'),
 ('d3',
  0.2939690351486206,
  'Retrieval-Augmented Generation (RAG) adds external knowledge via retrieval to improve LLM answers.'),
 ('d1',
  0.27219441533088684,
  'Transformers use attention to model relationships between tokens in a sequence.'),
 ('d2',
  0.22715233266353607,
  'Convolutional neural networks are widely used in computer vision for image classification.')]


## Stage 2 — Cross‑Encoder Re‑Ranking (accurate rescoring)


In [4]:

from sentence_transformers import CrossEncoder

cross_encoder_name = "cross-encoder/ms-marco-MiniLM-L-6-v2"
cross_encoder = CrossEncoder(cross_encoder_name)

def cross_encoder_rerank(query: str, candidate_indices):
    pairs = [(query, doc_texts[i]) for i in candidate_indices]
    scores = cross_encoder.predict(pairs)  # higher = more relevant
    order = np.argsort(-scores)
    return [(candidate_indices[i], float(scores[i])) for i in order]



### Quick smoke test (cross‑encoder)


In [5]:

reranked = cross_encoder_rerank(test_query, cand_idx.tolist())
[(doc_ids[i], score, doc_texts[i]) for i, score in reranked]

[('d7',
  5.196783542633057,
  'Cross-encoders score a query and a document together, enabling fine-grained token-level interactions.'),
 ('d8',
  -1.3642516136169434,
  'Bi-encoders encode queries and documents independently, enabling fast vector search at scale.'),
 ('d2',
  -9.761394500732422,
  'Convolutional neural networks are widely used in computer vision for image classification.'),
 ('d3',
  -10.232885360717773,
  'Retrieval-Augmented Generation (RAG) adds external knowledge via retrieval to improve LLM answers.'),
 ('d1',
  -10.348474502563477,
  'Transformers use attention to model relationships between tokens in a sequence.')]


## Evaluation — Precision@k and MRR


In [6]:

def precision_at_k(ranked_doc_ids, gold_id, k=5):
    return 1.0 if gold_id in ranked_doc_ids[:k] else 0.0

def reciprocal_rank(ranked_doc_ids, gold_id):
    for rank, did in enumerate(ranked_doc_ids, start=1):
        if did == gold_id:
            return 1.0 / rank
    return 0.0


In [7]:

import time, numpy as np

def evaluate_pipeline(queries, k_candidates=5, k_eval=3):
    bi_precisions, rer_precisions = [], []
    bi_mrrs, rer_mrrs = [], []
    bi_times, rer_times = [], []

    for qtext, gold in queries:
        t0 = time.time()
        cand_idx, _ = bi_encoder_search(qtext, k=k_candidates)
        bi_times.append(time.time() - t0)
        bi_ranked = [doc_ids[i] for i in cand_idx]

        t1 = time.time()
        reranked = cross_encoder_rerank(qtext, cand_idx.tolist())
        rer_times.append(time.time() - t1)
        rer_ranked = [doc_ids[i] for i,_ in reranked]

        bi_precisions.append(precision_at_k(bi_ranked, gold, k_eval))
        rer_precisions.append(precision_at_k(rer_ranked, gold, k_eval))
        bi_mrrs.append(reciprocal_rank(bi_ranked, gold))
        rer_mrrs.append(reciprocal_rank(rer_ranked, gold))

    return {
        f"P@{k_eval}": (float(np.mean(bi_precisions)), float(np.mean(rer_precisions))),
        "MRR": (float(np.mean(bi_mrrs)), float(np.mean(rer_mrrs))),
        "Avg Time (ms)": (float(np.mean(bi_times)*1000), float(np.mean(rer_times)*1000))
    }

results = evaluate_pipeline(queries, k_candidates=5, k_eval=3)
results

{'P@3': (1.0, 1.0),
 'MRR': (0.9, 1.0),
 'Avg Time (ms)': (5.287599563598633, 5.764532089233398)}


## Visualizing Rank Shifts (before vs after)


In [8]:

def show_rank_shift(query: str, k_candidates=5):
    cand_idx, _ = bi_encoder_search(query, k=k_candidates)
    bi_ranked = [doc_ids[i] for i in cand_idx]
    reranked = cross_encoder_rerank(query, cand_idx.tolist())
    rer_ranked = [doc_ids[i] for i,_ in reranked]

    print(f"QUERY: {query}\n")
    print("Bi-encoder top-k: ", bi_ranked)
    print("Re-ranker top-k:  ", rer_ranked)
    print("\nBi-encoder texts:")
    for did in bi_ranked:
        print(f" - {did}: {doc_texts[doc_ids.index(did)]}")
    print("\nRe-ranker texts:")
    for did in rer_ranked:
        print(f" - {did}: {doc_texts[doc_ids.index(did)]}")

show_rank_shift("Why use a re-ranker with dense retrieval?", k_candidates=5)

QUERY: Why use a re-ranker with dense retrieval?

Bi-encoder top-k:  ['d5', 'd6', 'd8', 'd3', 'd4']
Re-ranker top-k:   ['d5', 'd6', 'd3', 'd8', 'd4']

Bi-encoder texts:
 - d5: MiniLM and DistilBERT produce compact sentence embeddings suitable for dense retrieval.
 - d6: BM25 is a keyword-based sparse retrieval method that relies on exact and partial token matches.
 - d8: Bi-encoders encode queries and documents independently, enabling fast vector search at scale.
 - d3: Retrieval-Augmented Generation (RAG) adds external knowledge via retrieval to improve LLM answers.
 - d4: FAISS supports IVF and HNSW indexes for fast approximate nearest neighbor search at scale.

Re-ranker texts:
 - d5: MiniLM and DistilBERT produce compact sentence embeddings suitable for dense retrieval.
 - d6: BM25 is a keyword-based sparse retrieval method that relies on exact and partial token matches.
 - d3: Retrieval-Augmented Generation (RAG) adds external knowledge via retrieval to improve LLM answers.
 - d8:


## Summary & Exercises

**Summary**
- **Bi‑encoder**: fast/scalable candidate fetch, but limited by fixed-vector similarity.  
- **Cross‑encoder**: slower, but more accurate via joint token‑level attention.  
- **Best practice**: bi‑encoder → top‑k → cross‑encoder → top‑m to LLM.

**Exercises**
1. Scale the corpus to 100+ items (copy your notes) and re‑measure P@3/MRR and latency.  
2. Try `k_candidates ∈ {5, 10, 20}` and plot P@3 vs re‑ranker time.  
3. Swap bi‑encoder (`all‑MiniLM‑L6‑v2`) for `all‑mpnet‑base‑v2`; compare.  
4. Index `doc_embs` with FAISS IVF/HNSW and measure retrieval latency vs brute force.  
5. Add BM25 (sparse) and do **hybrid** (sparse+dense) before re‑ranking.
