
# Advanced RAG:
## Hybrid Search and Reranking with Qdrant and Sentence Transformers
The goal of the notebook is to demonstrate how using advanced techniques improve the search quality of a dense retrieval system. We'll combine dense and sparse search methods, then rerank the results using a cross-encoder to enhance relevance and accuracy.

## 1. Setting Up the Qdrant Client

We begin by connecting to the Qdrant vector database, which stores our indexed documents.

In [5]:
from qdrant_client import QdrantClient, models

client = QdrantClient("http://localhost:6333", timeout=600)
client.count("scifact")

CountResult(count=5183)

## 2. Loading Embedding Models

We’ll use two embedding models:
- **Dense embeddings** for semantic search.
- **Sparse embeddings** for keyword-based search.

In [6]:
from fastembed import TextEmbedding, SparseTextEmbedding

dense_embedding_model = TextEmbedding("sentence-transformers/all-MiniLM-L6-v2")
bm25_embedding_model = SparseTextEmbedding("Qdrant/bm25")

### Explanation
- **Dense Embeddings (`all-MiniLM-L6-v2`):** Generated by Sentence Transformers, these capture the semantic meaning of text, allowing searches based on concepts rather than exact words.
- **Sparse Embeddings (`BM25`):** A traditional keyword-matching method that excels at finding documents with exact term matches.

## 3. Performing a Simple Semantic Search

Let’s try a basic search using dense embeddings to see how it works.

In [38]:
query_text = "0-dimensional biomaterials lack inductive properties."

In [39]:
from qdrant_client.fastembed_common import QueryResponse

response: QueryResponse = client.query_points(
    "scifact",
    query=next(dense_embedding_model.query_embed(query_text)),
    using="all-MiniLM-L6-v2",
    limit=10,
    with_payload=True,
)

len(response.points)

10

In [40]:
response.points[4]

ScoredPoint(id=31715818, version=848, score=0.29541197, payload={'_id': '31715818', 'title': 'New opportunities: the use of nanotechnologies to manipulate and track stem cells.', 'text': 'Nanotechnologies are emerging platforms that could be useful in measuring, understanding, and manipulating stem cells. Examples include magnetic nanoparticles and quantum dots for stem cell labeling and in vivo tracking; nanoparticles, carbon nanotubes, and polyplexes for the intracellular delivery of genes/oligonucleotides and protein/peptides; and engineered nanometer-scale scaffolds for stem cell differentiation and transplantation. This review examines the use of nanotechnologies for stem cell tracking, differentiation, and transplantation. We further discuss their utility and the potential concerns regarding their cytotoxicity.'}, vector=None, shard_key=None, order_value=None)

### Explanation
- We convert the query into a dense vector using the dense embedding model.
- The `query_points` method searches the "scifact" collection for the top 10 documents closest to this vector in the `all-MiniLM-L6-v2` field.
- `with_payload=False` means we get only document IDs and scores, not the full text (for simplicity).

## 4. Benchmarking with BeIR SciFact Dataset

To evaluate our search methods, we’ll use the **BeIR SciFact dataset**, a standard benchmark for information retrieval.

### 4.1 Loading Queries and Ground Truth

In [10]:
from datasets import load_dataset

queries_dataset = load_dataset("BeIR/scifact", "queries", split="queries")
len(queries_dataset)
queries_dataset[0:10]

Using the latest cached version of the dataset since BeIR/scifact couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'queries' at /Users/sarangsanjaykulkarni/.cache/huggingface/datasets/BeIR___scifact/queries/0.0.0/984eed826375f18d27936c4a32bf0f8491e3f414 (last modified on Sun Jul  6 21:25:50 2025).


{'_id': ['0', '2', '4', '6', '9', '10', '11', '12', '14', '15'],
 'title': ['', '', '', '', '', '', '', '', '', ''],
 'text': ['0-dimensional biomaterials lack inductive properties.',
  '1 in 5 million in UK have abnormal PrP positivity.',
  '1-1% of colorectal cancer patients are diagnosed with regional or distant metastases.',
  '10% of sudden infant death syndrome (SIDS) deaths happen in newborns aged less than 6 months.',
  '32% of liver transplantation programs required patients to discontinue methadone treatment in 2001.',
  '4-PBA treatment decreases endoplasmic reticulum stress in response to general endoplasmic reticulum stress markers.',
  '4-PBA treatment raises endoplasmic reticulum stress in response to general endoplasmic reticulum stress markers.',
  '40mg/day dosage of folic acid and 2mg/day dosage of vitamin B12 does not affect chronic kidney disease (CKD) progression.',
  "5'-nucleotidase metabolizes 6MP.",
  '50% of patients exposed to radiation have activated marker

In [11]:
query_qrels = load_dataset("BeIR/scifact-qrels", split="train")
len(query_qrels)

919

In [12]:
query_qrels[0:3]

{'query-id': [0, 2, 4],
 'corpus-id': [31715818, 13734012, 22942787],
 'score': [1, 1, 1]}

### Explanation
- **queries_dataset:** Contains the queries we’ll search with.
- **query_qrels:** Provides ground truth relevance labels (query ID, document ID, and score) to evaluate search performance.

### 4.2 Building the Ground Truth Dataset

We need to format the ground truth into a structure suitable for evaluation.

In [13]:
from ranx import Qrels
from collections import defaultdict

qrels_dict = defaultdict(dict)
for entry in query_qrels:
    query_id = str(entry["query-id"])
    doc_id = str(entry["corpus-id"])
    qrels_dict[query_id][doc_id] = entry["score"]

qrels = Qrels(qrels_dict, name="scifact")

### Explanation
- We create a dictionary where each query ID maps to document IDs and their relevance scores (e.g., 1 for relevant, 0 for irrelevant).
- The `Qrels` object from the `ranx` library will be used later to evaluate search results.

## 5. Precomputing Query Embeddings

To speed up testing, we precompute embeddings for all queries.

In [14]:
import tqdm

dense_vectors, sparse_vectors, late_vectors = [], [], []
for query in tqdm.tqdm(queries_dataset):
    dense_query_vector = next(dense_embedding_model.query_embed(query["text"]))
    sparse_query_vector = next(bm25_embedding_model.query_embed(query["text"]))

    dense_vectors.append(dense_query_vector)
    sparse_vectors.append(sparse_query_vector)


100%|██████████| 1109/1109 [00:01<00:00, 766.37it/s]


### Explanation
- We loop through each query, generating its dense and sparse embeddings.
- `tqdm` shows a progress bar, making it clear how long the process takes.
- These precomputed vectors save time when testing multiple search pipelines.

## 6. Testing Search Pipelines

We’ll evaluate three search approaches:
1. **Dense embeddings alone**
2. **Sparse embeddings alone**
3. **Hybrid search with Reciprocal Rank Fusion (RRF)**

### 6.1 Dense Embeddings Search

In [15]:
from ranx import Run

run_dict = {}
for query_idx, query in enumerate(queries_dataset):
    query_id = str(query["_id"])

    query_vector = dense_vectors[query_idx]

    results = client.query_points(
        "scifact",
        query=query_vector,
        using="all-MiniLM-L6-v2",
        with_payload=False,
        limit=5,
    )

    run_dict[query_id] = {
        str(point.id): point.score
        for point in results.points
    }

dense_run = Run(run_dict, name="semantic_search")

In [16]:
from ranx import evaluate

evaluate(qrels, dense_run, metrics=["precision@5", "mrr@5"], make_comparable=True)

{'precision@5': np.float64(0.1517923362175525),
 'mrr@5': np.float64(0.5762875978574372)}

### Explanation
- We search using dense vectors and retrieve the top 5 results per query.
- Results are stored in a `Run` object for evaluation.
- **Precision@5:** Fraction of the top 5 results that are relevant.
- **MRR@5:** Average reciprocal rank of the first relevant result in the top 5.

### 6.2 Sparse Embeddings Search

In [17]:
run_dict = {}
for query_idx, query in enumerate(queries_dataset):
    query_id = str(query["_id"])

    query_vector = sparse_vectors[query_idx]

    results = client.query_points(
        "scifact",
        query=models.SparseVector(**query_vector.as_object()),
        using="bm25",
        with_payload=False,
        limit=5,
    )

    run_dict[query_id] = {
        str(point.id): point.score
        for point in results.points
    }

bm25_run = Run(run_dict, name="bm25")
evaluate(qrels, bm25_run, metrics=["precision@5", "mrr@5"], make_comparable=True)

{'precision@5': np.float64(0.16093943139678615),
 'mrr@5': np.float64(0.6474660074165636)}

### Explanation
- Similar to the dense search, but using sparse vectors (BM25) for keyword-based retrieval.
- We evaluate the same metrics to compare performance.

### 6.3 Hybrid Search with Reciprocal Rank Fusion (RRF)

Hybrid search combines dense and sparse results to improve retrieval quality.

In [18]:
hybrid_search_run_dict = {}
hybrid_search_result = {}
for query_idx, query in enumerate(queries_dataset):
    query_id = str(query["_id"])

    dense_query_vector = dense_vectors[query_idx]
    sparse_query_vector = sparse_vectors[query_idx]

    prefetch = [
        models.Prefetch(
            query=dense_query_vector,
            using="all-MiniLM-L6-v2",
            limit=10,
        ),
        models.Prefetch(
            query=models.SparseVector(**sparse_query_vector.as_object()),
            using="bm25",
            limit=10,
        ),
    ]
    hybrid_search_result = client.query_points(
        "scifact",
        prefetch=prefetch,
        query=models.FusionQuery(
            fusion=models.Fusion.RRF,
        ),
        with_payload=False,
        limit=5,
    )

    hybrid_search_run_dict[query_id] = {
        str(point.id): point.score
        for point in hybrid_search_result.points
    }

rrf_run = Run(hybrid_search_run_dict, name="hybrid_search")
evaluate(qrels, rrf_run, metrics=["precision@5", "mrr@5"], make_comparable=True)

{'precision@5': np.float64(0.17132262051915945),
 'mrr@5': np.float64(0.6561186650185415)}

### Explanation
- **Hybrid Search:** Combines semantic (dense) and keyword (sparse) searches.
- **Prefetch:** Runs both searches in parallel, retrieving the top 10 results from each.
- **RRF:** Fuses the rankings by assigning scores based on the reciprocal of each document’s rank (e.g., 1st = 1/1, 2nd = 1/2), then summing them. This balances the methods without needing weights.
- We take the top 5 fused results and evaluate them.

#### Explaination for RRF (Reciprocal Rank Fusion)
- Dense search results: [doc_A, doc_B, doc_C]
- Sparse search results: [doc_C, doc_A, doc_D]

**RRF scores:**
- doc_A: 1/1 (dense) + 1/2 (sparse) = 1.5
- doc_B: 1/2 (dense) + 0 (not in sparse) = 0.5
- doc_C: 1/3 (dense) + 1/1 (sparse) = 1.33
- doc_D: 0 (not in dense) + 1/3 (sparse) = 0.33

**Final ranking**: [doc_A, doc_C, doc_B, doc_D]

## 7. Reranking with a Cross-Encoder

Reranking refines the initial search results using a more accurate model.

### 7.1 Loading the Document Corpus

In [19]:
documents_dataset = load_dataset("BeIR/scifact", "corpus", split="corpus")

Using the latest cached version of the dataset since BeIR/scifact couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'corpus' at /Users/sarangsanjaykulkarni/.cache/huggingface/datasets/BeIR___scifact/corpus/0.0.0/984eed826375f18d27936c4a32bf0f8491e3f414 (last modified on Sun Jul  6 21:19:11 2025).


### 7.2 Setting Up the Cross-Encoder

In [20]:
import torch
from sentence_transformers import CrossEncoder

device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')
print(f"Using device: {device}")

model = CrossEncoder('cross-encoder/ms-marco-electra-base')
try:
    model.model = model.model.to(device)
    print("Model moved to MPS device")
except Exception as e:
    print(f"Could not move model to MPS device, using CPU: {e}")

Using device: mps
Model moved to MPS device


### Explanation
- **Cross-Encoder:** Takes query-document pairs and scores their relevance directly, capturing interactions better than bi-encoders used in initial retrieval.
- We optimize for performance by using available hardware (e.g., MPS on Apple Silicon).

### 7.3 Reranking Function


In [21]:
import os
import concurrent.futures
from tqdm import tqdm

def rerank(pairs, batch_size=24):  # Adjusted batch size for M3 architecture
    # Using batching for more efficient processing
    all_scores = []

    for i in range(0, len(pairs), batch_size):
        batch_pairs = pairs[i:i + batch_size]
        batch_scores = model.predict(batch_pairs)
        all_scores.extend(batch_scores)

    return all_scores

def process_query(query_id, doc_scores, query_texts, document_texts):
    query_text = query_texts.get(query_id, "")
    query_document_pairs = [(query_text, document_texts.get(doc_id, "")) for doc_id in doc_scores.keys()]
    scores = rerank(query_document_pairs)
    return query_id, {doc_id: score for doc_id, score in zip(doc_scores.keys(), scores)}

def reranked_data(data):
    # Cache document and query texts to dictionaries to avoid repeated lookups
    query_texts = {str(query["_id"]): query["text"] for query in queries_dataset}
    document_texts = {str(doc["_id"]): doc["text"] for doc in documents_dataset}

    # Determine optimal number of workers based on CPU cores
    max_workers = min(os.cpu_count() or 4, 8)  # Limiting to 8 concurrent tasks
    print(f"Using {max_workers} parallel workers")

    results = {}
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {
            executor.submit(process_query, query_id, doc_scores, query_texts, document_texts): query_id
            for query_id, doc_scores in data.items()
        }

        for future in tqdm(concurrent.futures.as_completed(futures),
                          total=len(futures), desc="Reranking queries"):
            query_id = futures[future]
            try:
                _, updated_scores = future.result()
                results[query_id] = updated_scores
            except Exception as e:
                print(f"Error processing query {query_id}: {e}")
                # Keep original scores if reranking fails
                results[query_id] = data[query_id]

    return results

### Explanation
- **rerank:** Processes query-document pairs in batches for efficiency.
- **process_query:** Pairs a query with its candidate documents and reranks them.
- **reranked_data:** Manages parallel processing to rerank all queries’ results.

### 7.4 Applying Reranking

In [22]:
query_vectors = {}
for query_idx, query in enumerate(queries_dataset):
    query_id = str(query["_id"])
    query_vectors[query_id] = {
        "dense": dense_vectors[query_idx],
        "sparse": sparse_vectors[query_idx]
    }

# Create prefetch objects for all queries at once
prefetch_jobs = []
reranker_dict = {}

# Process in batches
batch_size = 10  # Adjust based on your system's memory
for i in range(0, len(query_vectors), batch_size):
    batch_queries = {k: query_vectors[k] for k in list(query_vectors.keys())[i:i+batch_size]}

    # Execute batch queries
    batch_results = {}
    for query_id, vectors in batch_queries.items():
        prefetch = [
            models.Prefetch(
                query=vectors["dense"],
                using="all-MiniLM-L6-v2",
                limit=20,
            ),
            models.Prefetch(
                query=models.SparseVector(**vectors["sparse"].as_object()),
                using="bm25",
                limit=20,
            ),
        ]

        hybrid_search_result = client.query_points(
            "scifact",
            prefetch=prefetch,
            query=models.FusionQuery(
                fusion=models.Fusion.RRF,
            ),
            with_payload=False,
            limit=5,
        )

        reranker_dict[query_id] = {
            str(point.id): point.score
            for point in hybrid_search_result.points
        }

    # Show progress
    # print(f"Processed queries {i+1} to {min(i+batch_size, len(query_vectors))}")

# Final reranking in one go
print("Starting final reranking...")
final_data = reranked_data(reranker_dict)

Starting final reranking...
Using 8 parallel workers


  return forward_call(*args, **kwargs)
Reranking queries: 100%|██████████| 1109/1109 [01:09<00:00, 15.92it/s]


### Explanation
- We perform hybrid search to get the top 10 candidates, then rerank them with the cross-encoder.
- The reranked scores replace the original RRF scores, aiming to improve relevance.

In [23]:
final_data['0']

{'26071782': np.float32(0.0008164996),
 '29638116': np.float32(2.1777332e-05),
 '4346436': np.float32(0.00023488396),
 '10608397': np.float32(0.00070513197),
 '17388232': np.float32(2.6919299e-05)}

In [24]:
post_rerank_run = Run(final_data, name="post-rerank")
evaluate(qrels, post_rerank_run, metrics=["precision@5", "mrr@5"], make_comparable=True)

{'precision@5': np.float64(0.17082818294190358),
 'mrr@5': np.float64(0.6678203543469303)}

## 8. Comparing All Methods

In [25]:
from ranx import compare

compare(
    qrels=qrels,
    runs=[
        dense_run,
        bm25_run,
        rrf_run,
        post_rerank_run,
    ],
    metrics=["precision@5", "recall@5", "mrr@5", "dcg@5", "ndcg@5"],
)

#    Model            P@5      Recall@5    MRR@5    DCG@5    NDCG@5
---  ---------------  -------  ----------  -------  -------  --------
a    semantic_search  0.152    0.682       0.576    0.634    0.592
b    bm25             0.161    0.736ᵃ      0.647ᵃ   0.701ᵃ   0.665ᵃ
c    hybrid_search    0.171ᵃᵇ  0.775ᵃᵇ     0.656ᵃ   0.721ᵃ   0.678ᵃ
d    post-rerank      0.171ᵃᵇ  0.774ᵃᵇ     0.668ᵃ   0.730ᵃᵇ  0.687ᵃᵇ

### Explanation
- This compares all four methods (dense, sparse, hybrid, and post-rerank) across multiple metrics:
  - **Recall@5:** Fraction of relevant documents retrieved in the top 5.
  - **DCG@5:** Discounted Cumulative Gain, rewarding higher-ranked relevant documents.
  - **NDCG@5:** Normalized DCG, for comparability across queries.

In [26]:
from sentence_transformers import CrossEncoder

def rerank(pairs):
    model = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-4')
    return model.predict(pairs)


scores = rerank(pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])

  return forward_call(*args, **kwargs)


In [27]:
scores

array([5.657295e-04, 8.626879e-01], dtype=float32)

## 10. Understanding Evaluation Metrics

Here’s what the metrics mean:
- **Precision@5:** Fraction of top 5 results that are relevant.
- **Recall@5:** Fraction of all relevant documents retrieved in the top 5.
- **MRR@5:** Average reciprocal rank of the first relevant result in the top 5.
- **DCG@5:** Rewards relevant documents higher up the list.
- **NDCG@5:** Normalizes DCG for fair comparison.

These metrics show different aspects of search quality, from precision to ranking effectiveness.

---

## Conclusion

We’ve covered how hybrid search combines dense and sparse methods using RRF, and how reranking with a cross-encoder refines results. These techniques improve search relevance and accuracy by leveraging semantic understanding, keyword matching, and precise relevance scoring. Experiment with different queries or metrics to deepen your understanding!

---

You can copy each section (markdown and code blocks) into separate cells in a Jupyter notebook. Run the cells in order to set up, test, and experiment with hybrid search and reranking!