# **Explore Advanced Retrievers in LlamaIndex**


### Installing Required Libraries



In [39]:
!pip install -U \
llama-index \
llama-index-llms-gemini \
llama-index-retrievers-bm25 \
llama-index-embeddings-huggingface \
sentence-transformers \
rank-bm25 \
PyStemmer \
google-generativeai


Collecting PyStemmer
  Using cached PyStemmer-3.0.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)


### Importing Required Libraries

We begin by importing all necessary libraries and modules for our advanced retriever demonstrations.


In [40]:
# Core LlamaIndex imports
from llama_index.core import (
    VectorStoreIndex,
    Document,
    SimpleDirectoryReader,
    Settings,
    SummaryIndex,
    StorageContext
)

# LLM and Embedding imports
from llama_index.llms.gemini import Gemini
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Retriever imports
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import (
    QueryFusionRetriever,
    AutoMergingRetriever,
    RecursiveRetriever
)

# Node parsing and processing
from llama_index.core.node_parser import (
    HierarchicalNodeParser,
    SentenceSplitter
)

# Schema and response types
from llama_index.core.schema import (
    NodeWithScore,
    TextNode,
    IndexNode
)

from llama_index.core.response_synthesizers import (
    get_response_synthesizer
)

# Query engine
from llama_index.core.query_engine import RetrieverQueryEngine

# Standard library imports
import os
from typing import List
import Stemmer

print("All libraries imported successfully!")

All libraries imported successfully!


In [41]:
from google import genai
import os

client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])

for model in client.models.list():
    print(model.name)


models/gemini-2.5-flash
models/gemini-2.5-pro
models/gemini-2.0-flash
models/gemini-2.0-flash-001
models/gemini-2.0-flash-exp-image-generation
models/gemini-2.0-flash-lite-001
models/gemini-2.0-flash-lite
models/gemini-exp-1206
models/gemini-2.5-flash-preview-tts
models/gemini-2.5-pro-preview-tts
models/gemma-3-1b-it
models/gemma-3-4b-it
models/gemma-3-12b-it
models/gemma-3-27b-it
models/gemma-3n-e4b-it
models/gemma-3n-e2b-it
models/gemini-flash-latest
models/gemini-flash-lite-latest
models/gemini-pro-latest
models/gemini-2.5-flash-lite
models/gemini-2.5-flash-image
models/gemini-2.5-flash-preview-09-2025
models/gemini-2.5-flash-lite-preview-09-2025
models/gemini-3-pro-preview
models/gemini-3-flash-preview
models/gemini-3-pro-image-preview
models/nano-banana-pro-preview
models/gemini-robotics-er-1.5-preview
models/gemini-2.5-computer-use-preview-10-2025
models/deep-research-pro-preview-12-2025
models/gemini-embedding-001
models/aqa
models/imagen-4.0-generate-preview-06-06
models/imagen

In [42]:

import os
from google.colab import userdata

# Load API key securely from Colab Secrets
api_key = userdata.get("GOOGLE_API_KEY")

if not api_key:
    raise ValueError("GOOGLE_API_KEY not found in Colab Secrets.")

os.environ["GOOGLE_API_KEY"] = api_key


# LlamaIndex Imports
from llama_index.llms.gemini import Gemini
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings


# Initialize Gemini LLM
llm = Gemini(
    model="models/gemini-2.5-flash",   # or "gemini-1.5-pro"
    temperature=0.1
)

# Initialize embedding model
embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    trust_remote_code=True
)

# Configure global settings
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512
Settings.chunk_overlap = 50


print("✓ Gemini LLM initialized successfully!")
print(f"✓ Model: {llm.model}")
print(f"✓ Embedding Model: {embed_model.model_name}")
print(f"✓ Chunk Size: {Settings.chunk_size}")


  llm = Gemini(


✓ Gemini LLM initialized successfully!
✓ Model: models/gemini-2.5-flash
✓ Embedding Model: BAAI/bge-small-en-v1.5
✓ Chunk Size: 512


### Sample Data Setup

Let's create a comprehensive sample dataset that demonstrates the capabilities of different retriever types. We'll use multiple documents covering various topics to showcase different retrieval patterns.


In [43]:
# Create sample documents for testing
documents = [
    Document(
        text="""Machine Learning Fundamentals

        Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It focuses on developing computer programs that can access data and use it to learn for themselves.

        The process of learning begins with observations or data, such as examples, direct experience, or instruction, in order to look for patterns in data and make better decisions in the future. The primary aim is to allow computers to learn automatically without human intervention or assistance.
        """,
        metadata={"topic": "ML Basics", "difficulty": "beginner"}
    ),
    Document(
        text="""Types of Machine Learning

        There are three main types of machine learning:

        1. Supervised Learning: The algorithm learns from labeled training data, helping predict outcomes for unforeseen data. Common applications include classification and regression tasks.

        2. Unsupervised Learning: The algorithm learns patterns from unlabeled data. The system tries to learn without a teacher. Common applications include clustering and dimensionality reduction.

        3. Reinforcement Learning: The algorithm learns through trial and error, receiving rewards or penalties for actions. It's commonly used in robotics, gaming, and navigation.
        """,
        metadata={"topic": "ML Types", "difficulty": "intermediate"}
    ),
    Document(
        text="""Neural Networks Architecture

        Neural networks are computing systems inspired by biological neural networks in animal brains. They consist of interconnected nodes (neurons) organized in layers:

        - Input Layer: Receives the initial data
        - Hidden Layers: Process information through weighted connections
        - Output Layer: Produces the final result

        Deep learning uses neural networks with multiple hidden layers, enabling the learning of complex patterns. Each neuron applies an activation function to introduce non-linearity, allowing the network to learn sophisticated relationships in data.
        """,
        metadata={"topic": "Deep Learning", "difficulty": "advanced"}
    ),
    Document(
        text="""Data Preprocessing Techniques

        Data preprocessing is a crucial step in machine learning pipelines:

        - Data Cleaning: Remove noise, handle missing values, and eliminate duplicates
        - Normalization: Scale features to a standard range
        - Feature Engineering: Create new features from existing data
        - Encoding: Convert categorical variables to numerical format
        - Data Splitting: Divide data into training, validation, and test sets

        Proper preprocessing significantly impacts model performance and can be the difference between a successful and unsuccessful machine learning project.
        """,
        metadata={"topic": "Data Processing", "difficulty": "intermediate"}
    ),
    Document(
        text="""Model Evaluation Metrics

        Evaluating machine learning models requires appropriate metrics:

        For Classification:
        - Accuracy: Overall correctness of predictions
        - Precision: Quality of positive predictions
        - Recall: Coverage of actual positive cases
        - F1 Score: Harmonic mean of precision and recall

        For Regression:
        - Mean Squared Error (MSE): Average squared differences
        - Root Mean Squared Error (RMSE): Square root of MSE
        - R-squared: Proportion of variance explained
        - Mean Absolute Error (MAE): Average absolute differences
        """,
        metadata={"topic": "Evaluation", "difficulty": "intermediate"}
    )
]

print(f"Created {len(documents)} sample documents")
print("\nDocument topics:")
for i, doc in enumerate(documents, 1):
    print(f"{i}. {doc.metadata['topic']} ({doc.metadata['difficulty']} level)")

Created 5 sample documents

Document topics:
1. ML Basics (beginner level)
2. ML Types (intermediate level)
3. Deep Learning (advanced level)
4. Data Processing (intermediate level)
5. Evaluation (intermediate level)


## Background

Before diving into implementation, let's understand the theoretical foundation of advanced retrievers and their role in RAG systems.


### What are Advanced Retrievers?

Advanced retrievers are sophisticated mechanisms that go beyond simple similarity search to find the most relevant information from a knowledge base. While basic retrievers rely solely on vector similarity, advanced retrievers employ multiple strategies:

**Core Capabilities:**
- **Multi-Strategy Search**: Combine semantic, keyword, and structural approaches
- **Hierarchical Understanding**: Navigate document structures intelligently
- **Query Enhancement**: Generate and fuse multiple query variations
- **Context Preservation**: Maintain document hierarchy and relationships
- **Adaptive Ranking**: Use sophisticated scoring mechanisms

**Key Differences from Basic Retrieval:**
1. Basic: Single embedding similarity → Advanced: Multiple retrieval strategies
2. Basic: Flat document structure → Advanced: Hierarchical understanding
3. Basic: Single query → Advanced: Query expansion and fusion
4. Basic: Fixed chunk size → Advanced: Dynamic context merging


### Why are Advanced Retrievers Important?

Advanced retrievers address critical limitations in production RAG systems:

**1. Improved Recall and Precision**
- Combine semantic and keyword search to catch different query types
- Reduce false negatives through query expansion
- Minimize false positives with better ranking

**2. Better Context Understanding**
- Preserve document structure and hierarchy
- Maintain relationships between sections
- Provide appropriate context windows

**3. Handling Complex Queries**
- Multi-faceted questions requiring different retrieval strategies
- Queries that benefit from multiple perspectives
- Technical queries needing exact keyword matches

**4. Production Reliability**
- Fallback mechanisms for edge cases
- Consistent performance across query types
- Scalable to large document collections

**Real-World Impact:**
- Customer Support: 40% improvement in answer relevance
- Legal Research: 60% reduction in missed relevant cases
- Technical Documentation: 50% faster information retrieval


### Index Types Overview

LlamaIndex supports various index types, each optimized for different retrieval patterns:

**1. Vector Store Index**
- **Best For**: Semantic similarity search
- **How It Works**: Embeds documents and queries into vector space
- **Strengths**: Captures meaning and context
- **Use Cases**: General question answering, conceptual queries

**2. Keyword/BM25 Index**
- **Best For**: Exact term matching
- **How It Works**: Statistical ranking based on term frequency
- **Strengths**: Precise keyword retrieval, no embedding overhead
- **Use Cases**: Technical documentation, legal search, code retrieval

**3. Summary Index**
- **Best For**: Document-level retrieval
- **How It Works**: Creates summaries for document selection
- **Strengths**: Intelligent document filtering
- **Use Cases**: Multi-document QA, research papers, long-form content

**4. Hierarchical Index**
- **Best For**: Structured documents
- **How It Works**: Maintains parent-child relationships
- **Strengths**: Context preservation, dynamic merging
- **Use Cases**: Books, technical manuals, structured reports

**Choosing the Right Index:**
- Single strategy → Simple queries, homogeneous content
- Hybrid approach → Production systems, diverse query types
- Multi-index → Large-scale systems, specialized content types


## Core Retriever Demonstrations

Now let's implement and test each retriever type with practical examples.


### 1. Vector Index Retriever - The Foundation

The Vector Index Retriever is the most common retrieval method, using semantic similarity to find relevant documents.

**How It Works:**
1. Documents are split into chunks and embedded into vectors
2. Queries are embedded using the same model
3. Cosine similarity finds the closest vectors
4. Top-k most similar chunks are retrieved

**Strengths:**
- Captures semantic meaning beyond keywords
- Handles paraphrasing and synonyms well
- Good for conceptual questions

**Limitations:**
- May miss exact keyword matches
- Embedding quality dependent
- Computational overhead for embeddings


In [44]:
# Create Vector Store Index
vector_index = VectorStoreIndex.from_documents(
    documents,
    show_progress=True
)

# Get the retriever
vector_retriever = vector_index.as_retriever(
    similarity_top_k=3  # Retrieve top 3 most similar chunks
)

# Test with a semantic query
query = "How do computers learn without being programmed?"
retrieved_nodes = vector_retriever.retrieve(query)

print(f"Query: {query}\n")
print(f"Retrieved {len(retrieved_nodes)} nodes:\n")

for i, node in enumerate(retrieved_nodes, 1):
    print(f"Result {i} (Score: {node.score:.4f}):")
    print(f"Topic: {node.metadata.get('topic', 'N/A')}")
    print(f"Text preview: {node.text[:200]}...\n")

Parsing nodes:   0%|          | 0/5 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/5 [00:00<?, ?it/s]

Query: How do computers learn without being programmed?

Retrieved 3 nodes:

Result 1 (Score: 0.6995):
Topic: ML Basics
Text preview: Machine Learning Fundamentals
        
        Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It...

Result 2 (Score: 0.6321):
Topic: Deep Learning
Text preview: Neural Networks Architecture
        
        Neural networks are computing systems inspired by biological neural networks in animal brains. They consist of interconnected nodes (neurons) organized in...

Result 3 (Score: 0.5991):
Topic: ML Types
Text preview: Types of Machine Learning
        
        There are three main types of machine learning:
        
        1. Supervised Learning: The algorithm learns from labeled training data, helping predict out...



### 2. BM25 Retriever - Advanced Keyword-Based Search

BM25 (Best Matching 25) is a probabilistic ranking function that excels at keyword-based retrieval.

**How It Works:**
1. Tokenizes documents and queries
2. Applies stemming to normalize words
3. Calculates term frequency (TF) and inverse document frequency (IDF)
4. Ranks documents using BM25 scoring formula

**Strengths:**
- Excellent for exact term matching
- No embedding model required
- Fast and efficient
- Great for technical/domain-specific terms

**Use Cases:**
- Legal document search
- Code retrieval
- Technical documentation
- When exact terminology matters


In [45]:
from llama_index.core.node_parser import SentenceSplitter
from llama_index.retrievers.bm25 import BM25Retriever
import Stemmer

parser = SentenceSplitter(
    chunk_size=115,
    chunk_overlap=15
)

nodes = parser.get_nodes_from_documents(documents)

print(f"Created {len(nodes)} nodes for BM25 indexing.\n")


# 2)Initialize BM25 Retriever

bm25_retriever = BM25Retriever.from_defaults(
    nodes=nodes,
    similarity_top_k=3,
    stemmer=Stemmer.Stemmer("english"),
    language="english"
)


#Test Query

query = "supervised learning classification regression"

retrieved_nodes = bm25_retriever.retrieve(query)

print(f"Query: {query}\n")
print(f"Retrieved {len(retrieved_nodes)} nodes:\n")

for i, node in enumerate(retrieved_nodes, 1):
    print(f"Result {i} (Score: {node.score:.4f})")
    print(f"Topic: {node.metadata.get('topic', 'N/A')}")
    print(f"Text: {node.text}\n")


DEBUG:bm25s:Building index from IDs objects


Created 9 nodes for BM25 indexing.

Query: supervised learning classification regression

Retrieved 3 nodes:

Result 1 (Score: 1.8734)
Topic: ML Types
Text: Types of Machine Learning
        
        There are three main types of machine learning:
        
        1. Supervised Learning: The algorithm learns from labeled training data, helping predict outcomes for unforeseen data. Common applications include classification and regression tasks.
        
        2. Unsupervised Learning: The algorithm learns patterns from unlabeled data. The system tries to learn without a teacher. Common applications include clustering and dimensionality reduction.
        
        3.

Result 2 (Score: 0.7301)
Topic: Evaluation
Text: Model Evaluation Metrics
        
        Evaluating machine learning models requires appropriate metrics:
        
        For Classification:
        - Accuracy: Overall correctness of predictions
        - Precision: Quality of positive predictions
        - Recall: Cov

### 3. Document Summary Index Retrievers

Document Summary Index creates summaries of documents and uses them for initial selection, enabling more intelligent document-level retrieval.

**How It Works:**
1. Generates a summary for each document using an LLM
2. Uses summaries for initial document selection
3. Retrieves from selected documents

**Strengths:**
- Intelligent document pre-filtering
- Reduces search space
- Better for multi-document scenarios

**Best For:**
- Research paper collections
- Multi-document QA
- Large document sets


In [46]:
# Create Summary Index
summary_index = SummaryIndex.from_documents(
    documents,
    show_progress=True
)

# Create query engine from summary index
summary_query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize"
)

# Test with a broad query
query = "What are the main concepts in machine learning?"
response = summary_query_engine.query(query)

print(f"Query: {query}\n")
print(f"Response: {response}\n")
print(f"\nSource nodes used: {len(response.source_nodes)}")
for i, node in enumerate(response.source_nodes, 1):
    print(f"{i}. {node.metadata.get('topic', 'N/A')}")


Parsing nodes:   0%|          | 0/5 [00:00<?, ?it/s]

Query: What are the main concepts in machine learning?

Response: Machine learning is a field within artificial intelligence that allows systems to learn and improve from experience without being explicitly programmed. It involves developing computer programs that can access data, identify patterns, and use this understanding to make future decisions automatically.

The primary approaches to machine learning include:
*   **Supervised Learning**, where algorithms learn from labeled data to predict outcomes.
*   **Unsupervised Learning**, which involves discovering patterns in unlabeled data without explicit guidance.
*   **Reinforcement Learning**, where algorithms learn through trial and error, receiving feedback in the form of rewards or penalties.

A crucial step in the machine learning process is **data preprocessing**, which involves techniques such as cleaning data, normalizing features, engineering new features, encoding categorical variables, and splitting data into training, va

### 4. Auto-Merging Retriever - Hierarchical Context Preservation

Auto-Merging Retriever maintains document hierarchy and intelligently merges chunks when multiple child chunks from the same parent are retrieved.

**How It Works:**
1. Documents are parsed into hierarchical chunks (parent-child relationships)
2. Retrieval happens at the child level (smaller chunks)
3. If multiple children from same parent are retrieved, they're merged
4. Provides more complete context automatically

**Strengths:**
- Preserves document structure
- Automatic context expansion
- Better coherence in responses

**Use Cases:**
- Books and long-form content
- Technical manuals
- Structured documents


In [47]:
from llama_index.core.node_parser import HierarchicalNodeParser
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.core.retrievers import AutoMergingRetriever


# Create hierarchical nodes
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[512, 256, 128]
)

hierarchical_nodes = node_parser.get_nodes_from_documents(documents)

print(f"Created {len(hierarchical_nodes)} hierarchical nodes\n")


# Create storage context
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(hierarchical_nodes)


# Build vector index
hierarchical_index = VectorStoreIndex(
    hierarchical_nodes,
    storage_context=storage_context
)

print("Vector index built successfully\n")


# Create auto-merging retriever
base_retriever = hierarchical_index.as_retriever(
    similarity_top_k=6
)

auto_merging_retriever = AutoMergingRetriever(
    base_retriever,
    storage_context=storage_context,
    verbose=True
)

print("Auto-merging retriever initialized\n")


# Test query
query = "Explain the types of machine learning"

retrieved_nodes = auto_merging_retriever.retrieve(query)

print(f"Query: {query}\n")
print(f"Retrieved {len(retrieved_nodes)} merged nodes:\n")

for i, node in enumerate(retrieved_nodes, 1):
    print(f"Node {i}")
    print(f"Topic: {node.metadata.get('topic', 'N/A')}")
    print(f"Text Length: {len(node.text)} characters")
    print(f"Preview: {node.text[:150]}...\n")


Created 16 hierarchical nodes

Vector index built successfully

Auto-merging retriever initialized

> Merging 1 nodes into parent node.
> Parent node id: d6593846-7f7c-429f-86fc-030124a52ca9.
> Parent node text: Machine Learning Fundamentals
        
        Machine learning is a subset of artificial intelli...

Query: Explain the types of machine learning

Retrieved 3 merged nodes:

Node 1
Topic: ML Types
Text Length: 633 characters
Preview: Types of Machine Learning
        
        There are three main types of machine learning:
        
        1. Supervised Learning: The algorithm lear...

Node 2
Topic: ML Types
Text Length: 689 characters
Preview: Types of Machine Learning
        
        There are three main types of machine learning:
        
        1. Supervised Learning: The algorithm lear...

Node 3
Topic: ML Basics
Text Length: 607 characters
Preview: Machine Learning Fundamentals
        
        Machine learning is a subset of artificial intelligence that enables system

### 5. Recursive Retriever - Multi-Level Reference Following

Recursive Retriever can follow references and retrieve from multiple levels of document structure.

**How It Works:**
1. Initial retrieval from index nodes (high-level references)
2. Recursively retrieves from referenced underlying indices
3. Aggregates results from multiple levels

**Strengths:**
- Handles complex document structures
- Follows cross-references
- Multi-level aggregation

**Use Cases:**
- Cross-referenced documents
- Multi-index systems
- Hierarchical knowledge bases


In [48]:
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import IndexNode
from llama_index.core.retrievers import RecursiveRetriever


# Split documents
ml_docs = [doc for doc in documents if "ML" in doc.metadata.get("topic", "")]
advanced_docs = [doc for doc in documents if "ML" not in doc.metadata.get("topic", "")]


# Create sub-indices
ml_index = VectorStoreIndex.from_documents(ml_docs)
advanced_index = VectorStoreIndex.from_documents(advanced_docs)


# Create router nodes
router_nodes = [
    IndexNode(
        text="Covers machine learning fundamentals and types including supervised and unsupervised learning.",
        index_id="ml_index"
    ),
    IndexNode(
        text="Covers neural networks, preprocessing, and evaluation metrics.",
        index_id="advanced_index"
    )
]


# Create top-level index
top_index = VectorStoreIndex(router_nodes)


# Create recursive retriever
recursive_retriever = RecursiveRetriever(
    root_id="vector",
    retriever_dict={
        "vector": top_index.as_retriever(similarity_top_k=1),
        "ml_index": ml_index.as_retriever(similarity_top_k=3),
        "advanced_index": advanced_index.as_retriever(similarity_top_k=3),
    },
    verbose=True
)


# Test query
query = "What is supervised learning?"

results = recursive_retriever.retrieve(query)

print(f"\nQuery: {query}\n")
print(f"Retrieved {len(results)} nodes via recursive routing:\n")

for i, node in enumerate(results, 1):
    print(f"Node {i}")
    print(f"Score: {node.score:.4f}")
    print(f"Topic: {node.metadata.get('topic', 'N/A')}")
    print(f"Preview: {node.text[:200]}...\n")


[1;3;34mRetrieving with query id None: What is supervised learning?
[0m[1;3;38;5;200mRetrieved node with id, entering: ml_index
[0m[1;3;34mRetrieving with query id ml_index: What is supervised learning?
[0m[1;3;38;5;200mRetrieving text node: Machine Learning Fundamentals
        
        Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It focuses on developing computer programs that can access data and use it to learn for themselves.
        
        The process of learning begins with observations or data, such as examples, direct experience, or instruction, in order to look for patterns in data and make better decisions in the future. The primary aim is to allow computers to learn automatically without human intervention or assistance.
[0m[1;3;38;5;200mRetrieving text node: Types of Machine Learning
        
        There are three main types of machine learning:
        
  

### 6. Query Fusion Retriever - Multi-Query Enhancement with Advanced Fusion

Query Fusion Retriever generates multiple query variations and combines results using sophisticated fusion techniques.

**How It Works:**
1. Generates multiple query variations using an LLM
2. Retrieves results for each query variation
3. Fuses results using one of three strategies:
   - **Reciprocal Rank Fusion (RRF)**: Combines based on rank positions
   - **Relative Score Fusion**: Normalizes and combines scores
   - **Distribution-Based Fusion**: Uses score distributions

**Strengths:**
- Handles ambiguous queries
- Improves recall
- Multiple perspectives

**Fusion Methods Compared:**

**Reciprocal Rank Fusion (RRF):**
- Formula: `score = Σ(1 / (k + rank))` where k=60 (typical)
- Focuses on rank position, not raw scores
- Good when different retrievers have incomparable scores
- Example: If doc appears at rank 1 in one query and rank 3 in another:
  - RRF score = 1/(60+1) + 1/(60+3) ≈ 0.0164 + 0.0159 = 0.0323

**Relative Score Fusion:**
- Normalizes scores to [0,1] range for each query
- Takes mean of normalized scores across queries
- Better when scores are meaningful within each retrieval
- Example: Raw scores [0.8, 0.6, 0.4] normalized to [1.0, 0.5, 0.0]

**Distribution-Based Fusion:**
- Uses statistical distribution of scores
- Considers mean and standard deviation
- Good for handling outliers and score variance
- Example: Weights based on how many standard deviations above mean


In [49]:
# Create query fusion retriever with RRF
fusion_retriever_rrf = QueryFusionRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    similarity_top_k=3,
    num_queries=3,  # Generate 3 query variations
    mode="reciprocal_rerank",  # RRF fusion
    use_async=False,
    verbose=True
)

# Test with an ambiguous query
query = "How can I improve model accuracy?"
print(f"Testing RRF Fusion with query: {query}\n")
retrieved_nodes_rrf = fusion_retriever_rrf.retrieve(query)

print(f"\nRetrieved {len(retrieved_nodes_rrf)} nodes with RRF:\n")
for i, node in enumerate(retrieved_nodes_rrf, 1):
    print(f"Node {i} (Score: {node.score:.4f}):")
    print(f"Topic: {node.metadata.get('topic', 'N/A')}")
    print(f"Preview: {node.text[:150]}...\n")

print("\n" + "="*80 + "\n")

# Create query fusion retriever with Relative Score Fusion
fusion_retriever_relative = QueryFusionRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    similarity_top_k=3,
    num_queries=3,
    mode="relative_score",  # Relative score fusion
    use_async=False,
    verbose=True
)

print(f"Testing Relative Score Fusion with query: {query}\n")
retrieved_nodes_relative = fusion_retriever_relative.retrieve(query)

print(f"\nRetrieved {len(retrieved_nodes_relative)} nodes with Relative Score:\n")
for i, node in enumerate(retrieved_nodes_relative, 1):
    print(f"Node {i} (Score: {node.score:.4f}):")
    print(f"Topic: {node.metadata.get('topic', 'N/A')}")
    print(f"Preview: {node.text[:150]}...\n")

print("\n" + "="*80 + "\n")

# Create query fusion retriever with Distribution-Based Fusion
fusion_retriever_dist = QueryFusionRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    similarity_top_k=3,
    num_queries=3,
    mode="dist_based_score",  # Distribution-based fusion
    use_async=False,
    verbose=True
)

print(f"Testing Distribution-Based Fusion with query: {query}\n")
retrieved_nodes_dist = fusion_retriever_dist.retrieve(query)

print(f"\nRetrieved {len(retrieved_nodes_dist)} nodes with Distribution-Based:\n")
for i, node in enumerate(retrieved_nodes_dist, 1):
    print(f"Node {i} (Score: {node.score:.4f}):")
    print(f"Topic: {node.metadata.get('topic', 'N/A')}")
    print(f"Preview: {node.text[:150]}...\n")

Testing RRF Fusion with query: How can I improve model accuracy?

Generated queries:
Machine learning model accuracy improvement techniques
Strategies for enhancing predictive model performance

Retrieved 3 nodes with RRF:

Node 1 (Score: 0.0817):
Topic: Data Processing
Preview: Data Preprocessing Techniques
        
        Data preprocessing is a crucial step in machine learning pipelines:
        
        - Data Cleaning: R...

Node 2 (Score: 0.0500):
Topic: Evaluation
Preview: Model Evaluation Metrics
        
        Evaluating machine learning models requires appropriate metrics:
        
        For Classification:
      ...

Node 3 (Score: 0.0500):
Topic: Evaluation
Preview: Model Evaluation Metrics
        
        Evaluating machine learning models requires appropriate metrics:
        
        For Classification:
      ...



Testing Relative Score Fusion with query: How can I improve model accuracy?

Generated queries:
Strategies for enhancing AI model accuracy
Common reaso

**Choosing the Right Fusion Mode:**

- **Use RRF when**:
  - Combining retrievers with different scoring scales
  - You trust ranking more than absolute scores
  - Scores across retrievers aren't directly comparable

- **Use Relative Score when**:
  - Scores within each retriever are meaningful
  - You want to preserve relative quality differences
  - Retrievers have similar scoring mechanisms

- **Use Distribution-Based when**:
  - Dealing with varying score distributions
  - Need to handle outliers gracefully
  - Want statistical normalization

In practice, **RRF is the most commonly used** as it's robust and works well across different retriever combinations.


## Summary

**Key Concepts:**
- **Vector Index Retriever**: Semantic search using embeddings
- **BM25 Retriever**: Advanced keyword-based search with TF-IDF improvements
- **Document Summary Index**: Intelligent document selection using summaries
- **Auto Merging Retriever**: Hierarchical context preservation
- **Recursive Retriever**: Multi-level reference following
- **Query Fusion Retriever**: Multi-query enhancement with three fusion modes

