In [1]:
import os
import json
from typing import List, Optional
import asyncio
import warnings
import numpy as np
warnings.filterwarnings('ignore')

# Core LlamaIndex imports
from llama_index.core import (
    VectorStoreIndex, 
    SimpleDirectoryReader, 
    Document,
    Settings,
    DocumentSummaryIndex,
    KeywordTableIndex
)
from llama_index.core.retrievers import (
    BaseRetriever,
    VectorIndexRetriever,
    AutoMergingRetriever,
    RecursiveRetriever,
    QueryFusionRetriever
)
from llama_index.core.indices.document_summary import (
    DocumentSummaryIndexLLMRetriever,
    DocumentSummaryIndexEmbeddingRetriever,
)
from llama_index.core.node_parser import SentenceSplitter, HierarchicalNodeParser
from llama_index.core.schema import NodeWithScore, QueryBundle
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.core.embeddings import BaseEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Advanced retriever imports
from llama_index.retrievers.bm25 import BM25Retriever

# Sentence transformers
from sentence_transformers import SentenceTransformer

# Statistical libraries for fusion techniques
try:
    from scipy import stats
    SCIPY_AVAILABLE = True
except ImportError:
    SCIPY_AVAILABLE = False
    print("⚠️ scipy not available - some advanced fusion features will be limited")

print("✅ All imports successful!")

✅ All imports successful!


## llm

In [2]:
from langchain_ollama.chat_models import ChatOllama
from langchain_core.tools import tool
from langchain_core.messages import HumanMessage, ToolMessage, AIMessage
import os
from typing import Literal
from datetime import datetime
from pydantic import BaseModel


OLLAMA_CLIENT_BASE_URL = "http://localhost:11434"

# --- Initialize ChatOllama instance ---
# This is your main LLM instance that will be used for both direct queries and tool calls.
llm = ChatOllama(
    model="llama3.1:latest", # Ensure this model is pulled on your remote Ollama server
    temperature=0.0,
    base_url=OLLAMA_CLIENT_BASE_URL, # Crucial: points to your accessible remote server (via tunnel) [1, 2]
    api_key="ollama" # Dummy value, as Ollama doesn't use API keys
)

In [3]:
print("🔧 Initializing HuggingFace embeddings...")
embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)
print("✅ HuggingFace embeddings initialized!")

🔧 Initializing HuggingFace embeddings...
✅ HuggingFace embeddings initialized!


In [6]:
Settings.llm = llm
Settings.embed_model = embed_model

Advanced retrievers in LlamaIndex are sophisticated components that go beyond simple vector similarity search to provide more nuanced, context-aware, and intelligent information retrieval. They combine multiple techniques such as:

- **Semantic Understanding**: Using embeddings to understand meaning and context
- **Keyword Matching**: Precise term-based search for exact specifications
- **Hierarchical Context**: Maintaining relationships between different levels of information
- **Multi-Query Processing**: Generating and combining results from multiple query variations
- **Fusion Techniques**: Intelligently combining results from different retrieval methods

### Why are Advanced Retrievers Important?

1. **Improved Accuracy**: Advanced retrievers can find more relevant information by using multiple search strategies
2. **Better Context Preservation**: They maintain important relationships between pieces of information
3. **Reduced Hallucination**: More precise retrieval leads to more accurate AI responses
4. **Scalability**: Efficient retrieval strategies work better with large document collections
5. **Flexibility**: Different retrieval methods can be combined for optimal results

### Index Types Overview

Before exploring advanced retrievers, it's helpful to first understand the three main index types supported by LlamaIndex. Each is designed to support different retrieval scenarios:

**VectorStoreIndex:**
- Stores vector embeddings for each document chunk
- Best suited for semantic retrieval based on meaning
- Commonly used in LLM pipelines and RAG applications

**DocumentSummaryIndex:**
- Generates and stores summaries of documents at indexing time
- Uses summaries to filter documents before retrieving full content
- Especially useful for large and diverse document sets that cannot fit in the context window of an LLM

**KeywordTableIndex:**
- Extracts keywords from documents and maps them to specific content chunks
- Enables exact keyword matching for rule-based or hybrid search scenarios
- Ideal for applications requiring precise term matching


In [7]:
# Sample data for the lab - AI/ML focused documents
SAMPLE_DOCUMENTS = [
    "Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data.",
    "Deep learning uses neural networks with multiple layers to model and understand complex patterns in data.",
    "Natural language processing enables computers to understand, interpret, and generate human language.",
    "Computer vision allows machines to interpret and understand visual information from the world.",
    "Reinforcement learning is a type of machine learning where agents learn to make decisions through rewards and penalties.",
    "Supervised learning uses labeled training data to learn a mapping from inputs to outputs.",
    "Unsupervised learning finds hidden patterns in data without labeled examples.",
    "Transfer learning leverages knowledge from pre-trained models to improve performance on new tasks.",
    "Generative AI can create new content including text, images, code, and more.",
    "Large language models are trained on vast amounts of text data to understand and generate human-like text."
]

# Consistent query examples used throughout the lab
DEMO_QUERIES = {
    "basic": "What is machine learning?",
    "technical": "neural networks deep learning", 
    "learning_types": "different types of learning",
    "advanced": "How do neural networks work in deep learning?",
    "applications": "What are the applications of AI?",
    "comprehensive": "What are the main approaches to machine learning?",
    "specific": "supervised learning techniques"
}

print(f" Loaded {len(SAMPLE_DOCUMENTS)} sample documents")
print(f" Prepared {len(DEMO_QUERIES)} consistent demo queries")
for i, doc in enumerate(SAMPLE_DOCUMENTS[:3], 1):
    print(f"{i}. {doc}")
print("...")

 Loaded 10 sample documents
 Prepared 7 consistent demo queries
1. Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data.
2. Deep learning uses neural networks with multiple layers to model and understand complex patterns in data.
3. Natural language processing enables computers to understand, interpret, and generate human language.
...


In [8]:
class AdvancedRetrieversLab:
    def __init__(self):
        print("🚀 Initializing Advanced Retrievers Lab...")
        self.documents = [Document(text=text) for text in SAMPLE_DOCUMENTS]
        self.nodes = SentenceSplitter().get_nodes_from_documents(self.documents)
        
        print("📊 Creating indexes...")
        # Create various indexes
        self.vector_index = VectorStoreIndex.from_documents(self.documents)
        self.document_summary_index = DocumentSummaryIndex.from_documents(self.documents)
        self.keyword_index = KeywordTableIndex.from_documents(self.documents)
        
        print("✅ Advanced Retrievers Lab Initialized!")
        print(f"📄 Loaded {len(self.documents)} documents")
        print(f"🔢 Created {len(self.nodes)} nodes")

# Initialize the lab
lab = AdvancedRetrieversLab()

🚀 Initializing Advanced Retrievers Lab...
📊 Creating indexes...
current doc id: 8f64bd78-64cf-4ccb-8f6d-66a641a1bb79
current doc id: 3fa23e5c-271c-4351-9368-0c7d914a977e
current doc id: cbc4dce2-8bbe-4295-8bbc-730f37931944
current doc id: 01e06f81-ee8c-490d-9e0d-c27872d8a6db
current doc id: 715b64a5-816e-4294-a044-e261bd6f007b
current doc id: 915283a1-bda1-4475-9e7f-4ec140d9091c
current doc id: e6e19497-82f9-4cca-9045-ab127cc2fce8
current doc id: 1bb7c092-e626-41d2-8635-1c9f198b3e26
current doc id: 3bc16a77-f3a1-4a87-9258-8ec3961a1753
current doc id: c23c7b07-58a7-4e1a-b20f-a2b40d14ff97
✅ Advanced Retrievers Lab Initialized!
📄 Loaded 10 documents
🔢 Created 10 nodes


In [9]:
print("=" * 60)
print("1. VECTOR INDEX RETRIEVER")
print("=" * 60)

# Basic vector retriever
vector_retriever = VectorIndexRetriever(
    index=lab.vector_index,
    similarity_top_k=3
)

# Alternative creation method
alt_retriever = lab.vector_index.as_retriever(similarity_top_k=3)

query = DEMO_QUERIES["basic"]  # "What is machine learning?"
nodes = vector_retriever.retrieve(query)

print(f"Query: {query}")
print(f"Retrieved {len(nodes)} nodes:")
for i, node in enumerate(nodes, 1):
    print(f"{i}. Score: {node.score:.4f}")
    print(f"   Text: {node.text[:100]}...")
    print()

1. VECTOR INDEX RETRIEVER
Query: What is machine learning?
Retrieved 3 nodes:
1. Score: 0.8700
   Text: Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn fr...

2. Score: 0.7644
   Text: Reinforcement learning is a type of machine learning where agents learn to make decisions through re...

3. Score: 0.6979
   Text: Supervised learning uses labeled training data to learn a mapping from inputs to outputs....



In [10]:
print("=" * 60)
print("2. BM25 RETRIEVER")
print("=" * 60)

try:
    import Stemmer
    
    # Create BM25 retriever with default parameters
    bm25_retriever = BM25Retriever.from_defaults(
        nodes=lab.nodes,
        similarity_top_k=3,
        stemmer=Stemmer.Stemmer("english"),
        language="english"
    )
    
    query = DEMO_QUERIES["technical"]  # "neural networks deep learning"
    nodes = bm25_retriever.retrieve(query)
    
    print(f"Query: {query}")
    print("BM25 analyzes exact keyword matches with sophisticated scoring")
    print(f"Retrieved {len(nodes)} nodes:")
    
    for i, node in enumerate(nodes, 1):
        score = node.score if hasattr(node, 'score') and node.score else 0
        print(f"{i}. BM25 Score: {score:.4f}")
        print(f"   Text: {node.text[:100]}...")
        
        # Highlight which query terms appear in the text
        text_lower = node.text.lower()
        query_terms = query.lower().split()
        found_terms = [term for term in query_terms if term in text_lower]
        if found_terms:
            print(f"   → Found terms: {found_terms}")
        print()
    
    print("BM25 vs TF-IDF Comparison:")
    print("TF-IDF Problem: Linear term frequency scaling")
    print("  Example: 10 occurrences → score of 10, 100 occurrences → score of 100")
    print("BM25 Solution: Saturation function")
    print("  Example: 10 occurrences → high score, 100 occurrences → slightly higher score")
    print()
    print("TF-IDF Problem: No document length consideration")
    print("  Example: Long documents dominate results")
    print("BM25 Solution: Length normalization (b parameter)")
    print("  Example: Scores adjusted based on document length vs. average")
    print()
    print("Key BM25 Parameters:")
    print("- k1 ≈ 1.2: Term frequency saturation (how quickly scores plateau)")
    print("- b ≈ 0.75: Document length normalization (0=none, 1=full)")
    print("- IDF weighting: Rare terms get higher scores")
        
except ImportError:
    print("⚠️ BM25Retriever requires 'pip install PyStemmer'")
    print("Demonstrating BM25 concepts with fallback vector search...")
    
    fallback_retriever = lab.vector_index.as_retriever(similarity_top_k=3)
    query = DEMO_QUERIES["technical"]
    nodes = fallback_retriever.retrieve(query)
    
    print(f"Query: {query}")
    print("(Using vector fallback to demonstrate BM25 concepts)")
    
    for i, node in enumerate(nodes, 1):
        print(f"{i}. Vector Score: {node.score:.4f}")
        print(f"   Text: {node.text[:100]}...")
        
        # Demonstrate TF-IDF concept manually
        text_lower = node.text.lower()
        query_terms = query.lower().split()
        found_terms = [term for term in query_terms if term in text_lower]
        
        if found_terms:
            print(f"   → BM25 would boost this result for terms: {found_terms}")
        print()
    
    print("BM25 Concept Demonstration:")
    print("1. TF-IDF Foundation:")
    print("   - Term Frequency: How often words appear in document")
    print("   - Inverse Document Frequency: How rare words are across collection")
    print("   - TF-IDF = TF × IDF (balances frequency vs rarity)")
    print()
    print("2. BM25 Improvements:")
    print("   - Saturation: Prevents over-scoring repeated terms")
    print("   - Length normalization: Prevents long document bias")
    print("   - Tunable parameters: k1 (saturation) and b (length adjustment)")
    print()
    print("3. Real-world Usage:")
    print("   - Elasticsearch default scoring function")
    print("   - Apache Lucene/Solr standard")
    print("   - Used in 83% of text-based recommender systems")
    print("   - Developed by Robertson & Spärck Jones at City University London")

2. BM25 RETRIEVER
Query: neural networks deep learning
BM25 analyzes exact keyword matches with sophisticated scoring
Retrieved 3 nodes:
1. BM25 Score: 2.5203
   Text: Deep learning uses neural networks with multiple layers to model and understand complex patterns in ...
   → Found terms: ['neural', 'networks', 'deep', 'learning']

2. BM25 Score: 0.3372
   Text: Reinforcement learning is a type of machine learning where agents learn to make decisions through re...
   → Found terms: ['learning']

3. BM25 Score: 0.3024
   Text: Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn fr...
   → Found terms: ['learning']

BM25 vs TF-IDF Comparison:
TF-IDF Problem: Linear term frequency scaling
  Example: 10 occurrences → score of 10, 100 occurrences → score of 100
BM25 Solution: Saturation function
  Example: 10 occurrences → high score, 100 occurrences → slightly higher score

TF-IDF Problem: No document length consideration
  Example: Long docume

## 3. Document Summary Index Retrievers

In [11]:
print("=" * 60)
print("3. DOCUMENT SUMMARY INDEX RETRIEVERS")
print("=" * 60)

# LLM-based document summary retriever
doc_summary_retriever_llm = DocumentSummaryIndexLLMRetriever(
    lab.document_summary_index,
    choice_top_k=3  # Number of documents to select
)

# Embedding-based document summary retriever  
doc_summary_retriever_embedding = DocumentSummaryIndexEmbeddingRetriever(
    lab.document_summary_index,
    similarity_top_k=3  # Number of documents to select
)

query = DEMO_QUERIES["learning_types"]  # "different types of learning"

print(f"Query: {query}")

print("\nA) LLM-based Document Summary Retriever:")
print("Uses LLM to select relevant documents based on summaries")
try:
    nodes_llm = doc_summary_retriever_llm.retrieve(query)
    print(f"Retrieved {len(nodes_llm)} nodes")
    for i, node in enumerate(nodes_llm[:2], 1):
        print(f"{i}. Score: {node.score:.4f}" if hasattr(node, 'score') and node.score else f"{i}. (Document summary)")
        print(f"   Text: {node.text[:80]}...")
        print()
except Exception as e:
    print(f"LLM-based retrieval demo: {str(e)[:100]}...")

print("B) Embedding-based Document Summary Retriever:")
print("Uses vector similarity between query and document summaries")
try:
    nodes_emb = doc_summary_retriever_embedding.retrieve(query)
    print(f"Retrieved {len(nodes_emb)} nodes")
    for i, node in enumerate(nodes_emb[:2], 1):
        print(f"{i}. Score: {node.score:.4f}" if hasattr(node, 'score') and node.score else f"{i}. (Document summary)")
        print(f"   Text: {node.text[:80]}...")
        print()
except Exception as e:
    print(f"Embedding-based retrieval demo: {str(e)[:100]}...")

print("Document Summary Index workflow:")
print("1. Generates summaries for each document using LLM")
print("2. Uses summaries to select relevant documents")
print("3. Returns full content from selected documents")

3. DOCUMENT SUMMARY INDEX RETRIEVERS
Query: different types of learning

A) LLM-based Document Summary Retriever:
Uses LLM to select relevant documents based on summaries
Retrieved 3 nodes
1. Score: 9.0000
   Text: Supervised learning uses labeled training data to learn a mapping from inputs to...

2. Score: 8.0000
   Text: Machine learning is a subset of artificial intelligence that focuses on algorith...

B) Embedding-based Document Summary Retriever:
Uses vector similarity between query and document summaries
Retrieved 3 nodes
1. (Document summary)
   Text: Supervised learning uses labeled training data to learn a mapping from inputs to...

2. (Document summary)
   Text: Unsupervised learning finds hidden patterns in data without labeled examples....

Document Summary Index workflow:
1. Generates summaries for each document using LLM
2. Uses summaries to select relevant documents
3. Returns full content from selected documents


## 4. Auto Merging Retriever - Hierarchical Context Preservation

Auto Merging Retriever is designed to preserve context in long documents using a hierarchical structure. **It uses hierarchical chunking to break documents into parent and child nodes, and if enough child nodes from the same parent are retrieved, the retriever returns the parent node instead.**

In [12]:
print("=" * 60)
print("4. AUTO MERGING RETRIEVER")
print("=" * 60)

# Create hierarchical nodes
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[512, 256, 128]
)

hier_nodes = node_parser.get_nodes_from_documents(lab.documents)

# Create storage context with all nodes
from llama_index.core import StorageContext
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.vector_stores import SimpleVectorStore

docstore = SimpleDocumentStore()
docstore.add_documents(hier_nodes)

storage_context = StorageContext.from_defaults(docstore=docstore)

# Create base index
base_index = VectorStoreIndex(hier_nodes, storage_context=storage_context)
base_retriever = base_index.as_retriever(similarity_top_k=6)

# Create auto-merging retriever
auto_merging_retriever = AutoMergingRetriever(
    base_retriever, 
    storage_context,
    verbose=True
)

query = DEMO_QUERIES["advanced"]  # "How do neural networks work in deep learning?"
nodes = auto_merging_retriever.retrieve(query)

print(f"Query: {query}")
print(f"Auto-merged to {len(nodes)} nodes")
for i, node in enumerate(nodes[:3], 1):
    print(f"{i}. Score: {node.score:.4f}" if hasattr(node, 'score') and node.score else f"{i}. (Auto-merged)")
    print(f"   Text: {node.text[:120]}...")
    print()

4. AUTO MERGING RETRIEVER
Query: How do neural networks work in deep learning?
Auto-merged to 2 nodes
1. Score: 0.8570
   Text: Deep learning uses neural networks with multiple layers to model and understand complex patterns in data....

2. Score: 0.6956
   Text: Supervised learning uses labeled training data to learn a mapping from inputs to outputs....



## Recommended Retrievers by Use Case

Based on the authoritative source and the characteristics of each retriever, here are recommended approaches for different scenarios:

**General Q&A Applications:**
- **Primary**: Vector Index Retriever for semantic understanding
- **Enhancement**: Combine with BM25 Retriever using Query Fusion for hybrid approach
- **Benefit**: Combines semantic relevance with keyword matching
- **From authoritative source**: "For general Q&A, use a vector index retriever, potentially combined with a BM25 retriever. This retriever fusion combines semantic relevance with keyword matching."

**Technical Documentation:**
- **Primary**: BM25 Retriever for exact term matching
- **Enhancement**: Vector Index Retriever as secondary for contextual flexibility
- **Benefit**: Prioritizes exact technical terms while maintaining semantic understanding
- **From authoritative source**: "For technical documents, especially those where exact terms need to be prioritized, consider making BM25 your primary retriever, with the vector index retriever adding contextual flexibility as a secondary retriever."

**Long Documents:**
- **Primary**: Auto Merging Retriever
- **Benefit**: Retrieves longer parent versions only if enough shorter child versions are retrieved, preserving context
- **From authoritative source**: "For long documents, the auto merging retriever is a great option, because it will retrieve longer parent versions only if enough shorter child versions are retrieved."

**Research Papers:**
- **Primary**: Recursive Retriever
- **Benefit**: Follows citations and references to retrieve relevant content from cited papers
- **From authoritative source**: "For research papers, use the recursive retriever in order to retrieve relevant content from cited papers."

**Large Document Collections:**
- **Primary**: Document Summary Index Retriever for initial filtering
- **Enhancement**: Followed by Vector Index Retriever for detailed search within relevant documents
- **Benefit**: Narrows down relevant documents first, then performs detailed retrieval
- **From authoritative source**: "For large document sets, consider using the document summary index retriever to narrow down the number of relevant documents, followed by a vector search within the remaining subset to retrieve the most pertinent content."
