# Lab 3.4.3: LlamaIndex Query Engine with Hybrid Search

**Module:** 3.4 - AI Agents & Agentic Systems  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê (Intermediate)

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand LlamaIndex's approach to RAG (vs LangChain)
- [ ] Build different index types for different use cases
- [ ] Implement hybrid search (keyword + semantic)
- [ ] Add reranking for improved retrieval quality
- [ ] Create query engines with automatic citations

---

## üìö Prerequisites

- Completed: Lab 3.4.1 (RAG Pipeline)
- Knowledge of: Embeddings, vector search basics

---

## üåç Real-World Context

**Why LlamaIndex when we have LangChain?**

Think of them like different tools in a toolbox:
- **LangChain**: General-purpose framework for LLM applications (agents, chains, tools)
- **LlamaIndex**: Specialized for data indexing and retrieval (RAG excellence)

**Real-world use cases for LlamaIndex:**
- üìö **Enterprise Search**: Searching across millions of documents with citations
- üìä **Structured Data QA**: Querying databases with natural language
- üî¨ **Research Assistants**: Finding relevant papers with context
- üìã **Compliance Tools**: Answering policy questions with exact source references

---

## üßí ELI5: LlamaIndex vs LangChain

> **Imagine you're building a library system...** üìö
>
> **LangChain** is like a construction company. They can build:
> - Libraries (RAG)
> - Offices (agents)
> - Factories (chains)
> - ...and anything else you need!
>
> **LlamaIndex** is like a specialized library architect. They ONLY build libraries, but:
> - Their catalog systems are amazing
> - They have special ways to organize books
> - They're experts at helping you find exactly what you need
>
> **When to use which:**
> - Need a general AI application? ‚Üí LangChain
> - Need the BEST possible document retrieval? ‚Üí LlamaIndex
> - Need both? ‚Üí Use them together! (They work great as partners)

---

## Part 1: Environment Setup

In [None]:
# Install LlamaIndex and dependencies (run once)
# Note: BM25Retriever requires rank_bm25, SentenceTransformerRerank requires sentence-transformers
# Pinned versions for reproducibility - update as needed
# !pip install llama-index>=0.10.0 llama-index-llms-ollama llama-index-embeddings-ollama rank_bm25 sentence-transformers

In [None]:
# Standard imports
import os
import sys
from pathlib import Path
from typing import List, Optional
import time

# LlamaIndex imports
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Document,
    Settings,
    StorageContext,
    load_index_from_storage,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.response_synthesizers import ResponseMode

# Local models via Ollama
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

print("LlamaIndex imports successful!")

In [None]:
# Configure LlamaIndex to use local Ollama models
# This runs everything on DGX Spark - no API calls needed!

# Initialize embedding model
embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    base_url="http://localhost:11434"
)

# Initialize LLM
llm = Ollama(
    model="llama3.1:8b",
    temperature=0.1,
    request_timeout=120.0,
    base_url="http://localhost:11434"
)

# Configure global settings
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512
Settings.chunk_overlap = 50

print("LlamaIndex configured with local Ollama models:")
print(f"  - Embedding: nomic-embed-text")
print(f"  - LLM: llama3.1:8b")
print(f"  - Chunk size: 512 chars")

---

## Part 2: Loading Documents

LlamaIndex has built-in readers for many document types.

In [None]:
# Define paths
DATA_DIR = Path.cwd().parent / "data" / "sample_documents"
INDEX_DIR = Path.cwd().parent / "data" / "llamaindex_storage"

# Load documents
print(f"Loading documents from: {DATA_DIR}")

reader = SimpleDirectoryReader(
    input_dir=str(DATA_DIR),
    required_exts=[".txt", ".md"],
    recursive=True
)

documents = reader.load_data()

print(f"\nLoaded {len(documents)} documents:")
for doc in documents:
    filename = Path(doc.metadata.get('file_path', 'unknown')).name
    print(f"  - {filename}: {len(doc.text)} chars")

In [None]:
# Show a sample document
print("Sample document content:")
print("=" * 60)
print(documents[0].text[:1000])
print("...")
print("=" * 60)
print(f"\nMetadata: {documents[0].metadata}")

---

## Part 3: Creating a Vector Store Index

The VectorStoreIndex is LlamaIndex's primary index type for semantic search.

In [None]:
# Create node parser (chunker)
node_parser = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=50,
)

# Parse documents into nodes
nodes = node_parser.get_nodes_from_documents(documents)

print(f"Created {len(nodes)} nodes from {len(documents)} documents")
print(f"\nSample node:")
print(f"  Text: {nodes[0].text[:200]}...")
print(f"  Metadata: {nodes[0].metadata}")

In [None]:
# Create the vector index
print("Creating vector index (this will generate embeddings)...")
start_time = time.time()

index = VectorStoreIndex(
    nodes=nodes,
    show_progress=True
)

elapsed = time.time() - start_time
print(f"\nIndex created in {elapsed:.1f} seconds!")

In [None]:
# Persist the index for later use
INDEX_DIR.mkdir(parents=True, exist_ok=True)
index.storage_context.persist(persist_dir=str(INDEX_DIR))
print(f"Index saved to: {INDEX_DIR}")

---

## Part 4: Basic Query Engine

A query engine combines retrieval with response generation.

In [None]:
# Create a basic query engine
query_engine = index.as_query_engine(
    similarity_top_k=5,  # Retrieve top 5 nodes
    response_mode="compact",  # Compress retrieved text
)

print("Query engine created!")

In [None]:
# Test with a simple query
query = "What is the memory capacity of DGX Spark?"

print(f"Query: {query}")
print("="*60)

start_time = time.time()
response = query_engine.query(query)
elapsed = time.time() - start_time

print(f"\nResponse: {response}")
print(f"\n(Response time: {elapsed:.2f}s)")

In [None]:
# See the source nodes that were retrieved
print("\nSource nodes used:")
print("="*60)

for i, node in enumerate(response.source_nodes, 1):
    score = node.score if hasattr(node, 'score') else 'N/A'
    filename = Path(node.metadata.get('file_path', 'unknown')).name
    print(f"\n[{i}] Score: {score:.4f} | Source: {filename}")
    print(f"    {node.text[:200]}...")

---

## Part 5: Query Engine with Citations

LlamaIndex can automatically add citations to responses!

In [None]:
# Handle different LlamaIndex versions for CitationQueryEngine
try:
    from llama_index.core.query_engine import CitationQueryEngine
except ImportError:
    try:
        from llama_index.query_engine import CitationQueryEngine
    except ImportError:
        print("‚ö†Ô∏è CitationQueryEngine not available in this LlamaIndex version")
        print("   Try: pip install llama-index>=0.10.0")
        CitationQueryEngine = None

# Create a citation query engine (if available)
if CitationQueryEngine is not None:
    citation_engine = CitationQueryEngine.from_args(
        index=index,
        similarity_top_k=5,
        citation_chunk_size=512,
    )
    print("‚úÖ Citation query engine created!")
else:
    citation_engine = query_engine  # Fallback to regular engine
    print("‚ö†Ô∏è Using standard query engine (citations not available)")

In [None]:
# Query with citations
query = "How does the unified memory architecture benefit AI workloads on DGX Spark?"

print(f"Query: {query}")
print("="*60)

response = citation_engine.query(query)

print(f"\nResponse with citations:")
print(response.response)

print("\n" + "="*60)
print("Sources:")
for i, node in enumerate(response.source_nodes, 1):
    filename = Path(node.metadata.get('file_path', 'unknown')).name
    print(f"  [{i}] {filename}")

---

## Part 6: Hybrid Search (Keyword + Semantic)

### üßí ELI5: Why Hybrid Search?

> **Imagine looking for a book about "transformers"...** ü§ñ
>
> **Semantic search** understands you mean machine learning transformers and finds:
> - "Attention mechanisms in neural networks"
> - "BERT architecture explained"
>
> **Keyword search** finds exact matches:
> - "The Transformer Architecture..."
> - "Building Transformers from scratch"
>
> **Hybrid search** combines both:
> - Gets the semantic understanding
> - PLUS catches exact keyword matches
> - Best of both worlds!

In [None]:
from llama_index.core.retrievers import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever

# Create BM25 (keyword) retriever
bm25_retriever = BM25Retriever.from_defaults(
    nodes=nodes,
    similarity_top_k=5
)

# Create vector (semantic) retriever
vector_retriever = index.as_retriever(similarity_top_k=5)

print("Created BM25 and Vector retrievers!")

In [None]:
# Create a hybrid retriever using QueryFusion
hybrid_retriever = QueryFusionRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    similarity_top_k=5,
    num_queries=1,  # Don't generate multiple queries
    mode="reciprocal_rerank",  # Use RRF (Reciprocal Rank Fusion)
)

print("Hybrid retriever created using Reciprocal Rank Fusion!")

In [None]:
# Compare results: Semantic vs Keyword vs Hybrid
query = "What is CUDA and Tensor Core count in Blackwell GB10?"

print(f"Query: {query}\n")

# Vector (semantic) results
print("=" * 60)
print("VECTOR (SEMANTIC) RESULTS")
print("=" * 60)
vector_results = vector_retriever.retrieve(query)
for i, node in enumerate(vector_results[:3], 1):
    print(f"[{i}] Score: {node.score:.4f}")
    print(f"    {node.text[:150]}...\n")

# BM25 (keyword) results
print("=" * 60)
print("BM25 (KEYWORD) RESULTS")
print("=" * 60)
bm25_results = bm25_retriever.retrieve(query)
for i, node in enumerate(bm25_results[:3], 1):
    print(f"[{i}] Score: {node.score:.4f}")
    print(f"    {node.text[:150]}...\n")

# Hybrid results
print("=" * 60)
print("HYBRID (FUSED) RESULTS")
print("=" * 60)
hybrid_results = hybrid_retriever.retrieve(query)
for i, node in enumerate(hybrid_results[:3], 1):
    print(f"[{i}] Score: {node.score:.4f}")
    print(f"    {node.text[:150]}...\n")

In [None]:
# Create a query engine with hybrid retrieval
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.response_synthesizers import get_response_synthesizer

# Create response synthesizer
response_synthesizer = get_response_synthesizer(
    response_mode="compact"
)

# Create hybrid query engine
hybrid_query_engine = RetrieverQueryEngine(
    retriever=hybrid_retriever,
    response_synthesizer=response_synthesizer,
)

print("Hybrid query engine created!")

In [None]:
# Test the hybrid query engine
query = "What are the Tensor Cores and CUDA cores specifications of DGX Spark?"

print(f"Query: {query}")
print("="*60)

response = hybrid_query_engine.query(query)

print(f"\nResponse: {response}")

---

## Part 7: Adding Reranking

Reranking improves retrieval by re-scoring results with a more sophisticated model.

In [None]:
from llama_index.core.postprocessor import SentenceTransformerRerank

# Create a reranker
# Note: This uses a cross-encoder model for better ranking
reranker = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-6-v2",
    top_n=3  # Keep top 3 after reranking
)

print("Reranker initialized!")

In [None]:
# Create a query engine with reranking
reranked_query_engine = index.as_query_engine(
    similarity_top_k=10,  # Retrieve more initially
    node_postprocessors=[reranker],  # Then rerank to top 3
    response_mode="compact"
)

print("Reranked query engine created!")

In [None]:
# Compare: With and without reranking
query = "What quantization formats work best on DGX Spark?"

print(f"Query: {query}\n")

# Without reranking
print("=" * 60)
print("WITHOUT RERANKING")
print("=" * 60)
basic_response = query_engine.query(query)
print(f"Response: {basic_response}\n")

# With reranking
print("=" * 60)
print("WITH RERANKING")
print("=" * 60)
reranked_response = reranked_query_engine.query(query)
print(f"Response: {reranked_response}")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Not Using Proper Chunk Sizes

In [None]:
# ‚ùå Wrong: Default chunk size might not fit your use case
# Settings.chunk_size = 1024  # Too large for precise retrieval

# ‚úÖ Right: Tune chunk size for your content
# For technical docs: 256-512 chars
# For long-form content: 512-1024 chars
# For code: 512-768 chars

Settings.chunk_size = 512
print("Chunk size set to 512 - good for technical documentation")

### Mistake 2: Retrieving Too Few or Too Many Nodes

In [None]:
# ‚ùå Wrong: Too few nodes
# query_engine = index.as_query_engine(similarity_top_k=1)  # Might miss relevant info

# ‚ùå Wrong: Too many nodes
# query_engine = index.as_query_engine(similarity_top_k=20)  # Dilutes relevance

# ‚úÖ Right: Balance based on context window and needs
# Without reranking: k=3-5
# With reranking: k=10-20, then rerank to top 3-5

print("Recommended: k=5 for direct retrieval, k=10-20 with reranking")

### Mistake 3: Not Persisting the Index

In [None]:
# ‚ùå Wrong: Rebuilding index every time (slow!)
# index = VectorStoreIndex.from_documents(documents)  # Takes minutes

# ‚úÖ Right: Persist and reload
def get_or_create_index(documents, index_dir):
    """Load existing index or create new one."""
    index_dir = Path(index_dir)
    
    if (index_dir / "docstore.json").exists():
        print("Loading existing index...")
        storage_context = StorageContext.from_defaults(persist_dir=str(index_dir))
        return load_index_from_storage(storage_context)
    else:
        print("Creating new index...")
        index = VectorStoreIndex.from_documents(documents)
        index.storage_context.persist(persist_dir=str(index_dir))
        return index

print("Always persist your index for faster startup!")

---

## üéâ Checkpoint

You've learned:
- ‚úÖ How LlamaIndex differs from LangChain
- ‚úÖ Creating vector indices with LlamaIndex
- ‚úÖ Building query engines with citations
- ‚úÖ Implementing hybrid search (BM25 + vector)
- ‚úÖ Adding reranking for better results

---

## ‚úã Try It Yourself

Create a query engine that:
1. Uses hybrid retrieval
2. Applies reranking
3. Returns citations

<details>
<summary>üí° Hint</summary>

You can combine all three by creating a `RetrieverQueryEngine` with:
- `hybrid_retriever` as the retriever
- `reranker` in `node_postprocessors`
- A response synthesizer configured for citations
</details>

In [None]:
# Your implementation here
# Create the ultimate query engine combining all techniques!

---

## üöÄ Challenge (Optional)

Implement a **query transformation** that:
1. Takes the user's query
2. Generates multiple related queries
3. Retrieves results for all queries
4. Fuses the results together

This is called **Multi-Query Retrieval** and can significantly improve results for complex questions.

---

## üìñ Further Reading

- [LlamaIndex Documentation](https://docs.llamaindex.ai/)
- [Query Engines Guide](https://docs.llamaindex.ai/en/stable/module_guides/deploying/query_engine/)
- [Retrieval Strategies](https://docs.llamaindex.ai/en/stable/module_guides/querying/retriever/)

---

## üßπ Cleanup

In [None]:
# Comprehensive cleanup for DGX Spark
import gc

# Clear GPU memory if available
try:
    import torch
    if torch.cuda.is_available():
        torch.cuda.synchronize()
        torch.cuda.empty_cache()
        allocated = torch.cuda.memory_allocated() / 1e9
        print(f"‚úÖ GPU memory cleared ({allocated:.2f} GB still allocated)")
except ImportError:
    pass

# Python garbage collection
gc.collect()
print("‚úÖ Cleanup complete!")

---

## üéì Summary

In this notebook, you explored LlamaIndex's powerful retrieval capabilities:

1. **VectorStoreIndex**: Fast semantic search
2. **Query Engine**: Combines retrieval + generation
3. **Citations**: Automatic source attribution
4. **Hybrid Search**: Best of keyword + semantic
5. **Reranking**: Improved result quality

**When to choose LlamaIndex over LangChain for RAG:**
- Need advanced retrieval strategies
- Want built-in citations
- Working with structured data
- Need hybrid search out of the box

**Next up:** Lab 3.4.4 - LangGraph Workflow with Human-in-the-Loop