In [None]:
print('Setup complete.')

# Search Options & Chunking - Demo with gpt-5-mini

**Focus**: dense vs sparse vs hybrid retrieval; metadata filters; chunk A/B testing with gpt-5-mini integration.

This notebook demonstrates different retrieval strategies and their trade-offs in RAG systems using AskSageClient with gpt-5-mini and nvidia/NV-Embed-v2.

## Learning Objectives
- Understand dense retrieval (embedding-based similarity)
- Learn sparse retrieval (keyword-based like BM25)
- Explore hybrid retrieval combining both approaches
- Apply metadata filtering for refined search
- Compare different chunking strategies
- **NEW**: Use gpt-5-mini for enhanced RAG responses

In [None]:
# Install required packages
!pip install langchain langchain-community faiss-cpu tiktoken rank_bm25 pandas numpy matplotlib sentence-transformers transformers torch asksageclient

import os
import time
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
from typing import List, Dict, Any
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
from langchain.embeddings.base import Embeddings
from asksageclient import AskSageClient

print("✅ All packages installed and modules imported successfully!")

In [None]:
# Function to load credentials from a JSON file
def load_credentials(filename):
    try:
        with open(filename) as file:
            return json.load(file)
    except FileNotFoundError:
        raise FileNotFoundError("The credentials file was not found.")
    except json.JSONDecodeError:
        raise ValueError("Failed to decode JSON from the credentials file.")

# Load the credentials
credentials = load_credentials('../../credentials.json')

# Extract the API key, and email from the credentials to be used in the API request
api_key = credentials['credentials']['api_key']
email = credentials['credentials']['Ask_sage_user_info']['username']

# Create an instance of the AskSageClient class with the email and api_key
ask_sage_client = AskSageClient(email, api_key)

print("✅ AskSageClient initialized successfully!")

In [None]:
# Custom Embedding class for nvidia/NV-Embed-v2
class NVidiaEmbeddings(Embeddings):
    def __init__(self):
        self.model = SentenceTransformer('nvidia/NV-Embed-v2', trust_remote_code=True)
        
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Embed a list of documents."""
        embeddings = self.model.encode(texts, normalize_embeddings=True)
        return embeddings.tolist()
    
    def embed_query(self, text: str) -> List[float]:
        """Embed a single query text."""
        embedding = self.model.encode([text], normalize_embeddings=True)
        return embedding[0].tolist()

print("✅ Custom NVidiaEmbeddings class defined!")

In [None]:
# gpt-5-mini RAG Helper Function
def ask_gpt_5_mini_with_context(query: str, context_docs: List[Document], client: AskSageClient) -> Dict[str, Any]:
    """
    Use gpt-5-mini to answer a query using retrieved context documents.
    """
    # Prepare context from retrieved documents
    context_text = "\n\n".join([
        f"Document {i+1} (Topic: {doc.metadata.get('topic', 'N/A')}, "
        f"Difficulty: {doc.metadata.get('difficulty', 'N/A')}):\n{doc.page_content}"
        for i, doc in enumerate(context_docs)
    ])
    
    # Create the prompt with context
    prompt = f"""Based on the following context documents, please answer the question comprehensively.

Context Documents:
{context_text}

Question: {query}

Please provide a detailed answer based on the context provided above. If the context doesn't contain enough information to fully answer the question, please indicate what additional information would be helpful."""
    
    try:
        # Make the API call using gpt-5-mini
        response = client.query(
            message=prompt,
            system_prompt="You are a helpful assistant.",
            model="gpt-5-mini",

        )
        
        return {
            "success": True,
            "answer": response.get("response", "No response received"),
            "context_used": len(context_docs),
            "model": "gpt-5-mini"
        }
    except Exception as e:
        return {
            "success": False,
            "error": str(e),
            "context_used": len(context_docs),
            "model": "gpt-5-mini"
        }

print("✅ gpt-5-mini RAG helper function defined!")

## 1. Sample Dataset with Metadata

Let's create a diverse dataset with rich metadata for our retrieval experiments.

In [None]:
# Create sample documents with metadata
sample_docs = [
    {
        "content": "Machine learning algorithms like neural networks require large datasets for training. Deep learning models with millions of parameters need extensive computational resources and GPU acceleration. Popular frameworks include TensorFlow, PyTorch, and scikit-learn for different ML tasks.",
        "metadata": {"topic": "machine_learning", "difficulty": "advanced", "word_count": 32, "author": "Dr. Smith", "year": 2023}
    },
    {
        "content": "Python is a versatile programming language used in data science, web development, and artificial intelligence. Its simple syntax makes it beginner-friendly. Key libraries include NumPy, Pandas, and Matplotlib for data analysis and visualization.",
        "metadata": {"topic": "programming", "difficulty": "beginner", "word_count": 28, "author": "Prof. Johnson", "year": 2022}
    },
    {
        "content": "Natural language processing involves text analysis, sentiment analysis, and language understanding. Modern NLP uses transformer architectures like BERT and GPT. Applications include chatbots, translation services, and document analysis systems.",
        "metadata": {"topic": "nlp", "difficulty": "intermediate", "word_count": 28, "author": "Dr. Smith", "year": 2023}
    },
    {
        "content": "Data visualization helps communicate insights effectively. Popular libraries include matplotlib, seaborn, and plotly for creating charts and interactive dashboards. Good visualizations tell a story and make complex data accessible to stakeholders.",
        "metadata": {"topic": "data_science", "difficulty": "beginner", "word_count": 26, "author": "Prof. Davis", "year": 2022}
    },
    {
        "content": "Reinforcement learning agents learn through interaction with environments. Q-learning and policy gradient methods are fundamental approaches in RL. Applications include game playing, robotics, and autonomous systems optimization.",
        "metadata": {"topic": "machine_learning", "difficulty": "advanced", "word_count": 25, "author": "Dr. Wilson", "year": 2023}
    },
    {
        "content": "Web scraping extracts data from websites using libraries like BeautifulSoup and Scrapy. Always respect robots.txt and rate limits when scraping. Common applications include price monitoring, content aggregation, and market research.",
        "metadata": {"topic": "programming", "difficulty": "intermediate", "word_count": 26, "author": "Prof. Johnson", "year": 2022}
    },
    {
        "content": "Statistical analysis involves hypothesis testing, correlation analysis, and regression modeling. Understanding p-values and confidence intervals is crucial. Modern tools include R, Python's scipy.stats, and specialized software like SPSS.",
        "metadata": {"topic": "statistics", "difficulty": "intermediate", "word_count": 25, "author": "Dr. Brown", "year": 2023}
    },
    {
        "content": "Cloud computing platforms like AWS, Azure, and GCP provide scalable infrastructure. Serverless computing reduces operational overhead significantly. Key services include compute instances, storage solutions, and managed databases.",
        "metadata": {"topic": "cloud", "difficulty": "intermediate", "word_count": 24, "author": "Prof. Davis", "year": 2023}
    }
]

# Convert to LangChain documents
documents = [Document(page_content=doc["content"], metadata=doc["metadata"]) for doc in sample_docs]

print(f"Created {len(documents)} documents with metadata")
print(f"\nExample document:")
print(f"Content: {documents[0].page_content}")
print(f"Metadata: {documents[0].metadata}")

## 2. Chunking Strategy Comparison

Let's test different chunking strategies and see their impact on retrieval.

In [None]:
# Strategy A: Smaller chunks (150 chars)
splitter_a = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=20,
    length_function=len
)

# Strategy B: Larger chunks (300 chars)
splitter_b = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=30,
    length_function=len
)

# Apply both strategies
chunks_a = splitter_a.split_documents(documents)
chunks_b = splitter_b.split_documents(documents)

print("Chunking Strategy Comparison:")
print(f"Strategy A (150 chars): {len(chunks_a)} chunks")
print(f"Strategy B (300 chars): {len(chunks_b)} chunks")

print(f"\nExample chunk A: '{chunks_a[0].page_content}'")
print(f"Length: {len(chunks_a[0].page_content)} chars")

print(f"\nExample chunk B: '{chunks_b[0].page_content}'")
print(f"Length: {len(chunks_b[0].page_content)} chars")

## 3. Dense Retrieval (Embedding-Based)

Dense retrieval uses embeddings to capture semantic similarity. Now using nvidia/NV-Embed-v2.

In [None]:
# Initialize nvidia embeddings
embeddings = NVidiaEmbeddings()
print("✅ NVidia NV-Embed-v2 model loaded successfully!")

# Create dense retriever using Strategy A chunks
dense_vectorstore_a = FAISS.from_documents(chunks_a, embeddings)
dense_retriever_a = dense_vectorstore_a.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3}
)

# Create dense retriever using Strategy B chunks
dense_vectorstore_b = FAISS.from_documents(chunks_b, embeddings)
dense_retriever_b = dense_vectorstore_b.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3}
)

print("✅ Dense retrievers created for both chunking strategies")

# Test dense retrieval
test_query = "machine learning algorithms and neural networks"
print(f"\nTest Query: '{test_query}'")

start_time = time.time()
dense_results_a = dense_retriever_a.get_relevant_documents(test_query)
dense_latency_a = time.time() - start_time

print(f"\nDense Retrieval (Strategy A) - Latency: {dense_latency_a:.4f}s")
for i, doc in enumerate(dense_results_a, 1):
    print(f"{i}. {doc.page_content[:100]}...")
    print(f"   Topic: {doc.metadata.get('topic', 'N/A')}")

## 4. Sparse Retrieval (BM25 Keyword-Based)

Sparse retrieval uses exact keyword matching and term frequency statistics.

In [None]:
# Create BM25 retriever for Strategy A chunks
sparse_retriever_a = BM25Retriever.from_documents(chunks_a)
sparse_retriever_a.k = 3

# Create BM25 retriever for Strategy B chunks
sparse_retriever_b = BM25Retriever.from_documents(chunks_b)
sparse_retriever_b.k = 3

print("✅ Sparse (BM25) retrievers created for both chunking strategies")

# Test sparse retrieval
start_time = time.time()
sparse_results_a = sparse_retriever_a.get_relevant_documents(test_query)
sparse_latency_a = time.time() - start_time

print(f"\nSparse Retrieval (Strategy A) - Latency: {sparse_latency_a:.4f}s")
for i, doc in enumerate(sparse_results_a, 1):
    print(f"{i}. {doc.page_content[:100]}...")
    print(f"   Topic: {doc.metadata.get('topic', 'N/A')}")

## 5. Hybrid Retrieval (Dense + Sparse)

Hybrid retrieval combines the strengths of both approaches.

In [None]:
# Create hybrid retriever for Strategy A
hybrid_retriever_a = EnsembleRetriever(
    retrievers=[dense_retriever_a, sparse_retriever_a],
    weights=[0.5, 0.5]  # Equal weighting
)

# Create hybrid retriever for Strategy B
hybrid_retriever_b = EnsembleRetriever(
    retrievers=[dense_retriever_b, sparse_retriever_b],
    weights=[0.5, 0.5]
)

print("✅ Hybrid retrievers created for both chunking strategies")

# Test hybrid retrieval
start_time = time.time()
hybrid_results_a = hybrid_retriever_a.get_relevant_documents(test_query)
hybrid_latency_a = time.time() - start_time

print(f"\nHybrid Retrieval (Strategy A) - Latency: {hybrid_latency_a:.4f}s")
for i, doc in enumerate(hybrid_results_a, 1):
    print(f"{i}. {doc.page_content[:100]}...")
    print(f"   Topic: {doc.metadata.get('topic', 'N/A')}")

## 6. gpt-5-mini RAG Integration

Now let's use gpt-5-mini to generate comprehensive answers using our retrieved context.

In [None]:
# Test gpt-5-mini with different retrieval strategies
print("=" * 60)
print("gpt-5-mini RAG Comparison")
print("=" * 60)

# Test with dense retrieval
print("\n🔍 DENSE RETRIEVAL + gpt-5-mini")
print("-" * 40)
dense_rag_result = ask_gpt_o3_mini_with_context(test_query, dense_results_a, ask_sage_client)
if dense_rag_result["success"]:
    print(f"Answer: {dense_rag_result['answer']}")
    print(f"Context docs used: {dense_rag_result['context_used']}")
else:
    print(f"Error: {dense_rag_result['error']}")

print("\n" + "="*60)

# Test with sparse retrieval
print("\n🔍 SPARSE RETRIEVAL + gpt-5-mini")
print("-" * 40)
sparse_rag_result = ask_gpt_o3_mini_with_context(test_query, sparse_results_a, ask_sage_client)
if sparse_rag_result["success"]:
    print(f"Answer: {sparse_rag_result['answer']}")
    print(f"Context docs used: {sparse_rag_result['context_used']}")
else:
    print(f"Error: {sparse_rag_result['error']}")

print("\n" + "="*60)

# Test with hybrid retrieval
print("\n🔍 HYBRID RETRIEVAL + gpt-5-mini")
print("-" * 40)
hybrid_rag_result = ask_gpt_o3_mini_with_context(test_query, hybrid_results_a, ask_sage_client)
if hybrid_rag_result["success"]:
    print(f"Answer: {hybrid_rag_result['answer']}")
    print(f"Context docs used: {hybrid_rag_result['context_used']}")
else:
    print(f"Error: {hybrid_rag_result['error']}")

## 7. Metadata Filtering with gpt-5-mini

Apply filters to restrict search to specific document characteristics and use gpt-5-mini for analysis.

In [None]:
# Filter for advanced difficulty documents only
filter_advanced = {"difficulty": "advanced"}

# Test filtered dense retrieval
filtered_results = dense_vectorstore_a.similarity_search(
    test_query,
    k=3,
    filter=filter_advanced
)

print(f"Filtered Results (Advanced difficulty only):")
for i, doc in enumerate(filtered_results, 1):
    print(f"{i}. {doc.page_content[:100]}...")
    print(f"   Difficulty: {doc.metadata.get('difficulty')}")
    print(f"   Topic: {doc.metadata.get('topic')}")

# Use gpt-5-mini with filtered results
print("\n" + "="*60)
print("🔍 FILTERED (ADVANCED) + gpt-5-mini")
print("-" * 40)
filtered_rag_result = ask_gpt_o3_mini_with_context(test_query, filtered_results, ask_sage_client)
if filtered_rag_result["success"]:
    print(f"Answer: {filtered_rag_result['answer']}")
    print(f"Context docs used: {filtered_rag_result['context_used']}")
else:
    print(f"Error: {filtered_rag_result['error']}")

# Filter by author and year
filter_author_year = {"author": "Dr. Smith", "year": 2023}

author_results = dense_vectorstore_a.similarity_search(
    "natural language processing",
    k=2,
    filter=filter_author_year
)

print(f"\nFiltered Results (Dr. Smith, 2023):")
for i, doc in enumerate(author_results, 1):
    print(f"{i}. {doc.page_content[:100]}...")
    print(f"   Author: {doc.metadata.get('author')}")
    print(f"   Year: {doc.metadata.get('year')}")

## 8. Chunking Strategy Comparison with gpt-5-mini

Let's compare how different chunking strategies affect gpt-5-mini's responses.

In [None]:
# Compare chunking strategies with gpt-5-mini
comparison_query = "What are the key aspects of data science and machine learning?"

# Get results from both chunking strategies
dense_results_strategy_a = dense_retriever_a.get_relevant_documents(comparison_query)
dense_results_strategy_b = dense_retriever_b.get_relevant_documents(comparison_query)

print("=" * 70)
print("CHUNKING STRATEGY COMPARISON WITH gpt-5-mini")
print("=" * 70)

# Strategy A (smaller chunks)
print("\n📊 STRATEGY A (150 chars) + gpt-5-mini")
print("-" * 50)
strategy_a_result = ask_gpt_o3_mini_with_context(comparison_query, dense_results_strategy_a, ask_sage_client)
if strategy_a_result["success"]:
    print(f"Answer: {strategy_a_result['answer']}")
    print(f"Context docs used: {strategy_a_result['context_used']}")
else:
    print(f"Error: {strategy_a_result['error']}")

print("\n" + "="*70)

# Strategy B (larger chunks)
print("\n📊 STRATEGY B (300 chars) + gpt-5-mini")
print("-" * 50)
strategy_b_result = ask_gpt_o3_mini_with_context(comparison_query, dense_results_strategy_b, ask_sage_client)
if strategy_b_result["success"]:
    print(f"Answer: {strategy_b_result['answer']}")
    print(f"Context docs used: {strategy_b_result['context_used']}")
else:
    print(f"Error: {strategy_b_result['error']}")

## 9. Performance Analysis and Visualization

Let's create visualizations to compare the performance of different retrieval strategies.

In [None]:
# Performance comparison data
retrieval_methods = ['Dense', 'Sparse', 'Hybrid']
latencies = [dense_latency_a, sparse_latency_a, hybrid_latency_a]
chunk_counts = [len(chunks_a), len(chunks_b)]
chunk_strategies = ['Strategy A (150 chars)', 'Strategy B (300 chars)']

# Create subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Plot 1: Retrieval latencies
bars1 = ax1.bar(retrieval_methods, latencies, color=['#1f77b4', '#ff7f0e', '#2ca02c'])
ax1.set_title('Retrieval Method Latency Comparison', fontsize=14, fontweight='bold')
ax1.set_ylabel('Latency (seconds)')
ax1.set_xlabel('Retrieval Method')

# Add value labels on bars
for bar, latency in zip(bars1, latencies):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 0.001,
             f'{latency:.4f}s', ha='center', va='bottom')

# Plot 2: Chunk count comparison
bars2 = ax2.bar(chunk_strategies, chunk_counts, color=['#d62728', '#9467bd'])
ax2.set_title('Chunking Strategy Comparison', fontsize=14, fontweight='bold')
ax2.set_ylabel('Number of Chunks')
ax2.set_xlabel('Chunking Strategy')

# Add value labels on bars
for bar, count in zip(bars2, chunk_counts):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height + 0.1,
             f'{count}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

# Summary statistics
print("\n📈 PERFORMANCE SUMMARY")
print("=" * 40)
print(f"Fastest retrieval method: {retrieval_methods[latencies.index(min(latencies))]} ({min(latencies):.4f}s)")
print(f"Most chunks generated: {chunk_strategies[chunk_counts.index(max(chunk_counts))]} ({max(chunk_counts)} chunks)")
print(f"Fewest chunks generated: {chunk_strategies[chunk_counts.index(min(chunk_counts))]} ({min(chunk_counts)} chunks)")

## 10. Summary and Key Findings

This demo explored the key aspects of search options and chunking in RAG systems using AskSageClient with gpt-5-mini and nvidia/NV-Embed-v2:

### Key Updates Made:
1. **LLM Model**: Integrated gpt-5-mini for enhanced RAG responses
2. **Embedding Model**: Used nvidia/NV-Embed-v2 from HuggingFace
3. **Custom Integration**: Created NVidiaEmbeddings class for LangChain compatibility
4. **RAG Pipeline**: Added comprehensive RAG helper function with gpt-5-mini

### Retrieval Methods Compared:
1. **Dense Retrieval**: Embedding-based semantic similarity using nvidia/NV-Embed-v2
2. **Sparse Retrieval**: Keyword-based BM25 matching
3. **Hybrid Retrieval**: Combines both approaches for optimal performance

### Chunking Strategies:
- **Strategy A**: Smaller chunks (150 chars) for precision and granular retrieval
- **Strategy B**: Larger chunks (300 chars) for better context preservation

### Key Features Demonstrated:
- **gpt-5-mini Integration**: Enhanced answer generation using retrieved context
- **Metadata filtering**: Refined search results based on document attributes
- **Performance comparison**: Comprehensive analysis across different approaches
- **Visualization**: Clear charts showing performance metrics and trade-offs

### Best Practices Identified:
1. **Hybrid retrieval** typically provides the best balance of performance and quality
2. **Metadata filtering** offers powerful ways to constrain and improve search results
3. **gpt-5-mini** excels at synthesizing information from multiple retrieved documents
4. **Chunking strategy** should be chosen based on your specific use case:
   - Smaller chunks for precise information retrieval
   - Larger chunks for maintaining context and coherence

The combination of sophisticated retrieval strategies with gpt-5-mini's reasoning capabilities creates a powerful RAG system that can provide comprehensive, contextually-aware responses to user queries.