In [None]:
print('Setup complete.')

# Search Options & Chunking - Lab

**Hands-on**: compare dense-only vs hybrid on one query; record latency and quality.
**Deliverable**: comparison table.

## Instructions

In this lab, you will implement and compare different retrieval strategies to understand their performance trade-offs. You'll build both dense-only and hybrid retrievers, test them with the same query, and create a detailed comparison table.

## Success Criteria
- Implement dense-only retrieval with embeddings
- Implement hybrid retrieval (dense + sparse)
- Test both methods with the same query
- Measure and record latency for each method
- Evaluate quality of results
- Create a comprehensive comparison table

## Learning Objectives
- Understand practical differences between retrieval methods
- Learn to measure and compare system performance
- Practice building different retrieval strategies
- Develop skills in performance evaluation and analysis

In [None]:
# TODO: Install required packages for Google Colab
# Install: langchain, langchain-openai, langchain-community, faiss-cpu, tiktoken, rank_bm25, pandas, numpy
# Import all necessary modules for document processing, embeddings, vector stores, and retrievers
# Import time for latency measurement and pandas for creating comparison tables
# Set up your OpenAI API key
# Print confirmation that all packages are installed successfully

## Step 1: Prepare Your Dataset

Create a diverse set of documents with rich content for testing retrieval methods.

In [None]:
# TODO: Create 8-10 sample documents on a topic of your choice
# Each document should be 100-200 words long
# Include diverse content that covers different aspects of your chosen topic
# Add metadata to each document (e.g., category, difficulty, author, date)
# Convert to LangChain Document objects
# Print the total number of documents and preview the first document

## Step 2: Document Chunking

Split your documents into appropriate chunks for retrieval processing.

In [None]:
# TODO: Initialize a RecursiveCharacterTextSplitter
# Use chunk_size=200 and chunk_overlap=30 for optimal performance
# Split your documents into chunks
# Print statistics: original document count vs chunk count
# Display an example of an original document vs its chunks

## Step 3: Build Dense-Only Retriever

Create a retriever that uses only embedding-based similarity search.

In [None]:
# TODO: Initialize OpenAI embeddings
# Create a FAISS vector store from your document chunks
# Create a retriever from the vector store with k=5 results
# Print confirmation that the dense retriever is ready
# Include the number of vectors in the store

## Step 4: Build Sparse (BM25) Retriever

Create a keyword-based retriever using BM25 algorithm.

In [None]:
# TODO: Create a BM25Retriever from your document chunks
# Set k=5 to match the dense retriever
# Print confirmation that the sparse retriever is ready

## Step 5: Build Hybrid Retriever

Combine dense and sparse retrievers using EnsembleRetriever.

In [None]:
# TODO: Create an EnsembleRetriever combining dense and sparse retrievers
# Use equal weights [0.5, 0.5] for balanced performance
# Print confirmation that the hybrid retriever is ready

## Step 6: Define Your Test Query

Choose a specific query that will test both semantic understanding and keyword matching.

In [None]:
# TODO: Define a test query relevant to your document content
# The query should be 5-10 words long
# It should contain both conceptual terms (good for dense) and specific keywords (good for sparse)
# Print the query you'll be testing
# Explain why this query is good for comparing different retrieval methods

## Step 7: Test Dense-Only Retrieval

Measure the performance of your dense retriever.

In [None]:
# TODO: Record start time before running the query
# Run your test query through the dense retriever
# Record end time and calculate latency
# Display the retrieved documents with their content and metadata
# Store the results and timing for later comparison

## Step 8: Test Hybrid Retrieval

Measure the performance of your hybrid retriever with the same query.

In [None]:
# TODO: Record start time before running the query
# Run the same test query through the hybrid retriever
# Record end time and calculate latency
# Display the retrieved documents with their content and metadata
# Store the results and timing for comparison

## Step 9: Quality Assessment

Evaluate the quality of results from both retrieval methods.

In [None]:
# TODO: For each set of results, assess the following quality metrics:
# 1. Relevance: How well do the results match the query intent? (Rate 1-5)
# 2. Diversity: How many different topics/categories are covered?
# 3. Precision: What percentage of results are actually relevant?
# 4. Coverage: Does the result set cover the main aspects of the query?
# Create variables to store these quality scores for both methods

## Step 10: Create Comparison Table

Build a comprehensive comparison table showing all metrics.

In [None]:
# TODO: Create a pandas DataFrame with the following structure:
# Columns: Metric, Dense-Only, Hybrid, Winner
# Rows should include:
# - Latency (seconds)
# - Number of Results
# - Relevance Score (1-5)
# - Diversity Score (unique categories)
# - Precision Score (% relevant)
# - Coverage Score (1-5)
# For each row, determine and mark the "Winner" (Dense-Only, Hybrid, or Tie)
# Display the comparison table with proper formatting

## Step 11: Detailed Results Analysis

Provide detailed analysis of the differences between methods.

In [None]:
# TODO: Create a detailed analysis including:
# 1. Side-by-side comparison of actual retrieved documents
# 2. Identification of unique results in each method
# 3. Analysis of why certain documents were retrieved by one method but not the other
# 4. Discussion of the trade-offs observed
# Print this analysis in a structured format

## Step 12: Performance Summary

Summarize your findings and provide recommendations.

In [None]:
# TODO: Create a summary that includes:
# 1. Overall winner based on your specific use case and query
# 2. Scenarios where dense-only might be preferred
# 3. Scenarios where hybrid might be preferred
# 4. Key insights about the performance differences
# 5. Recommendations for production use
# Format this as a professional summary report

## Bonus Challenges (Optional)

If you complete the main lab early, try these additional experiments:

In [None]:
# TODO BONUS 1: Weight Optimization
# Test different weight combinations for the hybrid retriever
# Try [0.7, 0.3], [0.3, 0.7], and [0.8, 0.2]
# Determine which weighting works best for your query and content

In [None]:
# TODO BONUS 2: Multiple Query Testing
# Test both methods with 3-5 different queries
# Create an expanded comparison table showing performance across all queries
# Identify patterns in when each method performs better

In [None]:
# TODO BONUS 3: Chunking Strategy Impact
# Create a second set of chunks with different parameters (e.g., chunk_size=400)
# Test how chunking strategy affects the performance of both retrieval methods
# Add chunking strategy results to your comparison table

In [None]:
# TODO BONUS 4: Metadata Filtering Impact
# Test how adding metadata filters affects both dense and hybrid retrieval
# Compare filtered vs unfiltered results for both methods
# Analyze the impact on latency and result quality

## Final Deliverable

Your completed lab should produce a comprehensive comparison table with the following structure:

| Metric | Dense-Only | Hybrid | Winner |
|--------|------------|--------|--------|
| Latency (sec) | X.XXXX | X.XXXX | [Method] |
| Results Count | X | X | [Method] |
| Relevance (1-5) | X.X | X.X | [Method] |
| Diversity Score | X | X | [Method] |
| Precision (%) | XX% | XX% | [Method] |
| Coverage (1-5) | X.X | X.X | [Method] |

## Submission Checklist

Before completing this lab, ensure you have:

- [ ] Successfully implemented both dense-only and hybrid retrievers
- [ ] Tested both methods with the same query
- [ ] Measured latency accurately for both methods
- [ ] Evaluated quality using multiple metrics
- [ ] Created a detailed comparison table
- [ ] Provided analysis and recommendations
- [ ] Documented your findings clearly

**Final Deliverable**: A comparison table showing latency and quality metrics for dense-only vs hybrid retrieval methods, along with analysis and recommendations.