# Lab 3.4.1: RAG Pipeline - Solutions

This notebook contains complete solutions for the exercises in the RAG Pipeline notebook.

> **Important:** Run cells in order from top to bottom. Each cell may depend on variables or imports from previous cells.

---

## Exercise: Experiment with Chunk Sizes

**Solution:** Testing different chunk sizes to understand the impact on retrieval.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from pathlib import Path

# Load documents
DATA_DIR = Path.cwd().parent / "data" / "sample_documents"
loader = DirectoryLoader(str(DATA_DIR), glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()

# Test different chunk sizes
chunk_configs = [
    {"size": 100, "overlap": 10, "desc": "Very small - good for precise retrieval, may lack context"},
    {"size": 256, "overlap": 25, "desc": "Small - balanced for short queries"},
    {"size": 512, "overlap": 50, "desc": "Medium - recommended for most use cases"},
    {"size": 1024, "overlap": 100, "desc": "Large - good for complex topics"},
    {"size": 2000, "overlap": 200, "desc": "Very large - may dilute relevance"},
]

print("Chunk Size Analysis")
print("="*70)

for config in chunk_configs:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=config["size"],
        chunk_overlap=config["overlap"]
    )
    chunks = splitter.split_documents(documents)
    avg_size = sum(len(c.page_content) for c in chunks) / len(chunks)
    
    print(f"\nChunk size: {config['size']}, Overlap: {config['overlap']}")
    print(f"  Chunks created: {len(chunks)}")
    print(f"  Avg chunk length: {avg_size:.0f} chars")
    print(f"  Note: {config['desc']}")

## Challenge: Implement Hybrid Search

**Solution:** Combining BM25 (keyword) with vector (semantic) search.

In [None]:
# BM25 requires the rank_bm25 package: pip install rank_bm25

# Handle different LangChain versions for BM25Retriever
try:
    from langchain_community.retrievers import BM25Retriever
except ImportError:
    try:
        from langchain.retrievers import BM25Retriever
    except ImportError:
        raise ImportError(
            "BM25Retriever not found. Install with: pip install rank_bm25\n"
            "Then import from langchain.retrievers or langchain_community.retrievers"
        )

from langchain.retrievers import EnsembleRetriever
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA

# Initialize components
embeddings = OllamaEmbeddings(model="nomic-embed-text")
llm = Ollama(model="llama3.1:8b", temperature=0.3)

# Create chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.split_documents(documents)

# Create vector store
vectorstore = Chroma.from_documents(chunks, embeddings)

# Create BM25 (keyword) retriever
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5

# Create vector (semantic) retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Create ensemble (hybrid) retriever
# Weights: 30% keyword, 70% semantic
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.3, 0.7]
)

print("âœ… Hybrid retriever created!")
print("  - BM25 weight: 30% (keyword matching)")
print("  - Vector weight: 70% (semantic similarity)")

In [None]:
# Create RAG chain with hybrid retrieval
hybrid_rag = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=ensemble_retriever,
    return_source_documents=True
)

# Test the hybrid RAG
query = "What are the CUDA cores and Tensor Cores specifications of DGX Spark?"

print(f"Query: {query}")
print("="*60)

result = hybrid_rag.invoke({"query": query})

print(f"\nAnswer: {result['result']}")
print(f"\nSources used: {len(result['source_documents'])}")

## Key Insights

1. **Chunk Size Selection:**
   - Small (100-256): Good for precise factual queries
   - Medium (512): Best balance for most use cases
   - Large (1024+): Better for complex, contextual queries

2. **Hybrid Search Benefits:**
   - Catches exact keyword matches (BM25)
   - Finds semantically similar content (Vector)
   - More robust than either alone

3. **Weight Tuning:**
   - Technical docs: More weight on BM25 (terms matter)
   - Conversational: More weight on vector (meaning matters)