# Lab 3.4.1: Building a RAG Pipeline with ChromaDB

**Module:** 3.4 - AI Agents & Agentic Systems  
**Time:** 3 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê (Intermediate)

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand the RAG architecture and why it's essential for modern AI
- [ ] Load and chunk documents for effective retrieval
- [ ] Create embeddings using local models on DGX Spark
- [ ] Build a vector store with ChromaDB
- [ ] Implement retrieval strategies (dense, sparse, hybrid)
- [ ] Create a complete question-answering RAG pipeline

---

## üìö Prerequisites

- Completed: Module 12 (Deployment & Inference)
- Knowledge of: Python, basic NLP concepts, embeddings
- Running: Ollama with `nomic-embed-text` and `llama3.1` models

---

## üåç Real-World Context

**The Problem:** Large Language Models have a knowledge cutoff and can't access your private documents.

**Real Examples:**
- üìÑ **Customer Support:** Your company's support bot needs to answer questions about YOUR products
- üè• **Healthcare:** A medical AI needs access to the latest research papers
- üíº **Legal:** An AI assistant needs to reference specific contracts and legal documents
- üìä **Finance:** A trading assistant needs real-time market data and company reports

**RAG solves this** by giving the LLM the ability to "look up" information in your documents before answering!

---

## üßí ELI5: What is RAG?

> **Imagine you're taking an open-book test...** üìö
>
> Without RAG, an AI is like taking a test with just what you memorized. You might remember a lot, but what about things you never learned or details that changed after you studied?
>
> **With RAG, the AI gets to use the textbook!**
>
> When you ask a question:
> 1. üîç The AI first **looks up** relevant pages in the textbook (retrieval)
> 2. üìñ It **reads** those specific pages (context)
> 3. ‚úçÔ∏è Then it **writes** an answer using both what it knows AND what it just read (generation)
>
> **In AI terms:**
> - **Retrieval** = Finding relevant document chunks using similarity search
> - **Augmented** = Adding those chunks to the prompt as context
> - **Generation** = The LLM generates an answer using the augmented context

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                        RAG Pipeline                             ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                 ‚îÇ
‚îÇ   User Question     Document Store        LLM                   ‚îÇ
‚îÇ        ‚îÇ                  ‚îÇ                ‚îÇ                    ‚îÇ
‚îÇ        ‚ñº                  ‚îÇ                ‚îÇ                    ‚îÇ
‚îÇ   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê            ‚îÇ                ‚îÇ                    ‚îÇ
‚îÇ   ‚îÇ Embed   ‚îÇ            ‚îÇ                ‚îÇ                    ‚îÇ
‚îÇ   ‚îÇ Query   ‚îÇ            ‚îÇ                ‚îÇ                    ‚îÇ
‚îÇ   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îò            ‚îÇ                ‚îÇ                    ‚îÇ
‚îÇ        ‚îÇ                 ‚îÇ                ‚îÇ                    ‚îÇ
‚îÇ        ‚ñº                 ‚ñº                ‚îÇ                    ‚îÇ
‚îÇ   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê           ‚îÇ                    ‚îÇ
‚îÇ   ‚îÇ   Vector Similarity      ‚îÇ           ‚îÇ                    ‚îÇ
‚îÇ   ‚îÇ       Search             ‚îÇ           ‚îÇ                    ‚îÇ
‚îÇ   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò           ‚îÇ                    ‚îÇ
‚îÇ               ‚îÇ                          ‚îÇ                    ‚îÇ
‚îÇ               ‚ñº                          ‚îÇ                    ‚îÇ
‚îÇ   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê               ‚îÇ                    ‚îÇ
‚îÇ   ‚îÇ  Top-K Relevant     ‚îÇ               ‚îÇ                    ‚îÇ
‚îÇ   ‚îÇ     Chunks          ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê         ‚îÇ
‚îÇ   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò               ‚îÇ          ‚îÇ         ‚îÇ
‚îÇ                                         ‚ñº          ‚îÇ         ‚îÇ
‚îÇ                              ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê‚îÇ         ‚îÇ
‚îÇ                              ‚îÇ Prompt + Context   ‚îÇ‚îÇ         ‚îÇ
‚îÇ                              ‚îÇ                    ‚îÇ‚îÇ         ‚îÇ
‚îÇ                              ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò‚îÇ         ‚îÇ
‚îÇ                                        ‚îÇ           ‚îÇ         ‚îÇ
‚îÇ                                        ‚ñº           ‚îÇ         ‚îÇ
‚îÇ                              ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê          ‚îÇ
‚îÇ                              ‚îÇ   Generate Answer  ‚îÇ          ‚îÇ
‚îÇ                              ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò          ‚îÇ
‚îÇ                                        ‚îÇ                     ‚îÇ
‚îÇ                                        ‚ñº                     ‚îÇ
‚îÇ                              ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê          ‚îÇ
‚îÇ                              ‚îÇ   Final Response   ‚îÇ          ‚îÇ
‚îÇ                              ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò          ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

---

## Part 1: Environment Setup

Let's start by setting up our environment. We'll use local models on DGX Spark for maximum performance and privacy.

In [None]:
# Install required packages (run once)
# Note: In the NGC container, most packages are pre-installed
# rank_bm25 is required for the hybrid search challenge
# !pip install langchain langchain-community chromadb sentence-transformers rank_bm25

In [None]:
# Standard imports
import os
import sys
from pathlib import Path
from typing import List, Dict, Any
import time

# Add scripts to path
sys.path.insert(0, str(Path.cwd().parent / 'scripts'))

# Check GPU availability
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Import LangChain components
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

print("LangChain components imported successfully!")

In [ ]:
# Verify Ollama is running before proceeding
import requests

def check_ollama():
    """Check if Ollama is running and required models are available."""
    try:
        response = requests.get("http://localhost:11434/api/tags", timeout=5)
        if response.status_code == 200:
            models = [m['name'] for m in response.json().get('models', [])]
            print("‚úÖ Ollama is running")
            print(f"   Available models: {', '.join(models[:5])}")
            
            # Check for required models
            required = ['llama3.1', 'nomic-embed-text']
            for req in required:
                if any(req in m for m in models):
                    print(f"   ‚úÖ Found {req}")
                else:
                    print(f"   ‚ö†Ô∏è  Missing {req} - run: ollama pull {req}")
            return True
    except requests.exceptions.ConnectionError:
        print("‚ùå Ollama is not running!")
        print("   Start it with: ollama serve")
        print("   Then pull models: ollama pull llama3.1:8b nomic-embed-text")
        return False
    except Exception as e:
        print(f"‚ùå Error checking Ollama: {e}")
        return False

check_ollama()

### üîç What Just Happened?

We imported the core components we'll need:
- **TextSplitter**: Breaks documents into smaller chunks
- **DocumentLoaders**: Load documents from files
- **OllamaEmbeddings**: Create vector embeddings using local Ollama models
- **Chroma**: Our vector database for storing and searching embeddings
- **Ollama**: The LLM for generating answers
- **RetrievalQA**: Chains retrieval and generation together

---

## Part 2: Loading Documents

The first step in any RAG pipeline is loading your documents. Let's load the sample technical documents we've prepared.

In [None]:
# Define paths
DATA_DIR = Path.cwd().parent / "data" / "sample_documents"
CHROMA_DIR = Path.cwd().parent / "data" / "chroma_db"

print(f"Data directory: {DATA_DIR}")
print(f"ChromaDB directory: {CHROMA_DIR}")

# List available documents
if DATA_DIR.exists():
    docs = list(DATA_DIR.glob("*.txt"))
    print(f"\nFound {len(docs)} documents:")
    for doc in docs:
        print(f"  - {doc.name} ({doc.stat().st_size / 1024:.1f} KB)")
else:
    print("Data directory not found. Please run the setup script first.")

In [None]:
# Load all documents from the directory
loader = DirectoryLoader(
    str(DATA_DIR),
    glob="**/*.txt",
    loader_cls=TextLoader,
    show_progress=True
)

documents = loader.load()
print(f"\nLoaded {len(documents)} documents")

# Show a preview of the first document
if documents:
    first_doc = documents[0]
    print(f"\nFirst document source: {first_doc.metadata['source']}")
    print(f"Content preview (first 500 chars):")
    print("-" * 50)
    print(first_doc.page_content[:500])
    print("-" * 50)

---

## Part 3: Chunking Documents

### üßí ELI5: Why Do We Need Chunking?

> **Imagine you're looking for a recipe for chocolate chip cookies in a HUGE cookbook...** üìñ
>
> You wouldn't read the WHOLE cookbook to find it. You'd:
> 1. Look at the table of contents (too vague)
> 2. Find the "Cookies" section (better!)
> 3. Find the specific page for chocolate chip cookies (perfect!)
>
> **Chunking does the same thing!**
> - Whole document = cookbook (too big to search efficiently)
> - Chunks = individual recipes (just the right size)
>
> **Key insight:** Chunks should be:
> - Small enough to be specific
> - Large enough to be meaningful
> - Overlapping slightly so we don't miss context at boundaries

In [None]:
# Create the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,        # Maximum characters per chunk
    chunk_overlap=50,      # Characters overlapping between chunks
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]  # Priority of split points
)

# Split documents into chunks
chunks = text_splitter.split_documents(documents)

print(f"Created {len(chunks)} chunks from {len(documents)} documents")
print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) / len(chunks):.0f} characters")

In [None]:
# Examine some chunks
print("Sample chunks:")
print("=" * 60)

for i, chunk in enumerate(chunks[:3]):
    source = Path(chunk.metadata['source']).name
    print(f"\nChunk {i+1} (from {source}):")
    print(f"Length: {len(chunk.page_content)} characters")
    print("-" * 40)
    print(chunk.page_content[:300] + "...")
    print("=" * 60)

### ‚úã Try It Yourself: Experiment with Chunk Sizes

Change the `chunk_size` and `chunk_overlap` parameters and observe:
- How does the number of chunks change?
- What happens with very small chunks (100 characters)?
- What happens with very large chunks (2000 characters)?

<details>
<summary>üí° Hint</summary>

**Guidelines for chunk sizes:**
- **Too small (< 200):** Chunks lack context, retrieval may be meaningless
- **Too large (> 1000):** Chunks are too broad, less precise retrieval
- **Sweet spot (300-600):** Good balance of specificity and context
- **Overlap (10-20%):** Prevents losing information at chunk boundaries
</details>

In [None]:
# Your experimentation code here
# Try different chunk_size values: 100, 300, 512, 1000, 2000

chunk_sizes = [100, 300, 512, 1000]

for size in chunk_sizes:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=size,
        chunk_overlap=size // 10,  # 10% overlap
    )
    test_chunks = splitter.split_documents(documents)
    print(f"Chunk size {size}: {len(test_chunks)} chunks created")

---

## Part 4: Creating Embeddings

### üßí ELI5: What Are Embeddings?

> **Imagine you could describe ANY concept with a set of coordinates, like a map...** üó∫Ô∏è
>
> On this magic map:
> - "Dog" and "puppy" would be very close together
> - "Dog" and "cat" would be nearby (both pets)
> - "Dog" and "algebra" would be far apart
>
> **Embeddings are these coordinates!**
> - Each chunk of text ‚Üí a list of numbers (vector)
> - Similar text ‚Üí similar numbers (close on the map)
> - Different text ‚Üí different numbers (far on the map)
>
> **Why this matters for RAG:**
> When you ask "What is the GPU in DGX Spark?", we convert your question to coordinates, then find the document chunks with the closest coordinates!

In [None]:
# Initialize the embedding model
# We use Ollama's nomic-embed-text which runs locally on DGX Spark
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434"  # Default Ollama URL
)

print("Embedding model initialized!")
print("Model: nomic-embed-text (768 dimensions)")

In [None]:
# Let's see what embeddings look like
sample_text = "The DGX Spark has 128GB of unified memory."

# Create an embedding
start_time = time.time()
sample_embedding = embeddings.embed_query(sample_text)
elapsed = time.time() - start_time

print(f"Text: '{sample_text}'")
print(f"\nEmbedding:")
print(f"  - Dimensions: {len(sample_embedding)}")
print(f"  - First 10 values: {sample_embedding[:10]}")
print(f"  - Time to embed: {elapsed*1000:.1f}ms")

In [None]:
# Demonstrate semantic similarity with embeddings
import numpy as np

def cosine_similarity(v1, v2):
    """Calculate cosine similarity between two vectors."""
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# Create embeddings for similar and different texts
texts = [
    "The DGX Spark has 128GB of unified memory.",  # Original
    "DGX Spark features 128 gigabytes of shared RAM.",  # Similar meaning
    "The memory capacity is one hundred twenty-eight gigabytes.",  # Similar topic
    "Machine learning uses neural networks.",  # Different topic
    "I like pizza with extra cheese.",  # Completely different
]

print("Semantic Similarity Demo")
print("=" * 60)
print(f"Reference: '{texts[0]}'\n")

ref_embedding = embeddings.embed_query(texts[0])

for text in texts[1:]:
    text_embedding = embeddings.embed_query(text)
    similarity = cosine_similarity(ref_embedding, text_embedding)
    print(f"Similarity: {similarity:.3f} | '{text[:50]}...'")

### üîç What Just Happened?

Notice how:
- Texts with similar meaning have high similarity scores (close to 1.0)
- Texts about different topics have lower scores
- Completely unrelated texts have very low scores

This is the magic of embeddings - they capture **semantic meaning**, not just word overlap!

---

## Part 5: Building the Vector Store with ChromaDB

Now we'll store our embeddings in ChromaDB, a lightweight vector database perfect for local development.

In [None]:
# Create the vector store
# This embeds all chunks and stores them in ChromaDB

print(f"Creating vector store with {len(chunks)} chunks...")
print("This may take a minute as we embed each chunk.\n")

start_time = time.time()

# Create and persist the vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory=str(CHROMA_DIR),
    collection_name="dgx_spark_docs"
)

elapsed = time.time() - start_time
print(f"Vector store created in {elapsed:.1f} seconds!")
print(f"Stored at: {CHROMA_DIR}")

In [None]:
# Test the vector store with a simple query
query = "What is the memory capacity of DGX Spark?"

print(f"Query: '{query}'\n")
print("Top 3 most relevant chunks:")
print("=" * 60)

# Perform similarity search
results = vectorstore.similarity_search_with_score(query, k=3)

for i, (doc, score) in enumerate(results, 1):
    source = Path(doc.metadata['source']).name
    print(f"\n[{i}] Score: {score:.4f} | Source: {source}")
    print("-" * 40)
    print(doc.page_content[:300] + "...")

---

## Part 6: Creating the RAG Chain

Now let's combine retrieval with generation to create a complete RAG pipeline!

In [None]:
# Initialize the LLM
# Using Ollama with a local model on DGX Spark
llm = Ollama(
    model="llama3.1:8b",  # Use 8b for faster responses, or 70b for better quality
    temperature=0.3,       # Lower temperature for more factual responses
    base_url="http://localhost:11434"
)

print("LLM initialized: llama3.1:8b")

In [None]:
# Create a custom prompt template for RAG
RAG_PROMPT_TEMPLATE = """You are a helpful AI assistant with access to technical documentation.
Use the following context to answer the question. If the answer cannot be found in the context,
say "I don't have information about that in my knowledge base."

Context:
{context}

Question: {question}

Answer: """

prompt = PromptTemplate(
    template=RAG_PROMPT_TEMPLATE,
    input_variables=["context", "question"]
)

print("RAG prompt template created!")

In [None]:
# Create the retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}  # Retrieve top 5 chunks
)

# Create the RAG chain
rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # "stuff" = put all context into prompt
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

print("RAG chain created successfully!")

In [None]:
# Helper function to query the RAG system
def ask_rag(question: str, show_sources: bool = True) -> str:
    """Query the RAG system and optionally show sources."""
    print(f"Question: {question}")
    print("-" * 60)
    
    start_time = time.time()
    result = rag_chain.invoke({"query": question})
    elapsed = time.time() - start_time
    
    print(f"\nAnswer: {result['result']}")
    print(f"\n(Response time: {elapsed:.2f}s)")
    
    if show_sources:
        print("\nSources used:")
        for i, doc in enumerate(result['source_documents'], 1):
            source = Path(doc.metadata['source']).name
            print(f"  [{i}] {source}")
    
    return result['result']

In [None]:
# Let's test our RAG system!
print("="*60)
print("Testing the RAG Pipeline")
print("="*60 + "\n")

# Test question 1: Factual retrieval
ask_rag("What is the memory capacity of DGX Spark?")

In [None]:
# Test question 2: Technical details
print("\n" + "="*60 + "\n")
ask_rag("How does the unified memory architecture benefit AI workloads?")

In [None]:
# Test question 3: Multi-hop reasoning
print("\n" + "="*60 + "\n")
ask_rag("Can I fine-tune a 70B model on DGX Spark using LoRA without quantization?")

In [None]:
# Test question 4: Out of scope (should gracefully handle)
print("\n" + "="*60 + "\n")
ask_rag("What is the capital of France?")

---

## Part 7: Retrieval Strategies

Let's explore different retrieval strategies to improve results.

In [None]:
# Strategy 1: Similarity Search (Default)
# Simply returns the k most similar chunks

similarity_retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

print("Strategy 1: Similarity Search")
print("Returns the k most similar chunks based on cosine similarity.")

In [None]:
# Strategy 2: Maximum Marginal Relevance (MMR)
# Balances relevance with diversity to avoid redundant results

mmr_retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 5,
        "fetch_k": 20,     # Fetch more candidates first
        "lambda_mult": 0.5  # Balance between relevance (1) and diversity (0)
    }
)

print("Strategy 2: Maximum Marginal Relevance (MMR)")
print("Balances relevance with diversity to reduce redundancy.")

In [None]:
# Compare the two strategies
query = "What are the key features of DGX Spark?"

print(f"Query: '{query}'\n")

print("=" * 60)
print("SIMILARITY SEARCH RESULTS")
print("=" * 60)
sim_results = similarity_retriever.invoke(query)
for i, doc in enumerate(sim_results, 1):
    print(f"\n[{i}] {doc.page_content[:150]}...")

print("\n" + "=" * 60)
print("MMR SEARCH RESULTS")
print("=" * 60)
mmr_results = mmr_retriever.invoke(query)
for i, doc in enumerate(mmr_results, 1):
    print(f"\n[{i}] {doc.page_content[:150]}...")

### üîç What's the Difference?

- **Similarity Search**: May return chunks that are all very similar (redundant)
- **MMR**: Returns diverse results that cover different aspects of the query

**When to use which:**
- **Similarity**: When you want the absolute most relevant chunks
- **MMR**: When your query could be answered from multiple perspectives

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Chunks Too Small or Too Large

In [None]:
# ‚ùå Wrong: Chunks too small
bad_splitter = RecursiveCharacterTextSplitter(
    chunk_size=50,  # Way too small!
    chunk_overlap=10
)

# This creates chunks like: "The DGX Spark is NVIDIA's" - meaningless!

# ‚úÖ Right: Reasonable chunk size
good_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50
)

print("Chunk size should be 300-600 characters for most use cases.")

### Mistake 2: Not Including Metadata

In [None]:
# ‚ùå Wrong: Losing track of where chunks came from
# chunks_no_metadata = [doc.page_content for doc in documents]

# ‚úÖ Right: Keep metadata for source attribution
chunks_with_metadata = text_splitter.split_documents(documents)

# Now we can always trace back to the source
print(f"Example metadata: {chunks_with_metadata[0].metadata}")

### Mistake 3: Using the Wrong Number of Retrieved Chunks

In [None]:
# ‚ùå Wrong: Too few chunks (might miss relevant info)
bad_retriever = vectorstore.as_retriever(search_kwargs={"k": 1})

# ‚ùå Wrong: Too many chunks (wastes context window, dilutes relevance)
also_bad_retriever = vectorstore.as_retriever(search_kwargs={"k": 50})

# ‚úÖ Right: Balance based on your context window and needs
good_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

print("k=3-7 is a good starting point for most use cases.")

---

## üéâ Checkpoint

You've learned:
- ‚úÖ How RAG works and why it's important
- ‚úÖ Loading and chunking documents with appropriate sizes
- ‚úÖ Creating embeddings using local models
- ‚úÖ Building a vector store with ChromaDB
- ‚úÖ Implementing different retrieval strategies
- ‚úÖ Creating a complete RAG pipeline with LangChain

---

## üöÄ Challenge (Optional)

### Advanced RAG: Implement a Hybrid Search

Combine keyword search (BM25) with semantic search for better results.

**Hint:** Use `langchain.retrievers.EnsembleRetriever` to combine multiple retrievers.

In [None]:
# Challenge: Implement hybrid search
# Requires: pip install rank_bm25
# Your code here...

# # Handle import compatibility for BM25Retriever
# try:
#     from langchain_community.retrievers import BM25Retriever
# except ImportError:
#     from langchain.retrievers import BM25Retriever
#
# from langchain.retrievers import EnsembleRetriever
#
# # Create BM25 (keyword) retriever
# bm25_retriever = BM25Retriever.from_documents(chunks)
# bm25_retriever.k = 5
#
# # Create ensemble retriever
# ensemble_retriever = EnsembleRetriever(
#     retrievers=[bm25_retriever, mmr_retriever],
#     weights=[0.3, 0.7]  # 30% keyword, 70% semantic
# )
#
# # Test hybrid retrieval
# hybrid_results = ensemble_retriever.invoke("What are the key features of DGX Spark?")
# print("Hybrid search results:", len(hybrid_results))

---

## üìñ Further Reading

- [LangChain RAG Documentation](https://python.langchain.com/docs/tutorials/rag/)
- [ChromaDB Documentation](https://docs.trychroma.com/)
- [Ollama Embedding Models](https://ollama.com/library?tag=embedding)
- [RAG Best Practices (Anthropic)](https://docs.anthropic.com/claude/docs/retrieval-augmented-generation-rag)

---

## üßπ Cleanup

In [None]:
# Cleanup: Free GPU memory and resources
import gc

# Clear GPU memory if available
try:
    import torch
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
        allocated = torch.cuda.memory_allocated() / 1e9
        print(f"‚úÖ GPU memory cleared ({allocated:.2f} GB still allocated)")
except ImportError:
    pass

# Python garbage collection
gc.collect()
print("‚úÖ Cleanup complete!")

---

## üéì Summary

In this notebook, you built a complete RAG pipeline:

1. **Loaded documents** from files into LangChain Document objects
2. **Chunked documents** using RecursiveCharacterTextSplitter
3. **Created embeddings** with local Ollama models
4. **Stored vectors** in ChromaDB for efficient similarity search
5. **Implemented retrieval** with different strategies (similarity, MMR)
6. **Built a QA chain** combining retrieval with LLM generation

**Next up:** Lab 3.4.2 - Building Custom Tools for AI Agents