# Lab 3.5.1: Basic RAG Pipeline

**Module:** 3.5 - RAG Systems & Vector Databases  
**Time:** 3 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê (Intermediate)

---

## üéØ Learning Objectives

By the end of this notebook, you will:

- [ ] Understand the complete RAG (Retrieval-Augmented Generation) architecture
- [ ] Load and process documents for a knowledge base
- [ ] Implement effective chunking with overlap
- [ ] Generate embeddings using GPU-accelerated models
- [ ] Store and query vectors in ChromaDB
- [ ] Build an end-to-end RAG pipeline that answers questions

---

## üìö Prerequisites

- Completed: Module 3.4 (Test-Time Compute)
- Knowledge of: Python, basic NLP concepts, LLM usage
- Ollama running with a model installed (we'll use `qwen3:8b` or similar)

---

## üåç Real-World Context

**The Problem:** You're a developer at a company with thousands of internal documents (policies, technical guides, FAQs). Employees waste hours searching for information. A chatbot that just uses an LLM hallucinates answers because it wasn't trained on your company's data.

**The Solution:** RAG! Build a system that retrieves relevant internal documents and uses them to generate accurate, grounded answers. This is how ChatGPT Enterprise, Microsoft Copilot, and Google's NotebookLM work.

**Why This Matters:** RAG is the #1 most requested skill in LLM job postings. It's how you make LLMs actually useful for business applications.

---

## üßí ELI5: What is RAG?

> **Imagine you're taking an open-book exam.**
>
> Without the book, you'd have to answer every question from memory. Sometimes you'd remember wrong, sometimes you'd just make things up (that's "hallucination" in AI terms!).
>
> With the book, you can look up the right information before answering. You find the relevant pages, read them, and then write your answer based on what you just read.
>
> **RAG gives AI a library card.** Instead of the AI trying to remember everything from its training, it can look up relevant information in YOUR documents before answering.
>
> The "R" in RAG is "Retrieval" - finding the right pages in the book.  
> The "A" is "Augmented" - adding that information to the prompt.  
> The "G" is "Generation" - writing the answer based on what was found.

---

## Part 1: Environment Setup

### Install Required Packages

First, let's install everything we need for our RAG pipeline.

In [None]:
# Install required packages
# Note: Run this cell once, then restart the kernel if needed

!pip install -q \
    langchain==0.3.14 \
    langchain-community==0.3.14 \
    langchain-huggingface==0.1.2 \
    chromadb==0.5.23 \
    sentence-transformers==3.3.1 \
    ollama==0.4.4

print("‚úÖ Packages installed successfully!")

In [None]:
# Core imports
import os
import time
from pathlib import Path
from typing import List, Dict, Any, Optional

# LangChain for document processing
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import (
    DirectoryLoader,
    TextLoader,
    UnstructuredMarkdownLoader
)
from langchain.schema import Document

# Embeddings and Vector Store
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

# LLM interaction
import ollama

# GPU and memory utilities
import torch
import gc

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

### üîç What Just Happened?

We imported our core tools:
- **LangChain**: Framework for building LLM applications
- **ChromaDB**: Vector database for storing embeddings
- **HuggingFaceEmbeddings**: GPU-accelerated embedding models
- **Ollama**: Local LLM inference

---

## Part 2: Understanding the RAG Architecture

### The RAG Pipeline

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    INDEXING PHASE (Done Once)                   ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  Documents ‚Üí Chunking ‚Üí Embedding ‚Üí Vector Database            ‚îÇ
‚îÇ    üìÑüìÑüìÑ      ‚úÇÔ∏è         üî¢üî¢üî¢       üíæ                         ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                   QUERY PHASE (Each Question)                   ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  Question ‚Üí Embed ‚Üí Search ‚Üí Retrieve Top-K ‚Üí Generate Answer  ‚îÇ
‚îÇ     ‚ùì        üî¢      üîç         üìÑüìÑüìÑ           üí¨             ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

Let's build each component step by step!

## Part 3: Loading Documents

### üßí ELI5: Document Loading

> **Think of this like organizing your study materials.**
>
> Before you can study, you need to gather all your textbooks, notes, and handouts. Some are PDFs, some are Word docs, some are web pages. The document loader reads all these different formats and converts them into a standard format we can work with.

Let's load our sample documents:

In [None]:
# Path to our sample documents
DOCS_PATH = Path("../data/sample_documents")

# Check what documents we have
if DOCS_PATH.exists():
    docs_list = list(DOCS_PATH.glob("*.md"))
    print(f"üìÅ Found {len(docs_list)} markdown documents:")
    for doc in docs_list:
        size_kb = doc.stat().st_size / 1024
        print(f"   - {doc.name} ({size_kb:.1f} KB)")
else:
    print(f"‚ö†Ô∏è Documents directory not found at {DOCS_PATH}")
    print("Please ensure you have the sample_documents folder in ../data/")

In [None]:
def load_documents(docs_path: Path) -> List[Document]:
    """
    Load all markdown documents from a directory.
    
    Args:
        docs_path: Path to directory containing documents
        
    Returns:
        List of LangChain Document objects
    """
    documents = []
    
    # Load each markdown file
    for file_path in docs_path.glob("*.md"):
        try:
            # Read the file content
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()
            
            # Create a Document object with metadata
            doc = Document(
                page_content=content,
                metadata={
                    "source": file_path.name,
                    "file_path": str(file_path),
                    "file_size": file_path.stat().st_size
                }
            )
            documents.append(doc)
            print(f"‚úÖ Loaded: {file_path.name}")
            
        except Exception as e:
            print(f"‚ùå Error loading {file_path.name}: {e}")
    
    return documents

# Load our documents
raw_documents = load_documents(DOCS_PATH)
print(f"\nüìö Total documents loaded: {len(raw_documents)}")

In [None]:
# Let's peek at one document
if raw_documents:
    sample_doc = raw_documents[0]
    print(f"üìÑ Sample Document: {sample_doc.metadata['source']}")
    print(f"   Length: {len(sample_doc.page_content):,} characters")
    print(f"\n   First 500 characters:")
    print("   " + "-" * 60)
    print(sample_doc.page_content[:500])
    print("   ...")

### üîç What Just Happened?

We loaded our documents into LangChain's `Document` format, which has:
- `page_content`: The actual text
- `metadata`: Information about the source (filename, path, etc.)

Metadata is crucial for RAG because it lets us cite sources!

---

## Part 4: Chunking Documents

### üßí ELI5: Why Chunk?

> **Imagine you're looking for a recipe in a giant cookbook.**
>
> You wouldn't read the entire 500-page book to find how to make cookies. You'd look in the "Desserts" chapter, find the "Cookies" section, and read just those 2-3 pages.
>
> Chunking is breaking our documents into "cookie-sized" pieces so we can retrieve just the relevant parts. Too small = missing context. Too big = including irrelevant info.

### Chunking Strategy

We'll use **RecursiveCharacterTextSplitter** which:
1. Tries to split on double newlines (paragraph breaks)
2. Falls back to single newlines
3. Falls back to sentences
4. Falls back to words

This preserves semantic boundaries as much as possible.

In [None]:
# Configure our chunking strategy
CHUNK_SIZE = 512       # Target size in characters
CHUNK_OVERLAP = 50     # Overlap between chunks

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]  # Priority order
)

print(f"üìè Chunk configuration:")
print(f"   Chunk size: {CHUNK_SIZE} characters")
print(f"   Overlap: {CHUNK_OVERLAP} characters")
print(f"   Separators: paragraph ‚Üí line ‚Üí sentence ‚Üí word")

In [None]:
# Split all documents into chunks
chunks = text_splitter.split_documents(raw_documents)

print(f"‚úÇÔ∏è Chunking Results:")
print(f"   Original documents: {len(raw_documents)}")
print(f"   Total chunks: {len(chunks)}")
print(f"   Average chunks per document: {len(chunks) / len(raw_documents):.1f}")

# Analyze chunk sizes
chunk_sizes = [len(c.page_content) for c in chunks]
print(f"\nüìä Chunk Size Statistics:")
print(f"   Min: {min(chunk_sizes)} chars")
print(f"   Max: {max(chunk_sizes)} chars")
print(f"   Average: {sum(chunk_sizes)/len(chunk_sizes):.0f} chars")

In [None]:
# Let's examine a few chunks to see how splitting worked
print("üìù Sample Chunks:")
print("=" * 70)

for i in [0, 5, 10]:  # Look at chunks 0, 5, and 10
    if i < len(chunks):
        chunk = chunks[i]
        print(f"\nüîπ Chunk {i} (from {chunk.metadata['source']}):")
        print(f"   Length: {len(chunk.page_content)} chars")
        print(f"   Content (first 200 chars):")
        print(f"   '{chunk.page_content[:200]}...'")
        print("-" * 70)

### ‚úã Try It Yourself: Experiment with Chunk Sizes

What happens if you use different chunk sizes?

1. Try `CHUNK_SIZE = 256` (smaller chunks)
2. Try `CHUNK_SIZE = 1024` (larger chunks)

What do you notice about the number of chunks and their content?

<details>
<summary>üí° Hint</summary>

- Smaller chunks = more chunks, more precise retrieval, but less context
- Larger chunks = fewer chunks, more context, but might include irrelevant info
- The sweet spot is usually 256-512 tokens (roughly 512-1500 characters)
</details>

---

## Part 5: Creating Embeddings

### üßí ELI5: What are Embeddings?

> **Imagine organizing books in a library by "vibes" instead of alphabetically.**
>
> Books about love go near each other, books about war go near each other, sci-fi books cluster together. Even though "Romeo and Juliet" and "Pride and Prejudice" have different titles, they'd be placed near each other because they're both love stories.
>
> Embeddings convert text into numbers (a long list of numbers called a "vector") that capture its meaning. Similar meanings ‚Üí similar numbers ‚Üí close together in "embedding space."
>
> When you search for "How does the GPU work?", the embedding of your question will be close to chunks about GPU architecture, even if they don't use the exact words "GPU work."

### Loading the Embedding Model

In [None]:
# We'll use BGE-large, one of the best open-source embedding models
EMBEDDING_MODEL = "BAAI/bge-large-en-v1.5"

print(f"üîÑ Loading embedding model: {EMBEDDING_MODEL}")
print("   This may take a minute on first run (downloading ~1.3GB)...")

start_time = time.time()

# Initialize with GPU support
embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL,
    model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
    encode_kwargs={
        "normalize_embeddings": True,  # For cosine similarity
        "batch_size": 32  # Process multiple chunks at once
    }
)

load_time = time.time() - start_time
print(f"‚úÖ Model loaded in {load_time:.1f} seconds")
print(f"   Running on: {'GPU üöÄ' if torch.cuda.is_available() else 'CPU'}")

In [None]:
# Let's test the embedding model
test_texts = [
    "How do I train a neural network?",
    "What is deep learning?",
    "The weather is nice today."
]

test_embeddings = embedding_model.embed_documents(test_texts)

print(f"üìä Embedding Dimensions: {len(test_embeddings[0])}")
print(f"   (Each chunk becomes a vector of {len(test_embeddings[0])} numbers)")

# Calculate similarity between our test texts
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(f"\nüîó Similarity Scores:")
print(f"   '{test_texts[0]}' ‚Üî '{test_texts[1]}'")
print(f"   Similarity: {cosine_similarity(test_embeddings[0], test_embeddings[1]):.3f}")
print(f"   (High = similar topics)")

print(f"\n   '{test_texts[0]}' ‚Üî '{test_texts[2]}'")
print(f"   Similarity: {cosine_similarity(test_embeddings[0], test_embeddings[2]):.3f}")
print(f"   (Low = different topics)")

### üîç What Just Happened?

The embedding model converted each text into a 1024-dimensional vector. Notice:
- "Neural network training" and "deep learning" are semantically similar ‚Üí high cosine similarity (~0.8+)
- "Neural network training" and "weather" are unrelated ‚Üí low cosine similarity (~0.3)

This is how RAG finds relevant documents even when they don't share exact keywords!

---

## Part 6: Storing in Vector Database

### üßí ELI5: Vector Database

> **Remember our library organized by "vibes"?**
>
> A vector database is like a magical librarian who knows exactly where every book is in this vibes-based organization. When you say "I want a book about romance", they instantly point you to the right section.
>
> Technically, vector databases use special algorithms (like HNSW or IVF) to quickly find the nearest neighbors to your query vector without checking every single document.

### Creating the ChromaDB Collection

In [None]:
# Directory to store our vector database
CHROMA_PATH = "./chroma_db"

# Clean up any existing database (for fresh start)
import shutil
if Path(CHROMA_PATH).exists():
    shutil.rmtree(CHROMA_PATH)
    print(f"üóëÔ∏è Removed existing database at {CHROMA_PATH}")

print(f"\nüì¶ Creating vector database...")
print(f"   Embedding {len(chunks)} chunks...")

start_time = time.time()

# Create the vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory=CHROMA_PATH,
    collection_name="rag_documents"
)

index_time = time.time() - start_time
print(f"\n‚úÖ Vector database created in {index_time:.1f} seconds!")
print(f"   Stored at: {CHROMA_PATH}")
print(f"   Collection: rag_documents")
print(f"   Chunks indexed: {len(chunks)}")

In [None]:
# Let's test retrieval with a sample query
test_query = "How much memory does DGX Spark have?"

print(f"üîç Test Query: '{test_query}'")
print("-" * 60)

# Search for similar chunks
results = vectorstore.similarity_search_with_score(test_query, k=3)

print(f"\nüìÑ Top 3 Retrieved Chunks:")
for i, (doc, score) in enumerate(results):
    print(f"\nüîπ Result {i+1} (Similarity: {1-score:.3f}):")
    print(f"   Source: {doc.metadata['source']}")
    print(f"   Content: {doc.page_content[:200]}...")

### üîç What Just Happened?

ChromaDB:
1. Took each chunk and computed its embedding
2. Stored both the text and embedding
3. Built an index for fast similarity search

When we queried "How much memory does DGX Spark have?", it:
1. Embedded our question
2. Found the 3 most similar chunks
3. Returned them with similarity scores

Notice the retrieved chunks are about DGX Spark memory, even though we didn't use the exact same words!

---

## Part 7: Setting Up the LLM

Now we need an LLM to generate answers based on the retrieved context. We'll use Ollama for local inference.

In [None]:
# Check Ollama connection and available models
try:
    models = ollama.list()
    print("‚úÖ Ollama is running!")
    print(f"\nüìã Available models:")
    for model in models.get('models', []):
        name = model.get('name', 'unknown')
        size = model.get('size', 0) / 1e9
        print(f"   - {name} ({size:.1f} GB)")
except Exception as e:
    print(f"‚ùå Ollama not running: {e}")
    print("\nüí° To start Ollama:")
    print("   1. Open a terminal")
    print("   2. Run: ollama serve")
    print("   3. In another terminal: ollama pull qwen3:8b")

In [None]:
# Select model to use (modify based on what you have installed)
LLM_MODEL = "qwen3:8b"  # Good balance of quality and speed
# Alternatives: "qwen3:32b" (better quality), "nemotron-3-nano" (faster)

print(f"üìù Using model: {LLM_MODEL}")

# Test the model with a simple query
try:
    test_response = ollama.chat(
        model=LLM_MODEL,
        messages=[{"role": "user", "content": "Say 'Hello!' in exactly one word."}]
    )
    print(f"‚úÖ Model response: {test_response['message']['content']}")
except Exception as e:
    print(f"‚ùå Error with model {LLM_MODEL}: {e}")
    print(f"\nüí° Try: ollama pull {LLM_MODEL}")

---

## Part 8: Building the RAG Pipeline

Now we'll combine everything into a complete RAG system!

### The RAG Query Function

In [None]:
def rag_query(
    question: str,
    vectorstore: Chroma,
    model: str = "qwen3:8b",
    k: int = 5,
    verbose: bool = True
) -> Dict[str, Any]:
    """
    Complete RAG query: retrieve relevant chunks and generate an answer.
    
    Args:
        question: User's question
        vectorstore: ChromaDB vector store
        model: Ollama model to use
        k: Number of chunks to retrieve
        verbose: Print intermediate steps
        
    Returns:
        Dictionary with answer, sources, and timing info
    """
    start_time = time.time()
    
    # Step 1: Retrieve relevant chunks
    if verbose:
        print(f"üîç Retrieving top {k} relevant chunks...")
    
    retrieval_start = time.time()
    results = vectorstore.similarity_search_with_score(question, k=k)
    retrieval_time = time.time() - retrieval_start
    
    if verbose:
        print(f"   Retrieved {len(results)} chunks in {retrieval_time*1000:.0f}ms")
    
    # Step 2: Build context from retrieved chunks
    context_parts = []
    sources = []
    
    for doc, score in results:
        context_parts.append(doc.page_content)
        sources.append({
            "source": doc.metadata["source"],
            "similarity": 1 - score,  # Convert distance to similarity
            "preview": doc.page_content[:100] + "..."
        })
    
    context = "\n\n---\n\n".join(context_parts)
    
    # Step 3: Create the RAG prompt
    prompt = f"""You are a helpful AI assistant. Answer the question based ONLY on the provided context.
If the context doesn't contain enough information to answer the question, say so.
Always cite which source document your information comes from.

CONTEXT:
{context}

QUESTION: {question}

ANSWER (be concise and cite sources):"""
    
    # Step 4: Generate answer with LLM
    if verbose:
        print(f"üí≠ Generating answer with {model}...")
    
    generation_start = time.time()
    response = ollama.chat(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    generation_time = time.time() - generation_start
    
    answer = response["message"]["content"]
    total_time = time.time() - start_time
    
    if verbose:
        print(f"   Generated in {generation_time:.1f}s")
        print(f"   Total time: {total_time:.1f}s")
    
    return {
        "question": question,
        "answer": answer,
        "sources": sources,
        "timing": {
            "retrieval_ms": retrieval_time * 1000,
            "generation_s": generation_time,
            "total_s": total_time
        }
    }

### Let's Test Our RAG System!

In [None]:
# Test Question 1: DGX Spark specifications
question1 = "What is the memory capacity of the DGX Spark and why is it special?"

print("=" * 70)
print(f"‚ùì Question: {question1}")
print("=" * 70)

result1 = rag_query(question1, vectorstore, model=LLM_MODEL)

print(f"\nüí¨ Answer:")
print(result1["answer"])

print(f"\nüìö Sources Used:")
for src in result1["sources"][:3]:  # Show top 3 sources
    print(f"   - {src['source']} (similarity: {src['similarity']:.2f})")

In [None]:
# Test Question 2: Technical concept
question2 = "How does LoRA reduce memory requirements for fine-tuning?"

print("=" * 70)
print(f"‚ùì Question: {question2}")
print("=" * 70)

result2 = rag_query(question2, vectorstore, model=LLM_MODEL)

print(f"\nüí¨ Answer:")
print(result2["answer"])

print(f"\nüìö Sources Used:")
for src in result2["sources"][:3]:
    print(f"   - {src['source']} (similarity: {src['similarity']:.2f})")

In [None]:
# Test Question 3: Question that should NOT be in our documents
question3 = "What is the recipe for chocolate chip cookies?"

print("=" * 70)
print(f"‚ùì Question: {question3}")
print("=" * 70)

result3 = rag_query(question3, vectorstore, model=LLM_MODEL)

print(f"\nüí¨ Answer:")
print(result3["answer"])

print(f"\nüìö Sources Retrieved (not relevant):")
for src in result3["sources"][:3]:
    print(f"   - {src['source']} (similarity: {src['similarity']:.2f})")

### üîç What Just Happened?

Our RAG pipeline:
1. **Embedded** the question using the same model as our documents
2. **Retrieved** the 5 most similar chunks from ChromaDB
3. **Constructed** a prompt with the question and retrieved context
4. **Generated** an answer using the LLM, grounded in the retrieved documents

Notice:
- Questions about our docs get accurate, cited answers
- Questions NOT in our docs (like cookies) correctly report that the info isn't available!

---

## Part 9: Interactive RAG Demo

Let's create an interactive demo where you can ask your own questions!

In [None]:
def interactive_rag(vectorstore: Chroma, model: str = "qwen3:8b"):
    """
    Interactive RAG session - ask questions about your documents!
    Type 'quit' to exit.
    """
    print("\n" + "=" * 70)
    print("ü§ñ RAG Knowledge Assistant")
    print("=" * 70)
    print("Ask questions about the loaded documents!")
    print("Type 'quit' to exit.\n")
    
    while True:
        question = input("\n‚ùì Your question: ").strip()
        
        if question.lower() in ['quit', 'exit', 'q']:
            print("\nüëã Goodbye!")
            break
        
        if not question:
            print("Please enter a question.")
            continue
        
        print()
        result = rag_query(question, vectorstore, model=model, verbose=True)
        
        print(f"\nüí¨ Answer:")
        print("-" * 50)
        print(result["answer"])
        print("-" * 50)
        
        print(f"\nüìö Sources:")
        for src in result["sources"][:3]:
            print(f"   - {src['source']}")

In [None]:
# Uncomment to run interactive mode
# interactive_rag(vectorstore, model=LLM_MODEL)

### ‚úã Try It Yourself: Test Questions

Try asking these questions about our documents:

1. "What are the different quantization methods and their trade-offs?"
2. "Explain how the attention mechanism works in Transformers"
3. "What index types are available in FAISS?"
4. "How do I choose between ChromaDB, FAISS, and Qdrant?"

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Chunks Too Small
```python
# ‚ùå Wrong: Tiny chunks lose context
splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0)

# ‚úÖ Right: Balanced chunk size with overlap
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
```
**Why:** Tiny chunks might only contain "The model uses..." without explaining WHAT it uses.

### Mistake 2: No Overlap Between Chunks
```python
# ‚ùå Wrong: No overlap
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=0)

# ‚úÖ Right: 10-20% overlap
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
```
**Why:** Without overlap, a key sentence split between chunks is never fully retrievable.

### Mistake 3: Not Normalizing Embeddings
```python
# ‚ùå Wrong: Embeddings not normalized
embedding_model = HuggingFaceEmbeddings(model_name="...")

# ‚úÖ Right: Normalize for cosine similarity
embedding_model = HuggingFaceEmbeddings(
    model_name="...",
    encode_kwargs={"normalize_embeddings": True}
)
```
**Why:** Cosine similarity requires normalized vectors to work correctly.

### Mistake 4: Not Handling Missing Information
```python
# ‚ùå Wrong: No guidance for missing info
prompt = f"Answer this: {question}\nContext: {context}"

# ‚úÖ Right: Tell LLM to acknowledge limitations
prompt = f"""Answer based ONLY on the context.
If the context doesn't have the answer, say 'I don't have information about that.'

Context: {context}
Question: {question}"""
```
**Why:** Without explicit instruction, the LLM may hallucinate answers.

---

## üéâ Checkpoint

Congratulations! You've built a complete RAG system from scratch!

You've learned:
- ‚úÖ What RAG is and why it's essential for practical LLM applications
- ‚úÖ How to load and preprocess documents
- ‚úÖ How chunking works and why overlap matters
- ‚úÖ How embeddings capture semantic meaning
- ‚úÖ How vector databases enable fast similarity search
- ‚úÖ How to combine retrieval with generation for grounded answers

---

## üöÄ Challenge (Optional)

### Challenge 1: Add More Documents
Add your own documents (PDF, Word, or text) to the knowledge base.

### Challenge 2: Implement Conversation History
Modify `rag_query` to maintain conversation context for follow-up questions.

### Challenge 3: Add Source Citations
Modify the prompt to include specific page/chunk references in the answer.

### Challenge 4: Measure Quality
Create 10 Q&A pairs with ground-truth answers and measure how well your RAG system performs.

---

## üìñ Further Reading

- [RAG Paper](https://arxiv.org/abs/2005.11401) - The original RAG research paper
- [LangChain RAG Guide](https://python.langchain.com/docs/use_cases/question_answering/)
- [ChromaDB Documentation](https://docs.trychroma.com/)
- [BGE Embedding Models](https://huggingface.co/BAAI/bge-large-en-v1.5)
- [Ollama Documentation](https://ollama.ai/)

---

## üßπ Cleanup

Free up GPU memory and clean up resources.

In [None]:
# Clean up
del embedding_model
gc.collect()
torch.cuda.empty_cache()

print("‚úÖ Cleanup complete!")
print(f"\nüí° Note: The vector database persists at {CHROMA_PATH}")
print("   You can reload it later without re-embedding!")

---

## Next Steps

In the next lab, we'll explore different **chunking strategies** to improve retrieval quality!

‚û°Ô∏è Continue to [Lab 3.5.2: Chunking Strategies](./lab-3.5.2-chunking.ipynb)