# Lab 3.5.3: Vector Database Comparison

**Module:** 3.5 - RAG Systems & Vector Databases  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê (Intermediate)

---

## üéØ Learning Objectives

By the end of this notebook, you will:

- [ ] Implement the same RAG pipeline with ChromaDB, FAISS, and Qdrant
- [ ] Benchmark indexing time, query latency, and memory usage
- [ ] Understand GPU acceleration with FAISS on DGX Spark
- [ ] Implement metadata filtering with each database
- [ ] Know when to choose each vector database

---

## üìö Prerequisites

- Completed: Lab 3.5.1-3.5.2
- DGX Spark with GPU for FAISS acceleration

---

## üåç Real-World Context

**The Decision:** You've built your RAG prototype with ChromaDB. Now your startup is scaling - you need to handle 10 million documents and 1000 queries/second. Which vector database should you use in production?

**Trade-offs:**
- **ChromaDB**: Easy to use, but slower at scale
- **FAISS**: Blazing fast with GPU, but no built-in filtering
- **Qdrant**: Production features, but more complex setup

**The Goal:** Make an informed decision based on YOUR requirements.

---

## üßí ELI5: Vector Databases

> **Imagine three different librarians helping you find books:**
>
> **ChromaDB Librarian** üìö: Friendly and helpful. Knows where everything is, can handle special requests ("only books from 2023"). Works well for small libraries.
>
> **FAISS Librarian** ‚ö°: SUPER fast because they memorized the entire library layout. But if you want special requests, you have to do that yourself. Best for HUGE libraries.
>
> **Qdrant Librarian** üèóÔ∏è: Professional grade. Handles special requests, works with a team (distributed), keeps detailed records. Perfect for a large organization.
>
> On your DGX Spark, the FAISS librarian gets a jetpack (GPU) and becomes even faster!

---

## Part 1: Setup

In [None]:
# Install all vector databases
# Note: On DGX Spark (ARM64), use faiss-cpu. GPU acceleration via faiss-gpu
# requires building from source or using the NGC container with pre-built binaries.
!pip install -q \
    chromadb==0.5.23 \
    faiss-cpu==1.9.0 \
    qdrant-client==1.12.1 \
    langchain langchain-community langchain-huggingface \
    sentence-transformers

print("‚úÖ Vector databases installed!")
print("‚ÑπÔ∏è  Note: Using faiss-cpu. For GPU-accelerated FAISS, use NGC containers.")

In [None]:
import os
import time
import shutil
import psutil
from pathlib import Path
from typing import List, Dict, Any, Tuple
from dataclasses import dataclass
import numpy as np

# Vector databases
import chromadb
from chromadb.config import Settings
import faiss
from qdrant_client import QdrantClient
from qdrant_client.models import (
    VectorParams, Distance, PointStruct,
    Filter, FieldCondition, MatchValue
)

# LangChain wrappers
from langchain_community.vectorstores import Chroma, FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.schema import Document

import torch
import gc

print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

In [None]:
# Load and prepare documents
DOCS_PATH = Path("../data/sample_documents")

documents = []
for file_path in sorted(DOCS_PATH.glob("*.md")):
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
    documents.append(Document(
        page_content=content,
        metadata={
            "source": file_path.name,
            "category": "technical",
            "file_size": file_path.stat().st_size
        }
    ))

print(f"üìö Loaded {len(documents)} documents")

In [None]:
# Chunk documents
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50
)

chunks = splitter.split_documents(documents)
print(f"‚úÇÔ∏è Created {len(chunks)} chunks")

In [None]:
# Load embedding model
print("üîÑ Loading embedding model...")

embedding_model = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
    encode_kwargs={"normalize_embeddings": True, "batch_size": 32}
)

# Pre-compute embeddings for fair comparison
print("üìä Pre-computing embeddings...")
start = time.time()
chunk_texts = [c.page_content for c in chunks]
chunk_embeddings = embedding_model.embed_documents(chunk_texts)
embedding_time = time.time() - start

print(f"‚úÖ Computed {len(chunk_embeddings)} embeddings in {embedding_time:.2f}s")
print(f"   Embedding dimension: {len(chunk_embeddings[0])}")

---

## Part 2: ChromaDB Implementation

ChromaDB is Python-native and easy to use. Let's benchmark it.

### Utility Functions

Before benchmarking, we need some helper functions:

**psutil** is a Python library for system monitoring:
- `psutil.Process().memory_info().rss` - Returns the Resident Set Size (physical memory used by the process) in bytes
- Useful for measuring memory consumption of vector databases

In [None]:
@dataclass
class BenchmarkResult:
    """Benchmark results for a vector database."""
    db_name: str
    index_time_s: float
    avg_query_time_ms: float
    min_query_time_ms: float
    max_query_time_ms: float
    memory_mb: float
    supports_filtering: bool
    supports_gpu: bool


def get_memory_usage() -> float:
    """
    Get current process memory usage in MB.
    
    Uses psutil.Process().memory_info().rss which returns:
    - RSS (Resident Set Size): Physical memory currently used by the process
    - Measured in bytes, so we divide by 1024¬≤ to get megabytes
    """
    process = psutil.Process()
    return process.memory_info().rss / 1024 / 1024


test_queries = [
    "What is the memory capacity of DGX Spark?",
    "How does LoRA reduce training requirements?",
    "Explain the attention mechanism",
    "What is GPTQ quantization?",
    "How does hybrid search work in RAG?"
]

In [None]:
def benchmark_chromadb(
    chunks: List[Document],
    embeddings: List[List[float]],
    test_queries: List[str]
) -> BenchmarkResult:
    """
    Benchmark ChromaDB performance.
    """
    db_path = "./benchmark_chroma"
    if Path(db_path).exists():
        shutil.rmtree(db_path)
    
    memory_before = get_memory_usage()
    
    # Time indexing
    start = time.time()
    
    client = chromadb.PersistentClient(path=db_path)
    collection = client.get_or_create_collection(
        name="documents",
        metadata={"hnsw:space": "cosine"}
    )
    
    # Add documents with embeddings
    collection.add(
        ids=[f"doc_{i}" for i in range(len(chunks))],
        embeddings=embeddings,
        documents=[c.page_content for c in chunks],
        metadatas=[c.metadata for c in chunks]
    )
    
    index_time = time.time() - start
    memory_after = get_memory_usage()
    
    # Time queries
    query_times = []
    for query in test_queries:
        query_emb = embedding_model.embed_query(query)
        
        start = time.time()
        results = collection.query(
            query_embeddings=[query_emb],
            n_results=5
        )
        query_times.append((time.time() - start) * 1000)
    
    # Test filtering
    query_emb = embedding_model.embed_query(test_queries[0])
    filtered = collection.query(
        query_embeddings=[query_emb],
        n_results=5,
        where={"category": "technical"}
    )
    supports_filtering = len(filtered['ids'][0]) > 0
    
    # Cleanup
    del client, collection
    shutil.rmtree(db_path)
    
    return BenchmarkResult(
        db_name="ChromaDB",
        index_time_s=index_time,
        avg_query_time_ms=np.mean(query_times),
        min_query_time_ms=min(query_times),
        max_query_time_ms=max(query_times),
        memory_mb=memory_after - memory_before,
        supports_filtering=supports_filtering,
        supports_gpu=False
    )

print("üîµ Benchmarking ChromaDB...")
chroma_result = benchmark_chromadb(chunks, chunk_embeddings, test_queries)
print(f"   Index time: {chroma_result.index_time_s:.2f}s")
print(f"   Avg query: {chroma_result.avg_query_time_ms:.2f}ms")

---

## Part 3: FAISS Implementation (GPU Accelerated)

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search.

### Key FAISS Concepts and Functions

| Function | Purpose |
|----------|---------|
| `faiss.IndexFlatIP(dim)` | Creates a flat (brute-force) index using Inner Product similarity. Best for small datasets. |
| `faiss.IndexFlatL2(dim)` | Creates a flat index using L2 (Euclidean) distance. |
| `faiss.IndexIVFFlat(quantizer, dim, nlist)` | Creates an IVF (Inverted File) index that clusters vectors for faster approximate search. |
| `faiss.StandardGpuResources()` | Allocates GPU memory for FAISS operations. |
| `faiss.index_cpu_to_gpu(res, gpu_id, index)` | Moves a CPU index to GPU for acceleration. |
| `index.add(vectors)` | Adds vectors to the index. |
| `index.search(query, k)` | Finds k nearest neighbors. Returns (distances, indices). |

**Inner Product (IP) vs L2:** For normalized vectors, IP is equivalent to cosine similarity. We use IP because our embeddings are normalized.

In [None]:
def benchmark_faiss(
    chunks: List[Document],
    embeddings: List[List[float]],
    test_queries: List[str],
    use_gpu: bool = True
) -> BenchmarkResult:
    """
    Benchmark FAISS performance with optional GPU acceleration.
    
    FAISS Index Types Used:
    - IndexFlatIP: Exact search using inner product (cosine similarity for normalized vectors)
      Syntax: faiss.IndexFlatIP(dimension) where dimension = embedding size (e.g., 1024)
    
    GPU Acceleration (requires faiss-gpu):
    - faiss.StandardGpuResources(): Allocates GPU memory pool for FAISS
    - faiss.index_cpu_to_gpu(resources, gpu_id, cpu_index): Moves index to GPU
    
    Note: GPU acceleration requires faiss-gpu which must be built from source
    on ARM64 or used via NGC containers. With faiss-cpu, GPU is not available.
    """
    memory_before = get_memory_usage()
    
    # Convert to numpy - FAISS requires float32 numpy arrays
    embeddings_np = np.array(embeddings).astype('float32')
    dimension = embeddings_np.shape[1]
    
    # Check if GPU functions are available in faiss
    has_gpu_support = hasattr(faiss, 'StandardGpuResources')
    
    # Time indexing
    start = time.time()
    
    # Create index - IndexFlatIP for inner product (cosine similarity with normalized vectors)
    index = faiss.IndexFlatIP(dimension)
    
    gpu_used = False
    if use_gpu and torch.cuda.is_available() and has_gpu_support:
        try:
            # Move to GPU (only available with faiss-gpu)
            # StandardGpuResources manages GPU memory allocation
            res = faiss.StandardGpuResources()
            # index_cpu_to_gpu(resources, gpu_device_id, cpu_index)
            index = faiss.index_cpu_to_gpu(res, 0, index)
            gpu_used = True
        except Exception as e:
            print(f"   ‚ö†Ô∏è GPU acceleration not available: {e}")
    
    # Add vectors to the index
    index.add(embeddings_np)
    
    index_time = time.time() - start
    memory_after = get_memory_usage()
    
    # Time queries
    query_times = []
    for query in test_queries:
        query_emb = embedding_model.embed_query(query)
        query_np = np.array([query_emb]).astype('float32')
        
        start = time.time()
        # search returns (distances, indices) - distances shape: (n_queries, k)
        distances, indices = index.search(query_np, k=5)
        query_times.append((time.time() - start) * 1000)
    
    # FAISS doesn't have built-in filtering
    supports_filtering = False
    
    return BenchmarkResult(
        db_name=f"FAISS ({'GPU' if gpu_used else 'CPU'})",
        index_time_s=index_time,
        avg_query_time_ms=np.mean(query_times),
        min_query_time_ms=min(query_times),
        max_query_time_ms=max(query_times),
        memory_mb=memory_after - memory_before,
        supports_filtering=supports_filtering,
        supports_gpu=gpu_used
    )

print("üü¢ Benchmarking FAISS (attempting GPU)...")
faiss_gpu_result = benchmark_faiss(chunks, chunk_embeddings, test_queries, use_gpu=True)
print(f"   Mode: {faiss_gpu_result.db_name}")
print(f"   Index time: {faiss_gpu_result.index_time_s:.3f}s")
print(f"   Avg query: {faiss_gpu_result.avg_query_time_ms:.3f}ms")

print("\nüü° Benchmarking FAISS (CPU only)...")
faiss_cpu_result = benchmark_faiss(chunks, chunk_embeddings, test_queries, use_gpu=False)
print(f"   Index time: {faiss_cpu_result.index_time_s:.3f}s")
print(f"   Avg query: {faiss_cpu_result.avg_query_time_ms:.3f}ms")

### üîç GPU Acceleration Deep Dive

Let's see the GPU advantage more clearly with multiple index types.

**FAISS Index Types:**
- **Flat (Exact)**: Compares query to ALL vectors. Accurate but slow for large datasets.
- **IVF (Approximate)**: Clusters vectors into `nlist` groups. At query time, only searches `nprobe` clusters.
  - `nlist`: Number of clusters (more = better recall, slower indexing)
  - `nprobe`: Clusters to search at query time (more = better recall, slower query)
  - Requires training with `index.train(vectors)` before adding vectors

In [None]:
def benchmark_faiss_index_types(
    embeddings: List[List[float]],
    test_queries: List[str]
) -> Dict[str, float]:
    """
    Compare different FAISS index types.
    
    Index Types Demonstrated:
    1. IndexFlatIP - Exact brute-force search (slow but accurate)
    2. IndexIVFFlat - Approximate search using clustering (faster for large datasets)
       - Created with: faiss.IndexIVFFlat(quantizer, dimension, nlist)
       - quantizer: Index used for clustering (usually IndexFlatIP)
       - nlist: Number of clusters to create
       - Must call index.train(vectors) before adding vectors
       - Set index.nprobe to control how many clusters to search
    """
    embeddings_np = np.array(embeddings).astype('float32')
    dimension = embeddings_np.shape[1]
    n_vectors = len(embeddings)
    
    results = {}
    has_gpu_support = hasattr(faiss, 'StandardGpuResources')
    
    # 1. Flat Index (exact search)
    print("   Testing Flat index (exact)...")
    index_flat = faiss.IndexFlatIP(dimension)
    
    # Try GPU if available
    if torch.cuda.is_available() and has_gpu_support:
        try:
            res = faiss.StandardGpuResources()
            index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)
            print("      (Using GPU)")
        except Exception:
            print("      (Using CPU - GPU not available)")
    else:
        print("      (Using CPU)")
    
    index_flat.add(embeddings_np)
    
    times = []
    for query in test_queries:
        query_np = np.array([embedding_model.embed_query(query)]).astype('float32')
        start = time.time()
        index_flat.search(query_np, 5)
        times.append((time.time() - start) * 1000)
    results["Flat (Exact)"] = np.mean(times)
    
    # 2. IVF Index (approximate, faster)
    print("   Testing IVF index (approximate)...")
    # nlist = number of clusters. Rule of thumb: sqrt(n_vectors) to 4*sqrt(n_vectors)
    nlist = min(50, n_vectors // 10)
    
    # Create quantizer (used to assign vectors to clusters)
    quantizer = faiss.IndexFlatIP(dimension)
    
    # Create IVF index: IndexIVFFlat(quantizer, dimension, nlist)
    index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist)
    
    # IVF requires training to learn cluster centroids
    index_ivf.train(embeddings_np)
    index_ivf.add(embeddings_np)
    
    # nprobe = number of clusters to search (higher = more accurate but slower)
    index_ivf.nprobe = 5
    
    times = []
    for query in test_queries:
        query_np = np.array([embedding_model.embed_query(query)]).astype('float32')
        start = time.time()
        index_ivf.search(query_np, 5)
        times.append((time.time() - start) * 1000)
    results["IVF (Approximate)"] = np.mean(times)
    
    return results

print("üìä FAISS Index Type Comparison:")
faiss_index_results = benchmark_faiss_index_types(chunk_embeddings, test_queries)
for name, time_ms in faiss_index_results.items():
    print(f"   {name}: {time_ms:.3f}ms")

---

## Part 4: Qdrant Implementation

Qdrant is a production-ready vector database with excellent filtering support.

### Key Qdrant Concepts and Functions

| Class/Function | Purpose |
|----------------|---------|
| `QdrantClient(":memory:")` | Creates an in-memory Qdrant instance (or use URL for remote server) |
| `VectorParams(size, distance)` | Defines vector configuration: dimension size and distance metric |
| `Distance.COSINE` | Cosine similarity metric (also: EUCLID, DOT) |
| `PointStruct(id, vector, payload)` | A single vector with its ID and metadata (payload) |
| `client.create_collection(name, vectors_config)` | Creates a new collection to store vectors |
| `client.upsert(collection, points)` | Adds or updates points in the collection |
| `client.search(collection, query_vector, limit)` | Finds nearest neighbors |
| `Filter(must, should)` | Combines multiple filter conditions (AND/OR logic) |
| `FieldCondition(key, match)` | Filters on a specific metadata field |
| `MatchValue(value)` | Matches exact value in a field |

In [None]:
def benchmark_qdrant(
    chunks: List[Document],
    embeddings: List[List[float]],
    test_queries: List[str]
) -> BenchmarkResult:
    """
    Benchmark Qdrant performance (in-memory mode).
    
    Qdrant API Overview:
    - QdrantClient: Main client for interacting with Qdrant
      - ":memory:" creates an in-memory instance (no persistence)
      - Can also connect to a Qdrant server with QdrantClient(url="http://localhost:6333")
    
    - create_collection: Creates a named collection with vector configuration
      - VectorParams(size=dimension, distance=Distance.COSINE)
    
    - upsert: Adds points (vectors + metadata) to the collection
      - Each point is a PointStruct(id, vector, payload)
      - payload is a dict with metadata (like source, category, etc.)
    
    - search: Finds nearest neighbors with optional filtering
      - query_filter uses Filter with FieldCondition for metadata filtering
    """
    memory_before = get_memory_usage()
    dimension = len(embeddings[0])
    
    # Time indexing
    start = time.time()
    
    # Create in-memory client - use ":memory:" for testing, URL for production
    client = QdrantClient(":memory:")
    
    # Create collection with vector configuration
    # VectorParams defines: size (dimension) and distance metric
    client.create_collection(
        collection_name="documents",
        vectors_config=VectorParams(
            size=dimension,
            distance=Distance.COSINE  # Also: Distance.EUCLID, Distance.DOT
        )
    )
    
    # Create points - each point has: id, vector, and payload (metadata)
    points = [
        PointStruct(
            id=i,
            vector=embeddings[i],
            payload={
                "content": chunks[i].page_content,
                **chunks[i].metadata  # Include all document metadata
            }
        )
        for i in range(len(chunks))
    ]
    
    # Upsert (insert or update) points into the collection
    client.upsert(
        collection_name="documents",
        points=points
    )
    
    index_time = time.time() - start
    memory_after = get_memory_usage()
    
    # Time queries
    query_times = []
    for query in test_queries:
        query_emb = embedding_model.embed_query(query)
        
        start = time.time()
        results = client.search(
            collection_name="documents",
            query_vector=query_emb,
            limit=5
        )
        query_times.append((time.time() - start) * 1000)
    
    # Test filtering - Qdrant's killer feature!
    # Filter uses: must (AND), should (OR) with FieldCondition
    query_emb = embedding_model.embed_query(test_queries[0])
    filtered = client.search(
        collection_name="documents",
        query_vector=query_emb,
        query_filter=Filter(
            must=[FieldCondition(key="category", match=MatchValue(value="technical"))]
        ),
        limit=5
    )
    supports_filtering = len(filtered) > 0
    
    return BenchmarkResult(
        db_name="Qdrant",
        index_time_s=index_time,
        avg_query_time_ms=np.mean(query_times),
        min_query_time_ms=min(query_times),
        max_query_time_ms=max(query_times),
        memory_mb=memory_after - memory_before,
        supports_filtering=supports_filtering,
        supports_gpu=False
    )

print("üü£ Benchmarking Qdrant...")
qdrant_result = benchmark_qdrant(chunks, chunk_embeddings, test_queries)
print(f"   Index time: {qdrant_result.index_time_s:.2f}s")
print(f"   Avg query: {qdrant_result.avg_query_time_ms:.2f}ms")

---

## Part 5: Comprehensive Comparison

In [None]:
# Collect all results
all_results = [chroma_result, faiss_gpu_result, faiss_cpu_result, qdrant_result]

print("\n" + "=" * 90)
print("üìä VECTOR DATABASE BENCHMARK RESULTS")
print("=" * 90)
print(f"{'Database':<20} {'Index(s)':<10} {'Query(ms)':<12} {'Memory(MB)':<12} {'Filtering':<12} {'GPU':<8}")
print("-" * 90)

for r in sorted(all_results, key=lambda x: x.avg_query_time_ms):
    filter_str = "‚úÖ" if r.supports_filtering else "‚ùå"
    gpu_str = "‚úÖ" if r.supports_gpu else "‚ùå"
    print(f"{r.db_name:<20} {r.index_time_s:<10.3f} {r.avg_query_time_ms:<12.3f} "
          f"{r.memory_mb:<12.1f} {filter_str:<12} {gpu_str:<8}")

print("=" * 90)

In [None]:
# Calculate speedups
baseline = chroma_result.avg_query_time_ms

print("\n‚ö° Speed Comparison (vs ChromaDB):")
for r in all_results:
    if r.db_name != "ChromaDB":
        speedup = baseline / r.avg_query_time_ms
        print(f"   {r.db_name}: {speedup:.1f}x faster")

---

## Part 6: Feature Comparison

### Filtering Capabilities

In [None]:
# Demonstrate filtering capabilities
print("üîç Filtering Capabilities Comparison")
print("=" * 70)

# ChromaDB filtering example
print("\nüîµ ChromaDB Filtering:")
print("""```python
collection.query(
    query_embeddings=[query_emb],
    n_results=5,
    where={"category": "technical"},  # Exact match
    where_document={"$contains": "GPU"}  # Document contains
)
```""")

# FAISS filtering (manual)
print("\nüü¢ FAISS Filtering (Manual Post-Processing):")
print("""```python
# FAISS doesn't have built-in filtering
# You must filter results after retrieval
distances, indices = index.search(query_np, k=50)  # Get more results
filtered = [i for i in indices[0] if documents[i].metadata['category'] == 'technical'][:5]
```""")

# Qdrant filtering
print("\nüü£ Qdrant Filtering (Advanced):")
print("""```python
client.search(
    collection_name="documents",
    query_vector=query_emb,
    query_filter=Filter(
        must=[
            FieldCondition(key="category", match=MatchValue(value="technical")),
            FieldCondition(key="file_size", range=Range(gte=1000, lte=10000))
        ],
        should=[
            FieldCondition(key="source", match=MatchValue(value="guide.md"))
        ]
    ),
    limit=5
)
```""")

---

## Part 7: Decision Guide

In [None]:
print("""
üìã VECTOR DATABASE SELECTION GUIDE
‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

üîµ Choose ChromaDB if:
   ‚úÖ You're prototyping or learning
   ‚úÖ You have < 1M vectors
   ‚úÖ You need simple filtering
   ‚úÖ You want Python-native simplicity
   ‚ùå Not for: High-performance production, GPU acceleration

üü¢ Choose FAISS if:
   ‚úÖ Performance is critical (DGX Spark GPU acceleration!)
   ‚úÖ You have millions of vectors
   ‚úÖ You don't need complex filtering
   ‚úÖ You want the fastest possible search
   ‚ùå Not for: Built-in filtering, persistence (need to manage yourself)

üü£ Choose Qdrant if:
   ‚úÖ You need production-ready features
   ‚úÖ Complex filtering is required
   ‚úÖ You want distributed deployment
   ‚úÖ You need a managed cloud option
   ‚ùå Not for: Simplest use cases, GPU acceleration

‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

üöÄ DGX Spark Recommendation:
   Development: ChromaDB (simplicity)
   Performance: FAISS with GPU (blazing fast!)
   Production: Qdrant (features) or FAISS (speed)

""")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Not Using GPU with FAISS
```python
# ‚ùå Wrong: CPU-only FAISS on DGX Spark (wasted GPU!)
index = faiss.IndexFlatIP(dimension)

# ‚úÖ Right: GPU-accelerated FAISS
res = faiss.StandardGpuResources()
index = faiss.index_cpu_to_gpu(res, 0, index)
```

### Mistake 2: Using Flat Index for Large Datasets
```python
# ‚ùå Wrong: Flat index with millions of vectors (slow!)
index = faiss.IndexFlatIP(dimension)

# ‚úÖ Right: IVF index for large scale
quantizer = faiss.IndexFlatIP(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist=1000)
index.train(embeddings)
```

### Mistake 3: Not Persisting Vector Stores
```python
# ‚ùå Wrong: In-memory only, lost on restart
client = chromadb.Client()

# ‚úÖ Right: Persistent storage
client = chromadb.PersistentClient(path="./chroma_db")
```

---

## ‚úã Try It Yourself

### Exercise 1: Scale Test
Duplicate the chunks 10x and re-run the benchmarks. How do the results change?

### Exercise 2: FAISS IVF Tuning
Experiment with different `nlist` and `nprobe` values for FAISS IVF.

**IVF Tuning Guide:**
- `nlist` (training): More clusters = better recall but slower indexing. Try: 50, 100, 200
- `nprobe` (query): More probes = better recall but slower queries. Try: 5, 10, 20

### Exercise 3: Qdrant Quantization
Enable scalar quantization in Qdrant and measure memory savings.

<details>
<summary>üí° Hint for Exercise 3</summary>

**Scalar Quantization** reduces memory by converting float32 vectors to int8:
- Reduces memory by ~4x (32 bits ‚Üí 8 bits per dimension)
- Slight accuracy loss but usually acceptable
- `always_ram=True` keeps quantized vectors in RAM for fast access

```python
from qdrant_client.models import ScalarQuantization, ScalarQuantizationConfig

# ScalarQuantization: Converts vectors from float32 to int8
# - type="int8": Use 8-bit integers (also available: "int4" for even more compression)
# - always_ram=True: Keep quantized vectors in memory for speed

client.update_collection(
    collection_name="documents",
    quantization_config=ScalarQuantization(
        scalar=ScalarQuantizationConfig(
            type="int8",        # Quantization precision
            always_ram=True     # Keep in RAM for fast access
        )
    )
)

# After quantization, memory usage drops significantly
# Trade-off: ~1-2% recall loss for ~4x memory savings
```
</details>

---

## üéâ Checkpoint

You've learned:
- ‚úÖ How to implement RAG with ChromaDB, FAISS, and Qdrant
- ‚úÖ How to leverage GPU acceleration with FAISS on DGX Spark
- ‚úÖ The trade-offs between different vector databases
- ‚úÖ When to choose each database for your use case

---

## üßπ Cleanup

In [None]:
# Clean up
del embedding_model
gc.collect()
torch.cuda.empty_cache()

# Remove temp directories
for p in Path(".").glob("benchmark_*"):
    if p.is_dir():
        shutil.rmtree(p)

print("‚úÖ Cleanup complete!")

---

## Next Steps

In the next lab, we'll implement **hybrid search** combining dense embeddings with sparse BM25 retrieval!

‚û°Ô∏è Continue to [Lab 3.5.4: Hybrid Search](./lab-3.5.4-hybrid-search.ipynb)