# Chapter 13: Retrieval-Augmented Generation (RAG) On-Device

This notebook demonstrates how to build **Retrieval-Augmented Generation (RAG)** systems that run entirely on-device, combining local documents with local intelligence for private, efficient AI applications.

## Learning Objectives:
- Understand why local grounding matters for privacy and performance
- Implement lightweight vector databases for edge deployment
- Build complete on-device RAG pipelines with local chunking and embeddings
- Optimize search and retrieval with hybrid ranking algorithms
- Compare different vector database solutions (ChromaDB, Faiss, SQLite)
- Analyze trade-offs between speed, simplicity, and memory footprint


## Why Local Grounding Matters

**Local RAG** combines the power of retrieval-augmented generation with complete data sovereignty. Unlike cloud-based RAG systems, on-device RAG ensures:

- **Complete Privacy**: Your documents never leave your device
- **Zero Latency**: No network calls for document retrieval
- **Offline Capability**: Works without internet connection
- **Cost Efficiency**: No API costs for embedding or retrieval
- **Data Sovereignty**: You control your knowledge base entirely

This is especially critical for sensitive documents, proprietary information, or when working in environments with strict data governance requirements.


## Vector Databases for the Edge

For on-device RAG, we need lightweight vector databases that can run efficiently on local hardware. The main options are:

### **ChromaDB**: Excellent for Prototyping
- **Pros**: Simple API, good documentation, built-in persistence
- **Cons**: Higher memory overhead, slower for large datasets
- **Best for**: Development, small-scale deployments, learning

### **Faiss**: Pure Indexing Speed
- **Pros**: Extremely fast, optimized for production, low memory usage
- **Cons**: More complex setup, requires more code
- **Best for**: Production applications, large document collections

### **SQLite/DuckDB Extensions**: Minimal Footprint
- **Pros**: Smallest footprint, embedded database, familiar SQL interface
- **Cons**: Limited vector operations, requires extensions
- **Best for**: Embedded systems, resource-constrained environments

### **Trade-off Analysis**
- **Speed**: Faiss > ChromaDB > SQLite
- **Simplicity**: ChromaDB > SQLite > Faiss  
- **Memory**: SQLite < Faiss < ChromaDB


## The On-Device RAG Pipeline

A complete on-device RAG system consists of several key components:

1. **Document Chunking**: Split documents into manageable pieces
2. **Embedding Generation**: Convert text chunks to vector representations
3. **Vector Storage**: Store embeddings in a local vector database
4. **Query Processing**: Convert user queries to embeddings
5. **Nearest Neighbor Search**: Find most relevant document chunks
6. **Result Ranking**: Score and rank retrieved results
7. **Context Construction**: Build prompts with retrieved context

This pipeline runs entirely on your device, ensuring complete privacy and control over your data.


In [None]:
# Environment Setup for On-Device RAG
import os
import warnings
warnings.filterwarnings('ignore')

# Configure environment for stable on-device operation
os.environ['TRANSFORMERS_VERBOSITY'] = 'error'
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

print("Environment configured for on-device RAG operations")


🔧 Environment configured for lightweight operation


In [None]:
# Import Libraries for On-Device RAG
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

print("On-device RAG libraries imported successfully")


✅ All imports successful


In [None]:
# Load On-Device Embedding Model
print("Loading lightweight embedding model for on-device use...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print("On-device embedding model loaded successfully")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")
print("This model runs entirely on your device - no cloud dependencies!")


🤖 Loading embedding model...
✅ Model loaded successfully
📊 Model dimension: 384


In [None]:
# Test Basic Embedding Generation
test_text = "This is a test sentence for RAG."
embedding = model.encode([test_text])
print(f"Embedding shape: {embedding.shape}")
print(f"Sample values: {embedding[0][:5]}...")


📝 Test embedding shape: (1, 384)
📝 Test embedding sample: [ 0.00697795  0.11693418  0.08613165  0.03641429 -0.06965613]...


In [None]:
# Local Knowledge Base
documents = [
    "Machine learning is transforming healthcare by enabling early disease detection and personalized treatment plans.",
    "Artificial intelligence and robotics are advancing rapidly, creating new opportunities in manufacturing and automation.",
    "Natural language processing helps computers understand and generate human language for better communication.",
    "Computer vision systems can analyze images and videos to identify objects, faces, and patterns.",
    "Deep learning neural networks mimic the human brain to solve complex problems in various domains."
]

print(f"Loaded {len(documents)} local documents for on-device RAG system")
print("These documents are stored locally and never sent to external services")
for i, doc in enumerate(documents):
    print(f"  {i+1}. {doc[:60]}...")


📚 Loaded 5 sample documents
  1. Machine learning is transforming healthcare by enabling earl...
  2. Artificial intelligence and robotics are advancing rapidly, ...
  3. Natural language processing helps computers understand and g...
  4. Computer vision systems can analyze images and videos to ide...
  5. Deep learning neural networks mimic the human brain to solve...


In [None]:
# Generate Document Embeddings
print("Generating embeddings for all documents...")
document_embeddings = model.encode(documents)
print(f"Generated embeddings: {document_embeddings.shape}")


🔄 Generating embeddings...
✅ Generated embeddings: (5, 384)


In [None]:
# Test Similarity Search
query = "How does machine learning help in healthcare?"
query_embedding = model.encode([query])

# Calculate similarities
similarities = cosine_similarity(query_embedding, document_embeddings)[0]

print(f"Query: {query}")
print("\nSimilarity scores:")
for i, (doc, sim) in enumerate(zip(documents, similarities)):
    print(f"  {i+1}. {sim:.3f} - {doc[:50]}...")


🔍 Query: How does machine learning help in healthcare?

📊 Similarity scores:
  1. 0.664 - Machine learning is transforming healthcare by ena...
  2. 0.247 - Artificial intelligence and robotics are advancing...
  3. 0.323 - Natural language processing helps computers unders...
  4. 0.314 - Computer vision systems can analyze images and vid...
  5. 0.379 - Deep learning neural networks mimic the human brai...


In [None]:
# Basic RAG Implementation
def simple_rag(query, documents, model, top_k=2):
    """Basic RAG implementation for document retrieval"""
    # Generate query embedding
    query_embedding = model.encode([query])
    
    # Generate document embeddings
    doc_embeddings = model.encode(documents)
    
    # Calculate similarities
    similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
    
    # Get top-k results
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    results = []
    for idx in top_indices:
        results.append({
            'document': documents[idx],
            'similarity': similarities[idx],
            'rank': len(results) + 1
        })
    
    return results

print("RAG function defined")


✅ Simple RAG function defined


In [None]:
# Test RAG with Multiple Queries
queries = [
    "How does machine learning help in healthcare?",
    "What are the applications of computer vision?",
    "How do neural networks work?"
]

for i, query in enumerate(queries):
    print(f"\nQuery {i+1}: {query}")
    results = simple_rag(query, documents, model, top_k=2)
    
    for result in results:
        print(f"  Rank {result['rank']}: {result['similarity']:.3f} - {result['document'][:60]}...")



🔍 Query 1: How does machine learning help in healthcare?
  📄 Rank 1: 0.664 - Machine learning is transforming healthcare by enabling earl...
  📄 Rank 2: 0.379 - Deep learning neural networks mimic the human brain to solve...

🔍 Query 2: What are the applications of computer vision?
  📄 Rank 1: 0.662 - Computer vision systems can analyze images and videos to ide...
  📄 Rank 2: 0.346 - Natural language processing helps computers understand and g...

🔍 Query 3: How do neural networks work?
  📄 Rank 1: 0.506 - Deep learning neural networks mimic the human brain to solve...
  📄 Rank 2: 0.322 - Computer vision systems can analyze images and videos to ide...


In [None]:
# Vector Database Comparison: ChromaDB vs Alternatives
try:
    import chromadb
    
    print("Testing ChromaDB for on-device RAG...")
    print("ChromaDB: Excellent for prototyping and small-scale deployments")
    
    client = chromadb.Client()
    collection = client.create_collection("local_rag_docs")
    
    # Add documents to local vector store
    collection.add(
        documents=documents,
        ids=[f"doc_{i}" for i in range(len(documents))]
    )
    
    # Test local search
    results = collection.query(
        query_texts=["What is machine learning?"],
        n_results=3
    )
    
    print("ChromaDB on-device search successful")
    print(f"Found {len(results['documents'][0])} results locally")
    
    for i, doc in enumerate(results['documents'][0]):
        print(f"  {i+1}. {doc[:50]}...")
        
    print("\nChromaDB Trade-offs:")
    print("  ✓ Simple API and good documentation")
    print("  ✓ Built-in persistence")
    print("  ⚠ Higher memory overhead")
    print("  ⚠ Slower for large datasets")
        
except Exception as e:
    print(f"ChromaDB test failed: {e}")
    print("Continuing with alternative approaches...")


🗄️ Testing ChromaDB...
✅ ChromaDB search successful
📊 Found 3 results
  1. Machine learning is transforming healthcare by ena...
  2. Computer vision systems can analyze images and vid...
  3. Natural language processing helps computers unders...


In [None]:
# Performance Benchmarking
import time

def benchmark_rag(query, documents, model, iterations=5):
    """Benchmark RAG performance"""
    times = []
    
    for _ in range(iterations):
        start_time = time.time()
        results = simple_rag(query, documents, model)
        end_time = time.time()
        times.append(end_time - start_time)
    
    return {
        'avg_time': np.mean(times),
        'std_time': np.std(times),
        'min_time': np.min(times),
        'max_time': np.max(times)
    }

print("Benchmarking RAG performance...")
benchmark_results = benchmark_rag("What is machine learning?", documents, model)

print(f"\nPerformance Results:")
print(f"  Average time: {benchmark_results['avg_time']:.3f}s")
print(f"  Std deviation: {benchmark_results['std_time']:.3f}s")
print(f"  Min time: {benchmark_results['min_time']:.3f}s")
print(f"  Max time: {benchmark_results['max_time']:.3f}s")


⏱️ Benchmarking RAG performance...

📊 Performance Results:
  Average time: 0.024s
  Std deviation: 0.019s
  Min time: 0.013s
  Max time: 0.062s


In [None]:
# Summary
print("RAG Demo Complete!")
print("\nWhat We Demonstrated:")
print("  • Basic embedding generation")
print("  • Similarity search")
print("  • RAG pipeline implementation")
print("  • Vector database integration")
print("  • Performance benchmarking")

print("\nNext Steps:")
print("  • Add more documents to the knowledge base")
print("  • Implement advanced chunking strategies")
print("  • Add hybrid search capabilities")
print("  • Integrate with larger language models")
print("  • Deploy to production environment")


🎉 RAG Demo Complete!

📋 What We Demonstrated:
  ✅ Basic embedding generation
  ✅ Similarity search
  ✅ Simple RAG pipeline
  ✅ ChromaDB integration (if available)
  ✅ Performance benchmarking

🔧 Why This Version Works:
  ✅ No multiprocessing - avoids kernel crashes
  ✅ Minimal memory usage - lightweight operation
  ✅ Single-threaded - no complex parallel processing
  ✅ Essential dependencies only

🚀 Next Steps:
  • Add more documents to the knowledge base
  • Implement chunking strategies
  • Add hybrid search (vector + keyword)
  • Integrate with larger models
  • Deploy to production environment


In [None]:
# Advanced RAG with Text Chunking
def chunk_text(text, chunk_size=200, overlap=50):
    """Text chunking for improved retrieval performance"""
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)
    
    return chunks

def enhanced_rag(query, documents, model, chunk_size=200, top_k=3):
    """Enhanced RAG with document chunking"""
    # Chunk all documents
    all_chunks = []
    chunk_metadata = []
    
    for i, doc in enumerate(documents):
        chunks = chunk_text(doc, chunk_size)
        all_chunks.extend(chunks)
        chunk_metadata.extend([{'doc_id': i, 'chunk_id': j} for j in range(len(chunks))])
    
    # Generate embeddings for all chunks
    chunk_embeddings = model.encode(all_chunks)
    query_embedding = model.encode([query])
    
    # Calculate similarities
    similarities = cosine_similarity(query_embedding, chunk_embeddings)[0]
    
    # Get top-k results
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    results = []
    for idx in top_indices:
        results.append({
            'chunk': all_chunks[idx],
            'similarity': similarities[idx],
            'doc_id': chunk_metadata[idx]['doc_id'],
            'chunk_id': chunk_metadata[idx]['chunk_id'],
            'rank': len(results) + 1
        })
    
    return results

print("Enhanced RAG with chunking defined")


✅ Enhanced RAG with chunking defined


In [None]:
# Test Enhanced RAG with Chunking
print("Testing Enhanced RAG with Chunking...")
enhanced_results = enhanced_rag("How does machine learning help in healthcare?", documents, model, chunk_size=100, top_k=3)

print(f"\nEnhanced RAG Results:")
for result in enhanced_results:
    print(f"  Rank {result['rank']}: {result['similarity']:.3f}")
    print(f"     Doc {result['doc_id']}, Chunk {result['chunk_id']}")
    print(f"     Content: {result['chunk'][:80]}...")
    print()


🔍 Testing Enhanced RAG with Chunking...

📊 Enhanced RAG Results:
  📄 Rank 1: 0.664
     Doc 0, Chunk 0
     Content: Machine learning is transforming healthcare by enabling early disease detection ...

  📄 Rank 2: 0.379
     Doc 4, Chunk 0
     Content: Deep learning neural networks mimic the human brain to solve complex problems in...

  📄 Rank 3: 0.323
     Doc 2, Chunk 0
     Content: Natural language processing helps computers understand and generate human langua...



In [None]:
# Hybrid Search: Vector + Keyword Matching
def hybrid_search(query, documents, model, vector_weight=0.7, keyword_weight=0.3, top_k=3):
    """Hybrid search combining vector similarity and keyword matching"""
    import re
    
    # Vector similarity
    query_embedding = model.encode([query])
    doc_embeddings = model.encode(documents)
    vector_scores = cosine_similarity(query_embedding, doc_embeddings)[0]
    
    # Keyword matching
    query_words = set(re.findall(r'\b\w+\b', query.lower()))
    keyword_scores = []
    
    for doc in documents:
        doc_words = set(re.findall(r'\b\w+\b', doc.lower()))
        if len(query_words) > 0:
            overlap = len(query_words.intersection(doc_words))
            keyword_score = overlap / len(query_words)
        else:
            keyword_score = 0
        keyword_scores.append(keyword_score)
    
    # Combine scores
    hybrid_scores = (vector_weight * vector_scores + 
                    keyword_weight * np.array(keyword_scores))
    
    # Get top-k results
    top_indices = np.argsort(hybrid_scores)[::-1][:top_k]
    
    results = []
    for idx in top_indices:
        results.append({
            'document': documents[idx],
            'vector_score': vector_scores[idx],
            'keyword_score': keyword_scores[idx],
            'hybrid_score': hybrid_scores[idx],
            'rank': len(results) + 1
        })
    
    return results

print("Hybrid search function defined")


✅ Hybrid search function defined


In [None]:
# Test Hybrid Search
print("Testing Hybrid Search (Vector + Keyword)...")
hybrid_results = hybrid_search("machine learning healthcare", documents, model, top_k=3)

print(f"\nHybrid Search Results:")
for result in hybrid_results:
    print(f"  Rank {result['rank']}: Hybrid Score {result['hybrid_score']:.3f}")
    print(f"     Vector: {result['vector_score']:.3f}, Keyword: {result['keyword_score']:.3f}")
    print(f"     Content: {result['document'][:60]}...")
    print()


🔍 Testing Hybrid Search (Vector + Keyword)...

📊 Hybrid Search Results:
  📄 Rank 1: Hybrid Score 0.840
     Vector: 0.771, Keyword: 1.000
     Content: Machine learning is transforming healthcare by enabling earl...

  📄 Rank 2: Hybrid Score 0.315
     Vector: 0.307, Keyword: 0.333
     Content: Deep learning neural networks mimic the human brain to solve...

  📄 Rank 3: Hybrid Score 0.163
     Vector: 0.232, Keyword: 0.000
     Content: Computer vision systems can analyze images and videos to ide...



In [None]:
# Interactive RAG Testing
def interactive_rag_demo():
    """Interactive RAG demo for testing different queries"""
    print("Interactive RAG Demo")
    print("=" * 40)
    
    sample_queries = [
        "What is machine learning?",
        "How does AI help in healthcare?", 
        "What are neural networks?",
        "Tell me about computer vision",
        "How do robots work?"
    ]
    
    print("Sample queries:")
    for i, query in enumerate(sample_queries, 1):
        print(f"  {i}. {query}")
    
    print("\nTesting all sample queries...")
    
    for i, query in enumerate(sample_queries, 1):
        print(f"\n--- Query {i}: {query} ---")
        
        # Test different methods
        simple_results = simple_rag(query, documents, model, top_k=2)
        hybrid_results = hybrid_search(query, documents, model, top_k=2)
        
        print("Simple RAG:")
        for result in simple_results:
            print(f"  {result['similarity']:.3f} - {result['document'][:50]}...")
        
        print("Hybrid Search:")
        for result in hybrid_results:
            print(f"  {result['hybrid_score']:.3f} - {result['document'][:50]}...")

# Run interactive demo
interactive_rag_demo()


🎯 Interactive RAG Demo
📝 Sample queries you can try:
  1. What is machine learning?
  2. How does AI help in healthcare?
  3. What are neural networks?
  4. Tell me about computer vision
  5. How do robots work?

🔍 Testing all sample queries...

--- Query 1: What is machine learning? ---
Simple RAG:
  0.489 - Machine learning is transforming healthcare by ena...
  0.354 - Computer vision systems can analyze images and vid...
Hybrid Search:
  0.567 - Machine learning is transforming healthcare by ena...
  0.314 - Deep learning neural networks mimic the human brai...

--- Query 2: How does AI help in healthcare? ---
Simple RAG:
  0.456 - Machine learning is transforming healthcare by ena...
  0.392 - Deep learning neural networks mimic the human brai...
Hybrid Search:
  0.369 - Machine learning is transforming healthcare by ena...
  0.324 - Deep learning neural networks mimic the human brai...

--- Query 3: What are neural networks? ---
Simple RAG:
  0.533 - Deep learning neural networks

In [None]:
# Performance and Memory Analysis
def analyze_rag_performance():
    """Analyze RAG performance and memory usage"""
    import psutil
    import time
    
    print("RAG Performance Analysis")
    print("=" * 40)
    
    # Memory before
    process = psutil.Process()
    memory_before = process.memory_info().rss / 1024 / 1024  # MB
    
    # Test different document sizes
    test_sizes = [1, 3, 5, 10, 20]
    results = []
    
    for size in test_sizes:
        test_docs = documents * (size // len(documents) + 1)
        test_docs = test_docs[:size]
        
        start_time = time.time()
        test_results = simple_rag("What is machine learning?", test_docs, model)
        end_time = time.time()
        
        memory_after = process.memory_info().rss / 1024 / 1024  # MB
        
        results.append({
            'doc_count': len(test_docs),
            'time': end_time - start_time,
            'memory_mb': memory_after - memory_before
        })
    
    print("Performance Results:")
    print("Docs | Time (s) | Memory (MB)")
    print("-" * 30)
    for result in results:
        print(f"{result['doc_count']:4d} | {result['time']:7.3f} | {result['memory_mb']:8.1f}")
    
    # Model info
    print(f"\nModel Information:")
    print(f"  Model: all-MiniLM-L6-v2")
    print(f"  Dimension: {model.get_sentence_embedding_dimension()}")
    print(f"  Memory usage: {memory_after - memory_before:.1f} MB")

analyze_rag_performance()


📊 RAG Performance Analysis
📈 Performance Results:
Docs | Time (s) | Memory (MB)
------------------------------
   1 |   0.113 |      5.4
   3 |   0.201 |     13.3
   5 |   0.014 |     13.3
  10 |   0.117 |     21.4
  20 |   0.081 |     29.2

🤖 Model Information:
  Model: all-MiniLM-L6-v2
  Dimension: 384
  Memory usage: 29.2 MB


In [None]:
# Final Summary
print("RAG Demo Complete!")
print("=" * 50)

print("\nWhat We Demonstrated:")
print("  • Basic embedding generation")
print("  • Similarity search")
print("  • RAG with document chunking")
print("  • Hybrid search (vector + keyword)")
print("  • Interactive query testing")
print("  • Performance analysis")
print("  • Vector database integration")

print("\nAdvanced Features:")
print("  • Text chunking for better retrieval")
print("  • Hybrid search combining vector + keyword matching")
print("  • Interactive query testing")
print("  • Performance and memory analysis")
print("  • Multiple search strategies comparison")

print("\nPerformance Characteristics:")
print("  • Fast embedding generation")
print("  • Low memory footprint")
print("  • Scalable to larger document sets")
print("  • Reliable operation")

print("\nNext Steps for Production:")
print("  • Add more sophisticated chunking strategies")
print("  • Implement query expansion")
print("  • Add result ranking and filtering")
print("  • Integrate with larger language models")
print("  • Deploy with proper vector databases")
print("  • Add caching for better performance")

print("\nKey Takeaways:")
print("  • Simple approaches often work best")
print("  • Chunking improves retrieval quality")
print("  • Hybrid search combines best of both worlds")
print("  • Performance monitoring is essential")
print("  • Lightweight solutions are more reliable")


🎉 Enhanced RAG Demo Complete!

📋 What We Demonstrated:
  ✅ Basic embedding generation
  ✅ Simple similarity search
  ✅ Enhanced RAG with chunking
  ✅ Hybrid search (vector + keyword)
  ✅ Interactive query testing
  ✅ Performance analysis
  ✅ ChromaDB integration (if available)

🔧 Why This Version Works:
  ✅ No multiprocessing - avoids kernel crashes
  ✅ Minimal memory usage - lightweight operation
  ✅ Single-threaded - no complex parallel processing
  ✅ Essential dependencies only
  ✅ Multiple search strategies
  ✅ Performance monitoring

🚀 Advanced Features Added:
  • Text chunking for better retrieval
  • Hybrid search combining vector + keyword matching
  • Interactive query testing
  • Performance and memory analysis
  • Multiple search strategies comparison

📈 Performance Characteristics:
  • Fast embedding generation
  • Low memory footprint
  • Scalable to larger document sets
  • Reliable on macOS

🎯 Next Steps for Production:
  • Add more sophisticated chunking strategies
  • 