# 🚀 Private RAG Stack with EmbeddingGemma & SQLite-vec

## 🔒 100% Private | 💰 Zero Cost | 📱 Offline Capable

This notebook demonstrates how to build a complete RAG (Retrieval Augmented Generation) system using:

- **EmbeddingGemma**: Google's efficient 300M parameter embedding model
- **SQLite-vec**: Fast vector similarity search in SQLite
- **Qwen3**: Efficient local language model via Ollama

### What You'll Learn:
1. How to scrape and prepare documentation
2. Generate embeddings with EmbeddingGemma
3. Store vectors in SQLite with sqlite-vec
4. Perform semantic search
5. Generate contextual responses with local LLM

## 📚 Step 1: Import Required Libraries

We'll use a minimal set of libraries for our RAG pipeline:

In [1]:
import sqlite3
import sqlite_vec
import ollama
from sentence_transformers import SentenceTransformer
import requests
from bs4 import BeautifulSoup
import struct
import time
import os

## ⚙️ Step 2: Configuration & Helper Functions

Let's set up our configuration and utility functions:

In [2]:
# Configuration
EMBEDDING_MODEL = 'google/embeddinggemma-300m'  # Google's new EmbeddingGemma model
EMBEDDING_DIMS = 256  # Truncated from 768 for 3x faster processing
LLM_MODEL = 'qwen3:4b'  # Efficient 4B parameter model via Ollama
DB_FILE = "rag_vectors.db"
TABLE_NAME = "documents"

# Vector serialization helper
def serialize_f32(vector):
    """Convert float vector to bytes for SQLite storage"""
    return struct.pack("%sf" % len(vector), *vector)

print("✅ Configuration loaded!")

✅ Configuration loaded!


## 🌐 Step 3: Scrape Documentation

We'll scrape official documentation from key sources to build our knowledge base:

In [5]:
# Documentation sources for our RAG system
docs_to_scrape = {
    'sqlite_vec_python': 'https://alexgarcia.xyz/sqlite-vec/python.html',
    'sqlite_vec_demo': 'https://raw.githubusercontent.com/asg017/sqlite-vec/main/examples/simple-python/demo.py',
    'embeddinggemma_google_blog': 'https://developers.googleblog.com/en/introducing-embeddinggemma/',
    'huggingface_embeddinggemma': 'https://huggingface.co/google/embeddinggemma-300m',
    'huggingface_embeddinggemma_blog': 'https://huggingface.co/blog/embeddinggemma',
    'qwen3_ollama': 'https://ollama.com/library/qwen3',
    'sentence_transformers': 'https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html'
}

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
}

# Create docs directory
os.makedirs('docs', exist_ok=True)

print("📥 Scraping documentation...")
scraped_docs = []

for name, url in docs_to_scrape.items():
    try:
        print(f"📄 Fetching: {name}")
        response = requests.get(url, headers=headers, timeout=30)
        response.raise_for_status()
        
        # Parse and extract text
        soup = BeautifulSoup(response.content, 'html.parser')
        for element in soup(['script', 'style']):
            element.decompose()
        
        text_content = soup.get_text(separator='\n', strip=True)
        
        if text_content.strip():
            # Save locally
            filepath = f"docs/{name}.txt"
            with open(filepath, 'w', encoding='utf-8') as f:
                f.write(f"Source: {url}\n")
                f.write("=" * 80 + "\n\n")
                f.write(text_content)
            
            scraped_docs.append((name, text_content))
            print(f"   ✅ Saved: {name} ({len(text_content)} chars)")
        
        time.sleep(1)  # Be respectful to servers
        
    except Exception as e:
        print(f"   ❌ Error fetching {name}: {e}")

print(f"\n📊 Successfully scraped {len(scraped_docs)} documents")

📥 Scraping documentation...
📄 Fetching: sqlite_vec_python
   ✅ Saved: sqlite_vec_python (4801 chars)
📄 Fetching: sqlite_vec_demo
   ✅ Saved: sqlite_vec_demo (1213 chars)
📄 Fetching: embeddinggemma_google_blog
   ✅ Saved: embeddinggemma_google_blog (9745 chars)
📄 Fetching: huggingface_embeddinggemma
   ✅ Saved: huggingface_embeddinggemma (13384 chars)
📄 Fetching: huggingface_embeddinggemma_blog
   ✅ Saved: huggingface_embeddinggemma_blog (47633 chars)
📄 Fetching: qwen3_ollama
   ✅ Saved: qwen3_ollama (4949 chars)
📄 Fetching: sentence_transformers
   ✅ Saved: sentence_transformers (79545 chars)

📊 Successfully scraped 7 documents


## 🧠 Step 4: Initialize EmbeddingGemma Model

Load Google's EmbeddingGemma for generating high-quality embeddings:

> **Note**: You need to request access at https://huggingface.co/google/embeddinggemma-300m and run `huggingface-cli login`

In [7]:
# Load EmbeddingGemma model
print(f"🤖 Loading {EMBEDDING_MODEL}...")
embedding_model = SentenceTransformer(EMBEDDING_MODEL)
print("✅ EmbeddingGemma loaded successfully!")

# Test the model with a sample
sample_text = "EmbeddingGemma is Google's efficient embedding model"
sample_embedding = embedding_model.encode_document(sample_text, truncate_dim=EMBEDDING_DIMS)
print(f"Sample embedding: {sample_embedding}")
print(f"📊 Sample embedding shape: {sample_embedding.shape}")
print(f"💡 Using {EMBEDDING_DIMS} dimensions for 3x faster processing")

🤖 Loading google/embeddinggemma-300m...
✅ EmbeddingGemma loaded successfully!
Sample embedding: [-1.27548069e-01 -1.03747128e-02  6.00063242e-03  3.82866375e-02
  3.39595117e-02  7.22874403e-02 -3.14668380e-03  6.48741052e-02
  3.73455212e-02 -3.45372595e-02 -2.08048206e-02  1.68610997e-02
 -1.10374615e-02 -3.13950144e-02  9.01036486e-02 -1.35720125e-03
 -3.91047597e-02  8.65481852e-04 -3.12702060e-02  2.52648396e-03
  3.65278162e-02  2.71117315e-02  3.11176339e-03 -1.19482335e-02
 -2.34869476e-02  2.15606447e-02 -1.87460158e-03 -7.75819346e-02
 -2.47754939e-02  1.50251044e-02  7.44022150e-03 -1.15102921e-02
  7.88129028e-03 -6.16652519e-03  2.06527431e-02 -3.50111071e-03
 -3.06493905e-03 -4.67991307e-02  3.31943296e-02  3.31700258e-02
 -2.72768941e-02  5.07247038e-02 -3.14996615e-02  2.65736580e-02
  3.35765183e-02 -4.00158912e-02 -5.88491410e-02 -2.59615704e-02
  2.05058302e-03 -2.49158852e-02  6.53840415e-03  4.06379402e-02
  8.83991644e-03 -1.37844970e-02 -3.76720503e-02  9.1905295

## 🗄️ Step 5: Setup SQLite Vector Database

Initialize SQLite with the sqlite-vec extension for efficient vector storage and similarity search:

In [10]:
# Initialize SQLite with vector extension
print("🗄️ Setting up vector database...")

conn = sqlite3.connect(DB_FILE)
conn.enable_load_extension(True)
sqlite_vec.load(conn)
conn.enable_load_extension(False)

# Create vector table
conn.execute(f"""
    CREATE VIRTUAL TABLE IF NOT EXISTS {TABLE_NAME} USING vec0(
        text TEXT,
        source TEXT,
        embedding float[{EMBEDDING_DIMS}]
    )
""")
conn.commit()

print("✅ Vector database ready!")

🗄️ Setting up vector database...
✅ Vector database ready!


## 🔄 Step 6: Smart Token-Based Chunking & Embeddings

Process our scraped documents using intelligent token-based chunking for optimal embedding quality:

**Why Token-Based Chunking?**
- Uses the **same tokenizer** as EmbeddingGemma for perfect alignment
- **Respects token boundaries** instead of arbitrary character limits
- **Prevents word splitting** that degrades embedding quality
- **Consistent chunk sizes** measured in actual tokens, not characters

In [14]:
def token_based_chunking(text, tokenizer, max_tokens=2048, overlap_tokens=100):
    """
    Token-based chunking using the actual embedding model's tokenizer.
    Much more accurate than character-based chunking for optimal embeddings.
    """
    # Tokenize the entire text
    tokens = tokenizer.encode(text)
    
    if len(tokens) <= max_tokens:
        return [text]  # No need to chunk
    
    chunks = []
    start = 0
    
    while start < len(tokens):
        # Get chunk tokens
        end = min(start + max_tokens, len(tokens))
        chunk_tokens = tokens[start:end]
        
        # Decode back to text
        chunk_text = tokenizer.decode(chunk_tokens)
        chunks.append(chunk_text.strip())
        
        # Move start position with overlap
        if end >= len(tokens):
            break
        start = end - overlap_tokens
    
    return chunks

def chunk_text(text, model, max_tokens=2048, overlap_tokens=100):
    """Use token-based chunking with the embedding model's tokenizer."""
    return token_based_chunking(text, model.tokenizer, max_tokens, overlap_tokens)

# Process all documents with intelligent token-based chunking
print("📝 Chunking documents with token-based precision...")
all_chunks = []
all_sources = []
all_embeddings = []

for source_name, content in scraped_docs:
    # Use token-based chunking with the embedding model's tokenizer
    chunks = chunk_text(content, embedding_model, max_tokens=2048, overlap_tokens=100)
    print(f"📄 {source_name}: {len(chunks)} chunks")
    
    # Generate embeddings for all chunks
    chunk_embeddings = []
    for chunk in chunks:
        # Use document encoding with proper prompt
        embedding = embedding_model.encode_document(chunk, truncate_dim=EMBEDDING_DIMS)
        chunk_embeddings.append(embedding)
    
    all_chunks.extend(chunks)
    all_sources.extend([source_name] * len(chunks))
    all_embeddings.extend(chunk_embeddings)

print(f"\n📊 Total chunks created: {len(all_chunks)}")
print(f"🧮 Total embeddings generated: {len(all_embeddings)}")
print(f"💡 Using token-based chunking ensures optimal embedding quality!")

📝 Chunking documents with token-based precision...
📄 sqlite_vec_python: 1 chunks
📄 sqlite_vec_demo: 1 chunks
📄 embeddinggemma_google_blog: 2 chunks
📄 huggingface_embeddinggemma: 2 chunks
📄 huggingface_embeddinggemma_blog: 8 chunks
📄 qwen3_ollama: 1 chunks
📄 sentence_transformers: 13 chunks

📊 Total chunks created: 28
🧮 Total embeddings generated: 28
💡 Using token-based chunking ensures optimal embedding quality!


## 💾 Step 7: Store Embeddings in Vector Database

Insert all our document chunks and their embeddings into the SQLite vector database:

In [16]:
# Store all chunks and embeddings
print("💾 Storing embeddings in vector database...")

for i, (chunk, source, embedding) in enumerate(zip(all_chunks, all_sources, all_embeddings)):
    conn.execute(f"""
        INSERT INTO {TABLE_NAME} (rowid, text, source, embedding)
        VALUES (?, ?, ?, ?)
    """, (i + 1, chunk, source, serialize_f32(embedding.tolist())))
    
    if (i + 1) % 10 == 0:
        print(f"🔄 Processed {i + 1}/{len(all_chunks)} chunks...")

conn.commit()
print("\n✅ All embeddings stored successfully!")

# Verify our data
cursor = conn.execute(f"SELECT COUNT(*) FROM {TABLE_NAME}")
count = cursor.fetchone()[0]
print(f"📊 Database contains {count} documents ready for search")

💾 Storing embeddings in vector database...
🔄 Processed 10/28 chunks...
🔄 Processed 20/28 chunks...

✅ All embeddings stored successfully!
📊 Database contains 28 documents ready for search


## 🔍 Step 8: Semantic Search Function

Create a function to perform semantic search using our vector database:

In [17]:
def semantic_search(query_text, top_k=3):
    """Perform semantic search and return relevant documents"""
    
    # Generate query embedding using proper query prompt
    query_embedding = embedding_model.encode_query(query_text, truncate_dim=EMBEDDING_DIMS)
    
    # Search for similar documents
    cursor = conn.execute(f"""
        SELECT rowid, text, source, distance
        FROM {TABLE_NAME}
        WHERE embedding MATCH ?
        ORDER BY distance
        LIMIT ?
    """, (serialize_f32(query_embedding.tolist()), top_k))
    
    results = cursor.fetchall()
    
    print(f"🔍 Found {len(results)} relevant documents:")
    contexts = []
    
    for rowid, text, source, distance in results:
        contexts.append(text)
        print(f"📄 Source: {source} | Distance: {distance:.4f}")
        print(f"📝 Preview: {text[:100]}...\n")
    
    return contexts

# Test semantic search
test_query = "How does EmbeddingGemma work?"
print(f"🧪 Testing search with query: '{test_query}'")
test_results = semantic_search(test_query)
print(f"✅ Search test completed!")

🧪 Testing search with query: 'How does EmbeddingGemma work?'
🔍 Found 3 relevant documents:
📄 Source: huggingface_embeddinggemma_blog | Distance: 0.5917
📝 Preview: <bos>Welcome EmbeddingGemma, Google's new efficient embedding model
Hugging Face
Models
Datasets
Spa...

📄 Source: embeddinggemma_google_blog | Distance: 0.6111
📝 Preview: <bos>Introducing EmbeddingGemma: The Best-in-Class Open Model for On-Device Embeddings
            
...

📄 Source: huggingface_embeddinggemma | Distance: 0.6134
📝 Preview: .70
68.70
Mixed Precision* (768d)
68.03
68.03
Note: QAT models are evaluated after quantization
* Mi...

✅ Search test completed!


## 🤖 Step 9: RAG Query Function with Local LLM

Combine semantic search with local LLM to generate contextual responses:

> **Note**: Make sure you have Ollama installed and the Qwen3 model downloaded: `ollama pull qwen3:4b`

In [18]:
def rag_query(question, top_k=3):
    """Complete RAG pipeline: search + generate response"""
    
    print(f"❓ Question: {question}")
    print("=" * 60)
    
    # Step 1: Semantic search
    contexts = semantic_search(question, top_k)
    
    if not contexts:
        return "❌ No relevant information found."
    
    # Step 2: Build prompt with context
    combined_context = "\n\n".join(contexts)
    prompt = f"""Use the following contexts to answer the question comprehensively.
If you don't know the answer based on the provided contexts, just say that you don't know.

Contexts:
{combined_context}

Question: {question}

Answer:"""
    
    # Step 3: Generate response with local LLM
    print(f"🤖 Generating response with {LLM_MODEL}...\n")
    
    try:
        # Stream response for real-time output
        stream = ollama.chat(
            model=LLM_MODEL,
            messages=[{'role': 'user', 'content': prompt}],
            stream=True
        )
        
        response = ""
        for chunk in stream:
            if 'message' in chunk and 'content' in chunk['message']:
                content = chunk['message']['content']
                print(content, end='', flush=True)
                response += content
        
        print("\n" + "=" * 60)
        return response
        
    except Exception as e:
        error_msg = f"❌ Error with LLM: {e}"
        print(error_msg)
        return error_msg

print("✅ RAG query function ready!")

✅ RAG query function ready!


## 🎯 Step 10: Demo Queries - Let's Test Our RAG System!

Now let's put our RAG system to work with some interesting questions:

In [19]:
# Demo Question 1: About EmbeddingGemma
response1 = rag_query("What makes EmbeddingGemma special for mobile applications?")

❓ Question: What makes EmbeddingGemma special for mobile applications?
🔍 Found 3 relevant documents:
📄 Source: embeddinggemma_google_blog | Distance: 0.5936
📝 Preview: <bos>Introducing EmbeddingGemma: The Best-in-Class Open Model for On-Device Embeddings
            
...

📄 Source: huggingface_embeddinggemma_blog | Distance: 0.6205
📝 Preview: <bos>Welcome EmbeddingGemma, Google's new efficient embedding model
Hugging Face
Models
Datasets
Spa...

📄 Source: huggingface_embeddinggemma | Distance: 0.6275
📝 Preview: .70
68.70
Mixed Precision* (768d)
68.03
68.03
Note: QAT models are evaluated after quantization
* Mi...

🤖 Generating response with qwen3:4b...

<think>
Let me analyze the text to understand what makes EmbeddingGemma special for mobile applications.

From the provided text, I can find several key points about EmbeddingGemma and its mobile applications:

1. The text mentions "Use in mobile applications" in the context of the model's capabilities.

2. There's a section about "Mobi

In [20]:
# Demo Question 2: About SQLite-vec
response2 = rag_query("How do I use SQLite-vec with Python?")

❓ Question: How do I use SQLite-vec with Python?
🔍 Found 3 relevant documents:
📄 Source: sqlite_vec_python | Distance: 0.5931
📝 Preview: sqlite-vec in Python | sqlite-vec
🚧🚧🚧 This documentation is a work-in-progress! 🚧🚧🚧
Skip to content
...

📄 Source: sqlite_vec_demo | Distance: 0.6076
📝 Preview: import sqlite3
import sqlite_vec

from typing import List
import struct


def serialize_f32(vector: ...

📄 Source: sentence_transformers | Distance: 0.6535
📝 Preview: start_multi_process_pool.
Parameters
:
pool
(
Dict
[
str
,
object
]
) – A dictionary containing the ...

🤖 Generating response with qwen3:4b...

<think>
Let me analyze the provided contexts to answer the question about how to use SQLite-vec with Python.

The contexts include documentation for sqlite-vec, specifically the Python section. Here's what I can extract:

1. Installation: The documentation says to install the sqlite-vec PyPi package using pip:
   ```bash
   pip install sqlite-vec
   ```

2. Loading the extension: After i

In [21]:
# Demo Question 3: About Qwen3
response3 = rag_query("What are the key features of Qwen3 model?")

❓ Question: What are the key features of Qwen3 model?
🔍 Found 3 relevant documents:
📄 Source: huggingface_embeddinggemma_blog | Distance: 0.6556
📝 Preview: enumerate
(eval_dataset[
"passage_text"
] + train_dataset[
"passage_text"
][:
30_000
]))
relevant_do...

📄 Source: qwen3_ollama | Distance: 0.6615
📝 Preview: qwen3
Models
GitHub
Discord
Turbo
Sign in
Download
Models
Download
GitHub
Discord
Sign in
qwen3
7.8M...

📄 Source: sentence_transformers | Distance: 0.6685
📝 Preview: start_multi_process_pool.
Parameters
:
pool
(
Dict
[
str
,
object
]
) – A dictionary containing the ...

🤖 Generating response with qwen3:4b...

<think>
Let me analyze the question and the provided content to answer what the key features of Qwen3 model are.

The question asks about the key features of Qwen3 model. Looking at the content provided, I can see that there's a section about Qwen3 in the text. Let me extract the relevant information.

From the text:

### Key features of Qwen3:

1. **Significantly enhance

In [22]:
# Demo Question 4: Technical concepts
response4 = rag_query("How does vector similarity search work?")

❓ Question: How does vector similarity search work?
🔍 Found 3 relevant documents:
📄 Source: huggingface_embeddinggemma_blog | Distance: 0.6704
📝 Preview: prominent red spot."
,
"Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]
#...

📄 Source: sentence_transformers | Distance: 0.6733
📝 Preview: ,
prompt_name
:
str
|
None
=
None
,
prompt
:
str
|
None
=
None
,
batch_size
:
int
=
32
,
show_progre...

📄 Source: huggingface_embeddinggemma_blog | Distance: 0.6756
📝 Preview: <bos>Welcome EmbeddingGemma, Google's new efficient embedding model
Hugging Face
Models
Datasets
Spa...

🤖 Generating response with qwen3:4b...

<think>
Let me analyze this problem carefully. The user has provided a lot of text, and I need to understand what they're asking for. 

First, I see that the user has provided:
1. A description of embedding models (specifically EmbeddingGemma)
2. Some code examples for using EmbeddingGemma with different frameworks
3. A question at the end: "How does vect

## 🎉 Congratulations!

You've successfully built a complete private RAG system! Here's what we accomplished:

### ✅ What We Built:
- **Document Scraping**: Automated collection from web sources
- **Smart Chunking**: Optimized text segmentation for better retrieval
- **Modern Embeddings**: Google's EmbeddingGemma with Matryoshka learning
- **Vector Database**: SQLite-vec for fast similarity search
- **Local LLM**: Qwen3 for generating contextual responses
- **Complete Privacy**: Everything runs locally, no API calls

### 🚀 Key Benefits:
- **100% Private**: All processing happens on your machine
- **Zero Cost**: No API fees or usage limits
- **Offline Capable**: Works without internet after initial setup
- **Efficient**: EmbeddingGemma + SQLite-vec = fast performance
- **Scalable**: Can handle thousands of documents

### 🔄 Next Steps:
- Add more document sources to expand knowledge base
- Experiment with different chunking strategies
- Try other embedding dimensions (128, 512, 768)
- Implement conversation memory for multi-turn chats
- Build a simple web interface with Streamlit or Gradio

### 📚 Resources:
- [EmbeddingGemma Documentation](https://huggingface.co/google/embeddinggemma-300m)
- [SQLite-vec GitHub](https://github.com/asg017/sqlite-vec)
- [Ollama Models](https://ollama.com/library)

Happy building! 🎯

In [None]:
# Cleanup - close database connection
conn.close()
print("🧹 Database connection closed. Demo complete!")