# Search & AI/RAG Pipelines

This notebook demonstrates how to combine **full-text search** with **vector similarity search** to build a document search engine suitable for RAG (Retrieval-Augmented Generation) pipelines.

SurrealDB supports both BM25-based full-text search and HNSW vector indexes natively, making it a strong choice for AI applications that need hybrid retrieval.

## Prerequisites

- SurrealDB running locally (`docker run --rm -p 8000:8000 surrealdb/surrealdb:latest start --user root --pass root`)
- Project dependencies installed (`uv sync`)
- A `.env` file in the project root (optional, falls back to defaults)

In [None]:
# Setup: add project root to path and configure SurrealDB connection
import os, sys
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(project_root)
from dotenv import load_dotenv
load_dotenv()

from src.surreal_orm import SurrealDBConnectionManager

SurrealDBConnectionManager.set_connection(
    os.getenv("SURREALDB_URL", "ws://localhost:8000"),
    os.getenv("SURREALDB_USER", "root"),
    os.getenv("SURREALDB_PASS", "root"),
    os.getenv("SURREALDB_NAMESPACE", "ns"),
    os.getenv("SURREALDB_DATABASE", "db"),
)

## 1. Full-Text Search (BM25)

Full-text search is ideal when your users type natural language queries against text content. SurrealDB uses BM25 scoring under the hood, which ranks documents by term frequency and inverse document frequency.

**When to use FTS:**
- Keyword-based search (e.g., "quantum physics")
- When exact term matching matters
- When you need highlighted snippets in results

> **Note:** Full-text search requires a `DEFINE INDEX ... SEARCH ANALYZER` on the SurrealDB side. The ORM generates the queries, but the index must exist for `@N@` match operators to work.

In [None]:
# Define a model for articles with text fields we'll search against
from src.surreal_orm import BaseSurrealModel, SurrealConfigDict, SearchScore, SearchHighlight

class Article(BaseSurrealModel):
    model_config = SurrealConfigDict(table_name="article")

    id: str | None = None
    title: str
    body: str = ""
    category: str = "general"

In [None]:
# Define a text analyzer and FTS indexes on the SurrealDB side
# These are required for the @N@ match operators used by .search()
await Article.raw_query("""
    DEFINE ANALYZER IF NOT EXISTS simple_analyzer TOKENIZERS blank, class FILTERS lowercase;
    DEFINE INDEX IF NOT EXISTS ft_title ON article FIELDS title
        SEARCH ANALYZER simple_analyzer BM25(1.2, 0.75) HIGHLIGHTS;
    DEFINE INDEX IF NOT EXISTS ft_body ON article FIELDS body
        SEARCH ANALYZER simple_analyzer BM25(1.2, 0.75) HIGHLIGHTS;
""")
print("FTS analyzer and indexes defined on 'article' table")

In [None]:
# Create sample articles for searching
articles = [
    Article(title="Introduction to Quantum Computing", body="Quantum computers use qubits to perform computations.", category="science"),
    Article(title="Quantum Physics Explained", body="Quantum physics studies the behaviour of matter at atomic scales.", category="science"),
    Article(title="Classical Computing Architecture", body="Von Neumann architecture is the foundation of classical computers.", category="technology"),
    Article(title="Machine Learning Basics", body="Machine learning algorithms learn patterns from data.", category="technology"),
    Article(title="Deep Learning and Neural Networks", body="Neural networks with many layers can learn complex representations.", category="technology"),
]

for article in articles:
    await article.save()
    print(f"Saved: {article.title}")

In [None]:
# Basic full-text search on the title field
# This generates: SELECT * FROM articles WHERE title @0@ 'quantum'
results = await Article.objects().search(title="quantum").exec()
print(f"Found {len(results)} articles matching 'quantum':")
for r in results:
    print(f"  - {r.title}")

In [None]:
# Full-text search with BM25 relevance scoring and highlighted snippets
# SearchScore(0) references the @0@ match operator in the query
results = await Article.objects().search(title="quantum").annotate(
    relevance=SearchScore(0),
    snippet=SearchHighlight("<b>", "</b>", 0),
).exec()

print("Search results with relevance scores:")
for r in results:
    # The annotated fields are attached to each result instance
    relevance = getattr(r, "relevance", "N/A")
    snippet = getattr(r, "snippet", "N/A")
    print(f"  [{relevance}] {r.title}")
    print(f"     Snippet: {snippet}")

In [None]:
# Multi-field search: search across title AND body simultaneously
# Each field gets its own match reference (@0@, @1@, etc.)
results = await Article.objects().search(title="learning", body="neural").annotate(
    title_score=SearchScore(0),
    body_score=SearchScore(1),
).exec()

print("Multi-field search results:")
for r in results:
    ts = getattr(r, "title_score", "N/A")
    bs = getattr(r, "body_score", "N/A")
    print(f"  {r.title} (title: {ts}, body: {bs})")

## 2. Vector Similarity Search (KNN with HNSW)

Vector search finds documents by semantic similarity rather than keyword matching.
You store embedding vectors alongside your data, then query for the K nearest
neighbours using SurrealDB's HNSW index.

**When to use vector search:**
- Semantic search ("find similar documents")
- RAG pipelines (retrieve relevant context for LLMs)
- Recommendation systems

> **Note:** Vector search requires a `DEFINE INDEX ... HNSW` on the field. The ORM generates `<|K|>` KNN queries, but the index must exist.

In [ ]:
# Define a model with a vector field for embeddings
# VectorField[4] means 4-dimensional vectors (tiny for demo; real apps use 768-1536)
from src.surreal_orm.fields import VectorField

class Document(BaseSurrealModel):
    model_config = SurrealConfigDict(table_name="document")

    id: str | None = None
    title: str
    content: str = ""
    embedding: VectorField[4]  # Small dimension for demo

In [None]:
# Define HNSW vector index and FTS index for hybrid search
# HNSW index is required for the <|N|> KNN operator used by .similar_to()
await Document.raw_query("""
    DEFINE INDEX IF NOT EXISTS vec_idx ON document FIELDS embedding
        HNSW DIMENSION 4 DIST COSINE TYPE F32;
    DEFINE INDEX IF NOT EXISTS ft_content ON document FIELDS content
        SEARCH ANALYZER simple_analyzer BM25(1.2, 0.75) HIGHLIGHTS;
""")
print("HNSW and FTS indexes defined on 'document' table")

In [None]:
# Create documents with mock embeddings
# In production, you'd compute these with an embedding model (e.g., OpenAI ada-002)
docs = [
    Document(title="Python Basics", content="Learn Python programming", embedding=[0.1, 0.9, 0.2, 0.3]),
    Document(title="JavaScript Guide", content="Frontend development with JS", embedding=[0.2, 0.8, 0.3, 0.4]),
    Document(title="Rust Systems", content="Systems programming in Rust", embedding=[0.8, 0.1, 0.9, 0.7]),
    Document(title="Go Concurrency", content="Concurrent programming with Go", embedding=[0.7, 0.2, 0.8, 0.6]),
    Document(title="SQL Databases", content="Relational database design", embedding=[0.4, 0.5, 0.4, 0.5]),
]

for doc in docs:
    await doc.save()
    print(f"Saved: {doc.title} with embedding {doc.embedding}")

In [None]:
# KNN similarity search: find the 3 most similar documents to a query vector
# This generates: SELECT *, vector::distance::knn() AS _knn_distance
#                  FROM documents WHERE embedding <|3|> $vec
#                  ORDER BY _knn_distance
query_vector = [0.15, 0.85, 0.25, 0.35]  # Similar to Python/JS docs

results = await Document.objects().similar_to("embedding", query_vector, limit=3).exec()
print("Top 3 similar documents:")
for doc in results:
    distance = getattr(doc, "_knn_distance", "N/A")
    print(f"  - {doc.title} (distance: {distance})")

In [None]:
# Vector search with search effort tuning (ef parameter)
# Higher ef = more accurate but slower; useful for large datasets
results = await Document.objects().similar_to(
    "embedding", query_vector, limit=3, ef=40
).exec()
print("Results with ef=40 (higher accuracy):")
for doc in results:
    print(f"  - {doc.title}")

In [None]:
# Combine vector search with traditional filters
# Useful for scoped retrieval (e.g., only search within a category)
systems_vector = [0.75, 0.15, 0.85, 0.65]  # Similar to Rust/Go docs
results = await Document.objects().filter(
    title__contains="Go"
).similar_to("embedding", systems_vector, limit=5).exec()

print("Filtered vector search (title contains 'Go'):")
for doc in results:
    print(f"  - {doc.title}")

## 3. Hybrid Search (Vector + FTS)

Hybrid search combines the best of both worlds: semantic understanding from vectors and exact keyword matching from FTS. Results are merged using **Reciprocal Rank Fusion (RRF)**, which balances the rankings from both retrieval methods.

**When to use hybrid search:**
- RAG pipelines where both meaning and keywords matter
- Search engines where users expect both exact and fuzzy matches
- When neither pure FTS nor pure vector gives good enough recall

In [None]:
# Cleanup: remove all test data and indexes
await Article.raw_query("REMOVE TABLE IF EXISTS article")
await Document.raw_query("REMOVE TABLE IF EXISTS document")
await Article.raw_query("REMOVE ANALYZER IF EXISTS simple_analyzer")
print("Cleanup complete.")

## Production Notes

- **Embedding dimensions:** Use real embedding models in production. OpenAI `text-embedding-ada-002` produces 1536-dimensional vectors. Cohere and open-source models vary (384-4096).
- **Index configuration:** For large datasets, tune HNSW parameters (`m`, `efc`) in your migration's `CreateIndex` operation.
- **Vector types:** Use `VectorField[1536, "F32"]` for single-precision or `VectorField[1536, "F64"]` for double-precision.
- **FTS analyzers:** Define custom analyzers with tokenizers and filters (e.g., stemming, stop words) using `DefineAnalyzer` in migrations.