-
Notifications
You must be signed in to change notification settings - Fork 1
systems vector search
Active contributors: Saksham, Ravi
Vector search powers semantic property recommendations on top of pgvector. Property listings are embedded with Google's Gemini embedding model and stored in a property_embeddings table; the search layer blends vector similarity with PostgreSQL full-text ranking to produce a hybrid relevance score. A scheduler keeps embeddings in sync with property changes, and a backfill script can rebuild the index from scratch.
app/vector/
├── store.py # DB operations: upsert, watermark, advisory lock, hash
├── sync.py # run_property_vector_sync: incremental sync pipeline
├── compose.py # build_embedding_text, build_metadata
├── embedding_client.py # Gemini embed/embed_query wrappers
└── backfill.py # CLI entry point for a one-off backfill
app/services/
└── vector_sync_scheduler.py # Registers the sync job on the shared scheduler
| Abstraction | Location | Purpose |
|---|---|---|
property_embeddings table |
managed via raw SQL in store.py
|
Stores (property_id, embedding, metadata, emb_text_hash, created_at, updated_at)
|
vector_sync_state table |
store.py |
Holds the last_watermark for incremental sync |
compute_text_hash |
app/vector/store.py |
SHA-256 of the embedding text; skip re-embedding when unchanged |
build_embedding_text |
app/vector/compose.py |
Canonical text from title, type, location, numbers, description, amenities, tags |
build_metadata |
app/vector/compose.py |
JSON metadata stored alongside the embedding |
embed / embed_query
|
app/vector/embedding_client.py |
Gemini embed_content with retrieval_document / retrieval_query task types |
run_property_vector_sync |
app/vector/sync.py |
Three-phase incremental sync pipeline |
acquire_advisory_lock |
app/vector/store.py |
pg_try_advisory_lock to prevent concurrent sync workers |
graph TD
Sched["vector_sync_scheduler<br/>cron or interval"] --> Sync["run_property_vector_sync"]
Sync --> P1["Phase 1: acquire advisory lock<br/>read watermark<br/>fetch changed properties<br/>compute text hashes"]
P1 --> P2["Phase 2: embed texts<br/>NO DB session held"]
P2 --> P3["Phase 3: upsert embeddings<br/>advance watermark"]
P1 -->|unchanged hash| Skip["skip embedding for this row"]
Embed["embed(texts)"] --> Gemini["Gemini embed_content<br/>768-dim vectors"]
Search["semantic search"] --> Query["embed_query(user query)"]
Query --> Hybrid["hybrid: vector cosine + ts_rank<br/>combined_relevance_expr"]
Hybrid --> Builder["PropertyQueryBuilder.apply_sort<br/>SortBy.relevance"]
The sync pipeline is split into three phases so the DB session is released during the network-bound embedding call. Phase 1 acquires a Postgres advisory lock (pg_try_advisory_lock(hashtext('property_vector_sync'))), reads the watermark from vector_sync_state, fetches changed properties (only the columns consumed by build_embedding_text), computes the canonical embedding text and its SHA-256 hash, and compares against the stored emb_text_hash to decide which rows actually need re-embedding. If no rows changed, the lock is released and the run returns early.
Phase 2 calls the Gemini embedding API with only the texts whose hash changed. This happens with no DB session held, so the Supabase pooler connection is not held during the network I/O. The embedding client uses google-genai with retries (VECTOR_SYNC_MAX_RETRIES, exponential backoff) and runs the sync call in a thread executor to avoid blocking the event loop.
Phase 3 opens a fresh background-pool session, upserts embeddings (using a pgvector literal CAST(:emb AS vector) and ON CONFLICT (property_id) DO UPDATE), advances the watermark to the max updated_at (or created_at fallback) of the batch, and commits.
At query time, semantic search embeds the user query with embed_query (task type retrieval_query), computes cosine similarity against property_embeddings.embedding, and combines it with the ts_rank from PropertyQueryBuilder's full-text search into a combined_relevance_expr. The builder's apply_sort(sort_by=SortBy.relevance, combined_relevance_expr=...) then orders by that hybrid score. See repositories.
The scheduler in app/services/vector_sync_scheduler.py is gated by VECTOR_SYNC_ENABLED. If VECTOR_SYNC_CRON is set it uses a CronTrigger; otherwise it falls back to an IntervalTrigger with VECTOR_SYNC_INTERVAL_SECONDS. The job is registered with max_instances=1 so overlapping runs are skipped.
-
Property search layers vector similarity on top of
PropertyQueryBuilder's text rank. See repositories and features/ghar-core. -
Background pool — sync uses
AsyncSessionLocalBGso it does not starve HTTP/MCP request traffic. See core-cross-cutting. -
Shared scheduler — the sync job registers on the single
AsyncIOScheduler. See infrastructure. -
Gemini embedding model — configured via
GOOGLE_API_KEYandGEMINI_EMBED_MODELin settings.
- Change the embedding text: edit
build_embedding_textinapp/vector/compose.py. The hash automatically triggers re-embedding on the next sync pass. - Change the embedding model: update
GEMINI_EMBED_MODELand run a backfill (setVECTOR_SYNC_FORCE=1or runapp/vector/backfill.py). - Tune sync cadence: set
VECTOR_SYNC_CRONorVECTOR_SYNC_INTERVAL_SECONDS.
| File | Role |
|---|---|
app/vector/store.py |
DB ops: upsert, watermark, advisory lock, hash |
app/vector/sync.py |
Three-phase incremental sync pipeline |
app/vector/compose.py |
Embedding text + metadata composition |
app/vector/embedding_client.py |
Gemini embed/embed_query wrappers |
app/vector/backfill.py |
CLI backfill entry point |
app/services/vector_sync_scheduler.py |
Scheduler registration |
- Features overview
- Ghar Core (marketplace)
- 360 Stays (bookings)
- 360 Flatmates
- Property Management
- 360 Virtual Tours
- 360 Data Hub
- MCP servers and widgets
- AI agent
- Blog and SEO
- Notifications
- Vastu analyzer