-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Description
Context
DeepWiki-Open currently persists embeddings and indices locally via adalflow’s LocalDB (for example, in ~/.adalflow/databases when running with Docker) and, in some cases, writes embedding caches to disk using pickle (e.g., ./embedding_cache for DashScope batch embeddings). While there appears to be optional S3 persistence for the local database, there is no out-of-the-box integration with external vector databases.
Problem
- Local-only storage limits scalability, multi-repo/multi-tenant scenarios, and operational observability.
- Advanced vector search features (e.g., server-side filtering, HNSW/IVF tuning, partitioning, sharding, and horizontal scaling) are not available with the current LocalDB approach.
- Teams may already operate vector stores (e.g., Milvus) and want to consolidate infra and monitoring.
Proposal
Introduce an optional vector database backend, starting with Milvus, to store embeddings and metadata. LocalDB should remain the default to preserve the current zero-dependency workflow.
High-level design
- Storage abstraction
- Define a storage interface (e.g.,
VectorStore
) with methods:upsert(chunks)
,query(embedding, k, filters)
,delete(ids)
, andhealth/info
. - Provide two implementations:
- LocalDBVectorStore (existing behavior via adalflow LocalDB).
- MilvusVectorStore (new).
- Define a storage interface (e.g.,
- Configuration
- Extend config (e.g.,
api/config/embedder.json
or a newvector_store.json
) to select backend and parameters:{ "vector_store": { "backend": "localdb" | "milvus", "milvus": { "uri": "http://milvus:19530", "user": "root", "password": "Milvus", "collection_name": "deepwiki_chunks", "index_type": "HNSW", "metric_type": "COSINE", "efConstruction": 200, "M": 16 } } }
- Auto-derive embedding dimension from the configured embedder.
- Extend config (e.g.,
- Milvus schema and indexing
- Collection schema (suggested):
- id: string (doc+chunk id)
- repo_id: string
- path: string
- chunk_index: int
- content: string (optional, or keep outside if managed elsewhere)
- metadata: JSON
- embedding: float vector (dim = embedder output)
- Build index per chosen metric and index type.
- Optional partitions per repo_id.
- Collection schema (suggested):
- Pipeline integration
- Where DatabaseManager builds/loads the store, select backend from config.
- On ingestion, write chunks and vectors to Milvus (or LocalDB, unchanged).
- On retrieval, perform top-k vector search with optional metadata filters in Milvus.
- Migration and compatibility
- Provide a migrator to read existing LocalDB entries and write them into Milvus.
- Default remains LocalDB unless configured.
- Ops and DX
- Optional docker-compose service for Milvus (standalone) for local dev.
- Health check endpoint/logs to verify connectivity and collection readiness.
- Clear error messages for dimension mismatch, auth, or index creation failures.
Open questions
- Preferred default metric (COSINE vs INNER_PRODUCT) for current embedders?
- Whether to store full chunk content in Milvus or only ids/metadata and load content from LocalDB/disk?
- Multi-tenant isolation strategy (separate collections vs partitions by repo_id)?
Acceptance criteria
- Configuration switch to choose LocalDB or Milvus without code changes.
- Ingestion: repository chunks and embeddings successfully persisted to Milvus.
- Retrieval: queries return equivalent or better results compared to LocalDB for the same embedder.
- Documentation: setup guide for Milvus, config examples, and docker-compose snippet.
- Tests: unit tests for the Milvus store and an integration test covering ingest/query.
Alternatives considered
- Qdrant, Weaviate, or Chroma as additional backends; can be added behind the same abstraction later.
- Persisting LocalDB to S3 only (keeps local index semantics; doesn’t solve vector DB requirements).
Metadata
Metadata
Assignees
Labels
No labels