Add optional vector database backend (Milvus) for embedding storage and retrieval

## Context
DeepWiki-Open currently persists embeddings and indices locally via adalflow’s LocalDB (for example, in ~/.adalflow/databases when running with Docker) and, in some cases, writes embedding caches to disk using pickle (e.g., ./embedding_cache for DashScope batch embeddings). While there appears to be optional S3 persistence for the local database, there is no out-of-the-box integration with external vector databases.

## Problem
- Local-only storage limits scalability, multi-repo/multi-tenant scenarios, and operational observability.
- Advanced vector search features (e.g., server-side filtering, HNSW/IVF tuning, partitioning, sharding, and horizontal scaling) are not available with the current LocalDB approach.
- Teams may already operate vector stores (e.g., Milvus) and want to consolidate infra and monitoring.

## Proposal
Introduce an optional vector database backend, starting with Milvus, to store embeddings and metadata. LocalDB should remain the default to preserve the current zero-dependency workflow.

### High-level design
1. **Storage abstraction**
    - Define a storage interface (e.g., `VectorStore`) with methods: `upsert(chunks)`, `query(embedding, k, filters)`, `delete(ids)`, and `health/info`.
    - Provide two implementations:
        - LocalDBVectorStore (existing behavior via adalflow LocalDB).
        - MilvusVectorStore (new).
2. **Configuration**
    - Extend config (e.g., `api/config/embedder.json` or a new `vector_store.json`) to select backend and parameters:
      ```json
      {
        "vector_store": {
          "backend": "localdb" | "milvus",
          "milvus": {
            "uri": "http://milvus:19530",
            "user": "root",
            "password": "Milvus",
            "collection_name": "deepwiki_chunks",
            "index_type": "HNSW",
            "metric_type": "COSINE",
            "efConstruction": 200,
            "M": 16
          }
        }
      }
      ```
    - Auto-derive embedding dimension from the configured embedder.
3. **Milvus schema and indexing**
    - Collection schema (suggested):
        - id: string (doc+chunk id)
        - repo_id: string
        - path: string
        - chunk_index: int
        - content: string (optional, or keep outside if managed elsewhere)
        - metadata: JSON
        - embedding: float vector (dim = embedder output)
    - Build index per chosen metric and index type.
    - Optional partitions per repo_id.
4. **Pipeline integration**
    - Where DatabaseManager builds/loads the store, select backend from config.
    - On ingestion, write chunks and vectors to Milvus (or LocalDB, unchanged).
    - On retrieval, perform top-k vector search with optional metadata filters in Milvus.
5. **Migration and compatibility**
    - Provide a migrator to read existing LocalDB entries and write them into Milvus.
    - Default remains LocalDB unless configured.
6. **Ops and DX**
    - Optional docker-compose service for Milvus (standalone) for local dev.
    - Health check endpoint/logs to verify connectivity and collection readiness.
    - Clear error messages for dimension mismatch, auth, or index creation failures.

## Open questions
- Preferred default metric (COSINE vs INNER_PRODUCT) for current embedders?
- Whether to store full chunk content in Milvus or only ids/metadata and load content from LocalDB/disk?
- Multi-tenant isolation strategy (separate collections vs partitions by repo_id)?

## Acceptance criteria
- Configuration switch to choose LocalDB or Milvus without code changes.
- Ingestion: repository chunks and embeddings successfully persisted to Milvus.
- Retrieval: queries return equivalent or better results compared to LocalDB for the same embedder.
- Documentation: setup guide for Milvus, config examples, and docker-compose snippet.
- Tests: unit tests for the Milvus store and an integration test covering ingest/query.

## Alternatives considered
- Qdrant, Weaviate, or Chroma as additional backends; can be added behind the same abstraction later.
- Persisting LocalDB to S3 only (keeps local index semantics; doesn’t solve vector DB requirements).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add optional vector database backend (Milvus) for embedding storage and retrieval #353

Context

Problem

Proposal

High-level design

Open questions

Acceptance criteria

Alternatives considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add optional vector database backend (Milvus) for embedding storage and retrieval #353

Description

Context

Problem

Proposal

High-level design

Open questions

Acceptance criteria

Alternatives considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions