[WIP] Add table-level ANN index with DiskANN backend#103675
[WIP] Add table-level ANN index with DiskANN backend#103675fastio wants to merge 4 commits intoClickHouse:masterfrom
Conversation
… table-level integration
Introduce `rust/workspace/diskann-clickhouse` providing a C ABI over the DiskANN library: index build, in-memory and on-disk search, padded SIMD queries, and the FFI header consumed by the C++ side.
Group-based index with copy-on-write manager, plan optimization and per-part distance dispatch, ProfileEvents/metric kernel, gtests and `0_stateless` tests (04102-04111).
|
Workflow [PR], commit [e26ebee] Summary: ❌
AI ReviewSummaryThis PR adds an experimental table-level ANN index backed by ClickHouse Rules
Final Verdict
|
|
@fastio Thanks for this large PR. May I ask what was your motivation to make the index per-table and not per-part? |
Hi @rschu1ze, Thanks for the question — happy to share the reasoning. First, The main reason is that ANN search is a global top-K ranking problem, not a local part-pruning predicate. ORDER BY distance LIMIT K requires a global nearest-neighbor order across all active parts, so a per-part index would still need fan-out searches and a global merge — with no principled way to decide how many candidates to over-fetch from each part (too few collapses recall, too many makes the per-part index pointless). That is why the ANN index is modeled as a table-level search structure, while part coverage remains an internal lifecycle concern. Second, You are right, this PR is too large to review as a single mergeable change. My goal is to use it to discuss and validate the overall design direction first. If the direction makes sense, I will split it into a sequence of smaller PRs with clear boundaries, so each step can be reviewed independently. Happy to dig into any specific aspect if useful. |
|
Could you please explain how you synchronize data in the global index and in the main table and how you maintain consistency between them? |
|
@fastio We (@shankar-iyer and me) discussed the vector similarity index 2.0 last week. There are a few concerns with this PR.
This issue exists indeed but it is less worse than it seems - not only because parts grow quite large by default (150 GB). Fan-out because of per-part searches only reduces performance but it never reduces recall. Note that one could theoretically reduce the former by some new setting that would only consider N% of the parts for search. Even if all my concerns are invalid, I'd prefer SingleStore's mechanism of building covering vector indexes in addition to the original per-segment indexes instead of replacing them as in this PR (see sec 4.2 here). This still doesn't fit the LSM architecture but it is a little less disruptive. |
@CurtizJ, Apologies for the delayed response — I was on vacation for the past five days. Thanks for raising this — it's the right question to settle before the implementation lands. TL;DR. The main table and the global index are kept eventually consistent. Query correctness does not depend on the index being caught up: it is preserved by partitioning active parts into indexed and unindexed sets at query time and rerank-merging the two paths. Consistency model For a query at time t, let
We compute: How synchronization happens
Happy to dig deeper into any of these. |
@rschu1ze, Thank you for the very helpful feedback. On DiskANN: the current PR is a skeleton in which the ANN algorithm is pluggable behind IMaterializedIndexAlgorithm / MaterializedIndexAlgorithmFactory, so swapping in SPANN later should be straightforward — the DiskANN parts here are placeholders for evaluation rather than a hard commitment to the algorithm. On the topology: the SingleStore-style approach of building a cross-part covering index in addition to per-part indexes (sec. 4.2) is a direction well worth thinking through. |
Resolves #85766
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Add an experimental table-level Approximate Nearest Neighbor (ANN) index backed by DiskANN, with new SQL DDL syntax, query planner integration, and supporting
SYSTEMcommands.Documentation entry for user-facing changes
This change introduces an experimental table-level ANN index for high-dimensional vector search, using DiskANN as the underlying graph index.
Highlights:
ANNindex type forMergeTreetables, managed at table level (not per part) via anANNIndexManagerand group-based storage layout (ANNIndexGroup,ANNGroupCoverage,ANNGroupStorageDiskFull).rust/workspace/diskann-clickhouse) plus C++ adapters (DiskANNIndexBuilder,DiskANNIndexSearcherAdapter) behind commonIANNIndexBuilder/IANNIndexSearcherinterfaces.BuildANNIndexTaskintegrated withBackgroundJobsAssignee; vector data persisted viaVectorStreamWriterand per-part row-id mapping (PartRowIdMap*).useANNSearchthat rewrites eligible nearest-neighbor queries to use the index, with a newtableANNCoveragefunction for diagnostics.MergeTreeandServersettings,ProfileEvents,CurrentMetrics,AccessType, and aSYSTEMcommand for managing ANN indexes.04102-04111) covering DDL, query path,EXPLAIN, merge routing, metric/source validation, prefilter selectivity, and empty/small-table edge cases, plus extensivegtestunit tests for each component.The feature is experimental and gated behind dedicated settings; existing vector similarity index behavior is unchanged.
Refs: #103671
TODO
• ✅ DiskANN FFI and MergeTree table-level ann index foundation
• ✅ Group persistence, coverage tracking, background build, invalidation, and GC
• ✅ Query-plan rewrite with mixed indexed/unindexed part reads
• ✅ Basic observability and tests
• ☐ Fix multi-group top-K correctness
• ☐ Implement ANNIndexGroup merge/compaction
• ☐ Add real rescoring/reranking
• ☐ Support multi-replica setups: ReplicatedMergeTree, parallel replicas, lifecycle consistency
• ☐ Wire unused settings, add docs, and stabilize build/gtest/stateless CI
SIFT-1M ANN Benchmark Results
Setup
sift-128-euclidean(1M base, 10k query, 128-d L2)ann(DiskANN/Vamana), build_cfg=paper(max_degree/build_search_list_size/alphaper the DiskANN paper)single_group), fixedbeam_width=8,search_io_limit=500hash_seedpinned → recall is fully deterministic (identical across all 3 runs)concurrency ∈ {1, 32}ann_groups=1Recall@10 vs Search-List-Size Sweep
search_list_sizeGIST-1M ANN Benchmark Results
Setup
gist-960-euclidean(1M base, 1k query, 960-d L2)ann(DiskANN/Vamana), build_cfg=gist(DiskANN-paper-style params, tuned for high-dim)single_group), fixedbeam_width=8,search_io_limit=500hash_seedpinned → recall is fully deterministic across all 3 runsconcurrency ∈ {1, 16}ann_groups=1Recall@10 vs Search-List-Size Sweep
search_list_size