Skip to content

[WIP] Add table-level ANN index with DiskANN backend#103675

Draft
fastio wants to merge 4 commits intoClickHouse:masterfrom
fastio:feat-knn-step-4
Draft

[WIP] Add table-level ANN index with DiskANN backend#103675
fastio wants to merge 4 commits intoClickHouse:masterfrom
fastio:feat-knn-step-4

Conversation

@fastio
Copy link
Copy Markdown
Contributor

@fastio fastio commented Apr 28, 2026

Resolves #85766

Changelog category (leave one):

  • Experimental Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add an experimental table-level Approximate Nearest Neighbor (ANN) index backed by DiskANN, with new SQL DDL syntax, query planner integration, and supporting SYSTEM commands.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

This change introduces an experimental table-level ANN index for high-dimensional vector search, using DiskANN as the underlying graph index.

Highlights:

  • New ANN index type for MergeTree tables, managed at table level (not per part) via an ANNIndexManager and group-based storage layout (ANNIndexGroup, ANNGroupCoverage, ANNGroupStorageDiskFull).
  • DiskANN integration through a Rust FFI wrapper (rust/workspace/diskann-clickhouse) plus C++ adapters (DiskANNIndexBuilder, DiskANNIndexSearcherAdapter) behind common IANNIndexBuilder / IANNIndexSearcher interfaces.
  • Background BuildANNIndexTask integrated with BackgroundJobsAssignee; vector data persisted via VectorStreamWriter and per-part row-id mapping (PartRowIdMap*).
  • Query planner pass useANNSearch that rewrites eligible nearest-neighbor queries to use the index, with a new tableANNCoverage function for diagnostics.
  • New MergeTree and Server settings, ProfileEvents, CurrentMetrics, AccessType, and a SYSTEM command for managing ANN indexes.
  • Stateless tests (04102-04111) covering DDL, query path, EXPLAIN, merge routing, metric/source validation, prefilter selectivity, and empty/small-table edge cases, plus extensive gtest unit tests for each component.

The feature is experimental and gated behind dedicated settings; existing vector similarity index behavior is unchanged.

Refs: #103671

TODO

• ✅ DiskANN FFI and MergeTree table-level ann index foundation
• ✅ Group persistence, coverage tracking, background build, invalidation, and GC
• ✅ Query-plan rewrite with mixed indexed/unindexed part reads
• ✅ Basic observability and tests

• ☐ Fix multi-group top-K correctness
• ☐ Implement ANNIndexGroup merge/compaction
• ☐ Add real rescoring/reranking
• ☐ Support multi-replica setups: ReplicatedMergeTree, parallel replicas, lifecycle consistency
• ☐ Wire unused settings, add docs, and stabilize build/gtest/stateless CI

SIFT-1M ANN Benchmark Results

Setup

  • Machine: 64C,128GB memory.
  • Dataset: sift-128-euclidean (1M base, 10k query, 128-d L2)
  • Index: table-level ann (DiskANN/Vamana), build_cfg=paper (max_degree / build_search_list_size / alpha per the DiskANN paper)
  • Single group (single_group), fixed beam_width=8, search_io_limit=500
  • 3 repetitions per cell, median reported; hash_seed pinned → recall is fully deterministic (identical across all 3 runs)
  • 1000 queries per cell, K=10, 200-query warm-up
  • Concurrency: concurrency ∈ {1, 32}
  • Single build: 335 s (~5.6 min), ann_groups=1

Recall@10 vs Search-List-Size Sweep

search_list_size Recall@10 QPS (conc=1) QPS (conc=32) p50 (ms, c=1) p99 (ms, c=1) p99 (ms, c=32)
10 0.5471 108 219 ~9.2 10.6 153.7
30 0.8080 99 192 ~10.1 12.2 270.0
50 0.8967 91 174 ~10.8 14.2 330.0
100 0.9621 79 137 ~12.6 14.2 240.0
200 0.9886 62 91 ~16.0 17.7 360.0

GIST-1M ANN Benchmark Results

Setup

  • Machine: 64C,128GB memory.
  • Dataset: gist-960-euclidean (1M base, 1k query, 960-d L2)
  • Index: table-level ann (DiskANN/Vamana), build_cfg=gist (DiskANN-paper-style params, tuned for high-dim)
  • Single group (single_group), fixed beam_width=8, search_io_limit=500
  • 3 repetitions per cell, median reported; hash_seed pinned → recall is fully deterministic across all 3 runs
  • 1000 queries per cell, K=10, 200-query warm-up
  • Concurrency: concurrency ∈ {1, 16}
  • Single build: 1 754 s (~29.2 min), ann_groups=1

Recall@10 vs Search-List-Size Sweep

search_list_size Recall@10 QPS (conc=1) QPS (conc=16) p50 (ms, c=1) p99 (ms, c=1) p99 (ms, c=16)
20 0.4617 42.7 55.1 ~23.4 26.6 315.1
50 0.6600 39.9 50.1 ~25.0 31.5 336.3
100 0.7993 35.5 43.5 ~28.2 33.2 388.1
200 0.8963 29.0 34.6 ~34.5 41.3 486.7
400 0.9583 21.0 24.2 ~47.6 55.1 692.8

fastio added 3 commits April 15, 2026 22:04
Introduce `rust/workspace/diskann-clickhouse` providing a C ABI over the DiskANN library: index build, in-memory and on-disk search, padded SIMD queries, and the FFI header consumed by the C++ side.
Group-based index with copy-on-write manager, plan optimization and per-part distance dispatch, ProfileEvents/metric kernel, gtests and `0_stateless` tests (04102-04111).
@fastio fastio marked this pull request as draft April 28, 2026 12:58
@fastio fastio changed the title Add table-level ANN index with DiskANN backend [WIP] Add table-level ANN index with DiskANN backend Apr 28, 2026
@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Apr 28, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Apr 28, 2026

Workflow [PR], commit [e26ebee]

Summary:

job_name test_name status info comment
Style check FAIL
whitespace_check FAIL cidb
cpp FAIL cidb
various FAIL cidb
Fast test FAIL
Build ClickHouse FAIL
Build (arm_tidy) FAIL
Build ClickHouse FAIL cidb
Docs check DROPPED
Fast test (arm_darwin) DROPPED
Build (amd_debug) DROPPED
Build (amd_asan_ubsan) DROPPED
Build (amd_tsan) DROPPED
Build (amd_msan) DROPPED
Build (amd_binary) DROPPED

AI Review

Summary

This PR adds an experimental table-level ANN index backed by DiskANN, including DDL, planner integration, background build/search lifecycle, and tests. I did not find high-confidence correctness, safety, or compatibility problems that require changes before merge.

ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
No large/binary files
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added pr-experimental Experimental Feature submodule changed At least one submodule changed in this PR. labels Apr 28, 2026
@rschu1ze
Copy link
Copy Markdown
Member

@fastio Thanks for this large PR. May I ask what was your motivation to make the index per-table and not per-part?

@fastio
Copy link
Copy Markdown
Contributor Author

fastio commented Apr 29, 2026

@fastio Thanks for this large PR. May I ask what was your motivation to make the index per-table and not per-part?

Hi @rschu1ze, Thanks for the question — happy to share the reasoning.

First, The main reason is that ANN search is a global top-K ranking problem, not a local part-pruning predicate. ORDER BY distance LIMIT K requires a global nearest-neighbor order across all active parts, so a per-part index would still need fan-out searches and a global merge — with no principled way to decide how many candidates to over-fetch from each part (too few collapses recall, too many makes the per-part index pointless). That is why the ANN index is modeled as a table-level search structure, while part coverage remains an internal lifecycle concern.

Second, You are right, this PR is too large to review as a single mergeable change. My goal is to use it to discuss and validate the overall design direction first. If the direction makes sense, I will split it into a sequence of smaller PRs with clear boundaries, so each step can be reviewed independently.

Happy to dig into any specific aspect if useful.

@CurtizJ
Copy link
Copy Markdown
Member

CurtizJ commented May 1, 2026

@fastio

Could you please explain how you synchronize data in the global index and in the main table and how you maintain consistency between them?

@rschu1ze
Copy link
Copy Markdown
Member

rschu1ze commented May 3, 2026

@fastio We (@shankar-iyer and me) discussed the vector similarity index 2.0 last week. There are a few concerns with this PR.

  • The industry (examples: BigQuery, Turbopuffer, ElasticSearch, Starrocks) is moving towards SPANN aka. IVF vector indexes. This has a reason: Compared to DiskANN which came earlier and stores the Vamana graph on disk, SPANN performs only sequential reads, is simpler to implement, and has better trade-offs than DiskANN (based on the information in the SPANN paper). We therefore agreed to implement SPANN. There is actually already issue Add SPANN memory-disk hybrid vector similarity index #102146 for this.

  • Global indexes do not fit ClickHouse's architecture well.

    • Problem 1: They must be kept in-sync with the underlying parts.
    • Problem 2: They introduce a need for a stable row id which ClickHouse currently doesn't have.

First, The main reason is that ANN search is a global top-K ranking problem, not a local part-pruning predicate. ORDER BY distance LIMIT K requires a global nearest-neighbor order across all active parts, so a per-part index would still need fan-out searches and a global merge — with no principled way to decide how many candidates to over-fetch from each part (too few collapses recall, too many makes the per-part index pointless). That is why the ANN index is modeled as a table-level search structure, while part coverage remains an internal lifecycle concern.

This issue exists indeed but it is less worse than it seems - not only because parts grow quite large by default (150 GB). Fan-out because of per-part searches only reduces performance but it never reduces recall. Note that one could theoretically reduce the former by some new setting that would only consider N% of the parts for search.

Even if all my concerns are invalid, I'd prefer SingleStore's mechanism of building covering vector indexes in addition to the original per-segment indexes instead of replacing them as in this PR (see sec 4.2 here). This still doesn't fit the LSM architecture but it is a little less disruptive.

@fastio
Copy link
Copy Markdown
Contributor Author

fastio commented May 6, 2026

@fastio

Could you please explain how you synchronize data in the global index and in the main table and how you maintain consistency between them?

@CurtizJ, Apologies for the delayed response — I was on vacation for the past five days. Thanks for raising this — it's the right question to settle before the implementation lands.

TL;DR. The main table and the global index are kept eventually consistent. Query correctness does not depend on the index being caught up: it is preserved by partitioning active parts into indexed and unindexed sets at query time and rerank-merging the two paths.

Consistency model

For a query at time t, let

  • A_t = active parts visible to the main table snapshot
  • I_t = parts covered by the index snapshot

We compute:
Indexed = A_t ∩ I_t -> ANN search, approximate topK + distance
Unindexed = A_t \ I_t -> brute-force distance over the rows
result = rerank(Indexed ∪ Unindexed)

How synchronization happens

  • The write path is unchanged. There is zero intrusion into the MergeTree write path.
  • A background task periodically scans the main table's active parts, picks up parts not yet indexed, and builds index data for them.
  • A compaction mechanism rebuilds and merges existing index data.

Happy to dig deeper into any of these.

@fastio
Copy link
Copy Markdown
Contributor Author

fastio commented May 6, 2026

@fastio We (@shankar-iyer and me) discussed the vector similarity index 2.0 last week. There are a few concerns with this PR.

  • The industry (examples: BigQuery, Turbopuffer, ElasticSearch, Starrocks) is moving towards SPANN aka. IVF vector indexes. This has a reason: Compared to DiskANN which came earlier and stores the Vamana graph on disk, SPANN performs only sequential reads, is simpler to implement, and has better trade-offs than DiskANN (based on the information in the SPANN paper). We therefore agreed to implement SPANN. There is actually already issue Add SPANN memory-disk hybrid vector similarity index #102146 for this.

  • Global indexes do not fit ClickHouse's architecture well.

    • Problem 1: They must be kept in-sync with the underlying parts.
    • Problem 2: They introduce a need for a stable row id which ClickHouse currently doesn't have.

First, The main reason is that ANN search is a global top-K ranking problem, not a local part-pruning predicate. ORDER BY distance LIMIT K requires a global nearest-neighbor order across all active parts, so a per-part index would still need fan-out searches and a global merge — with no principled way to decide how many candidates to over-fetch from each part (too few collapses recall, too many makes the per-part index pointless). That is why the ANN index is modeled as a table-level search structure, while part coverage remains an internal lifecycle concern.

This issue exists indeed but it is less worse than it seems - not only because parts grow quite large by default (150 GB). Fan-out because of per-part searches only reduces performance but it never reduces recall. Note that one could theoretically reduce the former by some new setting that would only consider N% of the parts for search.

Even if all my concerns are invalid, I'd prefer SingleStore's mechanism of building covering vector indexes in addition to the original per-segment indexes instead of replacing them as in this PR (see sec 4.2 here). This still doesn't fit the LSM architecture but it is a little less disruptive.

@fastio We (@shankar-iyer and me) discussed the vector similarity index 2.0 last week. There are a few concerns with this PR.

  • The industry (examples: BigQuery, Turbopuffer, ElasticSearch, Starrocks) is moving towards SPANN aka. IVF vector indexes. This has a reason: Compared to DiskANN which came earlier and stores the Vamana graph on disk, SPANN performs only sequential reads, is simpler to implement, and has better trade-offs than DiskANN (based on the information in the SPANN paper). We therefore agreed to implement SPANN. There is actually already issue Add SPANN memory-disk hybrid vector similarity index #102146 for this.

  • Global indexes do not fit ClickHouse's architecture well.

    • Problem 1: They must be kept in-sync with the underlying parts.
    • Problem 2: They introduce a need for a stable row id which ClickHouse currently doesn't have.

First, The main reason is that ANN search is a global top-K ranking problem, not a local part-pruning predicate. ORDER BY distance LIMIT K requires a global nearest-neighbor order across all active parts, so a per-part index would still need fan-out searches and a global merge — with no principled way to decide how many candidates to over-fetch from each part (too few collapses recall, too many makes the per-part index pointless). That is why the ANN index is modeled as a table-level search structure, while part coverage remains an internal lifecycle concern.

This issue exists indeed but it is less worse than it seems - not only because parts grow quite large by default (150 GB). Fan-out because of per-part searches only reduces performance but it never reduces recall. Note that one could theoretically reduce the former by some new setting that would only consider N% of the parts for search.

Even if all my concerns are invalid, I'd prefer SingleStore's mechanism of building covering vector indexes in addition to the original per-segment indexes instead of replacing them as in this PR (see sec 4.2 here). This still doesn't fit the LSM architecture but it is a little less disruptive.

@fastio We (@shankar-iyer and me) discussed the vector similarity index 2.0 last week. There are a few concerns with this PR.

  • The industry (examples: BigQuery, Turbopuffer, ElasticSearch, Starrocks) is moving towards SPANN aka. IVF vector indexes. This has a reason: Compared to DiskANN which came earlier and stores the Vamana graph on disk, SPANN performs only sequential reads, is simpler to implement, and has better trade-offs than DiskANN (based on the information in the SPANN paper). We therefore agreed to implement SPANN. There is actually already issue Add SPANN memory-disk hybrid vector similarity index #102146 for this.

  • Global indexes do not fit ClickHouse's architecture well.

    • Problem 1: They must be kept in-sync with the underlying parts.
    • Problem 2: They introduce a need for a stable row id which ClickHouse currently doesn't have.

First, The main reason is that ANN search is a global top-K ranking problem, not a local part-pruning predicate. ORDER BY distance LIMIT K requires a global nearest-neighbor order across all active parts, so a per-part index would still need fan-out searches and a global merge — with no principled way to decide how many candidates to over-fetch from each part (too few collapses recall, too many makes the per-part index pointless). That is why the ANN index is modeled as a table-level search structure, while part coverage remains an internal lifecycle concern.

This issue exists indeed but it is less worse than it seems - not only because parts grow quite large by default (150 GB). Fan-out because of per-part searches only reduces performance but it never reduces recall. Note that one could theoretically reduce the former by some new setting that would only consider N% of the parts for search.

Even if all my concerns are invalid, I'd prefer SingleStore's mechanism of building covering vector indexes in addition to the original per-segment indexes instead of replacing them as in this PR (see sec 4.2 here). This still doesn't fit the LSM architecture but it is a little less disruptive.

@rschu1ze, Thank you for the very helpful feedback.

On DiskANN: the current PR is a skeleton in which the ANN algorithm is pluggable behind IMaterializedIndexAlgorithm / MaterializedIndexAlgorithmFactory, so swapping in SPANN later should be straightforward — the DiskANN parts here are placeholders for evaluation rather than a hard commitment to the algorithm.

On the topology: the SingleStore-style approach of building a cross-part covering index in addition to per-part indexes (sec. 4.2) is a direction well worth thinking through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-experimental Experimental Feature submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Vamana for ANN Vector Search (much faster than HNSW)

4 participants