Enable query condition cache for ORDER BY ... LIMIT n queries#104478
Enable query condition cache for ORDER BY ... LIMIT n queries#104478alexey-milovidov wants to merge 28 commits into
ORDER BY ... LIMIT n queries#104478Conversation
When `tryOptimizeTopK` selects a `ReadFromMergeTree` for TopK dynamic filtering, the running threshold can drop granules before they reach the WHERE/PREWHERE filter. Until now `updateQueryConditionCache` blanket-skipped QCC writes for such reads, because a WHERE result computed under one TopK plan would be wrong to reuse under a different one (a different LIMIT might need granules that were dropped before). Make the cache key for the WHERE filter depend on the TopK plan, instead of skipping it entirely: - Add `condition_hash` to `TopKFilterInfo`, computed deterministically from `(column_name, type_name, num_sort_columns, limit_n, direction)` in `tryOptimizeTopK`. The hash captures everything in the TopK plan that's known at planning time and independent of data. - In `updateQueryConditionCache`, instead of bailing out for TopK reads, combine the TopK `condition_hash` into the WHERE filter's `condition_hash` via `boost::hash_combine`. Same query + same part set + same TopK plan → QCC entry is reusable; different `LIMIT n`, different sort column, or different direction → new key, fresh entry. The change is symmetric: both the QCC writer (`FilterTransform::writeIntoQueryConditionCache`) and the read-side lookup go through the same `FilterStep::condition`, so the new combined hash is used consistently on both sides without further wiring. Originally part of WIP PR #81944 (commit `bec285616a2`), adapted to master's `TopKFilterInfo`/`tryOptimizeTopK` infrastructure.
|
Workflow [PR], commit [7af0425] Summary: ⏳
AI ReviewSummaryThis PR enables Missing context / blind spots
Findings
Tests
Final Verdict |
`ReadFromMergeTree::setTopKColumn` takes its argument by const reference, so wrapping the argument in `std::move` is a no-op and trips the `performance-move-const-arg` clang-tidy check on the arm_tidy build. Build report: https://s3.amazonaws.com/clickhouse-test-reports/PRs/104478/cf70625448dce5be22d92d107504f9ee3efb4baf/build_arm_tidy/build_clickhouse/build_clickhouse.log PR: #104478
`updateQueryConditionCache` salts the WHERE filter's `condition_hash` with `top_k_filter_info->condition_hash` before storing it on `FilterStep::condition` (which `FilterTransform::writeIntoQueryConditionCache` later uses as the cache key). The read path in `MergeTreeDataSelectExecutor::filterPartsByQueryConditionCache`, however, was looking up entries using only `dag->getHash()`, so cache writes done under a TopK plan would never be hit on subsequent reads. Pass `top_k_filter_info` through to `filterPartsByQueryConditionCache` and apply the same `boost::hash_combine` on the read side. Same query + same part set + same TopK plan → cache hit; different `LIMIT n`, sort column, or direction → fresh entry. Reported by `clickhouse-gh` review on #104478.
On Apple platforms, `std::size_t` is `unsigned long` while `UInt64` is
`unsigned long long` — different types of the same width — so the
non-const reference parameter of `boost::hash_combine` does not bind to
a `UInt64` lvalue and the build fails with:
error: no matching function for call to 'hash_combine'
note: candidate function template not viable: no known conversion
from 'UInt64' (aka 'unsigned long long') to 'std::size_t &'
(aka 'unsigned long &') for 1st argument
On x86_64 Linux both aliases happen to resolve to `unsigned long`, which
is why CI Linux builds passed but `Build (amd_darwin)` and
`Build (arm_darwin)` failed.
Switch the local seed variable on both the read path
(`MergeTreeDataSelectExecutor::filterPartsByQueryConditionCache`) and
the write path (`updateQueryConditionCache`) to `size_t`. The implicit
conversions to/from `UInt64` at the cache-API boundary are lossless on
all 64-bit platforms ClickHouse supports.
CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104478&sha=4ab18afb9c043a670ca72211ab85494fd956e3b8&name_0=PR&name_1=Build%20%28amd_darwin%29
PR: #104478
…ition_hash `updateQueryConditionCache` and `MergeTreeDataSelectExecutor::filterPartsByQueryConditionCache` already combine `top_k_filter_info->condition_hash` into the QCC key. There is a third site that writes to the cache: the distributed-index-analysis branch in `ReadFromMergeTree::selectRangesToRead`, which fills QCC with ranges excluded by index analysis. Without the same salt the writes there go under `outputs.front()->getHash()` while reads go under the salted hash, so TopK queries cannot reuse those entries. Apply `boost::hash_combine(hash, top_k_filter_info->condition_hash)` consistently with the other two sites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alexey-milovidov
left a comment
There was a problem hiding this comment.
Add functional tests for this optimization.
Add performance tests as well.
| /// fold the deterministic part of the TopK plan into the cache key. Same query + same | ||
| /// part set + same TopK params → cache hit; different LIMIT or sort column → fresh | ||
| /// entry, never reusing a row-set computed under different TopK conditions. | ||
| if (const auto & top_k_filter_info = read_from_merge_tree->getTopKFilterInfo()) |
There was a problem hiding this comment.
I will study this in detail - this is where I got stuck in my branch.
`tryOptimizeTopK` was hard-disabling `reader_settings.use_query_condition_cache` inside `ReadFromMergeTree::setTopKColumn`, so chunks never carried `MarkRangesInfo` and `FilterTransform::writeIntoQueryConditionCache` had nothing to flush. With that flag off, the WHERE-filter QCC entries we now key by `TopKFilterInfo::condition_hash` were never written, so the read paths in `MergeTreeDataSelectExecutor::filterPartsByQueryConditionCache` and the index-analysis branch of `selectRangesToRead` always missed. Drop the unconditional disable in `setTopKColumn`. With the filter pushdown now pulling `__topKFilter` into `query_info.filter_actions_dag`, the determinism check in both write paths (`updateQueryConditionCache` and `selectRangesToRead`) would also bail out, so add `isDeterministicAllowingTopKFilter` (and the matching exception in `isNodeDeterministic`) to treat `__topKFilter` as deterministic for QCC purposes: its non-determinism is bounded by the TopK plan parameters that are already folded into the cache key, and the threshold trajectory only tightens during execution, so a granule that the outer WHERE zeroes out under `__topKFilter` is one whose rows could not have reached the final result under any threshold trajectory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A stateless test (`04217_query_condition_cache_topk`) verifies the QCC populates and partitions entries by the TopK plan parameters: re-running the same `WHERE ... ORDER BY <col> LIMIT n` reuses the same cache entry, while changing `LIMIT`, the sort direction, or the sort column writes a fresh entry instead. A performance test (`query_condition_cache_topk`) compares `use_query_condition_cache = 1` vs `0` for an `ORDER BY ... LIMIT n` query under `use_top_k_dynamic_filtering = 1` so the speedup the QCC delivers on subsequent runs of the same TopK plan is visible in CI perf comparisons. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
There is no visible difference on the newly introduced performance test. |
The previous version used `WHERE URL LIKE '%google%' ORDER BY EventTime LIMIT 10` against `hits_100m_single`, which matches at least one row in nearly every granule of `hits`, so the QCC had no granules to skip and the comparison reported essentially the same time with and without the cache (87.9 ms vs 87.3 ms before, 84.5 ms vs 84.0 ms after — all within noise). Switch to a self-contained synthetic table so the granule-level selectivity of the WHERE filter is controlled: * 50M rows in a `MergeTree` with `index_granularity = 8192` (~6 100 granules per part). * `v2` is `rand()` per row, so per-part minmax statistics span the full `UInt32` range and cannot prune parts on `v2 = constant`. * Exactly one row is injected with `v2 = 12345`, so the matching row sits in a single granule; the other ~6 100 granules have zero matches and become skippable in the QCC after the prewarm run. `optimize_move_to_prewhere` is also explicitly disabled so the WHERE predicate stays as a separate `FilterStep` above `ReadFromMergeTree` and the dynamic-filtering branch of `tryOptimizeTopK` applies (it bails out when `getPrewhereInfo()` is already set), matching the path the QCC writer in `updateQueryConditionCache` follows. Local measurement on the new layout (median over 7 runs, after prewarm): * `use_query_condition_cache = 1`: ~2 ms * `use_query_condition_cache = 0`: ~8 ms so the speedup attributable to this PR is visible (~4×) in the CI performance comparison instead of being hidden inside the existing TopK dynamic-filtering speedup.
|
Still no significant difference - correct the test so that it will be visible. |
…eedup
The previous version used `v1 = number` and `LIMIT 10`, which makes the
TopK threshold settle to the 10th-smallest `v1` value across the whole
table. With a random PK (`id = rand()`) those 10 winning rows are
scattered across ~10 random granules, so `__topKFilter` rejects every
row in the remaining ~6 100 granules and FilterTransform never sees them.
Without FilterTransform input there is no QCC write for those granules,
so the cache only ends up covering a handful of granules per part and
the second-run speedup is small (within the ~5% perf-comparison noise
threshold on CI, e.g. 20.7 ms vs 20.7 ms).
Switch the synthetic schema so QCC has full per-granule coverage:
* `v1 = number % 1000` instead of `v1 = number`. Each granule of
8 192 rows now contains ~8 rows with `v1 = 0`. The TopK threshold
still settles to `0` almost immediately, but `__topKFilter` keeps
those ~8 rows per granule and they reach the WHERE filter — letting
QCC record a "no match" entry for every granule on the prewarm run.
* The single matching row (with `v2 = 12345`) is now also forced to
`v1 = 0`, so it survives `__topKFilter` and the query still returns
the planted row (otherwise it would be filtered out by PREWHERE and
the result would be empty).
* Doubled to 100M rows / ~12 200 granules so the cold-cache scan time
is large enough that the speedup is unambiguous.
Local measurement on the new layout (median over 5 runs, after prewarm,
`max_threads=12` to match the perf-comparison default):
* `use_query_condition_cache = 1`: ~2 ms
* `use_query_condition_cache = 0`: ~45 ms
so the speedup attributable to this PR is ~22× and far above the CI
stat threshold instead of being lost inside the existing TopK
dynamic-filtering speedup.
The salt that partitions QCC entries by TopK plan parameters previously omitted two ORDER BY clauses with semantic impact: `NULLS FIRST` / `NULLS LAST` and `COLLATE`. Two queries that differ only by those parts of the sort description currently collide in the cache, so granule decisions made under one ordering rule could be reused under another — which is unsound for Nullable columns or string sorts with different locales. Fold `sort_col_desc.nulls_direction` and (when present) `sort_col_desc.collator->getLocale()` into the `SipHash` seed alongside the other planning-time parameters so each combination gets its own entry. The skip-index path of `tryOptimizeTopK` already rejects Nullable columns and collated sorts on its `skip_index_type_eligible` guard, but the dynamic-filtering path does not, so the QCC salt has to carry these parameters too. Defensive: no test failures observed; the existing `04217_query_condition_cache_topk` continues to pass.
| </settings> | ||
|
|
||
| <!-- QCC enabled: prewarm primes the cache, measured runs reuse it. --> | ||
| <query>SELECT v1 FROM qcc_topk_test WHERE v2 = 12345 ORDER BY v1 ASC LIMIT 10 SETTINGS use_query_condition_cache = 1 FORMAT Null</query> |
There was a problem hiding this comment.
This is still not showing a visible gain in the PR performance discussion.
For a Performance Improvement PR, we need this benchmark to demonstrate a stable delta in CI. Please adjust the benchmark shape so the improvement is measurable (for example, explicit warmup vs measured query phases) and the use_query_condition_cache = 1 case is clearly better than the = 0 baseline.
There was a problem hiding this comment.
This is still live on the current head.
The perf-comparison shards for this PR are currently still PENDING, so there is still no CI-side measurement confirming a stable delta for query_condition_cache_topk.xml under the Performance Improvement contract.
Please keep this thread open until the perf report shows a clear and stable improvement for the use_query_condition_cache = 1 case versus = 0.
There was a problem hiding this comment.
This is still not demonstrated by the current benchmark shape.
The current file still measures the same tests/performance/query_condition_cache_topk.xml:62 / :65 pair, but completed perf runs after the rewrite show the use_query_condition_cache = 1 query within noise rather than visibly faster. For example, on 18908b0c:
amd_release: query 0 old0.1221s, new0.1268s, diff+3.8%, threshold4.8%arm_release: query 0 old0.0704s, new0.0702s, diff-0.3%, threshold0.7%
The disabled-cache query is in the same range, so the Performance Improvement contract still lacks CI evidence. Please reshape the predicate so the baseline really pays the wide-column scan cost and the QCC-on query is a changed faster row in the perf report, while the use_query_condition_cache = 0 query stays neutral.
The PREWHERE write path in `MergeTreeSelectProcessor::read` uses an unsalted `output->getHash()`, so salting only the PREWHERE read side made the keys diverge — for TopK plans with a deterministic PREWHERE (for example, skip-index TopK plans) the lookup never hit the entry the writer had just stored. Restrict the TopK salt to the WHERE lookup path, which already pairs with a salted write in `updateQueryConditionCache` and the secondary write path in `selectRangesToRead`.
With 100M rows the per-run absolute QCC speedup (~10 ms of WHERE-filter scan) is below the comparison harness's noise threshold once ~50 ms of unrelated per-query overhead is folded in, so the test showed no measurable delta between `use_query_condition_cache = 1` and `= 0`. Scale the fill to 500M rows (~5 GiB on disk) so the data-scan cost dominates query overhead and the cached-granule pruning produces a clear, stable improvement in the report.
The previous shape relied on `__topKFilter` keeping ~8 rows per granule for the WHERE filter to evaluate, but in the comparison harness the actual granule-skipping savings on subsequent runs were not visible above the noise floor — both the `use_query_condition_cache = 1` and `= 0` queries landed at ~280 ms on `amd_release`. Reshape the benchmark to a `WHERE` that matches no rows in the table (`payload = 'no_match'` against a `FixedString(8)` filled with random `%1000` values). With every granule yielding `0/N` after the WHERE filter, the prewarm run writes a "skip me" entry for every granule of every part, and every measured run with the QCC enabled returns without reading any data — which is the speedup the QCC-enabled query is supposed to measure. The QCC-disabled query keeps paying the full WHERE-filter cost on every run, giving a clear delta between the two queries in the perf comparison.
The earlier shape of the perf test (100M rows of `FixedString(8)` filled
from `toString(rand() % 1000)`) compressed so well that every granule
read was a few KB. The QCC granule-skipping saved real wall time
(measured ~5x locally), but in the CI perf harness the savings were
swallowed by per-query overhead (planning, network, pipeline setup),
producing the unstable ~1-4% delta the perf comparison job reported.
Rewrite the benchmark so the scan cost dominates over per-query setup:
* 10M rows × `FixedString(256)` filled with `randomFixedString(256)`
(incompressible, ~2.5 GB on disk; every granule decompression is a
measurable chunk of work).
* Same `payload = 'no_match'` WHERE predicate matches no rows, so the
QCC entry written by the prewarm marks every granule as "skip",
and the measured runs return without scanning.
Locally this widens the QCC-on vs QCC-off gap from a fraction of a
millisecond to roughly 20x (1ms cached vs ~36ms full scan on 10M rows),
which is large enough to survive the CI perf comparator's noise floor
on both `amd_release` and `arm_release` shards.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Pushed The two failing checks at
@groeneai, the AST fuzzer signal-6 above is unrelated to this PR — please investigate the |
|
@alexey-milovidov — confirmed the
Root cause matches your description: the analyzer-side CI on #104966 head
I'll continue monitoring #104966 via CIDB; once it merges I'll verify master goes clean on STID 4193-4b55. Tracked in |
…d `COLLATE`
The functional test `04217_query_condition_cache_topk` already covered
the `LIMIT`, sort direction, and sort column dimensions of the TopK QCC
salt. Add focused assertions for the two remaining inputs:
* `Nullable(UInt32)` sort column, `ORDER BY n ASC NULLS FIRST` vs
`... NULLS LAST` — same query shape, different `nulls_direction`
in the salt, must produce a fresh QCC entry.
* `String` sort column with `use_top_k_dynamic_filtering_for_variable_length_types`
so `tryOptimizeTopK` takes the dynamic-filtering branch, and
`ORDER BY s ASC COLLATE 'en_US'` vs `... COLLATE 'fr'` — same
query shape, different collator locale, must produce a fresh
entry.
This protects the new hash inputs in `tryOptimizeTopK` from future
regressions: any change that drops `nulls_direction` or the collator
locale from `TopKFilterInfo::condition_hash` would cause two different
queries to alias the same QCC entry, and these assertions would catch
the drop.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous version of the test put the empty `tuple()` in `WHERE` (and in
a `SELECT`-only position with no shard-pruning WHERE), so:
* `WHERE dummy IN tuple()` is always false; `optimize_skip_unused_shards`
fully prunes the cluster before the rewriter runs (`shards > 1` precondition
fails in `ClusterProxy::executeQuery`), and the empty-tuple branch in
`OptimizeShardingKeyRewriteIn::enterImpl` is never reached.
* `SELECT (dummy IN tuple()) FROM dist` with no WHERE filter does not set
`query_info.optimized_cluster`, so again the rewriter is skipped.
As a result the test reported `[ OK ]` on the master-HEAD release binary used
by the bugfix-validation job, so the validation framework reported
"Failed to reproduce the bug" and marked the check red.
Switch the trigger to the actual fuzz-reduced pattern (PR #104478,
`fuzzer.log.zst`, query around `Fuzzing step 716`):
SELECT (2 IN tuple(-2)) FROM dist
WHERE dummy IN (0, 2)
GROUP BY (dummy IN tuple()) AND materialize(-2147483649) AND (NULL GLOBAL IN (*))
FORMAT Null;
The non-empty `WHERE dummy IN (0, 2)` keeps `optimized_cluster` populated with
two shards, so the rewriter runs and visits the empty `(dummy IN tuple())`
expression in `GROUP BY`. Verified locally:
* master HEAD without the fix: aborts in
`contrib/llvm-project/libcxx/include/__vector/vector.h:444:
libc++ Hardening assertion !empty() failed: back() called on an empty vector`
(exit code 134).
* With the fix: query completes cleanly.
Bugfix-validation report:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104966&sha=95b2f912df1456d5d0b5d09d431882e3b9855fc3&name_0=PR&name_1=Bugfix%20validation%20%28functional%20tests%29
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Fast test build is compiled without ICU, so any query using `COLLATE` fails with `Collations support is disabled, because ClickHouse was built without ICU library` (`SUPPORT_IS_DISABLED`), which broke `04217_query_condition_cache_topk` in the Fast test job. Split the COLLATE locale assertions out into a separate test `04238_query_condition_cache_topk_collate.sql` tagged `no-fasttest` so the `NULLS FIRST/LAST` and TopK plan-parameter assertions keep running in Fast test, while the COLLATE assertions still run in regular stateless builds that include ICU. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104478&sha=589ab32c683d193b4699fe7b0aae356701a7ff96&name_0=PR&name_1=Fast%20test Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master already has `04238_incomplete_utf8_column_name_json.sql`; rename the new COLLATE test to `04240_query_condition_cache_topk_collate.sql` so test numbers stay unique. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous version replaced the original `VirtualColumnUtils::isDeterministic` call at the QCC write site in `selectRangesToRead` with the local `isNodeDeterministic`, which only checks `FUNCTION` nodes. That weakened the gate: a filter containing a non-deterministic constant `COLUMN` node (e.g. `now()`, `today()`) could now pass and have its `condition_hash` written to the query condition cache, then be reused by a later query where the constant value has changed, causing stale / incorrect mark pruning. Restore the original `COLUMN`-node check by introducing `isDeterministicAllowingTopKFilter`, a sibling of `VirtualColumnUtils::isDeterministic` that admits `__topKFilter` (it is the only "officially" non-deterministic function expected here, and its non-determinism is bounded by the salt combined into `condition_hash` below). This mirrors the same helper already used on the write side in `updateQueryConditionCache.cpp`, so the determinism predicates on both sides agree. `isNodeDeterministic` reverts to its master form; the row-policy gate that shares it is unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `04240` prefix is now also used by master tests (`04240_intexp2_jit_overflow`, `04240_jit_float_to_big_int_libcalls`, `04240_80087_clickhouse_local_logs_on_exception`). Move our `COLLATE` test to `04242` so each test still has its own numeric prefix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LLVM Coverage Report
Changed lines: 98.78% (81/82) | lost baseline coverage: 1 line(s) · Uncovered code |
This commit reverts the materialization change in src/Processors/Transforms/FinishSortingTransform.cpp added by the previous commit, and keeps only the documentation test under tests/queries/0_stateless/04300_stid_3336_finishsorting_columnreplicated.sql. The reported "Bad cast from type DB::ColumnReplicated to DB::ColumnString" exception captured in CI on PR ClickHouse#102882, PR ClickHouse#104478, PR ClickHouse#105893 does not reproduce deterministically on master HEAD on a debug build, so the preceding source change was speculative and is dropped. The test stays as a name-anchored reproducer for the query shape captured in CI. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=102882&sha=1de47ad4469490e183bd76c5dad2eddd82d61422&name_0=PR&name_1=Stateless%20tests%20%28amd_msan%2C%20WasmEdge%2C%20parallel%2C%201%2F2%29
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Enable the query condition cache for queries that go through TopK dynamic filtering (
ORDER BY ... LIMIT n). The QCC entry is partitioned by the TopK plan parameters so the same query reuses cached WHERE filter results, while a differentLIMIT, sort column, or sort direction produces a fresh entry.Documentation entry for user-facing changes
Why
When
tryOptimizeTopKpicks aReadFromMergeTreefor TopK dynamic filtering, the running threshold can drop granules before they reach the WHERE/PREWHERE filter. Until nowupdateQueryConditionCacheblanket-skipped QCC writes for such reads:The skip was correct but pessimistic — it disables QCC for every query that hits the TopK path, even when a re-run with the same plan and same parts would reuse exactly the same row set.
What changed
TopKFilterInfo::condition_hash— a deterministic hash over the planning-time TopK parameters:(column_name, type_name, num_sort_columns, limit_n, direction). Computed intryOptimizeTopKwhen the read is selected for TopK filtering.updateQueryConditionCache— instead of skipping for TopK reads, combinetop_k_filter_info->condition_hashinto the WHERE filter'scondition_hashviaboost::hash_combinebefore storing it onFilterStep.The QCC writer (
FilterTransform::writeIntoQueryConditionCache) and the read-side lookup both go through the sameFilterStep::condition, so the new combined hash is used consistently on both sides — same plan + same parts → cache hit; differentLIMIT n, different sort column → new key, fresh entry, never reusing a row set computed under different TopK conditions.Origin
Split out from WIP PR #81944 (commit
bec285616a2). The PR's original implementation was on its ownTopNFilterParametersinfrastructure that didn't survive the rebase onto master; this version keeps the same idea but uses master'sTopKFilterInfo/tryOptimizeTopKdirectly.