Add statistics-based part pruning#94140
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
|
Workflow [PR], commit [aa02ae5] Summary: ❌
AI ReviewSummaryThis PR adds Missing context
Findings
ClickHouse Rules
Performance & Safety
Final Verdict
|
|
@shankar-iyer FYI: Here is the pruning-by-statistics PR which I mentioned last Thursday. |
AI ReviewSummaryThis PR adds statistics-based part pruning for Findings❌ Blockers
ClickHouse Rules
Performance & Safety
Final Verdict
|
|
|
||
| Possible values: | ||
|
|
||
| - 0 — Disabled. |
There was a problem hiding this comment.
use_statistics_for_part_pruning introduces new query-planning behavior but is enabled by default immediately.
Given this is a new pruning mechanism in a core read path, could we roll it out behind an experimental gate (or default this setting to 0 first) until more production/CI coverage accumulates? That would reduce rollout risk while keeping opt-in testing straightforward.
| { | ||
| try | ||
| { | ||
| auto estimates = part.data_part->getEstimates(); |
There was a problem hiding this comment.
part.data_part->getEstimates() can silently disable statistics pruning for this part after a transient read exception.
IMergeTreeDataPart::getEstimates assigns estimates = Estimates() before loadStatistics(). If loadStatistics() throws once, the cache is left as an empty map; later calls return that cached empty map without retrying.
In this loop that means one temporary read failure can permanently degrade pruning quality for the lifetime of the part object. Please consider building estimates in a local variable and assigning to the cache only after successful load.
There was a problem hiding this comment.
That is a good suggestion, but we don't have to include it in this PR.
LLVM Coverage Report
Changed lines: 94.61% (193/204) · Uncovered code |
|
|
bff3166
StatisticsPartPrunerthat builds aKeyConditionfrom the filter expression and uses MinMax statistics to constructhyperrectanglesfor each part. The KeyCondition'scheckInHyperrectanglemethod is then reused to determine if a part should be pruned.Statisticsindex type.Implementation of the approach discussed in #93318 (comment)
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
ClickHouse is now able to prune entire data parts in SELECT queries based on min/max statistics.
Documentation entry for user-facing changes
Note
Medium Risk
Changes MergeTree part selection logic to prune parts using column statistics by default, which can affect query results/performance if pruning is incorrect (especially around NULLs/Float64/Decimal edge cases). Scope is contained behind
use_statistics_for_part_pruningand disabled forFINALand on-the-fly mutations.Overview
Adds statistics-based part pruning for MergeTree reads: after partition pruning, the executor can now drop entire parts using per-part
MinMaxcolumn statistics derived from the query filter (newStatisticsPartPruner+filterPartsByStatistics).Introduces the
use_statistics_for_part_pruningsetting (default enabled) and exposes pruning inEXPLAIN indexes=1as a newStatisticsindex entry;MinMaxin EXPLAIN is renamed toPartition Min-Maxto clarify scope.Updates docs with a new “Part Pruning with Statistics” section and adjusts/extends stateless tests (including new suites) to cover pruning behavior and edge cases (nullable, NaN, Float64 precision limits, decimals), often disabling the feature to keep existing EXPLAIN/row-count expectations stable.
Written by Cursor Bugbot for commit 77ef539. This will update automatically on new commits. Configure here.