Skip to content

Add statistics-based part pruning#94140

Merged
nickitat merged 68 commits intoClickHouse:masterfrom
zoomxi:statistics-part-pruning
Mar 27, 2026
Merged

Add statistics-based part pruning#94140
nickitat merged 68 commits intoClickHouse:masterfrom
zoomxi:statistics-part-pruning

Conversation

@zoomxi
Copy link
Copy Markdown
Contributor

@zoomxi zoomxi commented Jan 14, 2026

  • Introduces StatisticsPartPruner that builds a KeyCondition from the filter expression and uses MinMax statistics to construct hyperrectangles for each part. The KeyCondition's checkInHyperrectangle method is then reused to determine if a part should be pruned.
  • After partition pruning, each remaining part is checked to see if it can be pruned based on its column statistics.
  • Statistics pruning results are exposed in EXPLAIN output as a new Statistics index type.

Implementation of the approach discussed in #93318 (comment)

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

ClickHouse is now able to prune entire data parts in SELECT queries based on min/max statistics.

Documentation entry for user-facing changes

  • [*] Documentation is written (mandatory for new features)

Note

Medium Risk
Changes MergeTree part selection logic to prune parts using column statistics by default, which can affect query results/performance if pruning is incorrect (especially around NULLs/Float64/Decimal edge cases). Scope is contained behind use_statistics_for_part_pruning and disabled for FINAL and on-the-fly mutations.

Overview
Adds statistics-based part pruning for MergeTree reads: after partition pruning, the executor can now drop entire parts using per-part MinMax column statistics derived from the query filter (new StatisticsPartPruner + filterPartsByStatistics).

Introduces the use_statistics_for_part_pruning setting (default enabled) and exposes pruning in EXPLAIN indexes=1 as a new Statistics index entry; MinMax in EXPLAIN is renamed to Partition Min-Max to clarify scope.

Updates docs with a new “Part Pruning with Statistics” section and adjusts/extends stateless tests (including new suites) to cover pruning behavior and edge cases (nullable, NaN, Float64 precision limits, decimals), often disabling the feature to keep existing EXPLAIN/row-count expectations stable.

Written by Cursor Bugbot for commit 77ef539. This will update automatically on new commits. Configure here.

@zoomxi

This comment was marked as resolved.

@nikitamikhaylov nikitamikhaylov added the can be tested Allows running workflows for external contributors label Jan 14, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Jan 14, 2026

Workflow [PR], commit [aa02ae5]

Summary:

job_name test_name status info comment
Stateless tests (amd_msan, WasmEdge, parallel, 1/2) failure
02967_parallel_replicas_joins_and_analyzer FAIL cidb, issue ISSUE EXISTS

AI Review

Summary

This PR adds StatisticsPartPruner and wires statistics-based part pruning into MergeTree read planning, with new EXPLAIN index output and docs/settings updates. The approach is promising, but there are still correctness/robustness/performance rollout concerns in the current form (nullable-range handling reduces pruning quality, estimate caching can be poisoned after a transient exception, and the feature is enabled by default without an experimental gate).

Missing context
  • ⚠️ No CI logs/benchmarks were provided for this review run, so hot-path overhead of per-part statistics loading could not be validated quantitatively.
Findings
  • ❌ Blockers

    • [src/Core/Settings.cpp:1669] New behavior in a core read path is enabled by default via use_statistics_for_part_pruning = true, but it is not gated as experimental. Per ClickHouse rollout rules, new features should be under an experimental gate first.
      • Suggested fix: add an allow_experimental_* guard (or default this feature off behind compatibility) and roll out incrementally.
  • ⚠️ Majors

    • [src/Storages/Statistics/StatisticsPartPruner.cpp:45] Nullable columns currently use Range(min, +inf) instead of real [min, max], which is safe but systematically weakens pruning for upper-bound predicates (col > const above real max keeps provably non-matching parts).

      • Suggested fix: keep real [min, max] bounds and rely on nullable-type handling inside KeyCondition::checkInHyperrectangle.
    • [src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp:731] This path calls IMergeTreeDataPart::getEstimates for each part, and getEstimates loads all column statistics on first access (loadStatistics()), not just columns used by the filter. On wide parts this can add significant metadata I/O and CPU in planning.

      • Suggested fix: load only required columns (e.g., based on statistics_pruner.getUsedColumns()) and cache filtered estimates.
    • [src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp:731] IMergeTreeDataPart::getEstimates initializes cached estimates before loadStatistics(). If loadStatistics() throws once, cache may stay as empty estimates and later calls won't retry, silently disabling statistics pruning for that part object.

      • Suggested fix: build estimates in a local variable and assign to cache only after successful load.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate New part-pruning feature is enabled by default without an experimental gate.
No magic constants
Backward compatibility ⚠️ New planning behavior is on by default; compatibility/rollout guard should be stricter for a new pruning mechanism.
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout ⚠️ Default-on rollout for a new pruning path increases risk in OSS/Cloud before broader validation.
Compilation time
Performance & Safety
  • ⚠️ Per-part statistics loading currently fetches full statistics payloads on first access, even when predicates use a small subset of columns.
  • ⚠️ Estimate-cache write ordering can persist an empty cache after transient load exceptions, degrading pruning effectiveness for the lifetime of part objects.
Final Verdict
  • Status: ⚠️ Request changes
  • Minimum required actions:
    • Add an experimental/controlled rollout gate (or equivalent safe default) for statistics-based part pruning.
    • Preserve real nullable min/max bounds to avoid systematic pruning loss.
    • Avoid full-statistics loading in the per-part pruning loop; load/cache only required columns.
    • Fix estimate cache initialization order so failed loads do not poison cache state.

@clickhouse-gh clickhouse-gh Bot added the pr-feature Pull request with new product feature label Jan 14, 2026
@nickitat nickitat self-assigned this Jan 14, 2026
Comment thread docs/en/engines/table-engines/mergetree-family/mergetree.md Outdated
Comment thread docs/en/engines/table-engines/mergetree-family/mergetree.md Outdated
Comment thread src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp Outdated
Comment thread src/Processors/QueryPlan/ReadFromMergeTree.cpp Outdated
Comment thread tests/queries/0_stateless/03788_statistics_part_pruning.reference Outdated
Comment thread tests/queries/0_stateless/02864_statistics_usage.reference Outdated
Comment thread docs/en/engines/table-engines/mergetree-family/mergetree.md Outdated
Comment thread docs/en/engines/table-engines/mergetree-family/mergetree.md Outdated
Comment thread src/Processors/QueryPlan/ReadFromMergeTree.cpp Outdated
Comment thread src/Storages/Statistics/StatisticsPartPruner.h Outdated
Comment thread src/Storages/Statistics/StatisticsPartPruner.h Outdated
Comment thread docs/en/engines/table-engines/mergetree-family/mergetree.md Outdated
Comment thread src/Core/Settings.cpp Outdated
Comment thread src/Core/Settings.cpp Outdated
Comment thread src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp Outdated
Comment thread src/Storages/Statistics/StatisticsPartPruner.cpp Outdated
Comment thread src/Storages/Statistics/StatisticsPartPruner.h Outdated
@rschu1ze
Copy link
Copy Markdown
Member

@shankar-iyer FYI: Here is the pruning-by-statistics PR which I mentioned last Thursday.

Comment thread src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp Outdated
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Mar 23, 2026


AI Review

Summary

This PR adds statistics-based part pruning for MergeTree reads via a new Statistics pruning stage (controlled by use_statistics_for_part_pruning) and exposes it in EXPLAIN output. The implementation is generally solid, but there is one blocker: the new pruning path can surface per-part statistics deserialization exceptions to user queries, turning an optimization into a query-availability risk.

Findings

❌ Blockers

  • [src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp:729] filterPartsByStatistics calls getEstimates for each part without per-part exception handling. IMergeTreeDataPart::getEstimates loads/deserializes statistics, and ColumnStatistics::deserialize can throw ILLEGAL_STATISTICS (e.g. stale format files). This can abort otherwise valid SELECT queries due to a single problematic part.
    • Suggested fix: catch exceptions per part in filterPartsByStatistics, log with part name, and treat that part as non-prunable ({true,true} behavior), matching existing tolerant statistics-loading paths.

ClickHouse Rules

Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny ⚠️ New core pruning path propagates per-part statistics load exceptions to query execution.
No test removal
Experimental gate
No magic constants
Backward compatibility ⚠️ Existing parts with stale statistics metadata can now cause query-time exceptions in pruning path.
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout ⚠️ Needs exception-safe fallback in pruning path to keep optimization non-disruptive.
Compilation time

Performance & Safety

  • Safety concern: exception propagation from per-part statistics loading in the pruning path can make query availability depend on statistics file health.

Final Verdict

  • Status: ⚠️ Request changes
  • Minimum required actions:
    • Add per-part exception handling around statistics loading/evaluation in filterPartsByStatistics and fallback to non-pruning for the affected part.

Comment thread src/Core/Settings.cpp

Possible values:

- 0 — Disabled.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_statistics_for_part_pruning introduces new query-planning behavior but is enabled by default immediately.

Given this is a new pruning mechanism in a core read path, could we roll it out behind an experimental gate (or default this setting to 0 first) until more production/CI coverage accumulates? That would reduce rollout risk while keeping opt-in testing straightforward.

Comment thread src/Storages/Statistics/StatisticsPartPruner.cpp
Comment thread src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp Outdated
Comment thread src/Storages/Statistics/StatisticsPartPruner.cpp Outdated
Comment thread src/Storages/Statistics/StatisticsPartPruner.cpp
{
try
{
auto estimates = part.data_part->getEstimates();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ part.data_part->getEstimates() can silently disable statistics pruning for this part after a transient read exception.

IMergeTreeDataPart::getEstimates assigns estimates = Estimates() before loadStatistics(). If loadStatistics() throws once, the cache is left as an empty map; later calls return that cached empty map without retrying.

In this loop that means one temporary read failure can permanently degrade pruning quality for the lifetime of the part object. Please consider building estimates in a local variable and assigning to the cache only after successful load.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good suggestion, but we don't have to include it in this PR.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Mar 27, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.20% 84.20% +0.00%
Functions 24.60% 24.60% +0.00%
Branches 76.70% 76.70% +0.00%

Changed lines: 94.61% (193/204) · Uncovered code

Full report · Diff report

@nickitat
Copy link
Copy Markdown
Member

Stateless tests (amd_msan, WasmEdge, parallel, 1/2) - #100867

@nickitat nickitat enabled auto-merge March 27, 2026 18:59
@nickitat nickitat added this pull request to the merge queue Mar 27, 2026
Merged via the queue into ClickHouse:master with commit bff3166 Mar 27, 2026
151 of 153 checks passed
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 27, 2026
@zoomxi zoomxi deleted the statistics-part-pruning branch March 30, 2026 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors manual approve Manual approve required to run CI pr-feature Pull request with new product feature pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants