Add statistics-based part pruning by zoomxi · Pull Request #94140 · ClickHouse/ClickHouse

zoomxi · 2026-01-14T09:47:35Z

Introduces StatisticsPartPruner that builds a KeyCondition from the filter expression and uses MinMax statistics to construct hyperrectangles for each part. The KeyCondition's checkInHyperrectangle method is then reused to determine if a part should be pruned.
After partition pruning, each remaining part is checked to see if it can be pruned based on its column statistics.
Statistics pruning results are exposed in EXPLAIN output as a new Statistics index type.

Implementation of the approach discussed in #93318 (comment)

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

ClickHouse is now able to prune entire data parts in SELECT queries based on min/max statistics.

Documentation entry for user-facing changes

[*] Documentation is written (mandatory for new features)

Note

Medium Risk
Changes MergeTree part selection logic to prune parts using column statistics by default, which can affect query results/performance if pruning is incorrect (especially around NULLs/Float64/Decimal edge cases). Scope is contained behind use_statistics_for_part_pruning and disabled for FINAL and on-the-fly mutations.

Overview
Adds statistics-based part pruning for MergeTree reads: after partition pruning, the executor can now drop entire parts using per-part MinMax column statistics derived from the query filter (new StatisticsPartPruner + filterPartsByStatistics).

Introduces the use_statistics_for_part_pruning setting (default enabled) and exposes pruning in EXPLAIN indexes=1 as a new Statistics index entry; MinMax in EXPLAIN is renamed to Partition Min-Max to clarify scope.

Updates docs with a new “Part Pruning with Statistics” section and adjusts/extends stateless tests (including new suites) to cover pruning behavior and edge cases (nullable, NaN, Float64 precision limits, decimals), often disabling the feature to keep existing EXPLAIN/row-count expectations stable.

^{Written by Cursor Bugbot for commit 77ef539. This will update automatically on new commits. Configure here.}

clickhouse-gh · 2026-01-14T10:23:09Z

Workflow [PR], commit [aa02ae5]

Summary: ❌

job_name	test_name	status	info	comment
Stateless tests (amd_msan, WasmEdge, parallel, 1/2)		failure
	02967_parallel_replicas_joins_and_analyzer	FAIL	cidb, issue	ISSUE EXISTS

AI Review

Summary

This PR adds StatisticsPartPruner and wires statistics-based part pruning into MergeTree read planning, with new EXPLAIN index output and docs/settings updates. The approach is promising, but there are still correctness/robustness/performance rollout concerns in the current form (nullable-range handling reduces pruning quality, estimate caching can be poisoned after a transient exception, and the feature is enabled by default without an experimental gate).

Missing context

⚠️ No CI logs/benchmarks were provided for this review run, so hot-path overhead of per-part statistics loading could not be validated quantitatively.

Findings

❌ Blockers
- [src/Core/Settings.cpp:1669] New behavior in a core read path is enabled by default via use_statistics_for_part_pruning = true, but it is not gated as experimental. Per ClickHouse rollout rules, new features should be under an experimental gate first.
  - Suggested fix: add an allow_experimental_* guard (or default this feature off behind compatibility) and roll out incrementally.
⚠️ Majors
- [src/Storages/Statistics/StatisticsPartPruner.cpp:45] Nullable columns currently use Range(min, +inf) instead of real [min, max], which is safe but systematically weakens pruning for upper-bound predicates (col > const above real max keeps provably non-matching parts).
  - Suggested fix: keep real [min, max] bounds and rely on nullable-type handling inside KeyCondition::checkInHyperrectangle.
- [src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp:731] This path calls IMergeTreeDataPart::getEstimates for each part, and getEstimates loads all column statistics on first access (loadStatistics()), not just columns used by the filter. On wide parts this can add significant metadata I/O and CPU in planning.
  - Suggested fix: load only required columns (e.g., based on statistics_pruner.getUsedColumns()) and cache filtered estimates.
- [src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp:731] IMergeTreeDataPart::getEstimates initializes cached estimates before loadStatistics(). If loadStatistics() throws once, cache may stay as empty estimates and later calls won't retry, silently disabling statistics pruning for that part object.
  - Suggested fix: build estimates in a local variable and assign to cache only after successful load.

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	✅
No test removal	✅
Experimental gate	❌	New part-pruning feature is enabled by default without an experimental gate.
No magic constants	✅
Backward compatibility	⚠️	New planning behavior is on by default; compatibility/rollout guard should be stricter for a new pruning mechanism.
`SettingsChangesHistory.cpp`	✅
PR metadata quality	✅
Safe rollout	⚠️	Default-on rollout for a new pruning path increases risk in OSS/Cloud before broader validation.
Compilation time	✅

Performance & Safety

⚠️ Per-part statistics loading currently fetches full statistics payloads on first access, even when predicates use a small subset of columns.
⚠️ Estimate-cache write ordering can persist an empty cache after transient load exceptions, degrading pruning effectiveness for the lifetime of part objects.

Final Verdict

Status: ⚠️ Request changes
Minimum required actions:
- Add an experimental/controlled rollout gate (or equivalent safe default) for statistics-based part pruning.
- Preserve real nullable min/max bounds to avoid systematic pruning loss.
- Avoid full-statistics loading in the per-part pruning loop; load/cache only required columns.
- Fix estimate cache initialization order so failed loads do not poison cache state.

rschu1ze · 2026-01-19T10:08:29Z

@shankar-iyer FYI: Here is the pruning-by-statistics PR which I mentioned last Thursday.

clickhouse-gh · 2026-03-23T01:40:37Z

AI Review

Summary

This PR adds statistics-based part pruning for MergeTree reads via a new Statistics pruning stage (controlled by use_statistics_for_part_pruning) and exposes it in EXPLAIN output. The implementation is generally solid, but there is one blocker: the new pruning path can surface per-part statistics deserialization exceptions to user queries, turning an optimization into a query-availability risk.

Findings

❌ Blockers

[src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp:729] filterPartsByStatistics calls getEstimates for each part without per-part exception handling. IMergeTreeDataPart::getEstimates loads/deserializes statistics, and ColumnStatistics::deserialize can throw ILLEGAL_STATISTICS (e.g. stale format files). This can abort otherwise valid SELECT queries due to a single problematic part.
- Suggested fix: catch exceptions per part in filterPartsByStatistics, log with part name, and treat that part as non-prunable ({true,true} behavior), matching existing tolerant statistics-loading paths.

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	✅
Core-area scrutiny	⚠️	New core pruning path propagates per-part statistics load exceptions to query execution.
No test removal	✅
Experimental gate	➖
No magic constants	✅
Backward compatibility	⚠️	Existing parts with stale statistics metadata can now cause query-time exceptions in pruning path.
`SettingsChangesHistory.cpp`	✅
PR metadata quality	✅
Safe rollout	⚠️	Needs exception-safe fallback in pruning path to keep optimization non-disruptive.
Compilation time	✅

Performance & Safety

Safety concern: exception propagation from per-part statistics loading in the pruning path can make query availability depend on statistics file health.

Final Verdict

Status: ⚠️ Request changes
Minimum required actions:
- Add per-part exception handling around statistics loading/evaluation in filterPartsByStatistics and fallback to non-pruning for the affected part.

clickhouse-gh · 2026-03-23T03:50:12Z

+
+Possible values:
+
 - 0 — Disabled.


use_statistics_for_part_pruning introduces new query-planning behavior but is enabled by default immediately.

Given this is a new pruning mechanism in a core read path, could we roll it out behind an experimental gate (or default this setting to 0 first) until more production/CI coverage accumulates? That would reduce rollout risk while keeping opt-in testing straightforward.

clickhouse-gh · 2026-03-27T05:56:01Z

+    {
+        try
+        {
+            auto estimates = part.data_part->getEstimates();


⚠️ part.data_part->getEstimates() can silently disable statistics pruning for this part after a transient read exception.

IMergeTreeDataPart::getEstimates assigns estimates = Estimates() before loadStatistics(). If loadStatistics() throws once, the cache is left as an empty map; later calls return that cached empty map without retrying.

In this loop that means one temporary read failure can permanently degrade pruning quality for the lifetime of the part object. Please consider building estimates in a local variable and assigning to the cache only after successful load.

That is a good suggestion, but we don't have to include it in this PR.

clickhouse-gh · 2026-03-27T08:24:36Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.20%	84.20%	+0.00%
Functions	24.60%	24.60%	+0.00%
Branches	76.70%	76.70%	+0.00%

Changed lines: 94.61% (193/204) · Uncovered code

Full report · Diff report

nickitat · 2026-03-27T12:54:34Z

Stateless tests (amd_msan, WasmEdge, parallel, 1/2) - #100867

add statistics-based part pruning

5dcae79

This comment was marked as resolved.

Sign in to view

nikitamikhaylov added the can be tested Allows running workflows for external contributors label Jan 14, 2026

clickhouse-gh Bot added the pr-feature Pull request with new product feature label Jan 14, 2026

nickitat self-assigned this Jan 14, 2026

nickitat mentioned this pull request Jan 14, 2026

Experimental part pruning by partition ID prefix #91805

Open

1 task

zoomxi added 3 commits January 15, 2026 14:35

fix tests

ac141ee

Merge branch 'master' into statistics-part-pruning

478facc

fix tests

5156e29

zoomxi mentioned this pull request Jan 15, 2026

Fix accurate comparison between Decimal and Float types #94293

Merged

fix tests

5e03770

nickitat reviewed Jan 15, 2026

View reviewed changes

rschu1ze reviewed Jan 15, 2026

View reviewed changes

zoomxi added 5 commits January 17, 2026 15:03

Make changes based on code review suggestions

b490a68

update

713803a

fix style

4f32157

fix

1c32d89

fix nullable

ae9b5b5