Fix crash in case of iceberg table with mixed ORC/Parquet files by alesapin · Pull Request #99168 · ClickHouse/ClickHouse

alesapin · 2026-03-10T13:24:01Z

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix very rare crash when Iceberg table contains files of mixed format (ORC and Parquet). Fixes #88126.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Before we had a single initialization of KeyCondition for the whole read. Now we switch to per-file initialization. I don't expect any performance degradation, because opening/reading file is several orders of magnitude slower than analyzing something in WHERE condition.

Note

Medium Risk
Touches filter initialization and Parquet page-level predicate pushdown, which can affect row/column skipping and is exercised across multiple input formats; changes are localized but impact correctness/performance of filtering paths.

Overview
Fixes a crash when reading Iceberg tables containing mixed ORC and Parquet files by making FormatFilterInfo build key_condition/PREWHERE additional columns with std::call_once semantics via initKeyConditionOnce().

Updates ORC and Parquet input formats to use this new API and removes FormatFilterInfo::opaque and the Parquet V3-specific FilterInfoExt; Parquet page-level filter pushdown now derives per-column conditions directly from key_condition and keeps them alive in Parquet::Reader.

Adds a stateless regression test (04033_iceberg_orc_parquet_v3_crash) to ensure the mixed-format Iceberg read no longer segfaults.

^{Written by Cursor Bugbot for commit 13e3d24. This will update automatically on new commits. Configure here.}

clickhouse-gh · 2026-03-10T13:24:46Z

Workflow [PR], commit [13e3d24]

Summary: ❌

job_name	test_name	status	info	comment
AST fuzzer (amd_ubsan)		error		IGNORED

AI Review

1) Summary

This PR removes cross-format mutable state from FormatFilterInfo and localizes Parquet page-filter condition state into Parquet::Reader, while keeping lazy key-condition initialization via initKeyConditionOnce. It also adds a stateless regression test for mixed ORC/Parquet Iceberg data with input_format_parquet_use_native_reader_v3 = 1. High-level verdict: the fix is correct and I found no Blocker or Major issues.

2) Missing context (if any)

No CI report links or Praktika logs were provided in this review request.
No benchmark artifacts were provided.

3) Findings (by severity)

No Blocker or Major findings.

4) Tests & Evidence

Positive coverage: added tests/queries/0_stateless/04033_iceberg_orc_parquet_v3_crash.sh reproduces the mixed-format Iceberg scenario and verifies stable output.
Negative/error-handling coverage: no explicit error-path assertions were added; this is acceptable for this regression shape.
Additional test guidance (optional follow-up): add one variant where table format is Parquet and one file is ORC (reverse mix) to guard the symmetric path.

5) ClickHouse Compliance Checklist (Yes/No + short note)

Data deletions logged? Yes (N/A) — no data-deletion logic changed.
Serialization formats versioned? Yes (N/A) — no serialization format changes.
Experimental setting gate present? Yes (N/A) — no new feature introduced.
Settings exposed for constants/thresholds? Yes (N/A) — no new constants needing settings.
Backward compatibility preserved? Yes — behavior is a bug fix in reader initialization/state handling.
SettingsHistory.cpp updated for new/changed settings? Yes (N/A) — no setting changes.
Existing tests untouched (only additions)? Yes — existing tests were not relaxed/deleted.
Docs/user-facing notes updated? Yes (N/A) — internal bug fix; test added.
Core-area change got extra scrutiny? Yes — format readers and filtering state handling were reviewed.

6) Performance & Safety Notes

Safety: removing format_filter_info->opaque type-punning across formats reduces invalid-state risk.
Performance: no hot-path algorithmic regression identified; extractSingleColumnConditions remains conditional on page_filter_push_down and key_condition.
Benchmark status: none provided; minimal benchmark suggestion — compare SELECT count() and filtered scans on a mixed-format Iceberg dataset with V3 reader on/off.

7) User-Lens Review

The behavior is intuitive: mixed-format Iceberg reads now avoid the prior exception path and return expected rows. Error/actionability is unchanged, and the new regression test makes the fix robust against reintroduction.

8) Final Verdict

Status: Approve

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

src/Processors/Formats/Impl/Parquet/Reader.cpp

tests/queries/0_stateless/04033_iceberg_orc_parquet_v3_crash.sh

src/Processors/Formats/Impl/ParquetV3BlockInputFormat.cpp

clickhouse-gh · 2026-03-11T16:04:35Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	83.80%	83.80%	+0.00%
Functions	23.90%	23.80%	-0.10%
Branches	76.30%	76.30%	+0.00%

PR changed lines: PR changed-lines coverage: 95.31% (61/64)
Diff coverage report
Uncovered code

Fix crash in case of iceberg table with mixed ORC/Parquet files

2e9ff84

clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Mar 10, 2026

cursor bot reviewed Mar 10, 2026

View reviewed changes

src/Processors/Formats/Impl/Parquet/Reader.cpp Show resolved Hide resolved

tests/queries/0_stateless/04033_iceberg_orc_parquet_v3_crash.sh Outdated Show resolved Hide resolved

alesapin added 4 commits March 10, 2026 14:27

Remove hardcode

25c64c4

Fix crash

3da7610

Fix crash

7830a05

Get rid of opaque

0691e62

kssenii reviewed Mar 11, 2026

View reviewed changes

src/Processors/Formats/Impl/ParquetV3BlockInputFormat.cpp Show resolved Hide resolved

kssenii self-assigned this Mar 11, 2026

Get rid of initOnce

13e3d24

kssenii approved these changes Mar 11, 2026

View reviewed changes

alesapin added this pull request to the merge queue Mar 11, 2026

Merged via the queue into master with commit a34a749 Mar 11, 2026
162 of 164 checks passed

alesapin deleted the fix_crash_in_ext_format branch March 11, 2026 20:00

robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix crash in case of iceberg table with mixed ORC/Parquet files#99168

Fix crash in case of iceberg table with mixed ORC/Parquet files#99168
alesapin merged 6 commits intomasterfrom
fix_crash_in_ext_format

alesapin commented Mar 10, 2026 •

edited by cursor bot

Loading

Uh oh!

clickhouse-gh bot commented Mar 10, 2026 •

edited by alesapin

Loading

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh bot commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alesapin commented Mar 10, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh bot commented Mar 10, 2026 • edited by alesapin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

1) Summary

2) Missing context (if any)

3) Findings (by severity)

4) Tests & Evidence

5) ClickHouse Compliance Checklist (Yes/No + short note)

6) Performance & Safety Notes

7) User-Lens Review

8) Final Verdict

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh bot commented Mar 11, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alesapin commented Mar 10, 2026 •

edited by cursor bot

Loading

clickhouse-gh bot commented Mar 10, 2026 •

edited by alesapin

Loading