Skip to content

Fix crash in case of iceberg table with mixed ORC/Parquet files#99168

Merged
alesapin merged 6 commits intomasterfrom
fix_crash_in_ext_format
Mar 11, 2026
Merged

Fix crash in case of iceberg table with mixed ORC/Parquet files#99168
alesapin merged 6 commits intomasterfrom
fix_crash_in_ext_format

Conversation

@alesapin
Copy link
Copy Markdown
Member

@alesapin alesapin commented Mar 10, 2026

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix very rare crash when Iceberg table contains files of mixed format (ORC and Parquet). Fixes #88126.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Before we had a single initialization of KeyCondition for the whole read. Now we switch to per-file initialization. I don't expect any performance degradation, because opening/reading file is several orders of magnitude slower than analyzing something in WHERE condition.


Note

Medium Risk
Touches filter initialization and Parquet page-level predicate pushdown, which can affect row/column skipping and is exercised across multiple input formats; changes are localized but impact correctness/performance of filtering paths.

Overview
Fixes a crash when reading Iceberg tables containing mixed ORC and Parquet files by making FormatFilterInfo build key_condition/PREWHERE additional columns with std::call_once semantics via initKeyConditionOnce().

Updates ORC and Parquet input formats to use this new API and removes FormatFilterInfo::opaque and the Parquet V3-specific FilterInfoExt; Parquet page-level filter pushdown now derives per-column conditions directly from key_condition and keeps them alive in Parquet::Reader.

Adds a stateless regression test (04033_iceberg_orc_parquet_v3_crash) to ensure the mixed-format Iceberg read no longer segfaults.

Written by Cursor Bugbot for commit 13e3d24. This will update automatically on new commits. Configure here.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 10, 2026

Workflow [PR], commit [13e3d24]

Summary:

job_name test_name status info comment
AST fuzzer (amd_ubsan) error IGNORED

AI Review

1) Summary

This PR removes cross-format mutable state from FormatFilterInfo and localizes Parquet page-filter condition state into Parquet::Reader, while keeping lazy key-condition initialization via initKeyConditionOnce. It also adds a stateless regression test for mixed ORC/Parquet Iceberg data with input_format_parquet_use_native_reader_v3 = 1. High-level verdict: the fix is correct and I found no Blocker or Major issues.

2) Missing context (if any)

  • No CI report links or Praktika logs were provided in this review request.
  • No benchmark artifacts were provided.

3) Findings (by severity)

No Blocker or Major findings.

4) Tests & Evidence

  • Positive coverage: added tests/queries/0_stateless/04033_iceberg_orc_parquet_v3_crash.sh reproduces the mixed-format Iceberg scenario and verifies stable output.
  • Negative/error-handling coverage: no explicit error-path assertions were added; this is acceptable for this regression shape.
  • Additional test guidance (optional follow-up): add one variant where table format is Parquet and one file is ORC (reverse mix) to guard the symmetric path.

5) ClickHouse Compliance Checklist (Yes/No + short note)

  • Data deletions logged? Yes (N/A) — no data-deletion logic changed.
  • Serialization formats versioned? Yes (N/A) — no serialization format changes.
  • Experimental setting gate present? Yes (N/A) — no new feature introduced.
  • Settings exposed for constants/thresholds? Yes (N/A) — no new constants needing settings.
  • Backward compatibility preserved? Yes — behavior is a bug fix in reader initialization/state handling.
  • SettingsHistory.cpp updated for new/changed settings? Yes (N/A) — no setting changes.
  • Existing tests untouched (only additions)? Yes — existing tests were not relaxed/deleted.
  • Docs/user-facing notes updated? Yes (N/A) — internal bug fix; test added.
  • Core-area change got extra scrutiny? Yes — format readers and filtering state handling were reviewed.

6) Performance & Safety Notes

  • Safety: removing format_filter_info->opaque type-punning across formats reduces invalid-state risk.
  • Performance: no hot-path algorithmic regression identified; extractSingleColumnConditions remains conditional on page_filter_push_down and key_condition.
  • Benchmark status: none provided; minimal benchmark suggestion — compare SELECT count() and filtered scans on a mixed-format Iceberg dataset with V3 reader on/off.

7) User-Lens Review

The behavior is intuitive: mixed-format Iceberg reads now avoid the prior exception path and return expected rows. Error/actionability is unchanged, and the new regression test makes the fix robust against reintroduction.

8) Final Verdict

  • Status: Approve

@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Mar 10, 2026
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

@kssenii kssenii self-assigned this Mar 11, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 11, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 83.80% 83.80% +0.00%
Functions 23.90% 23.80% -0.10%
Branches 76.30% 76.30% +0.00%

PR changed lines: PR changed-lines coverage: 95.31% (61/64)
Diff coverage report
Uncovered code

@alesapin alesapin added this pull request to the merge queue Mar 11, 2026
Merged via the queue into master with commit a34a749 Mar 11, 2026
162 of 164 checks passed
@alesapin alesapin deleted the fix_crash_in_ext_format branch March 11, 2026 20:00
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Iceberg SEGV with Parquet reader on ORC file?

3 participants