Skip to content

Fix Iceberg read optimization returning NULLs for stats-less manifests#1991

Merged
arthurpassos merged 2 commits into
antalya-26.3from
fix/antalya-26.3/stateless-manifests
Jul 1, 2026
Merged

Fix Iceberg read optimization returning NULLs for stats-less manifests#1991
arthurpassos merged 2 commits into
antalya-26.3from
fix/antalya-26.3/stateless-manifests

Conversation

@zvonand

@zvonand zvonand commented Jun 30, 2026

Copy link
Copy Markdown
Member

C++ change similar to #1895 (without unneeded check)

tests added

Closes #1545

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix Iceberg read optimization returning NULL for every column when reading manifests without per-file column statistics (e.g. pyiceberg with default settings). Affects icebergLocal, icebergS3, icebergAzure, icebergHDFS, and all *Cluster variants.

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • Aarch64 tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • OAuth (5m)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

When an Iceberg manifest carries no per-column statistics, the parsed
`DataFileMetaInfo::columns_info` is empty. The read optimization in
`StorageObjectStorageSource::createReader` misread this as "every requested
column is absent from the file": it replaced each nullable column with a
constant `NULL` and set `need_only_count`, so the reader returned correct row
counts but all-NULL values — silent data loss.

Gate the absent-column-to-NULL loop on a non-empty `columns_info` so that
stats-less manifests fall through to the regular reader, which reads present
columns normally and resolves schema-evolution-absent columns via
`IcebergMetadata::getInitialSchemaByPath`.

Affects `icebergLocal`, `icebergS3`, `icebergAzure`, `icebergHDFS` and their
`*Cluster` variants. Antalya-only, introduced by #1069.

Add stateless test `04302_iceberg_read_optimization_no_column_stats` with a
checked-in stats-less Iceberg fixture and a `generate.py` that reproduces it.

C++ change taken from #1895
Closes #1545

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown

Workflow [PR], commit [df1e4e4]

}
}
for (const auto & column : requested_columns_list)
if (!file_meta_data.value()->columns_info.empty())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small question on dropping the has_value() check.
My version guarded the loop with file_meta_data.has_value() before calling .value(). This one calls file_meta_data.value()->columns_info.empty() directly, which assumes file_meta_data is always set at this point.

Is this guaranteed on every path that reaches this loop? I couldn't fully convince myself it can't be empty here. Ff there's a case where it is, .value() would throw instead of just skipping the block.
Just want to make sure an empty optional can't slip through.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a guard up the stream already:

@zvonand

zvonand commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

the c++ code is almost the same as in #1895 (w/o unnecessary .has_value() guard), which was verified already -- thus adding the label here.

@zvonand zvonand added the verified Approved for release label Jul 1, 2026
@zvonand

zvonand commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

OK to merge after CI finishes

@arthurpassos arthurpassos merged commit 2d33dd6 into antalya-26.3 Jul 1, 2026
224 of 229 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Read using IcebergLocal returns Nulls instead of actual rows

4 participants