Fix Iceberg read optimization returning NULLs for stats-less manifests#1991
Conversation
When an Iceberg manifest carries no per-column statistics, the parsed `DataFileMetaInfo::columns_info` is empty. The read optimization in `StorageObjectStorageSource::createReader` misread this as "every requested column is absent from the file": it replaced each nullable column with a constant `NULL` and set `need_only_count`, so the reader returned correct row counts but all-NULL values — silent data loss. Gate the absent-column-to-NULL loop on a non-empty `columns_info` so that stats-less manifests fall through to the regular reader, which reads present columns normally and resolves schema-evolution-absent columns via `IcebergMetadata::getInitialSchemaByPath`. Affects `icebergLocal`, `icebergS3`, `icebergAzure`, `icebergHDFS` and their `*Cluster` variants. Antalya-only, introduced by #1069. Add stateless test `04302_iceberg_read_optimization_no_column_stats` with a checked-in stats-less Iceberg fixture and a `generate.py` that reproduces it. C++ change taken from #1895 Closes #1545 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| } | ||
| } | ||
| for (const auto & column : requested_columns_list) | ||
| if (!file_meta_data.value()->columns_info.empty()) |
There was a problem hiding this comment.
Small question on dropping the has_value() check.
My version guarded the loop with file_meta_data.has_value() before calling .value(). This one calls file_meta_data.value()->columns_info.empty() directly, which assumes file_meta_data is always set at this point.
Is this guaranteed on every path that reaches this loop? I couldn't fully convince myself it can't be empty here. Ff there's a case where it is, .value() would throw instead of just skipping the block.
Just want to make sure an empty optional can't slip through.
There was a problem hiding this comment.
there is a guard up the stream already:
|
the c++ code is almost the same as in #1895 (w/o unnecessary |
|
OK to merge after CI finishes |
C++ change similar to #1895 (without unneeded check)
tests added
Closes #1545
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix Iceberg read optimization returning NULL for every column when reading manifests without per-file column statistics (e.g. pyiceberg with default settings). Affects
icebergLocal,icebergS3,icebergAzure,icebergHDFS, and all*Clustervariants.CI/CD Options
Exclude tests:
Regression jobs to run: