Fix silent ignore of delta_lake_snapshot_version without DeltaKernel#101489
Fix silent ignore of delta_lake_snapshot_version without DeltaKernel#101489alexey-milovidov merged 11 commits intoClickHouse:masterfrom
Conversation
The legacy `DeltaLakeMetadataImpl` (used for Azure and non-kernel S3 paths) threw `NOT_IMPLEMENTED` when encountering schema changes in `_delta_log` metadata entries or checkpoint files. This made `deltaLakeAzure` fail on tables with schema evolution (e.g., added columns). Instead of throwing, adopt the latest schema — matching the behavior of the DeltaKernel path. Delta log entries are processed in version order, so the last `metaData` entry always represents the current table schema. Older data files with fewer columns will return NULLs for missing columns, which is the expected behavior. Closes ClickHouse#100438
…ix/issue-100502
…ernel Previously, `delta_lake_snapshot_version` and CDF settings (`delta_lake_snapshot_start_version` / `delta_lake_snapshot_end_version`) were silently ignored when DeltaKernel was not active (Azure, GCS, or `allow_experimental_delta_kernel_rs = 0`), causing queries to return the latest table state instead of the requested snapshot with no indication of error. Now the legacy metadata reader raises `UNSUPPORTED_METHOD` if any of these settings are set to a non-default value, so users get a clear error instead of wrong data. Closes ClickHouse#100502
|
Workflow [PR], commit [6a6d55c] Summary: ✅ AI ReviewSummaryThis PR fixes a real correctness issue in the legacy Delta Lake metadata path: non-default ClickHouse Rules
Final Verdict
|
| "Reading from files with different schema is not possible " | ||
| "({} is different from {})", | ||
| file_schema.toString(), current_schema.toString()); | ||
| LOG_INFO(log, "Schema evolved: {} -> {}", file_schema.toString(), current_schema.toString()); |
There was a problem hiding this comment.
This change silently drops the previous NOT_IMPLEMENTED guard and effectively enables schema evolution by replacing file_schema with current_schema.
That is risky here because the rest of this path still assumes a single stable schema (see the comment above processMetadataFile), and this PR does not add coverage for mixed-schema logs/checkpoints. A query may now return inconsistent results instead of raising a clear exception.
Please keep the strict exception in this PR (focused on delta_lake_snapshot_* settings), or add dedicated mixed-schema tests and explicit compatibility handling in both JSON log and checkpoint paths.
This reverts commit e39876d.
Delta Lake is not supported under MSan.
|
The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix. |
…eltaKernel Change guard from `>= 0` to `!= -1` so that invalid negative values like `-2` are also rejected instead of being silently ignored. Add regression test for negative non-default value.
…ix/issue-100502
…ix/issue-100502
LLVM Coverage Report
Changed lines: 100.00% (20/20) · Uncovered code |
|
Hi — this PR may need backporting to Affected code: Why: delta_lake_snapshot_version was introduced in 25.8 (SettingsChangesHistory.cpp line 362), and delta_lake_snapshot_start_version / delta_lake_snapshot_end_version were introduced in 25.12 (lines 187-188). All supported branches contain at least delta_lake_snapshot_version and exhibit the silent-ignore bug. On 25.8, only the delta_lake_snapshot_version check is relevant; on 26.1+ all three settings need the validation. If this should be backported, consider adding |
|
Thanks @alexey-milovidov |
Closes #100502
Previously,
delta_lake_snapshot_versionand CDF settings (delta_lake_snapshot_start_version/delta_lake_snapshot_end_version) were silently ignored when DeltaKernel was not active (Azure, GCS, orallow_experimental_delta_kernel_rs = 0), causing queries to return the latest table state instead of the requested snapshot with no indication of error.Now the legacy metadata reader raises
UNSUPPORTED_METHODif any of these settings are set to a non-default value, so users get a clear error instead of wrong data.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Throw an error when
delta_lake_snapshot_versionor CDF version settings are used without DeltaKernel enabled, instead of silently returning wrong data.Documentation entry for user-facing changes
When using DeltaLake time travel (
delta_lake_snapshot_version) or change data feed (delta_lake_snapshot_start_version/delta_lake_snapshot_end_version) with storage backends that do not support DeltaKernel (e.g. Azure, GCS), ClickHouse now raises anUNSUPPORTED_METHODerror instead of silently ignoring the setting. To use these features, use S3 or Local storage withallow_experimental_delta_kernel_rs = 1.