Skip to content

Iceberg complex-type schema evolution: tuple-of-map branch uses wrong variable causing schema corruption #106207

@clickgapai

Description

@clickgapai

Found via ClickGap automated review. Please close or comment if this is incorrect or needs adjustment.

Retrospective finding from a historical scan of PR #69445 (merged 2024-12-05). Confirmed on current codebase — close with a note if already fixed.

Describe what's wrong

ComplexTypeSchemaProcessorFunctions.cpp:258-260 has a copy-paste error in the tuple-of-map subfield evolution branch. tryGet(tmp_node_map) succeeds (populating tmp_node_map), but the next line moves tmp_node_array into current_node — leaving a stale/null Array variable as the evolved type instead of the freshly-populated Map. Result: corrupted schema evolution for Iceberg Array-of-Map columns; potential data loss when reading.

Root cause: src/Storages/ObjectStorage/DataLakes/Iceberg/ComplexTypeSchemaProcessorFunctions.cpp:258-260 — the conditional dispatches on tmp_node_map but moves tmp_node_array. The two earlier branches in the if/else-if chain correctly move the variable they tested with tryGet; this branch breaks the pattern (likely a copy-paste from the Array branch above).

Why we believe this is a bug: src/Storages/ObjectStorage/DataLakes/Iceberg/ComplexTypeSchemaProcessorFunctions.cpp:258-260 — the conditional dispatches on tmp_node_map but moves tmp_node_array. The two earlier branches in the if/else-if chain correctly move the variable they tested with tryGet; this branch breaks the pattern (likely a copy-paste from the Array branch above).

Affected locations:

  • src/Storages/ObjectStorage/DataLakes/Iceberg/ComplexTypeSchemaProcessorFunctions.cpp:259 — current_node = std::move(tmp_node_array) — should be tmp_node_map after tryGet(tmp_node_map) succeeded on line 258

Impact: Iceberg tables with Array-of-Map columns that undergo schema evolution may have their evolved schema corrupted. The processor outputs an Array-typed transformer where a Map-typed transformer is expected, causing wrong-type ActionsDAG construction. Downstream: incorrect column reads (likely empty/null) or type-mismatch errors.

Does it reproduce on most recent release?

Likely yes — see testability note in additional context.

How to reproduce

Create an Iceberg table with a column of type Array(Map(String, X)). Evolve the schema (add/remove fields in the inner Map). Read the table after evolution — expect type mismatch or wrong values. Reading the tryBuildSchemaProcessor path in stateless tests can't reach this (Iceberg requires real Avro/S3); maintainer to validate via integration test or direct code review.

Expected behavior

The code should not exhibit the behavior described in the root cause above.

Error message and/or stacktrace

Code-verified at HEAD.

Additional context

Suggested fix: Line 259: change std::move(tmp_node_array) to std::move(tmp_node_map). Same shape as the matching tuple-of-array branch one block above which correctly uses tmp_node_array.

Analysis details: Confidence HIGH | Severity P1 | Testability: INTEGRATION_TEST

Found during automated review of PR #69445.


ClickGapAI · Confidence: HIGH · Severity: P1 · Finding: h_pr69445_audit_recovery_4f701264

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions