Skip to content

DeltaLake: fix reading subcolumns with non-default column mapping mode#86064

Merged
kssenii merged 10 commits intomasterfrom
delta-lake-fix-reading-subcolumns-with-column-mapping-mode
Aug 28, 2025
Merged

DeltaLake: fix reading subcolumns with non-default column mapping mode#86064
kssenii merged 10 commits intomasterfrom
delta-lake-fix-reading-subcolumns-with-column-mapping-mode

Conversation

@kssenii
Copy link
Copy Markdown
Member

@kssenii kssenii commented Aug 22, 2025

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix reading subcolumns with non-default column mapping mode in storage DeltaLake.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Aug 22, 2025

Workflow [PR], commit [d5dfc4d]

Summary:

job_name test_name status info comment
Stateless tests (amd_ubsan, parallel) failure
Server died FAIL
Exception in test runner FAIL
Integration tests (arm_binary, distributed plan, 2/4) failure
test_global_overcommit_tracker/test.py::test_global_overcommit FAIL
Stress test (amd_tsan) failure
Server died FAIL
Hung check failed, possible deadlock found (see hung_check.log) FAIL
Killed by signal (in clickhouse-server.log) FAIL
Fatal message in clickhouse-server.log (see fatal_messages.txt) FAIL
Killed by signal (output files) FAIL
Found signal in gdb.log FAIL
Stress test (arm_asan) failure
Server died FAIL
Hung check failed, possible deadlock found (see hung_check.log) FAIL
Killed by signal (in clickhouse-server.log) FAIL
Fatal message in clickhouse-server.log (see fatal_messages.txt) FAIL
Killed by signal (output files) FAIL
Stress test (amd_ubsan) failure
Server died FAIL
Hung check failed, possible deadlock found (see hung_check.log) FAIL
Killed by signal (in clickhouse-server.log) FAIL
Fatal message in clickhouse-server.log (see fatal_messages.txt) FAIL
Killed by signal (output files) FAIL
Found signal in gdb.log FAIL

@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Aug 22, 2025
@scanhex12 scanhex12 self-assigned this Aug 26, 2025
)

assert (
"col_x2D1\tNullable(Date32)\t\t\t\t\t\n"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I remember, the clickhouse columns in this parquet file were named like col-1, weren’t they?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try to create with col-1,... names? Will that work?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, this file is first read with ordinary s3 table function and it returned the column names as you see in the test

def s3_function(path):
return f""" s3(
'http://{started_cluster.minio_ip}:{started_cluster.minio_port}/{bucket}/{path}' ,
'{minio_access_key}',
'{minio_secret_key}')
"""
func = s3_function(data_file)
assert (
"2025-06-04\t('100022','2025-06-04 18:40:56.000000','2025-06-09 21:19:00.364000')\t100022"
== node.query(f"select * from {func}").strip()
)
assert (
"col_x2D1\tNullable(Date32)\t\t\t\t\t\n"
"col_x2D2\tTuple(\\n col_x2D3 Nullable(String),\\n col_x2D4 Nullable(DateTime64(6, \\'UTC\\')),\\n col_x2D5 Nullable(DateTime64(6, \\'UTC\\')))\t\t\t\t\t\n"
"col_x2D6\tNullable(Int64)" == node.query(f"describe table {func}").strip()
)

So the parquet file has columns named as col_x2D1, not col-1.
Though I now checked your test and added there DESCRIBE TABLE icebergS3(s3_conn, filename='field_ids_struct_test', SETTINGS iceberg_metadata_table_uuid = '149ecc15-7afc-4311-86b3-3a4c8d4ec08e');, and it indeed has the column names you mentioned. I guess it is because they are named this way in iceberg metadata. But in this test used only the parquet file.

Also the point is that even though I used this parquet file to insert the data

df = spark.read.parquet(os.path.join(SCRIPT_DIR, data_file))
write_delta_from_df(spark, df, path, mode="overwrite")

the result parquet file will not have columns named in the same way. Now only delta lake metadata will contain col_x2D1, while result parquet file columns will be of format col-{random-uuid} (I can add a check for this in the test to show it explicitly), this is because of columnMapping.mode = name, which means that parquet file names do not contain actual column names but instead randomly generated column names.

format_settings.parquet.allow_missing_columns = true;
}

static void checkTypesAndNestedTypesEqual(DataTypePtr type1, DataTypePtr type2, const std::string & column_name)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check complex types recursively?

@kssenii
Copy link
Copy Markdown
Member Author

kssenii commented Aug 28, 2025

Stress test

#76721
#81144
#84669

@kssenii kssenii enabled auto-merge August 28, 2025 15:48
@kssenii kssenii added this pull request to the merge queue Aug 28, 2025
Merged via the queue into master with commit 86d8f98 Aug 28, 2025
117 of 122 checks passed
@kssenii kssenii deleted the delta-lake-fix-reading-subcolumns-with-column-mapping-mode branch August 28, 2025 16:04
@robot-ch-test-poll4 robot-ch-test-poll4 added pr-backports-created-cloud deprecated label, NOOP pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR labels Aug 28, 2025
@PedroTadim
Copy link
Copy Markdown
Member

Is this the fix for #86204 ?

@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 28, 2025
@kssenii
Copy link
Copy Markdown
Member Author

kssenii commented Aug 29, 2025

Is this the fix for #86204 ?

I was fixing a different issue, but most likely it also fixes your issue, need to check.

robot-ch-test-poll2 added a commit that referenced this pull request Sep 2, 2025
Cherry pick #86064 to 25.8: DeltaLake: fix reading subcolumns with non-default column mapping mode
robot-clickhouse added a commit that referenced this pull request Sep 2, 2025
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Sep 2, 2025
clickhouse-gh bot added a commit that referenced this pull request Sep 9, 2025
Backport #86064 to 25.8: DeltaLake: fix reading subcolumns with non-default column mapping mode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-backports-created-cloud deprecated label, NOOP pr-bugfix Pull request with bugfix, not backported by default pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo v25.8-must-backport

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants