DeltaLake: fix reading subcolumns with non-default column mapping mode by kssenii · Pull Request #86064 · ClickHouse/ClickHouse

kssenii · 2025-08-22T17:27:21Z

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix reading subcolumns with non-default column mapping mode in storage DeltaLake.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

clickhouse-gh · 2025-08-22T17:27:49Z

Workflow [PR], commit [d5dfc4d]

Summary: ❌

job_name	test_name	status
Stateless tests (amd_ubsan, parallel)		failure
	Server died	FAIL
	Exception in test runner	FAIL
Integration tests (arm_binary, distributed plan, 2/4)		failure
	test_global_overcommit_tracker/test.py::test_global_overcommit	FAIL
Stress test (amd_tsan)		failure
	Server died	FAIL
	Hung check failed, possible deadlock found (see hung_check.log)	FAIL
	Killed by signal (in clickhouse-server.log)	FAIL
	Fatal message in clickhouse-server.log (see fatal_messages.txt)	FAIL
	Killed by signal (output files)	FAIL
	Found signal in gdb.log	FAIL
Stress test (arm_asan)		failure
	Server died	FAIL
	Hung check failed, possible deadlock found (see hung_check.log)	FAIL
	Killed by signal (in clickhouse-server.log)	FAIL
	Fatal message in clickhouse-server.log (see fatal_messages.txt)	FAIL
	Killed by signal (output files)	FAIL
Stress test (amd_ubsan)		failure
	Server died	FAIL
	Hung check failed, possible deadlock found (see hung_check.log)	FAIL
	Killed by signal (in clickhouse-server.log)	FAIL
	Fatal message in clickhouse-server.log (see fatal_messages.txt)	FAIL
	Killed by signal (output files)	FAIL
	Found signal in gdb.log	FAIL

…ing-subcolumns-with-column-mapping-mode

scanhex12 · 2025-08-26T10:23:11Z

tests/integration/test_storage_delta/test.py

+    )
+
+    assert (
+        "col_x2D1\tNullable(Date32)\t\t\t\t\t\n"


As I remember, the clickhouse columns in this parquet file were named like col-1, weren’t they?

Did you try to create with col-1,... names? Will that work?

hm, this file is first read with ordinary s3 table function and it returned the column names as you see in the test

ClickHouse/tests/integration/test_storage_delta/test.py

Lines 3394 to 3410 in a929851

def s3_function(path):

return f""" s3(

'http://{started_cluster.minio_ip}:{started_cluster.minio_port}/{bucket}/{path}' ,

'{minio_access_key}',

'{minio_secret_key}')

"""

func = s3_function(data_file)

assert (

"2025-06-04\t('100022','2025-06-04 18:40:56.000000','2025-06-09 21:19:00.364000')\t100022"

== node.query(f"select * from {func}").strip()

)

assert (

"col_x2D1\tNullable(Date32)\t\t\t\t\t\n"

"col_x2D2\tTuple(\\n col_x2D3 Nullable(String),\\n col_x2D4 Nullable(DateTime64(6, \\'UTC\\')),\\n col_x2D5 Nullable(DateTime64(6, \\'UTC\\')))\t\t\t\t\t\n"

"col_x2D6\tNullable(Int64)" == node.query(f"describe table {func}").strip()

)

So the parquet file has columns named as col_x2D1, not col-1.
Though I now checked your test and added there DESCRIBE TABLE icebergS3(s3_conn, filename='field_ids_struct_test', SETTINGS iceberg_metadata_table_uuid = '149ecc15-7afc-4311-86b3-3a4c8d4ec08e');, and it indeed has the column names you mentioned. I guess it is because they are named this way in iceberg metadata. But in this test used only the parquet file.

Also the point is that even though I used this parquet file to insert the data

ClickHouse/tests/integration/test_storage_delta/test.py

Lines 3412 to 3413 in a929851

df = spark.read.parquet(os.path.join(SCRIPT_DIR, data_file))

write_delta_from_df(spark, df, path, mode="overwrite")

the result parquet file will not have columns named in the same way. Now only delta lake metadata will contain col_x2D1, while result parquet file columns will be of format col-{random-uuid} (I can add a check for this in the test to show it explicitly), this is because of columnMapping.mode = name, which means that parquet file names do not contain actual column names but instead randomly generated column names.

scanhex12 · 2025-08-26T10:26:40Z

src/Storages/ObjectStorage/DataLakes/DeltaLakeMetadataDeltaKernel.cpp

    format_settings.parquet.allow_missing_columns = true;
 }

+static void checkTypesAndNestedTypesEqual(DataTypePtr type1, DataTypePtr type2, const std::string & column_name)


Should we also check complex types recursively?

…umn-mapping-mode

kssenii · 2025-08-28T13:03:45Z

Stress test

#76721
#81144
#84669

PedroTadim · 2025-08-28T16:14:16Z

Is this the fix for #86204 ?

kssenii · 2025-08-29T11:06:36Z

Is this the fix for #86204 ?

I was fixing a different issue, but most likely it also fixes your issue, need to check.

Cherry pick #86064 to 25.8: DeltaLake: fix reading subcolumns with non-default column mapping mode

…efault column mapping mode

Backport #86064 to 25.8: DeltaLake: fix reading subcolumns with non-default column mapping mode

kssenii added 2 commits August 22, 2025 19:07

Fix reading subcolumns with enabled columnMapping.mode

c21e93a

Better

2605691

clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Aug 22, 2025

kssenii and others added 4 commits August 22, 2025 19:33

Fix style check

4bd227e

Merge remote-tracking branch 'origin/master' into delta-lake-fix-read…

25af2d2

…ing-subcolumns-with-column-mapping-mode

Fix tests

74335fb

Fix style check

a929851

scanhex12 self-assigned this Aug 26, 2025

scanhex12 reviewed Aug 26, 2025

View reviewed changes

Review fixes

2dd6af2

scanhex12 approved these changes Aug 26, 2025

View reviewed changes

kssenii and others added 3 commits August 28, 2025 09:55

Check equality types only for table schema

304ec32

Merge branch 'master' into delta-lake-fix-reading-subcolumns-with-col…

70afd58

…umn-mapping-mode

Fix style check

d5dfc4d

kssenii enabled auto-merge August 28, 2025 15:48

kssenii added this pull request to the merge queue Aug 28, 2025

kssenii added the v25.8-must-backport label Aug 28, 2025

Merged via the queue into master with commit 86d8f98 Aug 28, 2025
117 of 122 checks passed

kssenii deleted the delta-lake-fix-reading-subcolumns-with-column-mapping-mode branch August 28, 2025 16:04

robot-ch-test-poll4 added pr-backports-created-cloud deprecated label, NOOP pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR labels Aug 28, 2025

robot-ch-test-poll3 mentioned this pull request Aug 28, 2025

Cherry pick #86064 to 25.8: DeltaLake: fix reading subcolumns with non-default column mapping mode #86370

Merged

robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 28, 2025

robot-ch-test-poll2 added a commit that referenced this pull request Sep 2, 2025

Merge pull request #86370 from ClickHouse/cherrypick/25.8/86064

b9a9b76

Cherry pick #86064 to 25.8: DeltaLake: fix reading subcolumns with non-default column mapping mode

robot-clickhouse added a commit that referenced this pull request Sep 2, 2025

Backport #86064 to 25.8: DeltaLake: fix reading subcolumns with non-d…

11a449d

…efault column mapping mode

robot-ch-test-poll2 mentioned this pull request Sep 2, 2025

Backport #86064 to 25.8: DeltaLake: fix reading subcolumns with non-default column mapping mode #86519

Merged

robot-ch-test-poll3 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Sep 2, 2025

clickhouse-gh bot added a commit that referenced this pull request Sep 9, 2025

Merge pull request #86519 from ClickHouse/backport/25.8/86064

d6e53e2

Backport #86064 to 25.8: DeltaLake: fix reading subcolumns with non-default column mapping mode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeltaLake: fix reading subcolumns with non-default column mapping mode#86064

DeltaLake: fix reading subcolumns with non-default column mapping mode#86064
kssenii merged 10 commits intomasterfrom
delta-lake-fix-reading-subcolumns-with-column-mapping-mode

kssenii commented Aug 22, 2025

Uh oh!

clickhouse-gh bot commented Aug 22, 2025 •

edited

Loading

Uh oh!

scanhex12 Aug 26, 2025

Uh oh!

scanhex12 Aug 26, 2025

Uh oh!

kssenii Aug 26, 2025

Uh oh!

scanhex12 Aug 26, 2025

Uh oh!

kssenii commented Aug 28, 2025

Uh oh!

Uh oh!

PedroTadim commented Aug 28, 2025

Uh oh!

kssenii commented Aug 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

	def s3_function(path):
	return f""" s3(
	'http://{started_cluster.minio_ip}:{started_cluster.minio_port}/{bucket}/{path}' ,
	'{minio_access_key}',
	'{minio_secret_key}')
	"""

	func = s3_function(data_file)
	assert (
	"2025-06-04\t('100022','2025-06-04 18:40:56.000000','2025-06-09 21:19:00.364000')\t100022"
	== node.query(f"select * from {func}").strip()
	)
	assert (
	"col_x2D1\tNullable(Date32)\t\t\t\t\t\n"
	"col_x2D2\tTuple(\\n col_x2D3 Nullable(String),\\n col_x2D4 Nullable(DateTime64(6, \\'UTC\\')),\\n col_x2D5 Nullable(DateTime64(6, \\'UTC\\')))\t\t\t\t\t\n"
	"col_x2D6\tNullable(Int64)" == node.query(f"describe table {func}").strip()
	)

	df = spark.read.parquet(os.path.join(SCRIPT_DIR, data_file))
	write_delta_from_df(spark, df, path, mode="overwrite")

Conversation

kssenii commented Aug 22, 2025

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh bot commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scanhex12 Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

scanhex12 Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

kssenii Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

scanhex12 Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

kssenii commented Aug 28, 2025

Uh oh!

Uh oh!

PedroTadim commented Aug 28, 2025

Uh oh!

kssenii commented Aug 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

clickhouse-gh bot commented Aug 22, 2025 •

edited

Loading