Enable parquet reader v3 by default #88827

al13n321 · 2025-10-21T08:26:23Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Parquet reader v3 is enabled by default.

clickhouse-gh · 2025-10-21T08:27:36Z

Workflow [PR], commit [9fb2c95]

Summary: ❌

job_name	test_name	status	info
Upgrade check (amd_tsan)		failure
	Killed by signal (in clickhouse-server.log)	FAIL	cidb
	Fatal message in clickhouse-server.log (see fatal_messages.txt)	FAIL	cidb
	Killed by signal (output files)	FAIL	cidb
	Found signal in gdb.log	FAIL	cidb
AST fuzzer (amd_debug)		failure
	Logical error: 'RangeReader read 3 rows, but 32 expected.'.	FAIL	cidb
BuzzHouse (amd_debug)		failure
	Buzzing result	failure	cidb

EmeraldShift · 2025-10-24T22:27:33Z

Im curious, where can I find benchmarks for the v1 vs v3 parquet reader?

al13n321 · 2025-10-27T21:22:32Z

Im curious, where can I find benchmarks for the v1 vs v3 parquet reader?

Probably nowhere yet.

al13n321 · 2025-10-27T21:25:26Z

Disabled prewhere in datalakes for now. To make it work we'll have to change how ColumnMapper is used and how delta lake partition columns are added, @divanik :

    /// TODO: Known problems with datalake prewhere:
    ///  * If the iceberg table went through schema evolution, columns read from file may need to
    ///    be renamed or typecast before applying prewhere. There's already a mechanism for
    ///    telling parquet reader to rename columns: ColumnMapper. And parquet reader already
    ///    automatically does type casts to requested types. But weirdly the iceberg reader uses
    ///    those mechanism to request the *old* name and type of the column, then has additional
    ///    code to do the renaming and casting as a separate step outside parquet reader.
    ///    We should probably change this and delete that additional code?
    ///  * Delta Lake can have "partition columns", which are columns with constant value specified
    ///    in the metadata, not present in parquet file. Like hive partitioning, but in metadata
    ///    files instead of path. Currently these columns are added to the block outside parquet
    ///    reader. If they appear in prewhere expression, parquet reader gets a "no column in block"
    ///    error. Unlike hive partitioning, we can't (?) just return these columns from
    ///    supportedPrewhereColumns() because at the time of the call the delta lake metadata hasn't
    ///    been read yet. So we should probably pass these columns to the parquet reader instead of
    ///    adding them outside.
    ///  * There's a bug in StorageObjectStorageSource::createReader: it makes a copy of
    ///    FormatFilterInfo, but for some reason unsets prewhere_info and row_level_filter_info.
    ///    There's probably no reason for this, and it should just copy those fields like the others.

al13n321 · 2025-11-03T18:37:55Z

The remaining failed tests are flaky.

divanik · 2025-11-05T11:12:56Z

///  * If the iceberg table went through schema evolution, columns read from file may need to
///    be renamed or typecast before applying prewhere. There's already a mechanism for
///    telling parquet reader to rename columns: ColumnMapper. And parquet reader already
///    automatically does type casts to requested types. But weirdly the iceberg reader uses
///    those mechanism to request the *old* name and type of the column, then has additional
///    code to do the renaming and casting as a separate step outside parquet reader.
///    We should probably change this and delete that additional code?
///  * Delta Lake can have "partition columns", which are columns with constant value specified
///    in the metadata, not present in parquet file. Like hive partitioning, but in metadata
///    files instead of path. Currently these columns are added to the block outside parquet
///    reader. If they appear in prewhere expression, parquet reader gets a "no column in block"
///    error. Unlike hive partitioning, we can't (?) just return these columns from
///    supportedPrewhereColumns() because at the time of the call the delta lake metadata hasn't
///    been read yet. So we should probably pass these columns to the parquet reader instead of
///    adding them outside.

In both these suggestions I see one problem: we delegate complicated behaviour from the source to a format (one layer down). This could be ok if we support only parquet v3, but I think that we want to preserve normal behaviour with other formats and old parquet reader, and in this case the task is becoming more tedious.

al13n321 · 2025-11-05T19:24:55Z

In both these suggestions I see one problem: we delegate complicated behaviour from the source to a format (one layer down).

Yep, the same transform would need to be implemented both inside the reader and as a separate IProcessor (for other formats), and there'd be some awkward code duplication between the two (hopefully reusing code for the important parts), likely with some bugs caused by unintended differences in behavior between the two. I don't have other ideas though. Do you?

nikitamikhaylov · 2025-11-14T17:44:02Z

The Upgrade check is broken in master:

Fatal> : Logical error: \'Cannot get position for substream data.arr.Array%28JSON%28max_dynamic_types%3D16%2C%20max_dynamic_paths%3D256%29%29.object_shared_data.0.structure_prefix: column data with position 1 doesn\'t have such substream\'.

…eld_id Antalya 25.8 Partially backport ClickHouse#88827

al13n321 added the ci-non-blocking Do not block pipeline on too many failed tests label Oct 21, 2025

clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Oct 21, 2025

al13n321 added 5 commits October 27, 2025 21:21

Enable parquet reader v3 by default

12395c5

Fix hudi

f034b30

Fix test_column_pruning

50cb5bd

Fix test

b08ab05

Disable datalake prewhere, allow missing field_id

bba27ea

al13n321 force-pushed the pqe branch from 46212fa to bba27ea Compare October 27, 2025 21:24

al13n321 added 5 commits October 28, 2025 00:20

UBSAN, randomize setting in stateless tests

ca5a573

Disable prewhere in test_storage_iceberg_with_spark for now

0572425

Merge remote-tracking branch 'origin/master' into pqe

a8599d9

Merge remote-tracking branch 'origin/master' into pqe

011fe71

Fix getting stuck if thread pool schedule fails, fix exception data race

df34d89

al13n321 marked this pull request as ready for review November 3, 2025 18:38

al13n321 added 5 commits November 3, 2025 18:45

Merge remote-tracking branch 'origin/master' into pqe

3ffc73b

Support output_format_parquet_max_dictionary_size in old writer

dffd627

Merge remote-tracking branch 'origin/master' into pqe

55d89f8

Check checksums

4b342a8

much clang very tidy wow

764a91a

Merge branch 'master' into pqe

9fb2c95

nikitamikhaylov added this pull request to the merge queue Nov 14, 2025

nikitamikhaylov approved these changes Nov 14, 2025

View reviewed changes

nikitamikhaylov self-assigned this Nov 14, 2025

Merged via the queue into master with commit 827a7ef Nov 14, 2025
126 of 130 checks passed

nikitamikhaylov deleted the pqe branch November 14, 2025 18:05

robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Nov 14, 2025

mkmkme mentioned this pull request Nov 26, 2025

Antalya 25.8 Partially backport #88827 Altinity/ClickHouse#1171

Merged

25 tasks

zvonand added a commit to Altinity/ClickHouse that referenced this pull request Nov 26, 2025

Merge pull request #1171 from Altinity/mkmkme/antalya-25.8-missing-fi…

f5fb292

…eld_id Antalya 25.8 Partially backport ClickHouse#88827

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable parquet reader v3 by default #88827

Enable parquet reader v3 by default #88827

Uh oh!

al13n321 commented Oct 21, 2025 •

edited by clickhouse-gh bot

Loading

Uh oh!

clickhouse-gh bot commented Oct 21, 2025 •

edited

Loading

Uh oh!

EmeraldShift commented Oct 24, 2025

Uh oh!

al13n321 commented Oct 27, 2025

Uh oh!

al13n321 commented Oct 27, 2025

Uh oh!

al13n321 commented Nov 3, 2025

Uh oh!

divanik commented Nov 5, 2025

Uh oh!

al13n321 commented Nov 5, 2025

Uh oh!

nikitamikhaylov commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Enable parquet reader v3 by default #88827

Enable parquet reader v3 by default #88827

Uh oh!

Conversation

al13n321 commented Oct 21, 2025 • edited by clickhouse-gh bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Uh oh!

clickhouse-gh bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EmeraldShift commented Oct 24, 2025

Uh oh!

al13n321 commented Oct 27, 2025

Uh oh!

al13n321 commented Oct 27, 2025

Uh oh!

al13n321 commented Nov 3, 2025

Uh oh!

divanik commented Nov 5, 2025

Uh oh!

al13n321 commented Nov 5, 2025

Uh oh!

nikitamikhaylov commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

al13n321 commented Oct 21, 2025 •

edited by clickhouse-gh bot

Loading

clickhouse-gh bot commented Oct 21, 2025 •

edited

Loading