Skip to content

Conversation

@al13n321
Copy link
Member

@al13n321 al13n321 commented Oct 21, 2025

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Parquet reader v3 is enabled by default.

@al13n321 al13n321 added the ci-non-blocking Do not block pipeline on too many failed tests label Oct 21, 2025
@clickhouse-gh
Copy link

clickhouse-gh bot commented Oct 21, 2025

Workflow [PR], commit [9fb2c95]

Summary:

job_name test_name status info comment
Upgrade check (amd_tsan) failure
Killed by signal (in clickhouse-server.log) FAIL cidb
Fatal message in clickhouse-server.log (see fatal_messages.txt) FAIL cidb
Killed by signal (output files) FAIL cidb
Found signal in gdb.log FAIL cidb
AST fuzzer (amd_debug) failure
Logical error: 'RangeReader read 3 rows, but 32 expected.'. FAIL cidb
BuzzHouse (amd_debug) failure
Buzzing result failure cidb

@clickhouse-gh clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Oct 21, 2025
@EmeraldShift
Copy link

Im curious, where can I find benchmarks for the v1 vs v3 parquet reader?

@al13n321
Copy link
Member Author

Im curious, where can I find benchmarks for the v1 vs v3 parquet reader?

Probably nowhere yet.

@al13n321
Copy link
Member Author

Disabled prewhere in datalakes for now. To make it work we'll have to change how ColumnMapper is used and how delta lake partition columns are added, @divanik :

    /// TODO: Known problems with datalake prewhere:
    ///  * If the iceberg table went through schema evolution, columns read from file may need to
    ///    be renamed or typecast before applying prewhere. There's already a mechanism for
    ///    telling parquet reader to rename columns: ColumnMapper. And parquet reader already
    ///    automatically does type casts to requested types. But weirdly the iceberg reader uses
    ///    those mechanism to request the *old* name and type of the column, then has additional
    ///    code to do the renaming and casting as a separate step outside parquet reader.
    ///    We should probably change this and delete that additional code?
    ///  * Delta Lake can have "partition columns", which are columns with constant value specified
    ///    in the metadata, not present in parquet file. Like hive partitioning, but in metadata
    ///    files instead of path. Currently these columns are added to the block outside parquet
    ///    reader. If they appear in prewhere expression, parquet reader gets a "no column in block"
    ///    error. Unlike hive partitioning, we can't (?) just return these columns from
    ///    supportedPrewhereColumns() because at the time of the call the delta lake metadata hasn't
    ///    been read yet. So we should probably pass these columns to the parquet reader instead of
    ///    adding them outside.
    ///  * There's a bug in StorageObjectStorageSource::createReader: it makes a copy of
    ///    FormatFilterInfo, but for some reason unsets prewhere_info and row_level_filter_info.
    ///    There's probably no reason for this, and it should just copy those fields like the others.

@al13n321
Copy link
Member Author

al13n321 commented Nov 3, 2025

The remaining failed tests are flaky.

@al13n321 al13n321 marked this pull request as ready for review November 3, 2025 18:38
@divanik
Copy link
Member

divanik commented Nov 5, 2025

///  * If the iceberg table went through schema evolution, columns read from file may need to
///    be renamed or typecast before applying prewhere. There's already a mechanism for
///    telling parquet reader to rename columns: ColumnMapper. And parquet reader already
///    automatically does type casts to requested types. But weirdly the iceberg reader uses
///    those mechanism to request the *old* name and type of the column, then has additional
///    code to do the renaming and casting as a separate step outside parquet reader.
///    We should probably change this and delete that additional code?
///  * Delta Lake can have "partition columns", which are columns with constant value specified
///    in the metadata, not present in parquet file. Like hive partitioning, but in metadata
///    files instead of path. Currently these columns are added to the block outside parquet
///    reader. If they appear in prewhere expression, parquet reader gets a "no column in block"
///    error. Unlike hive partitioning, we can't (?) just return these columns from
///    supportedPrewhereColumns() because at the time of the call the delta lake metadata hasn't
///    been read yet. So we should probably pass these columns to the parquet reader instead of
///    adding them outside.

In both these suggestions I see one problem: we delegate complicated behaviour from the source to a format (one layer down). This could be ok if we support only parquet v3, but I think that we want to preserve normal behaviour with other formats and old parquet reader, and in this case the task is becoming more tedious.

@al13n321
Copy link
Member Author

al13n321 commented Nov 5, 2025

In both these suggestions I see one problem: we delegate complicated behaviour from the source to a format (one layer down).

Yep, the same transform would need to be implemented both inside the reader and as a separate IProcessor (for other formats), and there'd be some awkward code duplication between the two (hopefully reusing code for the important parts), likely with some bugs caused by unintended differences in behavior between the two. I don't have other ideas though. Do you?

@nikitamikhaylov
Copy link
Member

The Upgrade check is broken in master:

Fatal> : Logical error: \'Cannot get position for substream data.arr.Array%28JSON%28max_dynamic_types%3D16%2C%20max_dynamic_paths%3D256%29%29.object_shared_data.0.structure_prefix: column data with position 1 doesn\'t have such substream\'.

@nikitamikhaylov nikitamikhaylov added this pull request to the merge queue Nov 14, 2025
@nikitamikhaylov nikitamikhaylov self-assigned this Nov 14, 2025
Merged via the queue into master with commit 827a7ef Nov 14, 2025
126 of 130 checks passed
@nikitamikhaylov nikitamikhaylov deleted the pqe branch November 14, 2025 18:05
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Nov 14, 2025
zvonand added a commit to Altinity/ClickHouse that referenced this pull request Nov 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-non-blocking Do not block pipeline on too many failed tests pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants