Skip to content

Added Parquet Shredded VARIANT Support to ParquetReaderv3#102499

Open
rorylshanks wants to merge 109 commits intoClickHouse:masterfrom
rorylshanks:feature/add-parquet-variant-support
Open

Added Parquet Shredded VARIANT Support to ParquetReaderv3#102499
rorylshanks wants to merge 109 commits intoClickHouse:masterfrom
rorylshanks:feature/add-parquet-variant-support

Conversation

@rorylshanks
Copy link
Copy Markdown

@rorylshanks rorylshanks commented Apr 12, 2026

This PR introduces the parquet Shredded VARIANT standard to CLickHouse. duckdb has this already and the main benefit is that it can read less bytes from disk. This is of course critical for a main parquet use case, which is remote reads from S3.

Benchmark results for JSONBench reading from file() on NVME on the final binary are below.

query method wall s user CPU s OSReadBytes
q1_collection_counts parquet_variant_shredded 0.330 0.08 4.1 MB
parquet_variant_unshredded 0.620 0.69 133.9 MB
parquet_string 0.750 1.25 133.6 MB
parquet_json 4.830 4.57 131.8 MB
q2_collection_users parquet_variant_shredded 0.420 0.22 17.4 MB
parquet_variant_unshredded 0.660 0.83 133.7 MB
parquet_string 0.830 2.08 133.3 MB
parquet_json 4.910 5.26 131.2 MB
q3_hourly_events parquet_variant_shredded 0.370 0.13 9.4 MB
parquet_variant_unshredded 0.640 0.78 133.7 MB
parquet_string 0.810 1.93 133.3 MB
parquet_json 4.730 4.64 131.2 MB
q4_first_posts parquet_variant_shredded 0.400 0.13 19.2 MB
parquet_variant_unshredded 0.670 1.13 133.3 MB
parquet_string 0.800 1.93 133.0 MB
parquet_json 4.910 5.23 130.9 MB
q5_activity_span parquet_variant_shredded 0.400 0.15 19.7 MB
parquet_variant_unshredded 0.670 1.06 133.6 MB
parquet_string 0.780 1.92 133.4 MB
parquet_json 4.840 5.00 131.2 MB

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add Parquet shredded VARIANT support, including read/write paths and subcolumn-aware read optimizations for semi-structured data.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Apr 12, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Apr 12, 2026

Workflow [PR], commit [4b822ff]

Summary:

job_name test_name status info comment
Unit tests (asan_ubsan, function_prop_fuzzer) FAIL

AI Review

Summary

This PR adds Parquet VARIANT read/write support (including shredded typed payload handling), integrates it with schema inference/subcolumn planning, and adds broad stateless coverage for parsing, projection, hardening, and roundtrips. After checking current code and prior discussion threads, I did not find a remaining blocker or major issue that is still live at HEAD.

ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
No large/binary files
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-feature Pull request with new product feature label Apr 12, 2026
Comment thread src/Processors/Formats/Impl/Parquet/VariantBinaryDecoder.cpp Outdated
Comment thread src/Processors/Formats/Impl/Parquet/VariantShreddedConversion.cpp Outdated
Comment thread src/Processors/Formats/Impl/Parquet/VariantReader.cpp Outdated
Comment thread src/Processors/Formats/Impl/Parquet/VariantUtils.cpp Outdated
Comment thread src/Processors/Formats/Impl/Parquet/VariantBinaryDecoder.cpp Outdated
Comment thread src/Columns/ColumnObject.cpp Outdated
Comment thread src/Processors/Formats/Impl/Parquet/VariantBinaryDecoder.cpp
Comment thread src/Processors/Formats/Impl/Parquet/VariantBinaryDecoder.cpp Outdated
@melvynator
Copy link
Copy Markdown
Member

Thanks for the contribution. Really cool.

Comment thread src/Processors/Formats/Impl/Parquet/VariantWrite.cpp Outdated
…ng. Alsofixed issues in prewhere/header plumbing for typed JSON subcolumns.
Comment thread src/Storages/prepareReadingFromFormat.cpp Outdated
@clickhouse-gh clickhouse-gh Bot added the manual approve Manual approve required to run CI label Apr 26, 2026
Comment thread src/Processors/Formats/Impl/Parquet/VariantBinaryDecoder.cpp
Comment thread src/Processors/Transforms/FilterTransform.cpp Outdated
Comment thread src/Processors/Formats/Impl/Parquet/ReadManager.cpp
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented May 10, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 83.80% -0.30%
Functions 91.10% 91.10% +0.00%
Branches 76.60% 76.20% -0.40%

Changed lines: 58.25% (3971/6817) | lost baseline coverage: 25 line(s) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors manual approve Manual approve required to run CI pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants