New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A native parquet reader for primitive types #60361
base: master
Are you sure you want to change the base?
Conversation
This is an automatic comment. The PR descriptions does not match the template. Please, edit it accordingly. The error is: More than one changelog category specified: 'Improvement', 'Performance Improvement' |
This is an automated comment for commit 5166d8b with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page Successful checks
|
0957b32
to
f7b7f7e
Compare
Change-Id: I83a8ec8271edefcd96cb5b3bcd12f6b545d9dec0
Change-Id: I38b8368b022263d9a71cb3f3e9fdad5d6ca26753
Change-Id: If79741b7456667a8dde3e355d9dc684c2dd84f4f
This reverts commit 5df94b7.
5012933
to
5166d8b
Compare
Well, I think these break tests are not caused by my commit. UnitTestsAsan and ASTFuzzerTestAsan have succeeded in former test. While the integration test is failed because of logical error. Hope for further suggestion. |
Yes, we need to start the investigation, and then check and fix every failure one by one. |
@alexey-milovidov |
@copperybean, please resolve conflicts. |
std::shared_ptr<::arrow::io::RandomAccessFile> arrow_file) | ||
{ | ||
std::unique_ptr<parquet::ParquetFileReader> res; | ||
THROW_PARQUET_EXCEPTION(res = parquet::ParquetFileReader::Open(std::move(arrow_file))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File metadata should be passed in here, otherwise file footer will be read again here.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
A native parquet reader, which can read parquet binary to ClickHouse Columns directly. Now this feature can be activated by setting
input_format_parquet_use_native_reader
to true.Currently, parquet file is read by arrow library, and it's read to arrow table first, and then copy the arrow table to ClickHouse Columns. There are some shortcomings in performance.
int16_t
array first; then nullability relatedvalid_bits_
is generated. While, thenull_map
andoffsets
(not included this time) can be generated directly when reading definition and repetition levels.This feature is first implemented in the product BMR of Baidu AI Cloud, which has been fully tested.
Performance Test
As a result, the perforation of current implementation is speedup obviously. To generate the test data, a parquet file
src.parquet
of TPCDS table store_sales with scale 5000 is used, there are 35207247 rows in this file. Next, the test data is generated with following query:Then the performance is tested by following command
For each field, two types tests are triggered with different
input_format_parquet_use_native_reader
setting, and single thread is used. The parquet reading duration is counted as commit log duration while reading parquet. The CPU model used by this test isIntel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
.The test result is detailed in following table
Documentation entry for user-facing changes