Skip to content

Support loading parquet file in parallel by splitting row group#3945

Merged
chaoyli merged 1 commit intoStarRocks:mainfrom
meegoo:feature_parquet_parallel_scan
Mar 18, 2022
Merged

Support loading parquet file in parallel by splitting row group#3945
chaoyli merged 1 commit intoStarRocks:mainfrom
meegoo:feature_parquet_parallel_scan

Conversation

@meegoo
Copy link
Copy Markdown
Contributor

@meegoo meegoo commented Mar 8, 2022

What type of PR is this:

  • bug
  • feature
  • enhancement
  • others

Which issues of this PR fixes :

Fixes #3942

Problem Summary(Required) :

StarRocks broker load granularity of parallel scan is file. So that load one large file will be only one parallel process.
According to this problem, we support split parquet file using parquet row group and scan parallel.
In the case of FE setting parameter load_parallel_instance_num=8, the load performance of a single Parquet file is improved by 6x times

@meegoo meegoo force-pushed the feature_parquet_parallel_scan branch from 833839f to dadf3db Compare March 8, 2022 09:09
@meegoo
Copy link
Copy Markdown
Contributor Author

meegoo commented Mar 8, 2022

run starrocks_be_unittest

@meegoo meegoo force-pushed the feature_parquet_parallel_scan branch 2 times, most recently from 04e0117 to 2943aa9 Compare March 14, 2022 07:08
@meegoo
Copy link
Copy Markdown
Contributor Author

meegoo commented Mar 14, 2022

run starrocks_fe_unittest

@chaoyli chaoyli changed the title support parallel parquet file load by split through row group Support loading parquet file in parallel by split row group Mar 14, 2022
@chaoyli
Copy link
Copy Markdown
Contributor

chaoyli commented Mar 14, 2022

Add a more concise message about the performance test in the commit message.

@decster decster changed the title Support loading parquet file in parallel by split row group Support loading parquet file in parallel by splitting row group Mar 16, 2022
rickif
rickif previously approved these changes Mar 16, 2022
decster
decster previously approved these changes Mar 16, 2022
Comment thread be/src/exec/parquet_reader.cpp Outdated
@meegoo meegoo dismissed stale reviews from decster and rickif via 3d63d96 March 17, 2022 03:32
@meegoo meegoo force-pushed the feature_parquet_parallel_scan branch from 2a14611 to 3d63d96 Compare March 17, 2022 03:32
@wanpengfei-git
Copy link
Copy Markdown
Collaborator

[FE PR Coverage check]

😍 pass : 0 / 0 (0%)

Copy link
Copy Markdown
Contributor

@ABingHuang ABingHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread be/src/exec/parquet_reader.cpp Outdated
@chaoyli chaoyli merged commit ddfd897 into StarRocks:main Mar 18, 2022
wyb added a commit to wyb/starrocks that referenced this pull request Mar 19, 2022
StarRocks#3945
This commit supports loading parquet file in parallel by splitting row group,
and requires that start offset and size must be set.

So set broker range start offset and size in spark load push task.
gengjun-git pushed a commit that referenced this pull request Mar 19, 2022
)

#3945
This commit supports loading parquet file in parallel by splitting row group,
and requires that start offset and size must be set.

So set broker range start offset and size in spark load push task.

* Update BE for smooth upgrade
jaogoy pushed a commit to jaogoy/starrocks that referenced this pull request Nov 15, 2023
* Add std.md

Signed-off-by: Sida Shen <shenstan1@gmail.com>

* Update std.md

* Update std.md

* Update std.md

* Update std.md

Signed-off-by: Sida Shen <shenstan1@gmail.com>
Co-authored-by: Sida Shen <shenstan1@gmail.com>
Co-authored-by: evelyn.zhaojie <98087056+evelynzhaojie@users.noreply.github.com>
(cherry picked from commit 0173ebf)

Co-authored-by: SidaShen <star.0731@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

broker load support split parquet file for parallel scan

6 participants