Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: read columns in parallel #63581

Open
al13n321 opened this issue May 9, 2024 · 1 comment
Open

Parquet: read columns in parallel #63581

al13n321 opened this issue May 9, 2024 · 1 comment

Comments

@al13n321
Copy link
Member

al13n321 commented May 9, 2024

Currently ParquetBlockInputFormat parallelizes reading at row group granularity. I.e. it reads+decodes multiple row groups in parallel, but each row group reading+decoding is single-threaded. It would probably be better to be able to read different columns from the same row group in parallel.

Benefits:

  • Less memory usage because we don't have to keep num_threads * row_group_size of read data in memory. This is pretty important because row groups are typically hundreds of MBs, and we want to read+decode using tens of threads => potentially tens of GB of memory usage.
  • Faster reads if the file has few row groups - because we can have more threads than row groups.
  • Faster reads with small LIMIT - because we'll finish the first row group faster.

Implementation requirements:

  • Small column chunks (column chunk = a column in a row group) should be grouped together to avoid short reads. Even whole row groups should be grouped together if they're small - this is already implemented, and needs to keep working.
  • Row group may be too big to be decoded as one Block, even if its compressed data is small (we've seen extreme compression ratios in practice). So, each column chunk needs to be split into pieces, then corresponding pieces collected from all columns into a Block. Column chunk reader should be careful to not run too far ahead and queue up too many pieces.
@zhanglistar
Copy link
Contributor

Looking forward to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants