You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently ParquetBlockInputFormat parallelizes reading at row group granularity. I.e. it reads+decodes multiple row groups in parallel, but each row group reading+decoding is single-threaded. It would probably be better to be able to read different columns from the same row group in parallel.
Benefits:
Less memory usage because we don't have to keep num_threads * row_group_size of read data in memory. This is pretty important because row groups are typically hundreds of MBs, and we want to read+decode using tens of threads => potentially tens of GB of memory usage.
Faster reads if the file has few row groups - because we can have more threads than row groups.
Faster reads with small LIMIT - because we'll finish the first row group faster.
Implementation requirements:
Small column chunks (column chunk = a column in a row group) should be grouped together to avoid short reads. Even whole row groups should be grouped together if they're small - this is already implemented, and needs to keep working.
Row group may be too big to be decoded as one Block, even if its compressed data is small (we've seen extreme compression ratios in practice). So, each column chunk needs to be split into pieces, then corresponding pieces collected from all columns into a Block. Column chunk reader should be careful to not run too far ahead and queue up too many pieces.
The text was updated successfully, but these errors were encountered:
Currently ParquetBlockInputFormat parallelizes reading at row group granularity. I.e. it reads+decodes multiple row groups in parallel, but each row group reading+decoding is single-threaded. It would probably be better to be able to read different columns from the same row group in parallel.
Benefits:
Implementation requirements:
The text was updated successfully, but these errors were encountered: