Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #2855: Read batches for Parquet strings #2856

Merged
merged 10 commits into from
Nov 20, 2023

Conversation

bmcdonald3
Copy link
Contributor

In the first implementation of Parquet, strings were the only data type being read one at a time, rather than in batches since we can't read directly into a uint(8) buffer, so we were reading one value at a time and then copying it into the Chapel buffer. Rather than doing that one a time, we can switch to reading batches and then bulk copying to improve performance. Here are some performance numbers collected on 2 nodes of a Cray XC with varying batch sizes:

batchSize exec time
1 9.63s
10 3.66s
256 3.05s
512 3.03s
8192 3.05s

So, we see about a 3x improvement.

Closes #2855

@bmcdonald3 bmcdonald3 marked this pull request as draft November 16, 2023 22:10
@bmcdonald3
Copy link
Contributor Author

I believe the test that is failing is because of the writeMultiCol parquet function was not accounting for null values and thus didn't care about the definition level, which we now have to use to detect null values, since we can no longer just check values_read to give us indices of where there may be a null string.

I think it will require some modifications to the writeMultiCol function in order for this to work.

Copy link
Member

@stress-tess stress-tess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!! Thanks @bmcdonald3!

src/ArrowFunctions.cpp Show resolved Hide resolved
src/ArrowFunctions.cpp Outdated Show resolved Hide resolved
@stress-tess stress-tess added this pull request to the merge queue Nov 20, 2023
Merged via the queue into Bears-R-Us:master with commit 8215275 Nov 20, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read batches for Parquet strings
3 participants