-
Notifications
You must be signed in to change notification settings - Fork 97
Read multiple row groups in Parquet files correctly #3950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Iterate through the column's row groups while maintaining a count of the total items read, and terminate the loop when the specified number of items have been read. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
|
@jhh67 Thanks for this! |
|
I was able to recreate the error and verify that the PR does prevent the server crash in this example: |
ajpotts
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thank you!
|
Investigating this further, I noticed the output of Notice how the bottoms filled with zeros. I did check that the first part of the array has correct values. |
The variable skipIdx contains the number of values to be skipped in the column prior to reading values. Skipping is done one row group at a time, so this value must be updated as each row group is skipped. Also, readColumnDbFl and readColumnIrregularBitWidth now return the number of values read, so that ReadColumn increments the index into the output array properly.
|
I pushed a fix for the bug that |
drculhane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed the error, and confirmed the fix. Looks good to me.
jaketrookman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great
|
@jhh67 : @drculhane ran the tests for you and so we're going to merge this one in. The unit tests automatically run in the CI with size=100. We also usually try to run locally with We're always looking for ways to improve our unit tests, so if you have any specific proposals let us know. Thanks again! We suspect this bug was affecting other users as well. |
* Read multiple row groups correctly Iterate through the column's row groups while maintaining a count of the total items read, and terminate the loop when the specified number of items have been read. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com> * Skip values and count values read properly The variable skipIdx contains the number of values to be skipped in the column prior to reading values. Skipping is done one row group at a time, so this value must be updated as each row group is skipped. Also, readColumnDbFl and readColumnIrregularBitWidth now return the number of values read, so that ReadColumn increments the index into the output array properly. --------- Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com> Co-authored-by: John H. Hartman <jhh67@users.noreply.github.com> Co-authored-by: ajpotts <amanda.j.potts@gmail.com>
…Us#3950)" (Bears-R-Us#3969) This reverts commit 091b8dd.
Columns with more than one row group were not read correctly, which could lead to server crashes and perhaps memory corruption. This fix iterates through the column's row groups while maintaining a count of the total items read, and terminates the loop when the specified number of items have been read.
To do (see https://github.com/Bears-R-Us/arkouda/blob/master/CONTRIBUTING.md#writing-pull-requests)
Closes #3951