Skip to content

Conversation

@jhh67
Copy link
Contributor

@jhh67 jhh67 commented Dec 19, 2024

Columns with more than one row group were not read correctly, which could lead to server crashes and perhaps memory corruption. This fix iterates through the column's row groups while maintaining a count of the total items read, and terminates the loop when the specified number of items have been read.

To do (see https://github.com/Bears-R-Us/arkouda/blob/master/CONTRIBUTING.md#writing-pull-requests)

  • make test
  • make pypy
  • flake8 arkouda

Closes #3951

@jhh67 jhh67 changed the title Read multiple row groups correctly Read multiple row groups in Parquet files correctly Dec 19, 2024
Iterate through the column's row groups while maintaining a count of the
total items read, and terminate the loop when the specified number of items
have been read.

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
@ajpotts ajpotts marked this pull request as ready for review December 21, 2024 00:19
@ajpotts
Copy link
Contributor

ajpotts commented Dec 26, 2024

@jhh67 Thanks for this!

@ajpotts
Copy link
Contributor

ajpotts commented Dec 26, 2024

I was able to recreate the error and verify that the PR does prevent the server crash in this example:


import arkouda as ak
import numpy as np
import pandas as pd
ak.connect()

size = 10**8

arrays = [(np.arange(size)//2).tolist(), np.arange(size).tolist(),]
tuples = list(zip(*arrays))

index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
s = pd.Series(np.random.randn(size), index=index)

df = s.to_frame()
df.to_parquet("test_frame.parquet")

ak_df = ak.read_parquet("test_frame.parquet")
ak.DataFrame(ak_df)

Copy link
Contributor

@ajpotts ajpotts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you!

@ajpotts
Copy link
Contributor

ajpotts commented Dec 27, 2024

Investigating this further, I noticed the output of ak_df above is:

                 0     first    second
0        -0.848479         0         0
1        -0.759512         0         1
2         0.814430         1         2
3         1.195904         1         3
4         0.848203         2         4
...            ...       ...       ...
99999995  0.000000  49999997  99999995
99999996  0.000000  49999998  99999996
99999997  0.000000  49999998  99999997
99999998  0.000000  49999999  99999998
99999999  0.000000  49999999  99999999

Notice how the bottoms filled with zeros. I did check that the first part of the array has correct values.

The variable skipIdx contains the number of values to be skipped in the column
prior to reading values. Skipping is done one row group at a time, so this
value must be updated as each row group is skipped.

Also, readColumnDbFl and readColumnIrregularBitWidth now return the number of
values read, so that ReadColumn increments the index into the output array
properly.
@jhh67
Copy link
Contributor Author

jhh67 commented Jan 3, 2025

I pushed a fix for the bug that ak_df is filled with zeros at the bottom. I've done what testing I can, but would appreciate more thorough testing being done, or point me at how to do more extensive tests. I fear that I'm playing a game of wack-a-mole with the bugs and there may be more hiding in the code.

Copy link
Contributor

@drculhane drculhane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed the error, and confirmed the fix. Looks good to me.

Copy link
Contributor

@jaketrookman jaketrookman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great

@ajpotts ajpotts added this pull request to the merge queue Jan 9, 2025
@ajpotts
Copy link
Contributor

ajpotts commented Jan 9, 2025

@jhh67 : @drculhane ran the tests for you and so we're going to merge this one in. The unit tests automatically run in the CI with size=100. We also usually try to run locally with make test size=10**8, and also a make test with gasnet. Once it is merged in, HPE runs the unit tests every night at scale on actual machines, so if we missed anything we should find out within a few days.

We're always looking for ways to improve our unit tests, so if you have any specific proposals let us know.

Thanks again! We suspect this bug was affecting other users as well.

Merged via the queue into Bears-R-Us:master with commit 091b8dd Jan 9, 2025
19 checks passed
ajpotts added a commit that referenced this pull request Jan 10, 2025
jabraham17 pushed a commit to jabraham17/arkouda that referenced this pull request Jan 21, 2025
* Read multiple row groups correctly

Iterate through the column's row groups while maintaining a count of the
total items read, and terminate the loop when the specified number of items
have been read.

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

* Skip values and count values read properly

The variable skipIdx contains the number of values to be skipped in the column
prior to reading values. Skipping is done one row group at a time, so this
value must be updated as each row group is skipped.

Also, readColumnDbFl and readColumnIrregularBitWidth now return the number of
values read, so that ReadColumn increments the index into the output array
properly.

---------

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>
Co-authored-by: John H. Hartman <jhh67@users.noreply.github.com>
Co-authored-by: ajpotts <amanda.j.potts@gmail.com>
jabraham17 pushed a commit to jabraham17/arkouda that referenced this pull request Jan 21, 2025
ajpotts added a commit that referenced this pull request Jan 27, 2025
github-merge-queue bot pushed a commit that referenced this pull request Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Read multiple row groups in Parquet files correctly

4 participants