Read multiple row groups in Parquet files correctly #3950

jhh67 · 2024-12-19T21:44:15Z

Columns with more than one row group were not read correctly, which could lead to server crashes and perhaps memory corruption. This fix iterates through the column's row groups while maintaining a count of the total items read, and terminates the loop when the specified number of items have been read.

To do (see https://github.com/Bears-R-Us/arkouda/blob/master/CONTRIBUTING.md#writing-pull-requests)

make test
make pypy
flake8 arkouda

Closes #3951

Iterate through the column's row groups while maintaining a count of the total items read, and terminate the loop when the specified number of items have been read. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

ajpotts · 2024-12-26T22:45:47Z

@jhh67 Thanks for this!

ajpotts · 2024-12-26T23:28:27Z

I was able to recreate the error and verify that the PR does prevent the server crash in this example:


import arkouda as ak
import numpy as np
import pandas as pd
ak.connect()

size = 10**8

arrays = [(np.arange(size)//2).tolist(), np.arange(size).tolist(),]
tuples = list(zip(*arrays))

index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
s = pd.Series(np.random.randn(size), index=index)

df = s.to_frame()
df.to_parquet("test_frame.parquet")

ak_df = ak.read_parquet("test_frame.parquet")
ak.DataFrame(ak_df)

ajpotts

Looks great! Thank you!

ajpotts · 2024-12-27T15:31:17Z

Investigating this further, I noticed the output of ak_df above is:

                 0     first    second
0        -0.848479         0         0
1        -0.759512         0         1
2         0.814430         1         2
3         1.195904         1         3
4         0.848203         2         4
...            ...       ...       ...
99999995  0.000000  49999997  99999995
99999996  0.000000  49999998  99999996
99999997  0.000000  49999998  99999997
99999998  0.000000  49999999  99999998
99999999  0.000000  49999999  99999999

Notice how the bottoms filled with zeros. I did check that the first part of the array has correct values.

The variable skipIdx contains the number of values to be skipped in the column prior to reading values. Skipping is done one row group at a time, so this value must be updated as each row group is skipped. Also, readColumnDbFl and readColumnIrregularBitWidth now return the number of values read, so that ReadColumn increments the index into the output array properly.

jhh67 · 2025-01-03T14:54:14Z

I pushed a fix for the bug that ak_df is filled with zeros at the bottom. I've done what testing I can, but would appreciate more thorough testing being done, or point me at how to do more extensive tests. I fear that I'm playing a game of wack-a-mole with the bugs and there may be more hiding in the code.

drculhane

Confirmed the error, and confirmed the fix. Looks good to me.

jaketrookman

Looks great

ajpotts · 2025-01-09T19:30:54Z

@jhh67 : @drculhane ran the tests for you and so we're going to merge this one in. The unit tests automatically run in the CI with size=100. We also usually try to run locally with make test size=10**8, and also a make test with gasnet. Once it is merged in, HPE runs the unit tests every night at scale on actual machines, so if we missed anything we should find out within a few days.

We're always looking for ways to improve our unit tests, so if you have any specific proposals let us know.

Thanks again! We suspect this bug was affecting other users as well.

This reverts commit 091b8dd.

…3969) This reverts commit 091b8dd.

* Read multiple row groups correctly Iterate through the column's row groups while maintaining a count of the total items read, and terminate the loop when the specified number of items have been read. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com> * Skip values and count values read properly The variable skipIdx contains the number of values to be skipped in the column prior to reading values. Skipping is done one row group at a time, so this value must be updated as each row group is skipped. Also, readColumnDbFl and readColumnIrregularBitWidth now return the number of values read, so that ReadColumn increments the index into the output array properly. --------- Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com> Co-authored-by: John H. Hartman <jhh67@users.noreply.github.com> Co-authored-by: ajpotts <amanda.j.potts@gmail.com>

…Us#3950)" (Bears-R-Us#3969) This reverts commit 091b8dd.

…3950)" …" This reverts commit 135c02d.

…3950)" …" (#3989) This reverts commit 135c02d.

jhh67 changed the title ~~Read multiple row groups correctly~~ Read multiple row groups in Parquet files correctly Dec 19, 2024

ajpotts requested review from drculhane, e-kayrakli and jaketrookman December 20, 2024 15:45

ajpotts mentioned this pull request Dec 20, 2024

Read multiple row groups in Parquet files correctly #3951

Closed

Read multiple row groups correctly

1976c87

Iterate through the column's row groups while maintaining a count of the total items read, and terminate the loop when the specified number of items have been read. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

jhh67 force-pushed the parquet branch from 8f72ccb to 1976c87 Compare December 20, 2024 16:20

ajpotts marked this pull request as ready for review December 21, 2024 00:19

ajpotts approved these changes Dec 26, 2024

View reviewed changes

jhh67 force-pushed the parquet branch from 1881ff9 to 0e3dda7 Compare January 2, 2025 21:34

jhh67 force-pushed the parquet branch from 0e3dda7 to be3c67f Compare January 2, 2025 21:35

ajpotts added 2 commits January 9, 2025 10:45

Merge branch 'master' into parquet

d29859a

Merge branch 'master' into parquet

6f2546b

drculhane approved these changes Jan 9, 2025

View reviewed changes

jaketrookman approved these changes Jan 9, 2025

View reviewed changes

ajpotts added this pull request to the merge queue Jan 9, 2025

Merged via the queue into Bears-R-Us:master with commit 091b8dd Jan 9, 2025
19 checks passed

drculhane mentioned this pull request Jan 10, 2025

Parquet file fix needs unit test #3967

Closed

ajpotts added a commit that referenced this pull request Jan 10, 2025

Revert "Read multiple row groups in Parquet files correctly (#3950)"

26ec244

This reverts commit 091b8dd.

ajpotts mentioned this pull request Jan 10, 2025

Revert "Read multiple row groups in Parquet files correctly" #3969

Merged

github-merge-queue bot pushed a commit that referenced this pull request Jan 10, 2025

Revert "Read multiple row groups in Parquet files correctly (#3950)" (#…

135c02d

…3969) This reverts commit 091b8dd.

jabraham17 pushed a commit to jabraham17/arkouda that referenced this pull request Jan 21, 2025

Revert "Read multiple row groups in Parquet files correctly (Bears-R-…

0a70248

…Us#3950)" (Bears-R-Us#3969) This reverts commit 091b8dd.

ajpotts added a commit that referenced this pull request Jan 27, 2025

Revert "Revert "Read multiple row groups in Parquet files correctly (#…

f64d13d

…3950)" …" This reverts commit 135c02d.

github-merge-queue bot pushed a commit that referenced this pull request Jan 27, 2025

Revert "Revert "Read multiple row groups in Parquet files correctly (#…

40f0d4c

…3950)" …" (#3989) This reverts commit 135c02d.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read multiple row groups in Parquet files correctly #3950

Read multiple row groups in Parquet files correctly #3950

Uh oh!

jhh67 commented Dec 19, 2024 •

edited by ajpotts

Loading

Uh oh!

ajpotts commented Dec 26, 2024 •

edited

Loading

Uh oh!

ajpotts commented Dec 26, 2024

Uh oh!

ajpotts left a comment

Uh oh!

ajpotts commented Dec 27, 2024

Uh oh!

jhh67 commented Jan 3, 2025

Uh oh!

drculhane left a comment

Uh oh!

jaketrookman left a comment

Uh oh!

ajpotts commented Jan 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Read multiple row groups in Parquet files correctly #3950

Read multiple row groups in Parquet files correctly #3950

Uh oh!

Conversation

jhh67 commented Dec 19, 2024 • edited by ajpotts Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajpotts commented Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajpotts commented Dec 26, 2024

Uh oh!

ajpotts left a comment

Choose a reason for hiding this comment

Uh oh!

ajpotts commented Dec 27, 2024

Uh oh!

jhh67 commented Jan 3, 2025

Uh oh!

drculhane left a comment

Choose a reason for hiding this comment

Uh oh!

jaketrookman left a comment

Choose a reason for hiding this comment

Uh oh!

ajpotts commented Jan 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jhh67 commented Dec 19, 2024 •

edited by ajpotts

Loading

ajpotts commented Dec 26, 2024 •

edited

Loading