Add support for optionally reading and writing Parquet files by bmcdonald3 · Pull Request #967 · Bears-R-Us/arkouda

bmcdonald3 · 2021-11-09T02:06:53Z

This PR adds the ability to read and write int64 Parquet files in Arkouda through a pdarray.save_parquet() function and an ak.read_parquet() function. These functions behave in much the same way as the pdarray.save() and ak.load() functions (see docstrings). This is only the initial support, and more dtypes will likely be added as use necessitates.

For reading Parquet files, the Arrow C++ library was chosen due to library completeness and portability. The actual functionality is written in C++, with light C wrappers to allow leveraging of the Chapel C interoperability features, which requires the C++ code to be compiled into an object file prior to calling the functions within Chapel.

On the Chapel side, the functionality is modeled off of the HDF5 code, but currently only supports int64 and int32 Arrow datatypes, which are read into int64 pdarrays.

Performance numbers collected on a single node Cray CS:

Function	Parquet (GiB/s)	HDF5 (GiB/s)
Read	0.65	2.29
Write	0.11	2.81

Potential next steps:

Better error handling
Improve performance
Extend to support more reading of Arrow types to pdarrays
Extend to support Arkouda Strings
Support append mode
Auto detect arrow/parquet support

Parquet as an optional dependency:
Running a make with this PR will build Arkouda as normal, without requiring Arrow. Only when the environment variable ARKOUDA_SERVER_PARQUET_SUPPORT is set will the Arrow-requiring files be pulled into the build. This functions in much the same way as ZMQ and HDF5, where the path can be specified in Makefile.paths and there is a provided make install-arrow command to download Arrow from offline and build from source.

Closes #903

glitch

Overall this looks really good!

My main questions are about closing parquet files & readers, whether we should throw an error vs. return MsgTuple with MsgType.Error from the read/write parquet chapel functions, and the try! versus throwing an error.

Great work!

EDIT:
Oh, one more thing please add src/ArrowFunctions.o to .gitignore and add an empty test/UnitTestParquetCpp.good since it also shows up as an untracked file in git.

Makefile

src/ArrowFunctions.h

src/ArrowFunctions.cpp

src/GenSymIO.chpl

src/Parquet.chpl

glitch · 2021-11-09T14:21:40Z

src/Parquet.chpl

+    var high = min(d1.high, d2.high);
+    if (d1.stride !=1) && (d2.stride != 1) {
+      //TODO: change this to throw
+      halt("At least one domain must have stride 1");


Yes, I agree, we should throw here.

ronawho · 2021-11-09T15:33:39Z

@glitch already found this, but FYI this is based on #966 just because it reduces the amount of code we need to bring in for the parquet dependency check.

Ben also mentioned this a bit in the PR, but the intent of this is to be the first usable draft that we will then build on. The goal was to get something usable in that we can start getting feedback on as well as start running nightly correctness/performance testing. There are a number of known next steps (and we'd be interested in input on prioritization on those.) Note that better error-handling is one of those next steps. Today there are some halts and it's worth noting in the C++ code we currently throw an uncaught exception, which is effectively the same as a halt. Turning that into proper error-handling should be straight forward, but is a fair amount of boilerplate code and we didn't want to make those changes just yet in case larger changes to the core code were required.

edit and to be clear we'd be happy to make any error-handling or whatever changes you want in this PR, just wanted to give context on what we were thinking going into this.

glitch

I'd like to get a few more reviews from other team members but overall I think it looks pretty good.

stress-tess

Looks good to me! Nice Work

arkouda/pdarrayclass.py

reuster986

Great work, @bmcdonald3 ! I like the design here, especially how you handled the optional inclusion of parquet functionality at compile time in the makefile. I think it's a good prototype for how we can think about extending arkouda with other modular content.

I'm on board with the intention expressed by you and @ronawho to get basic functionality working in this PR and then come back and add more robust error handling, etc. later.

I just requested one minor change regarding how wildcards are handled, because I want us to adhere to ak.read_all usage, rather than ak.load_all.

Well done!

tests/parquet_test.py

src/GenSymIO.chpl

tests/parquet_test.py

ronawho · 2021-11-16T18:01:57Z

I'm on board with the intention expressed by you and @ronawho to get basic functionality working in this PR and then come back and add more robust error handling, etc. later.

@reuster986 do you (or others) have any input on prioritization of next steps? (there's an unordered list of potential next steps in the PR description)

bmcdonald3 · 2021-11-16T22:42:56Z

Thanks @reuster986, just pushed up some code that addresses your suggestion, assuming that I understand sufficiently. Also, I would note that the method for optional inclusion was work from @ronawho, so I cannot take credit for that!

reuster986

Other than a small change on the test logic, everything looks good.

reuster986 · 2021-11-30T14:06:30Z

tests/parquet_test.py

-        if arr1[i] != arr2[i]:
-            print(arr1[i], 'does not match', arr2[i], 'at index', i)
-            return 1
+    if (arr1 != arr2).all():


I think this should be if (arr1 != arr2).any():.

As written, arrays will compare equal unless all of their values differ.

If you care about order you could use numpy to do the comparison or just straight list comparison

# Numpy >>> np.array_equal([1,2,3], [1,2,3]) True >>> np.array_equal([1,2,3], [3,2,1]) False # Straight list comparison >>> [1, 2, 3] == [1, 2, 3] True >>> [1, 2, 3] == [3, 2, 1] False >>> [1, 2, 3] == [1, 2, 3, 4] False

You should also return the boolean value instead of an integer.

@glitch I believe arr1 and arr2 are arkouda arrays. Although you could do np.array_equal(arr1.to_ndarray(), arr2.to_ndarray()), I guess I have a personal preference for staying in arkouda to avoid array transfers (even though those are fast now and these arrays are small, so at this point it's mostly a stylistic choice).

Also, it's been a while since I looked, but I think most of the test suite uses the (a == b).all() pattern, so it might be good to use that for consistency.

@reuster986 you are correct, (a == b).all() is the pattern elsewhere, and it seems that what other tests do most often is just assertTrue, rather than do a compare_values function like I was doing, so I have switched to that format for the test to be more consistent with other tests.

mhmerrill · 2021-11-30T16:05:46Z

@reuster986 @bmcdonald3 we would like to merge this week, is this ready to go or do we need to do more?

reuster986 · 2021-11-30T16:35:59Z

@mhmerrill I reviewed earlier this morning and requested a small logic change in the tests (see discussion above). After that, I think it's ready to go.

bmcdonald3 · 2021-11-30T17:02:54Z

@mhmerrill This is all that we had planned adding for this initial Parquet support effort. We do have a couple of follow up PRs nearly ready that improve the error handling, performance, and support additional Parquet types among a few other small changes, but we wanted to get a few nights worth of testing data on this initial work before adding in those other changes if that sounds good to you.

files in Arkouda through a `pdarray.save_parquet()` function and an `ak.read_parquet()` function. These functions behave in much the same way as the `pdarray.save()` and `ak.load()` functions (see docstrings). This is only the initial support, and more dtypes will likely be added as use necessitates.

…dback

mhmerrill · 2021-11-30T18:43:54Z

@bmcdonald3 @ronawho should we have a separate CI process to check this? or does the CI already check this?

ronawho · 2021-12-01T02:21:32Z

@bmcdonald3 @ronawho should we have a separate CI process to check this? or does the CI already check this?

I think we discussed this offline, but it's tested by default in CI testing -- https://github.com/Bears-R-Us/arkouda/pull/967/files#diff-3ab46ee209a127470fce3c2cf106b1a1dbadbb929a4b5b13656a4bc4ce19c0b8R7

Enable parquet support for Arkouda testing Enable the opt-in parquet support for our nightly arkouda testing. See Bears-R-Us/arkouda#967 for more info.

bmcdonald3 changed the title ~~Add support for optionality reading and writing Parquet files~~ Add support for optionally reading and writing Parquet files Nov 9, 2021

glitch reviewed Nov 9, 2021

View reviewed changes

glitch requested review from glitch, hokiegeek2, mhmerrill, reuster986 and stress-tess November 9, 2021 18:07

glitch approved these changes Nov 9, 2021

View reviewed changes

bmcdonald3 force-pushed the parquet-support branch from 987b4fe to 401d4db Compare November 9, 2021 18:25

stress-tess approved these changes Nov 9, 2021

View reviewed changes

arkouda/pdarrayclass.py Outdated Show resolved Hide resolved

reuster986 requested changes Nov 16, 2021

View reviewed changes

tests/parquet_test.py Outdated Show resolved Hide resolved

src/GenSymIO.chpl Outdated Show resolved Hide resolved

tests/parquet_test.py Outdated Show resolved Hide resolved

bmcdonald3 force-pushed the parquet-support branch 2 times, most recently from ab43417 to 83fdb00 Compare November 16, 2021 23:11

glitch linked an issue Nov 17, 2021 that may be closed by this pull request

Apache Parquet Support #903

Closed

bmcdonald3 requested a review from reuster986 November 29, 2021 17:03

reuster986 requested changes Nov 30, 2021

View reviewed changes

reuster986 approved these changes Nov 30, 2021

View reviewed changes

bmcdonald3 added 5 commits November 30, 2021 09:29

Add comments and other suggestions from Kyle

be67e80

Fix some documentation that Pierce pointed out

5a3ecfe

Remove automatic wildcarding of files for read_parquet per Bill's fee…

e5a1a5b

…dback

Change to (a == b).all() pattern in parquet_test

81f9e3f

bmcdonald3 force-pushed the parquet-support branch from b9d12d9 to 81f9e3f Compare November 30, 2021 17:33

mhmerrill merged commit d7f3544 into Bears-R-Us:master Nov 30, 2021

ronawho mentioned this pull request Dec 1, 2021

Enable parquet support for Arkouda testing chapel-lang/chapel#18777

Merged

bmcdonald3 mentioned this pull request Dec 6, 2021

Add handling of errors in Parquet code and minor clean up #993

Merged

bmcdonald3 deleted the parquet-support branch January 27, 2022 20:29

Conversation

bmcdonald3 commented Nov 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glitch left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glitch Nov 9, 2021

Choose a reason for hiding this comment

Uh oh!

ronawho commented Nov 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glitch left a comment

Choose a reason for hiding this comment

Uh oh!

stress-tess left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

reuster986 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ronawho commented Nov 16, 2021

Uh oh!

bmcdonald3 commented Nov 16, 2021

Uh oh!

reuster986 left a comment

Choose a reason for hiding this comment

Uh oh!

reuster986 Nov 30, 2021

Choose a reason for hiding this comment

Uh oh!

glitch Nov 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reuster986 Nov 30, 2021

Choose a reason for hiding this comment

Uh oh!

bmcdonald3 Nov 30, 2021

Choose a reason for hiding this comment

Uh oh!

mhmerrill commented Nov 30, 2021

Uh oh!

reuster986 commented Nov 30, 2021

Uh oh!

bmcdonald3 commented Nov 30, 2021

Uh oh!

mhmerrill commented Nov 30, 2021

Uh oh!

ronawho commented Dec 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

bmcdonald3 commented Nov 9, 2021 •

edited

Loading

glitch left a comment •

edited

Loading

ronawho commented Nov 9, 2021 •

edited

Loading

glitch Nov 30, 2021 •

edited

Loading