Skip to content

Add support for optionally reading and writing Parquet files#967

Merged
mhmerrill merged 5 commits intoBears-R-Us:masterfrom
bmcdonald3:parquet-support
Nov 30, 2021
Merged

Add support for optionally reading and writing Parquet files#967
mhmerrill merged 5 commits intoBears-R-Us:masterfrom
bmcdonald3:parquet-support

Conversation

@bmcdonald3
Copy link
Contributor

@bmcdonald3 bmcdonald3 commented Nov 9, 2021

This PR adds the ability to read and write int64 Parquet files in Arkouda through a pdarray.save_parquet() function and an ak.read_parquet() function. These functions behave in much the same way as the pdarray.save() and ak.load() functions (see docstrings). This is only the initial support, and more dtypes will likely be added as use necessitates.

For reading Parquet files, the Arrow C++ library was chosen due to library completeness and portability. The actual functionality is written in C++, with light C wrappers to allow leveraging of the Chapel C interoperability features, which requires the C++ code to be compiled into an object file prior to calling the functions within Chapel.

On the Chapel side, the functionality is modeled off of the HDF5 code, but currently only supports int64 and int32 Arrow datatypes, which are read into int64 pdarrays.

Performance numbers collected on a single node Cray CS:

Function  Parquet (GiB/s) HDF5 (GiB/s)
Read 0.65 2.29
Write 0.11 2.81

Potential next steps:

  • Better error handling
  • Improve performance
  • Extend to support more reading of Arrow types to pdarrays
  • Extend to support Arkouda Strings
  • Support append mode
  • Auto detect arrow/parquet support

Parquet as an optional dependency:
Running a make with this PR will build Arkouda as normal, without requiring Arrow. Only when the environment variable ARKOUDA_SERVER_PARQUET_SUPPORT is set will the Arrow-requiring files be pulled into the build. This functions in much the same way as ZMQ and HDF5, where the path can be specified in Makefile.paths and there is a provided make install-arrow command to download Arrow from offline and build from source.

Closes #903

@bmcdonald3 bmcdonald3 changed the title Add support for optionality reading and writing Parquet files Add support for optionally reading and writing Parquet files Nov 9, 2021
Copy link
Collaborator

@glitch glitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks really good!

My main questions are about closing parquet files & readers, whether we should throw an error vs. return MsgTuple with MsgType.Error from the read/write parquet chapel functions, and the try! versus throwing an error.

Great work!

EDIT:
Oh, one more thing please add src/ArrowFunctions.o to .gitignore and add an empty test/UnitTestParquetCpp.good since it also shows up as an untracked file in git.

var high = min(d1.high, d2.high);
if (d1.stride !=1) && (d2.stride != 1) {
//TODO: change this to throw
halt("At least one domain must have stride 1");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, we should throw here.

@ronawho
Copy link
Contributor

ronawho commented Nov 9, 2021

@glitch already found this, but FYI this is based on #966 just because it reduces the amount of code we need to bring in for the parquet dependency check.

Ben also mentioned this a bit in the PR, but the intent of this is to be the first usable draft that we will then build on. The goal was to get something usable in that we can start getting feedback on as well as start running nightly correctness/performance testing. There are a number of known next steps (and we'd be interested in input on prioritization on those.) Note that better error-handling is one of those next steps. Today there are some halts and it's worth noting in the C++ code we currently throw an uncaught exception, which is effectively the same as a halt. Turning that into proper error-handling should be straight forward, but is a fair amount of boilerplate code and we didn't want to make those changes just yet in case larger changes to the core code were required.

edit and to be clear we'd be happy to make any error-handling or whatever changes you want in this PR, just wanted to give context on what we were thinking going into this.

Copy link
Collaborator

@glitch glitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to get a few more reviews from other team members but overall I think it looks pretty good.

Copy link
Member

@stress-tess stress-tess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Nice Work

Copy link
Collaborator

@reuster986 reuster986 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @bmcdonald3 ! I like the design here, especially how you handled the optional inclusion of parquet functionality at compile time in the makefile. I think it's a good prototype for how we can think about extending arkouda with other modular content.

I'm on board with the intention expressed by you and @ronawho to get basic functionality working in this PR and then come back and add more robust error handling, etc. later.

I just requested one minor change regarding how wildcards are handled, because I want us to adhere to ak.read_all usage, rather than ak.load_all.

Well done!

@ronawho
Copy link
Contributor

ronawho commented Nov 16, 2021

I'm on board with the intention expressed by you and @ronawho to get basic functionality working in this PR and then come back and add more robust error handling, etc. later.

@reuster986 do you (or others) have any input on prioritization of next steps? (there's an unordered list of potential next steps in the PR description)

@bmcdonald3
Copy link
Contributor Author

Thanks @reuster986, just pushed up some code that addresses your suggestion, assuming that I understand sufficiently. Also, I would note that the method for optional inclusion was work from @ronawho, so I cannot take credit for that!

@bmcdonald3 bmcdonald3 force-pushed the parquet-support branch 2 times, most recently from ab43417 to 83fdb00 Compare November 16, 2021 23:11
@glitch glitch linked an issue Nov 17, 2021 that may be closed by this pull request
Copy link
Collaborator

@reuster986 reuster986 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than a small change on the test logic, everything looks good.

if arr1[i] != arr2[i]:
print(arr1[i], 'does not match', arr2[i], 'at index', i)
return 1
if (arr1 != arr2).all():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be if (arr1 != arr2).any():.

As written, arrays will compare equal unless all of their values differ.

Copy link
Collaborator

@glitch glitch Nov 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you care about order you could use numpy to do the comparison or just straight list comparison

# Numpy
>>> np.array_equal([1,2,3], [1,2,3])
True
>>> np.array_equal([1,2,3], [3,2,1])
False

# Straight list comparison
>>> [1, 2, 3] == [1, 2, 3]
True
>>> [1, 2, 3] == [3, 2, 1]
False
>>> [1, 2, 3] == [1, 2, 3, 4]
False

You should also return the boolean value instead of an integer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glitch I believe arr1 and arr2 are arkouda arrays. Although you could do np.array_equal(arr1.to_ndarray(), arr2.to_ndarray()), I guess I have a personal preference for staying in arkouda to avoid array transfers (even though those are fast now and these arrays are small, so at this point it's mostly a stylistic choice).

Also, it's been a while since I looked, but I think most of the test suite uses the (a == b).all() pattern, so it might be good to use that for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reuster986 you are correct, (a == b).all() is the pattern elsewhere, and it seems that what other tests do most often is just assertTrue, rather than do a compare_values function like I was doing, so I have switched to that format for the test to be more consistent with other tests.

@mhmerrill
Copy link
Contributor

@reuster986 @bmcdonald3 we would like to merge this week, is this ready to go or do we need to do more?

@reuster986
Copy link
Collaborator

@mhmerrill I reviewed earlier this morning and requested a small logic change in the tests (see discussion above). After that, I think it's ready to go.

@bmcdonald3
Copy link
Contributor Author

@mhmerrill This is all that we had planned adding for this initial Parquet support effort. We do have a couple of follow up PRs nearly ready that improve the error handling, performance, and support additional Parquet types among a few other small changes, but we wanted to get a few nights worth of testing data on this initial work before adding in those other changes if that sounds good to you.

files in Arkouda through a `pdarray.save_parquet()` function
and an `ak.read_parquet()` function. These functions behave
in much the same way as the `pdarray.save()` and `ak.load()`
functions (see docstrings). This is only the initial support,
and more dtypes will likely be added as use necessitates.
@mhmerrill mhmerrill merged commit d7f3544 into Bears-R-Us:master Nov 30, 2021
@mhmerrill
Copy link
Contributor

@bmcdonald3 @ronawho should we have a separate CI process to check this? or does the CI already check this?

@ronawho
Copy link
Contributor

ronawho commented Dec 1, 2021

@bmcdonald3 @ronawho should we have a separate CI process to check this? or does the CI already check this?

I think we discussed this offline, but it's tested by default in CI testing -- https://github.com/Bears-R-Us/arkouda/pull/967/files#diff-3ab46ee209a127470fce3c2cf106b1a1dbadbb929a4b5b13656a4bc4ce19c0b8R7

ronawho added a commit to chapel-lang/chapel that referenced this pull request Dec 1, 2021
Enable parquet support for Arkouda testing

Enable the opt-in parquet support for our nightly arkouda testing.

See Bears-R-Us/arkouda#967 for more info.
@bmcdonald3 bmcdonald3 deleted the parquet-support branch January 27, 2022 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Apache Parquet Support

6 participants