Add support for optionally reading and writing Parquet files#967
Add support for optionally reading and writing Parquet files#967mhmerrill merged 5 commits intoBears-R-Us:masterfrom
Conversation
There was a problem hiding this comment.
Overall this looks really good!
My main questions are about closing parquet files & readers, whether we should throw an error vs. return MsgTuple with MsgType.Error from the read/write parquet chapel functions, and the try! versus throwing an error.
Great work!
EDIT:
Oh, one more thing please add src/ArrowFunctions.o to .gitignore and add an empty test/UnitTestParquetCpp.good since it also shows up as an untracked file in git.
| var high = min(d1.high, d2.high); | ||
| if (d1.stride !=1) && (d2.stride != 1) { | ||
| //TODO: change this to throw | ||
| halt("At least one domain must have stride 1"); |
There was a problem hiding this comment.
Yes, I agree, we should throw here.
|
@glitch already found this, but FYI this is based on #966 just because it reduces the amount of code we need to bring in for the parquet dependency check. Ben also mentioned this a bit in the PR, but the intent of this is to be the first usable draft that we will then build on. The goal was to get something usable in that we can start getting feedback on as well as start running nightly correctness/performance testing. There are a number of known next steps (and we'd be interested in input on prioritization on those.) Note that better error-handling is one of those next steps. Today there are some halts and it's worth noting in the C++ code we currently throw an uncaught exception, which is effectively the same as a halt. Turning that into proper error-handling should be straight forward, but is a fair amount of boilerplate code and we didn't want to make those changes just yet in case larger changes to the core code were required. edit and to be clear we'd be happy to make any error-handling or whatever changes you want in this PR, just wanted to give context on what we were thinking going into this. |
glitch
left a comment
There was a problem hiding this comment.
I'd like to get a few more reviews from other team members but overall I think it looks pretty good.
987b4fe to
401d4db
Compare
stress-tess
left a comment
There was a problem hiding this comment.
Looks good to me! Nice Work
reuster986
left a comment
There was a problem hiding this comment.
Great work, @bmcdonald3 ! I like the design here, especially how you handled the optional inclusion of parquet functionality at compile time in the makefile. I think it's a good prototype for how we can think about extending arkouda with other modular content.
I'm on board with the intention expressed by you and @ronawho to get basic functionality working in this PR and then come back and add more robust error handling, etc. later.
I just requested one minor change regarding how wildcards are handled, because I want us to adhere to ak.read_all usage, rather than ak.load_all.
Well done!
@reuster986 do you (or others) have any input on prioritization of next steps? (there's an unordered list of potential next steps in the PR description) |
|
Thanks @reuster986, just pushed up some code that addresses your suggestion, assuming that I understand sufficiently. Also, I would note that the method for optional inclusion was work from @ronawho, so I cannot take credit for that! |
ab43417 to
83fdb00
Compare
reuster986
left a comment
There was a problem hiding this comment.
Other than a small change on the test logic, everything looks good.
tests/parquet_test.py
Outdated
| if arr1[i] != arr2[i]: | ||
| print(arr1[i], 'does not match', arr2[i], 'at index', i) | ||
| return 1 | ||
| if (arr1 != arr2).all(): |
There was a problem hiding this comment.
I think this should be if (arr1 != arr2).any():.
As written, arrays will compare equal unless all of their values differ.
There was a problem hiding this comment.
If you care about order you could use numpy to do the comparison or just straight list comparison
# Numpy
>>> np.array_equal([1,2,3], [1,2,3])
True
>>> np.array_equal([1,2,3], [3,2,1])
False
# Straight list comparison
>>> [1, 2, 3] == [1, 2, 3]
True
>>> [1, 2, 3] == [3, 2, 1]
False
>>> [1, 2, 3] == [1, 2, 3, 4]
False
You should also return the boolean value instead of an integer.
There was a problem hiding this comment.
@glitch I believe arr1 and arr2 are arkouda arrays. Although you could do np.array_equal(arr1.to_ndarray(), arr2.to_ndarray()), I guess I have a personal preference for staying in arkouda to avoid array transfers (even though those are fast now and these arrays are small, so at this point it's mostly a stylistic choice).
Also, it's been a while since I looked, but I think most of the test suite uses the (a == b).all() pattern, so it might be good to use that for consistency.
There was a problem hiding this comment.
@reuster986 you are correct, (a == b).all() is the pattern elsewhere, and it seems that what other tests do most often is just assertTrue, rather than do a compare_values function like I was doing, so I have switched to that format for the test to be more consistent with other tests.
|
@reuster986 @bmcdonald3 we would like to merge this week, is this ready to go or do we need to do more? |
|
@mhmerrill I reviewed earlier this morning and requested a small logic change in the tests (see discussion above). After that, I think it's ready to go. |
|
@mhmerrill This is all that we had planned adding for this initial Parquet support effort. We do have a couple of follow up PRs nearly ready that improve the error handling, performance, and support additional Parquet types among a few other small changes, but we wanted to get a few nights worth of testing data on this initial work before adding in those other changes if that sounds good to you. |
files in Arkouda through a `pdarray.save_parquet()` function and an `ak.read_parquet()` function. These functions behave in much the same way as the `pdarray.save()` and `ak.load()` functions (see docstrings). This is only the initial support, and more dtypes will likely be added as use necessitates.
b9d12d9 to
81f9e3f
Compare
|
@bmcdonald3 @ronawho should we have a separate CI process to check this? or does the CI already check this? |
I think we discussed this offline, but it's tested by default in CI testing -- https://github.com/Bears-R-Us/arkouda/pull/967/files#diff-3ab46ee209a127470fce3c2cf106b1a1dbadbb929a4b5b13656a4bc4ce19c0b8R7 |
Enable parquet support for Arkouda testing Enable the opt-in parquet support for our nightly arkouda testing. See Bears-R-Us/arkouda#967 for more info.
This PR adds the ability to read and write int64 Parquet files in Arkouda through a
pdarray.save_parquet()function and anak.read_parquet()function. These functions behave in much the same way as thepdarray.save()andak.load()functions (see docstrings). This is only the initial support, and more dtypes will likely be added as use necessitates.For reading Parquet files, the Arrow C++ library was chosen due to library completeness and portability. The actual functionality is written in C++, with light C wrappers to allow leveraging of the Chapel C interoperability features, which requires the C++ code to be compiled into an object file prior to calling the functions within Chapel.
On the Chapel side, the functionality is modeled off of the HDF5 code, but currently only supports int64 and int32 Arrow datatypes, which are read into int64 pdarrays.
Performance numbers collected on a single node Cray CS:
Potential next steps:
Parquet as an optional dependency:
Running a
makewith this PR will build Arkouda as normal, without requiring Arrow. Only when the environment variableARKOUDA_SERVER_PARQUET_SUPPORTis set will the Arrow-requiring files be pulled into the build. This functions in much the same way as ZMQ and HDF5, where the path can be specified inMakefile.pathsand there is a providedmake install-arrowcommand to download Arrow from offline and build from source.Closes #903