Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Bindings: Numpy Support #29

Merged
merged 42 commits into from
Apr 18, 2019
Merged

Conversation

SteVwonder
Copy link
Member

@SteVwonder SteVwonder commented Dec 13, 2018

@salasoom and @lindstro: just wanted to post this PR to update you on the progress of the python bindings. Support for compression/decompression from/to numpy arrays has been added, as well as a few tests to verify that a "round-trip" of compression/decompression works. @salasoom, if you have some time before the holidays, it would be great to discuss what other tests would be beneficial.

TODO:

  • wrap main zfp functions and testing utility functions
  • add tests and add to Travis CI
  • floating-point (de-)compression for 1->4D matches known-good checksums
  • write_header argument for compress_numpy
  • add support for arbitrary strides rather than rearranging the array into fortran-contiguous
  • integer (de-)compression matches the known-good checksums
  • wrap the random strided array functions in test/utils and add to python tests
  • support decompression into a pre-allocated buffer
  • roll back changes to travis.sh
  • release GIL before calls to compress/decompress
  • update Cython flags to embed function signatures and docstrings
  • support for lossless compression
  • documentation
    • dependencies, configuration, compilation, testing, and installation
    • API docs and example usage
  • get tests passing on Travis
    • Linux
    • Mac
  • breakup the large unittest method into finer granularity chunks
  • requirements.txt

@SteVwonder
Copy link
Member Author

SteVwonder commented Dec 14, 2018

Oh. It's probably worth mentioning that the bindings aren't built by default. You have to provide -DBUILD_PTYHON=on on the cmake line to build them.

@lindstro
Copy link
Member

@SteVwonder This is fantastic! I'm up against a paper deadline this week but either I and/or @salasoom will take a closer look over the next few days.

@salasoom
Copy link
Member

Can you base the PR from the develop branch? We exclude documentation and 99% of tests when making releases on master.

Now that we are finishing bindings in a couple languages, we may mess around with the directory structure in the next release. For now, please separate the tests into dir tests/python.

When on develop, you'll see all the tests (and docs). For testing array compression/decompression, we generate ~million element arrays and pass them into the compressor, and then checksum the resulting bitstream and decompressed array (once decompressed). You can find C libs in tests/utils for generating the smooth random arrays, and for checksum-ing resulting memory (hash32/64). Calling those from Python would seem to be the way to go. We compare checksums against what's stored in tests/constants (macro-defined constants).

I've been meaning to refactor some of those utility libraries, so your input can help re-shape those APIs to be the least painful when calling from Python (ex. removing double pointers).

As far as benchmarking goes, that's something we've been meaning to get around to, an executable we can run on local machines, because we see a large range of test runtimes on Travis and Appveyor. For now, we time how long compression and decompression take for those million element arrays, and print them out as part of the test. At least then we have a history of CI log output with some timings.

@lindstro
Copy link
Member

@SteVwonder I've had a chance to skim your PR. I like a lot of what I'm seeing. Great job!

I also have a few comments and concerns:

compress_numpy (which is perhaps the most important function) supports fixed-accuracy mode only, while decompress_numpy supports all modes. Although we have to start somewhere, this is a significant limitation. While we support only fixed-rate compression in the CUDA version, there are good technical reasons for this having to do with lack of knowledge where in the compressed stream each thread should start. I know we'll be making changes to the Python API before the release, but I wanted to point out that we'll have to change this function to support the other modes. For now, it might be best to rename the function to make it explicit that it only handles the fixed-accuracy mode. On the other hand, only a single line (zfp_stream_set_accuracy) has to change to support fixed-precision and fixed-rate also. Alternatively, we might consider passing in an object that encodes the compression mode, which is essentially what zfp_stream does, and then add three more short functions for setting the compression parameters. I think Markus or I can make this change.

I don't think we should have a default tolerance. The user should be aware that this is something they need to specify. If they don't know how to set this, then I think it would be safer to use a tolerance of zero.

Another difference wrt. zfp_compress is that only one field can be compressed at a time. That might be fine for the Python interface since one can simply make multiple calls and concatenate the compressed byte strings. However, this does cause an incompatibility with files compressed via the C interface, as zfp_compress does not insert a header between each field. This is something worth documenting at least.

Does Cython expose internal implementation functions like _init_field in the global namespace? That would violate the ANSI standard, which reserves all public symbols starting with _.

class Memory(): Maybe check the return value of malloc?

The return value to zfp_decompress is not checked. It will be zero in case of failure. Maybe throw an exception?

One potential concern: By default, zfp's compressed data is in multiples of 64-bit "words." Are there any concerns that Python might pass an unaligned array of bytes to decompress_numpy? That could cause issues on some platforms.

Regarding the changes to the LICENSE, LLNL has gotten picky about what can go in this file: see https://software.llnl.gov/about/licenses/. We may have to update the license and place this additional information in the python directory.

@SteVwonder SteVwonder force-pushed the feature/py-bindings branch 2 times, most recently from fdeb157 to 37ba369 Compare December 23, 2018 07:01
@SteVwonder SteVwonder changed the base branch from master to develop December 23, 2018 07:05
@codecov-io
Copy link

codecov-io commented Dec 23, 2018

Codecov Report

Merging #29 into develop will decrease coverage by 0.37%.
The diff coverage is 89.26%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop      #29      +/-   ##
===========================================
- Coverage    93.55%   93.18%   -0.38%     
===========================================
  Files           62       64       +2     
  Lines         3058     3477     +419     
===========================================
+ Hits          2861     3240     +379     
- Misses         197      237      +40
Impacted Files Coverage Δ
python/test_utils.pyx 86.29% <86.29%> (ø)
python/zfp.pyx 91.89% <91.89%> (ø)
src/inline/bitstream.c 94.81% <0%> (+3.7%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6e9f19b...8d64d04. Read the comment docs.

@lindstro
Copy link
Member

@SteVwonder Currently, there's an assumption that the numpy array uses C ordering, which may or may not be true. I think it would be safer to call zfp_set_stride_* in _init_field by translating the ndarray.strides byte strides to zfp's scalar strides. See this discussion regarding this issue: zarr-developers/numcodecs#160.

@SteVwonder
Copy link
Member Author

@lindstro, I looked into the type and exec problems reported in zarr-developers/numcodecs#160. I had no issues with referencing the type field in the zfp_field struct (it is already included in this PRs commits), but I did have an issue referencing the exec field in the zfp_stream struct (Cython would error out with "invalid syntax").

I haven't found a general way to reference fields/variables in C that overlap with reserved keywords in Python/Cython. Ultimately though, I don't think this is much of an issue for zfp. Cython does not require that every field in a struct be listed. It only requires those fields that you need to directly access from Cython code. So we can just omit exec from the zfp_stream definition in Cython and then set exec via the zfp_stream_set_execution function. If we need to reference the exec field directly from Cython/Python for some reason, we might need to change the name in C.

@lindstro
Copy link
Member

@SteVwonder I'm fine with that for the current Python implementation. I still think we should take this into consideration and change C struct member names for the next release to avoid issues like this.

With regards to the type field, is it possible that the behavior differs between Python 2 and 3, or did you test both?

@SteVwonder
Copy link
Member Author

Just pushed a commit adding support for fixed-rate and fixed-precision modes. The compress_numpy function now has two additional "optional" arguments. Exactly one of the "optional" arguments (i.e., rate, precision, tolerance) must be provided by the user, otherwise an exception is thrown. This eliminates the need for having three separate functions, but it is a bit odd to require an "optional" argument. As a quick sanity check, is -1 a valid value for rate or tolerance? (I'm currently using -1 as the default value for all three optional arguments under the assumption its an invalid value).

Hopefully this works for the EOY milestone, and we can come up with a better interface for the March release. Also, same caveat as before that the current python test cases just verify that nothing catastrophic happens when doing a round-trip compression/decompression using the bindings.

I've been meaning to refactor some of those utility libraries, so your input can help re-shape those APIs to be the least painful when calling from Python (ex. removing double pointers).

After peeking at the utility libraries, I think I have a few ideas on how to make them more usable from Python. @salasoom, can we meet in person after the holidays to go over these tests a bit more in-depth and brainstorm?

Can you base the PR from the develop branch?
For now, please separate the tests into dir tests/python

👍 Done.


I don't think we should have a default tolerance. The user should be aware that this is something they need to specify. If they don't know how to set this, then I think it would be safer to use a tolerance of zero.

Makes sense. Removed the default tolerance as part of the fixed-rate/precision commit.

Another difference wrt. zfp_compress is that only one field can be compressed at a time.

That's a good point. I can document that as a limitation of the current numpy-specific interface. Would adding a boolean output_header argument that toggles the writing of the header make it compatible with files compressed via the C interface?

Does Cython expose internal implementation functions like _init_field in the global namespace? That would violate the ANSI standard, which reserves all public symbols starting with _.

__init__ doesn't get exported as a symbol in the shared library, but other functions that start with _ are exported, like _PyBytes_Type and ___pyx_module_is_main_zfp. I don't think this should be an issue since only Python will be dlopening the shared library, but maybe I'm missing something.

class Memory(): Maybe check the return value of malloc?
The return value to zfp_decompress is not checked. It will be zero in case of failure. Maybe throw an exception?

Good ideas. Both are done.


I still think we should take this into consideration and change C struct member names for the next release to avoid issues like this.

👍 sounds good to me. Also, I spoke too soon. It does appear that Cython has support for struct members whose names conflict with python keywords (cython reference). I believe the follow should work (will try out tomorrow).

ctypedef struct zfp_stream:
    zfp_exec _exec "exec"

With regards to the type field, is it possible that the behavior differs between Python 2 and 3, or did you test both?

That is possible. I have only been able to test against python 3 while on my mac. I can test against both py2 & 3 on Linux tomorrow.

@lindstro
Copy link
Member

@SteVwonder Well done! I have a few comments below.

Just pushed a commit adding support for fixed-rate and fixed-precision modes. The compress_numpy function now has two additional "optional" arguments. Exactly one of the "optional" arguments (i.e., rate, precision, tolerance) must be provided by the user, otherwise an exception is thrown. This eliminates the need for having three separate functions, but it is a bit odd to require an "optional" argument. As a quick sanity check, is -1 a valid value for rate or tolerance? (I'm currently using -1 as the default value for all three optional arguments under the assumption its an invalid value).

Thanks--I think this is fine for now. -1 is not valid for any of the three parameters, so that'll work. But I agree that we need a cleaner interface for the release.

Hopefully this works for the EOY milestone, and we can come up with a better interface for the March release. Also, same caveat as before that the current python test cases just verify that nothing catastrophic happens when doing a round-trip compression/decompression using the bindings.

OK. I'm still concerned about the memory layout of the numpy array. Did you get a chance to look into using strides? I believe asfortranarray actually permutes the array, which could be expensive and unnecessary. Instead, I would suggest mapping shape[3] to nx, shape[2] to ny, and so on. This corresponds to the default ndarray layout and presumably the most common case. Then set the zfp strides to guard against other layouts. This ensures that in the most common case, we will traverse the numpy array in sequential order, and no deep copies are needed.

I've been meaning to refactor some of those utility libraries, so your input can help re-shape those APIs to be the least painful when calling from Python (ex. removing double pointers).

After peeking at the utility libraries, I think I have a few ideas on how to make them more usable from Python. @salasoom, can we meet in person after the holidays to go over these tests a bit more in-depth and brainstorm?

Can you base the PR from the develop branch?
For now, please separate the tests into dir tests/python

+1 Done.

I don't think we should have a default tolerance. The user should be aware that this is something they need to specify. If they don't know how to set this, then I think it would be safer to use a tolerance of zero.

Makes sense. Removed the default tolerance as part of the fixed-rate/precision commit.

Another difference wrt. zfp_compress is that only one field can be compressed at a time.

That's a good point. I can document that as a limitation of the current numpy-specific interface. Would adding a boolean output_header argument that toggles the writing of the header make it compatible with files compressed via the C interface?

Yes, I think that makes sense.

Does Cython expose internal implementation functions like _init_field in the global namespace? That would violate the ANSI standard, which reserves all public symbols starting with _.

__init__ doesn't get exported as a symbol in the shared library, but other functions that start with _ are exported, like _PyBytes_Type and ___pyx_module_is_main_zfp. I don't think this should be an issue since only Python will be dlopening the shared library, but maybe I'm missing something.

That's fine. I was just concerned about polluting the global namespace with internal functions and doing so in an ANSI non-compliant manner.

class Memory(): Maybe check the return value of malloc?
The return value to zfp_decompress is not checked. It will be zero in case of failure. Maybe throw an exception?

Good ideas. Both are done.

I still think we should take this into consideration and change C struct member names for the next release to avoid issues like this.

+1 sounds good to me. Also, I spoke too soon. It does appear that Cython has support for struct members whose names conflict with python keywords (cython reference). I believe the follow should work (will try out tomorrow).

Nice.

ctypedef struct zfp_stream:
    zfp_exec _exec "exec"

With regards to the type field, is it possible that the behavior differs between Python 2 and 3, or did you test both?

That is possible. I have only been able to test against python 3 while on my mac. I can test against both py2 & 3 on Linux tomorrow.

Sounds good. I noticed that this PR does not pass on Travis. Do you know why?

@SteVwonder SteVwonder force-pushed the feature/py-bindings branch 2 times, most recently from d2ba3a8 to ebe63fa Compare December 31, 2018 04:18
@SteVwonder
Copy link
Member Author

Did you get a chance to look into using strides?

I have not, but I agree that strides is definitely the cleaner/better way to go in the long term. Unfortunately, I will not be able to implement that before the EOY.

I noticed that this PR does not pass on Travis. Do you know why?

I'm not 100% sure of the root cause, but I think it has to do with either the version of python (v3.4) or Cython (v?) in Ubuntu Trusty. After switching to Ubuntu Xenial, everything passes. I will investigate that further.

Regarding the changes to the LICENSE, LLNL has gotten picky about what can go in this file: see software.llnl.gov/about/licenses. We may have to update the license and place this additional information in the python directory.

Forgot to mention earlier that I made this change. The top-level LICENSE is unchanged, and there are two subdirectories under the python directory which contain copied code along with a copy of the original licenses for that code.

That's a good point. I can document that as a limitation of the current numpy-specific interface. Would adding a boolean output_header argument that toggles the writing of the header make it compatible with files compressed via the C interface?

Yes, I think that makes sense.

Is there a file currently generated as part of the testsuite to test this against?

PS - I added a commit that removes references to reserved keywords like bytes and type.

@lindstro
Copy link
Member

Did you get a chance to look into using strides?

I have not, but I agree that strides is definitely the cleaner/better way to go in the long term. Unfortunately, I will not be able to implement that before the EOY.

No worries. Let's tackle that next year. Would be nice to verify that the Python version can decompress a file written by the zfp command-line tool (see more below).

I noticed that this PR does not pass on Travis. Do you know why?

I'm not 100% sure of the root cause, but I think it has to do with either the version of python (v3.4) or Cython (v?) in Ubuntu Trusty. After switching to Ubuntu Xenial, everything passes. I will investigate that further.

Nice!

Regarding the changes to the LICENSE, LLNL has gotten picky about what can go in this file: see software.llnl.gov/about/licenses. We may have to update the license and place this additional information in the python directory.

Forgot to mention earlier that I made this change. The top-level LICENSE is unchanged, and there are two subdirectories under the python directory which contain copied code along with a copy of the original licenses for that code.

Sounds good.

That's a good point. I can document that as a limitation of the current numpy-specific interface. Would adding a boolean output_header argument that toggles the writing of the header make it compatible with files compressed via the C interface?

Yes, I think that makes sense.

Is there a file currently generated as part of the testsuite to test this against?

No, we don't include binaries with zfp. But, what I would suggest is that you generate a small rectangular 2D or 3D floating-point array and compress it with the zfp command-line tool (use -h to write a header). A linear sequence of integers will do: f(x, y, z) = x + nx * (y + ny * z). You want to make sure nx != ny != nz, e.g., nx = 15, ny = 16, nz = 17. Then feed the compressed byte stream to your decompressor and verify that it works correctly. You could use -a 0 to achieve near lossless compression.

Then repeat with zfPy compressing the file and the command-line tool decompressing.

PS - I added a commit that removes references to reserved keywords like bytes and type.

Can you please elaborate? Surely zfp_field.type needs to be referenced somewhere?

We should be careful to avoid Python keywords in the zfp C implementation. I tried to locate a list of reserved keywords but did not see either type or bytes referenced. Are these provided by Python modules?

@SteVwonder
Copy link
Member Author

No, we don't include binaries with zfp. But, what I would suggest is that you generate a small rectangular 2D or 3D floating-point array and compress it with the zfp command-line tool (use -h to write a header). A linear sequence of integers will do: f(x, y, z) = x + nx * (y + ny * z). You want to make sure nx != ny != nz, e.g., nx = 15, ny = 16, nz = 17. Then feed the compressed byte stream to your decompressor and verify that it works correctly. You could use -a 0 to achieve near lossless compression.
Then repeat with zfPy compressing the file and the command-line tool decompressing.

zfPy: I like that stylization 😄

👍 Sounds like a solid plan. We should be able to write the "compress with zfPy and decompress with CLI" test after adding the write_header argument to compress_numpy. I think we'll need to expose zfp_field as a "first-class citizen" before being able to write the reverse test, since that requires passing metadata about the array (i.e., type & dimensonality) to the decompression function.

We should be careful to avoid Python keywords in the zfp C implementation. I tried to locate a list of reserved keywords but did not see either type or bytes referenced. Are these provided by Python modules?

Sorry, I was being sloppy with my terms. type and bytes are technically built-in functions. You can find the full list of built-ins here. As far as reserved keywords, the best reference I could find was in the Cython source code.

PS - I added a commit that removes references to reserved keywords like bytes and type.

Can you please elaborate? Surely zfp_field.type needs to be referenced somewhere?

You are absolutely correct; it does need to be referenced. In the case of type, I used a feature of Cython to "rename" the struct member to _type. The struct definition now looks like:

ctypedef struct zfp_field:
    zfp_type _type "type"
    ...

And the references to the member look like: cdef zfp_type ztype = field[0]._type. I didn't actually run into any issues personally with type, other than syntax highlighting oddities, but out of an abundance of caution (paranoia?), I made the switch. Since it wasn't actually an issue, I can revert the change if you want. Since exec is actually a reserved keyword and not just a built-in function, we will need to do something similar to handle that case (at least until the definition in C changes).

In the case of bytes, it was only used in function definitions at the top of the file (i.e., bitstream* stream_open(void* data, size_t bytes);). Cython actually only requires type information for the parameters, so I changed it to bitstream* stream_open(void* data, size_t);.

@lindstro
Copy link
Member

lindstro commented Jan 7, 2019

Just pushed a commit adding support for fixed-rate and fixed-precision modes. The compress_numpy function now has two additional "optional" arguments. Exactly one of the "optional" arguments (i.e., rate, precision, tolerance) must be provided by the user, otherwise an exception is thrown. This eliminates the need for having three separate functions, but it is a bit odd to require an "optional" argument. As a quick sanity check, is -1 a valid value for rate or tolerance? (I'm currently using -1 as the default value for all three optional arguments under the assumption its an invalid value).

@SteVwonder We now have support for reversible (lossless) compression (see feature/lossless branch). I think a quite natural default would be to use reversible compression when the user does not specify a rate, precision, or tolerance. We'll have to work on integrating these branches, but I wanted to mention this as a perhaps better alternative than throwing an exception when no optional argument is passed.

@rabernat
Copy link

rabernat commented Jan 9, 2019

Following this work with interest.

I have a question that I couldn't quite answer from reading the code: can the decompress_numpy decompress directly into a pre-allocated buffer? This would be useful for zarr-developers/numcodecs#117.

python/zfp.pyx Outdated Show resolved Hide resolved
@lindstro
Copy link
Member

lindstro commented Jan 9, 2019

Following this work with interest.

I have a question that I couldn't quite answer from reading the code: can the decompress_numpy decompress directly into a pre-allocated buffer? This would be useful for zarr-developers/numcodecs#117.

@SteVwonder can correct me if I'm wrong, but the current implementation allocates a buffer (ndarray). The reason for this is that the caller of decompress_numpy does not necessarily know how large a buffer to allocate since the header information that stores array dimensions and scalar type is part of the compressed stream.

If this metadata were available elsewhere, then I believe there's no technical reason why we could not supplement the API with a function that also accepts the buffer to decompress into. We'd want to ensure that the buffer is large enough to hold the decompressed data. The zfp C implementation requires the caller to allocate the buffer anyway, which is made easier by separating the calls for parsing the header and decompressing the array.

@rabernat Can you point us to en example of how such functionality is being used in Zarr/numcodecs?

@rabernat
Copy link

rabernat commented Jan 9, 2019

@rabernat Can you point us to en example of how such functionality is being used in Zarr/numcodecs?

The python wrapper for blosc in numcodecs contain an optional dest argument:
https://github.com/zarr-developers/numcodecs/blob/master/numcodecs/blosc.pyx#L330

@SteVwonder
Copy link
Member Author

@SteVwonder can correct me if I'm wrong, but the current implementation allocates a buffer (ndarray). The reason for this is that the caller of decompress_numpy does not necessarily know how large a buffer to allocate since the header information that stores array dimensions and scalar type is part of the compressed stream.

👍

If this metadata were available elsewhere, then I believe there's no technical reason why we could not supplement the API with a function that also accepts the buffer to decompress into.

Agreed!

@lindstro
Copy link
Member

@rabernat Can you point us to en example of how such functionality is being used in Zarr/numcodecs?

The python wrapper for blosc in numcodecs contain an optional dest argument:
https://github.com/zarr-developers/numcodecs/blob/master/numcodecs/blosc.pyx#L330

Seems like a reasonable solution. We should be able to do something similar.

zfp supports parallel compression using OpenMP and CUDA. Is that something that would be useful in Zarr?

@rabernat
Copy link

zfp supports parallel compression using OpenMP and CUDA. Is that something that would be useful in Zarr?

Definitely!

I'm a bit concerned that we now have two competing implementations of the python bindings for zfp, the one here and the one from @halehawk in zarr-developers/numcodecs#160.

@halehawk
Copy link
Contributor

halehawk commented Jan 10, 2019 via email

@romankarlstetter
Copy link

Nice to see work on python bindings for zfp.

I would like to test the python bindings for zfp. How would I do this? As far as I see it, there is no release of this yet. How can I install and use the current development version of these bindings?

Thanks in advance :)

SteVwonder and others added 27 commits April 17, 2019 15:52
includes validation for number of dimentions, zfp_type, zfp_mode, and
compression parameter number
remove call to `asfortranarray` in `compress_numpy`, replace with calls
to `zfp_field_set_stride_*d` when creating the `zfp_field`

tests of floating point arrays beyond 1D now work
precision and accuracy compression aren't designed for integer types,
don't use them in the checksum testing
pull input validation out into separate functions
Per @romankarlstetter's suggestion, make sure docstrings and function
signatures are included in the compiled bindings. This greatly improves
python binding usage in interactive scenarios, like Jupyter Notebooks.
enables other python threads to execute while executing the
computational expensive C functions zfp_compress and zfp_decompress
raise a RuntimeError if the write failed and the return is 0
instead, leverage numpy's itemsize method
…ent compilers, plus some 2.7 (osx included)

fix python test utils compile error
flipping the strides in the `compress_numpy` function means we have to
provide to numpy the reversed of how we actually want the data compressed
add unittests for this funcionality
The associated metadata (ztype and shape) are now required arguments.
There is no longer an attempt to read the header, and thus there is no
validation of user-provided metadata against header metadata.  Remove
strides since they aren't test and aren't fully working.
@salasoom salasoom merged commit 2d1216d into LLNL:develop Apr 18, 2019
@SteVwonder SteVwonder deleted the feature/py-bindings branch April 18, 2019 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants