New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Foreign b2nd array compatibility #1072
Conversation
To avoid confusion with the HDF5 array dataset itself.
Thanks to @martaiborra for the clarification.
Instead of checking that the array's chunk shape matches the dataset's chunk shape, check that the array's whole shape matched the dataset's chunk shape. PyTables stores one Blosc2 chunk per HDF5 chunk, but it should be able to cope with Blosc2 frames containing several chunks (since reading does not really operate at the Blosc2 chunk level), as long as the whole Blosc2 array has the proper shape. This may ease having PyTables read b2nd-compressed arrays where dataset chunks contain several Blosc2 chunks.
These checks were already made by the filter, but still missing here.
This increases compatibility with datasets written with other tools, esp. if they also use b2nd for scalar or unidimensional data, as chunk rank/shape filter values used by PyTables for the checks may be missing (and filter set function would not set them either since rank < 2).
The issue seems to have vanished on Guix (GCC 12.3.0), let us see what CI says about Ubuntu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good. Thanks @ivilata !
Fix broken b2nd optimized slice assembly and tests This fixes the assembly of slices obtained via Blosc2 ND optimized slicing, which was using `memcpy` from the outer dimension of each chunk slice instead of the inner one. The new code avoids the manual assembly of the slice altogether by leaving the job to `b2nd_copy_buffer`, which was published in C-Blosc2 2.11.0 (thus the dependencies on C-Blosc2 and python-blosc2 are updated too). A new unit test `tables.test_carray.Blosc2Ndim3MinChunkOptTestCase` has been added that would trigger the error in case of the bug, to avoid regressions. Also, this fixes other unit tests that had been added for b2nd optimized slicing but were not enabled. Finally, `tables.test_carray.Blosc2NDNoChunkshape` has been added to check the compatibility with arrays that contain b2nd chunks but do not include the extra filter parameters with the chunk rank and shape (e.g. because they were created with code other than `hdf5-blosc2`, see #1072).
/* Although blosc2_decompress_ctx ("else" branch) can decompress b2nd-formatted data, | ||
* there may be padding bytes when the chunkshape is not a multiple of the blockshape, | ||
* and only b2nd machinery knows how to handle these correctly. | ||
*/ | ||
if (blosc2_meta_exists(schunk, "b2nd") >= 0 | ||
|| blosc2_meta_exists(schunk, "caterva") >= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove the "caterva" check here, as it never reached the production status, and its functionality has been included in "b2nd".
This expands the work on b2nd array support with direct chunking (#1056) to better handle such arrays created with other tools, e.g. when the filter values with chunk rank & shape are missing, or by being more relaxed about the format of the b2nd array used for storing the chunk (e.g. by not requiring it to consist of a single inner chunk, or have a specific blocksize). Some missing tests on chunk data shape have been added in the optimized path.
A new example script has been added to test writing and reading partial chunks.
Finally, the workaround for stack smashing on some versions of GCC has been removed (it seems not to be needed anymore).