Foreign b2nd array compatibility #1072

ivilata · 2023-10-19T10:05:40Z

This expands the work on b2nd array support with direct chunking (#1056) to better handle such arrays created with other tools, e.g. when the filter values with chunk rank & shape are missing, or by being more relaxed about the format of the b2nd array used for storing the chunk (e.g. by not requiring it to consist of a single inner chunk, or have a specific blocksize). Some missing tests on chunk data shape have been added in the optimized path.

A new example script has been added to test writing and reading partial chunks.

Finally, the workaround for stack smashing on some versions of GCC has been removed (it seems not to be needed anymore).

…darray

To avoid confusion with the HDF5 array dataset itself.

@martaiborra

Thanks to @martaiborra for the clarification.

Instead of checking that the array's chunk shape matches the dataset's chunk shape, check that the array's whole shape matched the dataset's chunk shape. PyTables stores one Blosc2 chunk per HDF5 chunk, but it should be able to cope with Blosc2 frames containing several chunks (since reading does not really operate at the Blosc2 chunk level), as long as the whole Blosc2 array has the proper shape. This may ease having PyTables read b2nd-compressed arrays where dataset chunks contain several Blosc2 chunks.

These checks were already made by the filter, but still missing here.

This increases compatibility with datasets written with other tools, esp. if they also use b2nd for scalar or unidimensional data, as chunk rank/shape filter values used by PyTables for the checks may be missing (and filter set function would not set them either since rank < 2).

The issue seems to have vanished on Guix (GCC 12.3.0), let us see what CI says about Ubuntu.

FrancescAlted

Looks very good. Thanks @ivilata !

Fix broken b2nd optimized slice assembly and tests This fixes the assembly of slices obtained via Blosc2 ND optimized slicing, which was using `memcpy` from the outer dimension of each chunk slice instead of the inner one. The new code avoids the manual assembly of the slice altogether by leaving the job to `b2nd_copy_buffer`, which was published in C-Blosc2 2.11.0 (thus the dependencies on C-Blosc2 and python-blosc2 are updated too). A new unit test `tables.test_carray.Blosc2Ndim3MinChunkOptTestCase` has been added that would trigger the error in case of the bug, to avoid regressions. Also, this fixes other unit tests that had been added for b2nd optimized slicing but were not enabled. Finally, `tables.test_carray.Blosc2NDNoChunkshape` has been added to check the compatibility with arrays that contain b2nd chunks but do not include the extra filter parameters with the chunk rank and shape (e.g. because they were created with code other than `hdf5-blosc2`, see #1072).

FrancescAlted · 2023-11-27T12:45:31Z

hdf5-blosc2/src/blosc2_filter.c

+    /* Although blosc2_decompress_ctx ("else" branch) can decompress b2nd-formatted data,
+     * there may be padding bytes when the chunkshape is not a multiple of the blockshape,
+     * and only b2nd machinery knows how to handle these correctly.
+     */
    if (blosc2_meta_exists(schunk, "b2nd") >= 0
        || blosc2_meta_exists(schunk, "caterva") >= 0) {


I think we can remove the "caterva" check here, as it never reached the production status, and its functionality has been included in "b2nd".

ivilata added 10 commits October 17, 2023 12:36

Merge remote-tracking branch 'origin/master' into direct-chunking-b2n…

5d79cd4

…darray

Mention B2ND in errors when reading chunk slices from the file

6051655

To avoid confusion with the HDF5 array dataset itself.

Note on why a b2nd-specific case is needed in filter decompression

8cd8559

Thanks to @martaiborra for the clarification.

Check rank & shape of b2nd array when getting a slice

78e1b39

These checks were already made by the filter, but still missing here.

Correct error message when lacking parameters for Blosc2 compression

942d2db

Replace magic maximum rank constant with proper HDF5 constant

5d691de

Remove workaround for stack smashing/segmentation fault on GCC+Linux

e16edd5

The issue seems to have vanished on Guix (GCC 12.3.0), let us see what CI says about Ubuntu.

Notes on chunk size requirements for the HDF5 filter pipeline

c6b47e1

ivilata requested review from FrancescAlted and martaiborra October 19, 2023 10:05

ivilata self-assigned this Oct 19, 2023

ivilata added 2 commits October 19, 2023 13:31

Turn some filter error messages into warning and info messages

3ba4e78

Add example script to test reading and writing partial chunks

d5536db

FrancescAlted approved these changes Oct 19, 2023

View reviewed changes

ivilata merged commit bb02c88 into master Oct 19, 2023
30 checks passed

ivilata deleted the direct-chunking-b2ndarray branch October 19, 2023 15:05

ivilata mentioned this pull request Oct 23, 2023

Update hdf5-blosc2 to support b2nd silx-kit/hdf5plugin#282

Merged

ivilata mentioned this pull request Nov 9, 2023

Fix broken b2nd optimized slice assembly and tests #1078

Merged

FrancescAlted reviewed Nov 27, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Foreign b2nd array compatibility #1072

Foreign b2nd array compatibility #1072

ivilata commented Oct 19, 2023 •

edited

FrancescAlted left a comment

FrancescAlted Nov 27, 2023

Foreign b2nd array compatibility #1072

Foreign b2nd array compatibility #1072

Conversation

ivilata commented Oct 19, 2023 • edited

FrancescAlted left a comment

Choose a reason for hiding this comment

FrancescAlted Nov 27, 2023

Choose a reason for hiding this comment

ivilata commented Oct 19, 2023 •

edited