Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use H5Dchunk_iter rather than H5Dget_chunk_info for Blosc2 #991

Closed
mkitti opened this issue Jan 5, 2023 · 7 comments
Closed

Use H5Dchunk_iter rather than H5Dget_chunk_info for Blosc2 #991

mkitti opened this issue Jan 5, 2023 · 7 comments
Milestone

Comments

@mkitti
Copy link

mkitti commented Jan 5, 2023

HDF5 1.14 introduces H5Dchunk_iter which provides a more efficient way to iterate over all chunks rather than having $O(N^2)$ scaling where $N$ is the number of chunks. HDF5 1.14 was recently released. The function is being backported and is expected in 1.12.3 and 1.10.10.

Docs: https://docs.hdfgroup.org/hdf5/develop/group___h5_d.html#title6
Original pull request: HDFGroup/hdf5#6

cc: @oscargm98 @FrancescAlted

@FrancescAlted
Copy link
Member

Thanks @mkitti, that's interesting. I think the advantage in speed should be more evident when chunks are small, which is not the usual case for Blosc2 chunks, but worth considering for next releases anyway.

@avalentino avalentino added this to the 3.8.1 milestone Jan 5, 2023
@mkitti
Copy link
Author

mkitti commented Jan 5, 2023

More directly, it is dependent on the number of chunks. Generally, I suppose, if the chunks are small, then you would tend to have many of chunks. You might still have many chunks though if your dataset is also quite large in total.

@FrancescAlted
Copy link
Member

For the record, I was curious on the kind of overhead HDF5 is adding to the e.g. inkernel case, and it is really low (< 1%) when using Blosc2:

image

Note that the time is used mainly on reading from the OS filesystem cache and multithread synching (38% above), blosc2 (27%) and finally numexpr (14%).

Also, even using old plain Blosc filter (where the chunksizes are 72x smaller, and queries perfom ~2x slower too), the HDF5 layer is not adding too much overhead either:

image

So, IMO we should not be too pressed in optimizing the HDF5 calls. Having a much larger chunksize (as Blosc2 allows) is a way better path for getting good perfomance (at least for usual PyTables use cases).

@mkitti
Copy link
Author

mkitti commented Jan 6, 2023

How many chunks are in your typical use case with large chunks? This only became an issue for me when I started to scale up to large multi-terabyte datasets.

@FrancescAlted
Copy link
Member

In the case of the inkernel queries above for Blosc2, the number is < 100 (3.1 GB / 36 MB). The point is that this means typically 2 orders of magnitude less than using the typical HDF5 single-partition.

For datasets that do not fit in OS filesystem cache, it would be interesting to try this out. Which is the typical chunksize that you use for your multi-terabyte datasets? The access pattern? And the kind of speed-up that you get?

@FrancescAlted
Copy link
Member

Finally, I realized the scenario where H5Dchunk_iter would be more useful, so I am having another go at this with #999. Indeed, finding a case with a 5x speed is not something to sneeze at.

@FrancescAlted
Copy link
Member

Now that #999 is in, we can close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants