A cloud optimized Python package for reading HDF5 data stored in S3
h5coro is a pure Python implementation of a subset of the HDF5 specification that has been optimized for reading data out of S3. The project has its roots in the development of an on-demand science data processing system called SlideRule, where a new C++ implementation of the HDF5 specification was developed for performant read access to Earth science datasets stored in AWS S3. Over time, user's of SlideRule began requesting the ability to performantly read HDF5 and NetCDF files out of S3 from their own Python scripts. The result is h5coro: the re-implementation in Python of the core HDF5 reading logic that exists in SlideRule. Since then, h5coro has become its own project, which will continue to grow and diverge in functionality from its parent implementation. For more information on SlideRule and the organization behind h5coro, see https://slideruleearth.io.
h5coro is optimized for reading HDF5 data in high-latency high-throughput environments. It accomplishes this through a few key design decisions:
- All reads are concurrent. Each dataset and/or attribute read by h5coro is performed in its own thread.
- Intelligent range gets are used to read as many dataset chunks as possible in each read operation. This drastically reduces the number of HTTP requests to S3 and means there is no longer a need to re-chunk the data (it actually works better on smaller chunk sizes due to the granularity of the request).
- Block caching is used to minimize the number of GET requests made to S3. S3 has a large first-byte latency (we've measured it at ~60ms on our systems), which means there is a large penalty for each read operation performed. h5coro performs all reads to S3 as large block reads and then maintains data in a local cache for access to smaller amounts of data within those blocks.
- The system is serverless and does not depend on any external services to read the data. This means it scales naturally as the user application scales, and it reduces overall system complexity.
- No metadata repository is needed. The structure of the file are cached as they are read so that successive reads to other datasets in the same file will not have to re-read and re-build the directory structure of the file.
For a full list of which parts of the HDF5 specification h5coro implements, see the compatibility section at the end of this readme. The major limitations currently present in the package are:
- The code only implements a subset of the HDF5 specification. h5coro has been shown to work on a number of different datasets, but depending on the version of the HDF5 C library used to write the file, and what options were used during its creation, it is very possible that some part of h5coro will need to be updated to support reading it. Hopefully, over time as more of the spec is implemented, this will become less of a problem.
- It is a read-only library and has no functionality to write HDF5 data.
The simplest way to install h5coro is by using the conda package manager.
conda install -c conda-forge h5coro
Alternatively, you can also install h5coro using pip.
pip install h5coro
To use h5coro
as a backend to xarray, simply install both
xarray
and h5coro
in your current environment.
h5coro
will automatically be recognized by xarray
,
so you can use it like any other xarray engine:
import xarray as xr
h5ds = xr.open_dataset("file.h5", engine="h5coro")
You can see what backends are available in xarray using:
xr.backends.list_engines()
# (1) import
from h5coro import h5coro, s3driver
# (2) create
h5obj = h5coro.H5Coro(f'{my_bucket}/{path_to_hdf5_file}', s3driver.S3Driver)
# (3) read
datasets = [{'dataset': '/path/to/dataset1', 'hyperslice': []},
{'dataset': '/path/to/dataset2', 'hyperslice': [324, 374]}]
promise = h5obj.readDatasets(datasets=datasets, block=True)
# (4) display
for variable in promise:
print(f'{variable}: {promise[variable]}')
h5coro
: the main module implementing the HDF5 reader object
s3driver
: the driver used to read HDF5 data from S3
The call to h5coro.H5Coro
creates a reader object that opens up the HDF5 file, reads the start of the file, and is then ready to accept read requests.
The calling application must have credentials to access the object in the specified S3 bucket. h5coro uses boto3
, so any credentials supplied via the standard AWS methods will work. If credentials need to be supplied externally, then in the call to h5coro.H5Coro
pass in an argument credentials
as a dictionary with the following three fields: "aws_access_key_id", "aws_secret_access_key", "aws_session_token".
The H5Coro.read
function takes a list of dictionary objects that describe the datasets that need to be read in parallel.
If the block
parameter is set to True, then the code will wait for all of the datasets to be read before returning; otherwise, the code will return immediately and not until the dataset within the reader object is access will the code block.
The h5coro promise is a dictionary of numpy
arrays containing the values of the variables read, along with some additional logic that provides the ability to block while waiting for the data to be populated.
h5coro is licensed under the 3-clause BSD license found in the LICENSE file at the root of this source tree.
We welcome and invite contributions from anyone at any career stage and with any amount of coding experience towards the development of h5coro. We appreciate any and all contributions made towards the development of the project. You will be recognized for your work by being listed as one of the project contributors.
- Fixing typographical or coding errors
- Submitting bug reports or feature requests through the use of GitHub issues
- Improving documentation and testing
- Sharing use cases and examples (such as Jupyter Notebooks)
- Providing code for everyone to use
Check the project issues tab to see if the feature has already been suggested. If not, please submit a new issue describing your requested feature or enhancement. Please give your feature request both a clear title and description. Please let us know in your description if this is something you would like to contribute to the project.
Check the project issues tab to see if the problem has already been reported. If not, please submit a new issue so that we are made aware of the problem. Please provide as much detail as possible when writing the description of your bug report. Providing detailed information and examples will help us resolve issues faster.
We follow a standard Forking Workflow for code changes and additions. Submitted code goes through a review and comment process by the project maintainers.
- Make each pull request as small and simple as possible
- Commit messages should be clear and describe the changes
- Larger changes should be broken down into their basic components and integrated separately
- Bug fixes should be their own pull requests with an associated GitHub issue
- Write a descriptive pull request message with a clear title
- Please be patient as reviews of pull requests can take time
- Fork the repository to your personal GitHub account by clicking the “Fork” button on the project main page. This creates your own server-side copy of the repository.
- Either by cloning to your local system or working in GitHub Codespaces, create a work environment to make your changes.
- Add the original project repository as the upstream remote. While this step isn’t a necessary, it allows you to keep your fork up to date in the future.
- Create a new branch to do your work.
- Make your changes on the new branch.
- Push your work to GitHub under your fork of the project.
- Submit a Pull Request from your forked branch to the project repository.
Format Element | Supported | Contains | Missing |
---|---|---|---|
Field Sizes | Yes | 1, 2, 4, 8, bytes | |
Superblock | Partial | Version 0, 2 | Version 1, 3 |
Base Address | Yes | ||
B-Tree | Partial | Version 1 | Version 2 |
Group Symbol Table | Yes | Version 1 | |
Local Heap | Yes | Version 0 | |
Global Heap | No | Version 1 | |
Fractal Heap | Yes | Version 0 | |
Shared Object Header Message Table | No | Version 0 | |
Data Object Headers | Yes | Version 1, 2 | |
Shared Message | No | Version 1 | |
NIL Message | Yes | Unversioned | |
Dataspace Message | Yes | Version 1 | |
Link Info Message | Yes | Version 0 | |
Datatype Message | Partial | Version 1 | Version 0, 2, 3 |
Fill Value (Old) Message | No | Unversioned | |
Fill Value Message | Partial | Version 2, 3 | Version 1 |
Link Message | Yes | Version 1 | |
External Data Files Message | No | Version 1 | |
Data Layout Message | Partial | Version 3 | Version 1, 2 |
Bogus Message | No | Unversioned | |
Group Info Message | No | Version 0 | |
Filter Pipeline Message | Yes | Version 1, 2 | |
Attribute Message | Partial | Version 1, 2, 3 | Shared message support for v3 |
Object Comment Message | No | Unversioned | |
Object Modification Time (Old) Message | No | Unversioned | |
Shared Message Table Message | No | Version 0 | |
Object Header Continuation Message | Yes | Version 1, 2 | |
Symbol Table Message | Yes | Unversioned | |
Object Modification Time Message | No | Version 1 | |
B-Tree ‘K’ Value Message | No | Version 0 | |
Driver Info Message | No | Version 0 | |
Attribute Info Message | No | Version 0 | |
Object Reference Count Message | No | Version 0 | |
Compact Storage | Yes | ||
Continuous Storage | Yes | ||
Chunked Storage | Yes | ||
Fixed Point Type | Yes | ||
Floating Point Type | Yes | ||
Time Type | No | ||
String Type | Yes | ||
Bit Field Type | No | ||
Opaque Type | No | ||
Compound Type | No | ||
Reference Type | No | ||
Enumerated Type | No | ||
Variable Length Type | No | ||
Array Type | No | ||
Deflate Filter | Yes | ||
Shuffle Filter | Yes | ||
Fletcher32 Filter | No | ||
Szip Filter | No | ||
Nbit Filter | No | ||
Scale Offset Filter | No |