# Advanced HDF5 with h5py Tutorial

#### Prepared for 2023 HDF5 User Group Meeting


Aleksandar Jelenak (<ajelenak@hdfgroup.org>)</br>
https://orcid.org/0009-0001-2102-0559

# About h5py

* Created by Andrew Collette in 2008
* Andrew also wrote a book about h5py: https://www.oreilly.com/library/view/python-and-hdf5/9781491944981
* The most popular package for using HDF5 library (libhdf5) in Python
* As of 2023-08-13, the h5py GitHub repository had 1,912 stars

# What makes h5py popular?

* **NumPy array** ↔︎ h5py ↔︎ bytes ↔︎ libhdf5 ↔︎ **HDF5 file**
* API natural to Python users
* Very smart transition to the libhdf5 API

# h5py _High-Level_ (Pythonic) API

* h5py classes for HDF5 entities: Group, Dataset, Commited Datatype, Attribute.
* h5py File also represents the root group (so is an h5py Group, too).
* h5py Group and Attribute classes based on the Python dictionary.
* HDF5 links (names of HDF5 objects in a group) are the keys of a group or attribute object dictionaries.

* Also high-level classes for Virtual Datasets, filters, hyperslab selections, and dimension scales.

# h5py _Low-Level_ API

* Gateway to the libhdf5 C API
* Easy to find the h5py equivalent of any libhdf5 C function. Example: `H5Dget_storage_size` ➞ `h5py.h5d.get_storage_size`.
* Applies to libhdf5 constants, too: `H5F_LIBVER_EARLIEST` ➞ `h5py.h5f.LIBVER_EARLIEST`.

## How to Access Low-Level API?

* `id` property of the high-level h5py objects (File, Group, Dataset, Datatype).
* For everything else (datatypes, property lists, etc.), there are special h5py classes. Example: `h5py.h5p.PropFCID` class for file creation property lists.
* This API is very useful but is not meant to be used frequently. (Don't do C-style programming in Python.)

# h5py Resources

* https://github.com/h5py/h5py
* https://docs.h5py.org
* https://api.h5py.org
* Since 2023 is the Year of Open Science... Properly cite h5py to make your scientific work more [FAIR](https://www.go-fair.org/fair-principles/): https://doi.org/10.5281/zenodo.594310

# h5py Tutorial

We'll start from the file `ou_process.h5` which is in the parent folder and make a copy in the current folder.

In [1]:
from shutil import copy
from pathlib import Path

In [2]:
src_file = Path('../ou_process.h5')
tgt_file = Path('./h5py_tutorial_ou_process.h5')
copy(src_file, tgt_file)
tgt_file.is_file()

True

In [3]:
import h5py
import numpy as np

# h5py File and Group Objects

Open the copied file in the append mode:

In [4]:
f = h5py.File(str(tgt_file), mode='a')
f

<HDF5 file "h5py_tutorial_ou_process.h5" (mode r+)>

In [5]:
isinstance(f, h5py.Group)

True

In [6]:
f.filename

'h5py_tutorial_ou_process.h5'

In [7]:
f.name

'/'

## Low-level API for the File object

These are the libhdf5 functions that take as the first argument a file identifier.

In [8]:
f.id

<h5py.h5f.FileID at 0x10fd10900>

In [9]:
f.id.id

72057594037927936

libhdf5 `H5Fget_mdc_config` function:

In [10]:
h5ac_cache = f.id.get_mdc_config()
h5ac_cache

<h5py.h5ac.CacheConfig at 0x7fd278c03400>

This object holds the file's metadata cache configuration.

Important libhdf5 structs appear as Python extension types that provide property-style access to their struct fields.

In [11]:
h5ac_cache.version

1

In [12]:
h5ac_cache.dirty_bytes_threshold

262144

The root group's (or any other HDF5 group) members are available through standard Python dictionary API:

In [13]:
list(f.keys())

['dataset']

Access to HDF5 datasets is from the root group or their parent group, using the standard Python dictionary "get key" operation.

In [14]:
dset = f['dataset']
dset

<HDF5 dataset "dataset": shape (100, 1000), type "<f8">

HDF5 attributes are available from the `.attrs` property, again, with Python dictionary interface.

In [15]:
list(dset.attrs.keys())

[]

This dataset does not have any attributes.

# h5py Dataset Object

HDF5 datasets hold the file's data so there are many properties and methods to work with them. Reading/writing data supports NumPy slicing syntax and NumPy arrays:

In [16]:
dset[10:12, 500:510]

array([[-0.10165412, -0.0957284 , -0.0889663 , -0.10223986, -0.10099564,
        -0.10982668, -0.10267864, -0.09925343, -0.09020471, -0.08736308],
       [-0.06492745, -0.05901887, -0.05680423, -0.05866818, -0.02616324,
        -0.04909445, -0.05726054, -0.0654222 , -0.0717754 , -0.08202989]])

Most popular dataset settings are available as read-only properties:

In [17]:
dset.dtype

dtype('<f8')

In [18]:
print(dset.chunks)

None


In [19]:
dset.size

100000

In [20]:
dset.nbytes

800000

## Low level API for the Dataset Object

These are the libhdf5 functions that take as the first argument a dataset identifier.

In [21]:
dset.id

<h5py.h5d.DatasetID at 0x13ff29e90>

In [22]:
dset.id.id

360287970189639680

Let's check if the `dset` is chunked using the low-level API:

In [23]:
(
    dset  # h5py.Dataset
        .id  # libhdf5 H5D API
        .get_create_plist()  # H5Dget_create_plist -> h5py.h5p.PropDCID
        .get_layout()  # H5Pget_layout
    ==
    h5py.h5d.CHUNKED  # H5D_CHUNKED
)

False

We will now add new content to this file for use in the rest of the tutorial.

# Working with Attributes

We will add several attributes to the root group and the dataset:

In [24]:
f.attrs['Conventions'] = 'HUG23'
f.attrs['title'] = 'Content prepared for the HUG23 h5py tutorial'

In [25]:
dset.attrs['description'] = 'Sample dataset with some data'
dset.attrs['version'] = 1
dset.attrs['units'] = 'some_units'

We will flush this new content to the file, just to be sure it is stored, and check these attributes:

In [26]:
f.flush()

In [27]:
for n, v in f.attrs.items():
    print(f'{n} = {v}')

Conventions = HUG23
dt = 0.01
mu = b'0.000000'
sigma = [[[0.1]]]
theta = 1.0
title = Content prepared for the HUG23 h5py tutorial


## h5py and HDF5 Strings

The past is fraught with issues and several different ways of interpreting HDF5 strings. The current approach started with h5py v3.0.
* HDF5 dataset strings:
    * `bytes` objects if variable-length
    * NumPy bytes arrays (`S` dtype) if fixed-length
    * `h5py.Dataset.asstr()` to retrieve them as `str` objects

* HDF5 attribute strings:
    * `str` objects if variable-length
    * NumPy bytes arrays if fixed-length

Note the difference in the `mu` and the `Conventions` attribute values above. We will now close the file in order to use `h5dump` tool to show the actual storage settings of these two attributes.

In [28]:
f.close()

In [29]:
!h5dump -a /mu -a /Conventions {str(tgt_file)}

HDF5 "h5py_tutorial_ou_process.h5" {
ATTRIBUTE "mu" {
   DATATYPE  H5T_STRING {
      STRSIZE 8;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_UTF8;
      CTYPE H5T_C_S1;
   }
   DATASPACE  SCALAR
   DATA {
   (0): "0.000000"
   }
}
ATTRIBUTE "Conventions" {
   DATATYPE  H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_UTF8;
      CTYPE H5T_C_S1;
   }
   DATASPACE  SCALAR
   DATA {
   (0): "HUG23"
   }
}
}


What is different?

We're going to reopen the file to continue adding more data:

In [30]:
f = h5py.File(str(tgt_file), mode='a')

# Dataset Filters

The data of chunked datasets can be modified on-the-fly by _filters_. The most popular filters are for data compression. Registered filters are listed at https://portal.hdfgroup.org/display/support/Registered+Filter+Plugins.

A dataset chunk is a fixed-size part of the dataset with the same rank as the dataset. Chunks do not overlap and are also known as _tiles_.

## How to Use HDF5 Filters in h5py

Several filters come with libhdf5, they're referred to as _built-in_. Examples: DEFLATE (known also as "zlib" or "gzip"), shuffle, scaleoffset. HDF Group also provides pre-built binary packages with a reasonable selection of other compression filters that can be installed.

For Python users, there's the hdf5plugin package.

Missing HDF5 filters makes the data stored using them inaccessible, thus posing a data interoperability risk.

The exception below usually means you are missing some filter:

```
OSError: Can't read data (can't open directory: /usr/local/hdf5/lib/plugin)
```

In [31]:
import hdf5plugin

## Creating a Compressed Dataset

We'll create several HDF5 datasets with different compression filters using the data from the already existing dataset:

In [32]:
grp = f.create_group('compressed')

In [33]:
# DEFLATE compression
grp.create_dataset('defl', data=f['dataset'][...],
                   chunks=(20, 250),
                   compression='gzip', compression_opts=6)

<HDF5 dataset "defl": shape (100, 1000), type "<f8">

In [34]:
# ZSTD compression
grp.create_dataset('zstd', data=f['dataset'][...],
                   chunks=(20, 250),
                   **hdf5plugin.Zstd(10))

<HDF5 dataset "zstd": shape (100, 1000), type "<f8">

In [35]:
# ZFP fixed-accuracy lossy compression
grp.create_dataset('zfp-fa', data=f['dataset'][...],
                   chunks=(20, 250),
                   **hdf5plugin.Zfp(accuracy=0.0001))

<HDF5 dataset "zfp-fa": shape (100, 1000), type "<f8">

In [36]:
f.flush()

What does `**hdf5plugin.Zfp(accuracy=0.0001)` do?

In [37]:
hdf5plugin.Zfp(accuracy=0.0001)._kwargs

{'compression': 32013,
 'compression_opts': (3, 0, 3944497965, 1058682594, 0, 0)}

### What is the compression ratio of these different filters?

In [39]:
orig = f['dataset'].id.get_storage_size()
orig

800000

In [40]:
for dset in grp.values():
    print(f'comp. ratio of {dset.name} = {orig / dset.id.get_storage_size() :.3f}')

comp. ratio of /compressed/defl = 1.049
comp. ratio of /compressed/zfp-fa = 4.319
comp. ratio of /compressed/zstd = 1.052


# Object References

Object reference is a datatype, where its value references (points to) another HDF5 group, dataset, or committed datatype in the file. Such a value is independent of the object's current or future path name(s).

The reference of an HDF5 object is obtained from its `.ref` property.

We will store references to the three compressed datasets in a new dataset:

In [41]:
orefs = f.create_dataset(
    'objrefs/refs', 
    data=[grp['zfp-fa'].ref, grp['defl'].ref, grp['zstd'].ref],
    dtype=h5py.ref_dtype)
orefs

<HDF5 dataset "refs": shape (3,), type "|O">

In [42]:
f.flush()

Whenever you see an "O" NumPy dtype, there is something special going on behind the scenes. Let's see what this "O" dtype represents:

In [43]:
orefs.dtype.metadata

mappingproxy({'ref': h5py.h5r.Reference})

Accessing the referenced HDF5 object, _dereferencing_, uses the same syntax as for groups or datasets but with an object reference instead of a path name. Usually, the file object is used for dereferencing:

In [44]:
print(f[orefs[1]])
print(f'{f[orefs[-1]].name} filter id = {f[orefs[-1]].compression}')

<HDF5 dataset "defl": shape (100, 1000), type "<f8">
/compressed/zstd filter id = None


Notice the last line above... h5py's filter reporting is a bit outdated and can provide misleading information for non-default filters.

# Region References

A value of this datatype references (points to) a region within a dataset defined by one or more hyperslabs. New region references are generated from the dataset's `regionref` property and a hyperslab selection.

We're going to create region references for the compressed datasets and store them in a new dataset:

In [45]:
rrefs = f.create_dataset(
    'regrefs/rrefs', 
    data=[
        grp['zfp-fa'].regionref[:, 512:567],
        grp['defl'].regionref[19:67, 348:412], 
        grp['zstd'].regionref[4:17, 900:]
    ],
    dtype=h5py.regionref_dtype)
f.flush()
rrefs

<HDF5 dataset "rrefs": shape (3,), type "|O">

In [46]:
rrefs.dtype.metadata

mappingproxy({'ref': h5py.h5r.RegionReference})

Region references also serve as object references. So if using them with a group object, the return is the reference's dataset:

In [47]:
f[rrefs[-1]]

<HDF5 dataset "zstd": shape (100, 1000), type "<f8">

To actually retrieve the region reference's hyperslab selection, it needs to be applied to its dataset:

In [48]:
grp['zstd'][rrefs[-1]]

array([[-0.12320103, -0.10488131, -0.10018684, ..., -0.04948539,
        -0.04782969, -0.03800686],
       [-0.02783029, -0.04050565, -0.03843151, ..., -0.02579387,
        -0.04382448, -0.04072392],
       [-0.09068191, -0.0917756 , -0.0928998 , ..., -0.01506572,
        -0.00179956, -0.00295008],
       ...,
       [-0.03684699, -0.04224158, -0.04714038, ...,  0.02249485,
         0.02015074,  0.01677423],
       [ 0.05109927,  0.04079238,  0.03424684, ...,  0.0113786 ,
         0.00130804,  0.01331327],
       [-0.09514859, -0.09555737, -0.09829147, ..., -0.00449951,
        -0.00395626,  0.02206343]])

In [49]:
np.array_equal(f[rrefs[-1]][rrefs[-1]], grp['zstd'][4:17, 900:])

True

In [50]:
np.array_equal(f[rrefs[0]][rrefs[0]], grp['zfp-fa'][:, 512:567])

True

In [51]:
f[rrefs[0]][rrefs[1]]

ValueError: Region reference must point to this dataset

# Dimension Scales

A dimension scale is an HDF5 dataset _attached_ to a dimension of another HDF5 dataset. This is a powerful way to record some kind of relationship between the dimension scale and its "host" dataset. For example, a dimension scale could hold coordinate data representing a physical quantity associated with that dimension of the "host" dataset.

One dimension scale can be attached to more datasets, or any other dimension of the same dataset. Implementation of dimension scales uses already available HDF5 storage features so can be considered a convention.

Working with dimension scales is available via the `h5py.Dataset.dims` property.

We are going to: (1) create a dataset with some "coordinate" data; (2) make it a dimension scale; (3) attach it to all the compressed dataset.

In [52]:
x = f.create_dataset('dimscale/x', data=np.arange(0, 50, .5))

In [53]:
x.make_scale('X')
x.is_scale

True

In [54]:
for dset in grp.values():
    dset.dims[0].attach_scale(x)

We can verify that a dim. scale is attached by reading some data from it via the `dims` property:

In [55]:
grp['zstd'].dims[0][0].name

'/dimscale/x'

In [56]:
grp['defl'].dims[0][0][5:11]

array([2.5, 3. , 3.5, 4. , 4.5, 5. ])

In [57]:
f.close()

In [58]:
!h5ls -rv {str(tgt_file)}

Opened "h5py_tutorial_ou_process.h5" with sec2 driver.
/                        Group
    Attribute: Conventions scalar
        Type:      variable-length null-terminated UTF-8 string
    Attribute: dt scalar
        Type:      native double
    Attribute: mu scalar
        Type:      8-byte null-terminated UTF-8 string
    Attribute: sigma {1, 1, 1}
        Type:      native double
    Attribute: theta scalar
        Type:      native float
    Attribute: title scalar
        Type:      variable-length null-terminated UTF-8 string
    Location:  1:96
    Links:     1
/compressed              Group
    Location:  1:808192
    Links:     1
/compressed/defl         Dataset {100/100, 1000/1000}
    Attribute: DIMENSION_LIST {2}
        Type:      variable length of
                   object reference
    Location:  1:808896
    Links:     1
    Chunks:    {20, 250} 40000 bytes
    Storage:   800000 logical bytes, 762743 allocated bytes, 104.88% utilization
    F

# Accessing HDF5 Files in the Cloud with h5py

Currently two methods:

* Read-only S3 (ROS3) virtual file driver (VFD). It is part of libhdf5 but must be specifically built. Good news for conda package users -- its libhdf5 package comes with this VFD. However, the PyPI version does not have it.
* Any Python package for cloud object stores that provides a file-like object. (The fsspec package is highly recommended.)

Hopefully soon -- libhdf5 REST virtual object layer (VOL) with Highly Scalable Data Service (HSDS).

We're going to use the ROS3 VFD.

There's a significant performance difference between typical and cloud-optimized HDF5 files when in a cloud object store. We're going to use a cloud-optimized file from a NASA satellite in AWS S3 (just because it's already there).

Opening this file requires a bit more arguments:

In [59]:
ros3f = h5py.File(
    's3://hdf5.sample/data/cohdf5/PAGE8MiB_ATL03_20190928165055_00270510_004_01.h5',
    mode='r',
    driver='ros3', aws_region='us-west-2'.encode('ascii'),
    page_buf_size=8 * 1024 * 1024
)
ros3f

<HDF5 file "PAGE8MiB_ATL03_20190928165055_00270510_004_01.h5" (mode r)>

* `driver` keyword must be `ros3`
* `page_buf_size` sets the page buffer cache at 8 MiB (only available to HDF5 files with _PAGE_ file space strategy)

In [60]:
# List root groups members...
list(ros3f.keys())

['METADATA',
 'ancillary_data',
 'atlas_impulse_response',
 'ds_surf_type',
 'ds_xyz',
 'gt1l',
 'gt1r',
 'gt2l',
 'gt2r',
 'gt3l',
 'gt3r',
 'orbit_info',
 'quality_assessment']

In [61]:
# Store all root group attributes into a dict...
dict(ros3f.attrs.items())

{'Conventions': b'CF-1.6',
 'citation': b'Cite these data in publications as follows: The data used in this study were produced by the ICESat-2 Science Project Office at NASA/GSFC. The data archive site is the NASA National Snow and Ice Data Center Distributed Active Archive Center.',
 'contributor_name': b'Thomas E Neumann (thomas.neumann@nasa.gov), Thorsten Markus (thorsten.markus@nasa.gov), Suneel Bhardwaj (suneel.bhardwaj@nasa.gov) David W Hancock III (david.w.hancock@nasa.gov)',
 'contributor_role': b'Instrument Engineer, Investigator, Principle Investigator, Data Producer, Data Producer',
 'creator_name': b'GSFC I-SIPS > ICESat-2 Science Investigator-led Processing System',
 'date_created': b'2021-02-19T16:53:58.000000Z',
 'date_type': b'UTC',
 'description': b'This data set (ATL03) contains height above the WGS 84 ellipsoid (ITRF2014 reference frame), latitude, longitude, and time for all photons downlinked by the Advanced Topographic Laser Altimeter System (ATLAS) instrument on

In [62]:
# METADATA group's members...
list(ros3f['METADATA'].keys())

['AcquisitionInformation',
 'DataQuality',
 'DatasetIdentification',
 'Extent',
 'Lineage',
 'ProcessStep',
 'ProductSpecificationDocument',
 'QADatasetIdentification',
 'SeriesIdentification']