# HDF5 Users Group 2020 `h5py` Tutorial

This is a very short tutorial about h5py, the Python language interface to the HDF5 library. Familiarity with the HDF5 [data model](https://portal.hdfgroup.org/display/HDF5/HDF5+User+Guides), its [programming API](https://portal.hdfgroup.org/pages/viewpage.action?pageId=50073943), and [NumPy](https://numpy.org/doc/stable/) is assumed.

h5py resources:
* GitHub [repository](https://github.com/h5py/h5py)
* [Documentation](https://docs.h5py.org/en/2.10.0/)
* O'Reilly [book](https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/) written by h5py's creator

This tutorial covers some of the most frequent HDF5 operations:

1. Create a file, its groups and attributes.
1. Create compact, chunked, resizable, and compressed HDF5 datasets.
1. Open a file for reading and list group content.
1. Read datasets and attributes.
1. h5py low-level API and its relation to the HDF5 C API.

In [1]:
import sys
import h5py
import numpy as np

## Software Used in this Tutorial

Python version:

In [2]:
print(sys.version)

3.8.6 (default, Sep 28 2020, 04:40:29) 
[Clang 9.1.0 (clang-902.0.39.2)]


Package versions:

In [3]:
for _ in (h5py, np):
    print(f'{_.__name__} -> v{_.__version__}')

h5py -> v2.10.0
numpy -> v1.19.2


_Note on h5py: The version used in this tutorial is the current latest release from more than a year ago. A new version, tagged 3.0, is being prepared and may be released soon._

HDF5 library version:

In [None]:
h5py.version.hdf5_version

## Creating HDF5 Files, Groups, and Attributes

An HDF5 file is created with the [`h5py.File`](https://docs.h5py.org/en/2.10.0/high/file.html) class:

In [4]:
f = h5py.File('hug-tutorial.h5', mode='w')
f

<HDF5 file "hug-tutorial.h5" (mode r+)>

Use `mode='r+'` if modifying already existing HDF5 file. Consult the [documentation](https://docs.h5py.org/en/2.10.0/high/file.html#reference) for all available keyword arguments to control properties of the created file.

Some basic information about the created HDF5 file and the library settings that apply to it:

In [5]:
f.filename

'hug-tutorial.h5'

In [6]:
f.driver

'sec2'

In [7]:
f.libver

('earliest', 'v110')

In [8]:
f.name

'/'

The `name` property reveals the HDF5 [path name](https://portal.hdfgroup.org/display/HDF5/HDF5+Glossary#HDF5Glossary-P) of h5py objects. In the above case, the `name` value is `/` which means that `h5py. File` objects also represent HDF5 root group. We can use the file object to add other HDF5 content to the root group. 

In [9]:
g1 = f.create_group('g1')
g1

<HDF5 group "/g1" (0 members)>

It is possible to create HDF5 groups at an arbitrary sublevel from the root group directly:

In [10]:
g2 = f.create_group('several/levels/below')
g2

<HDF5 group "/several/levels/below" (0 members)>

HDF5 groups in the file are available as [`h5py.Group`]() objects and their interface resembles Python dictionaries. For example, the `len()` function reports the number of member HDF5 objects in the group:

In [11]:
len(f)

2

`keys()`, `values()`, or `items()` methods return appropriate Python iterators for the group's HDF5 objects:

In [12]:
print(list(f.keys()))
print(list(f.values()))
print(list(f.items()))

['g1', 'several']
[<HDF5 group "/g1" (0 members)>, <HDF5 group "/several" (1 members)>]
[('g1', <HDF5 group "/g1" (0 members)>), ('several', <HDF5 group "/several" (1 members)>)]


Working with HDF5 attributes is available from the `attrs` property of the h5py group and dataset objects. It exposes the [`AttributeManager` class](https://docs.h5py.org/en/2.10.0/high/attr.html#AttributeManager) which also has a dictionary-style interface.

Creating HDF5 attributes is as simple as assigning new dictionary values. Below we create two HDF5 atrributes in the root group:

In [13]:
f.attrs['attribute example 1'] = 'Example 1'
f.attrs['attribute example 2'] = 2.0

The `len()` function gives the number of assigned attributes:

In [14]:
len(f.attrs)

2

And `keys()`, `values()`, or `items()` methods all provide appropriate attributes information:

In [15]:
list(f.attrs.keys())

['attribute example 1', 'attribute example 2']

In [16]:
list(f.attrs.values())

['Example 1', 2.0]

In [17]:
list(f.attrs.items())

[('attribute example 1', 'Example 1'), ('attribute example 2', 2.0)]

Use the [`create()`]() method for total control over all aspects of attribute creation. The default h5py behaviour is to delete the old attribute when creating new attribute with the same name. The [`modify()`](https://docs.h5py.org/en/2.10.0/high/attr.html#AttributeManager.modify) method should be used to only change the attribute's value. This means that the new value should be compatible with the properties of the old attribute value.

Let's change the `attribute example 2` attribute's value to `15`:

In [18]:
f.attrs.modify('attribute example 2', 15)

In [19]:
list(f.attrs.items())

[('attribute example 1', 'Example 1'), ('attribute example 2', 15.0)]

and try to change its value to a string now:

In [20]:
try:
    print(f"Old attribute value datatype: {f.attrs['attribute example 2'].dtype}")
    f.attrs.modify('attribute example 2', 'string value')
except Exception as e:
    print(f'ERROR: {e}')

Old attribute value datatype: float64
ERROR: No conversion path for dtype: dtype('<U12')


## Creating HDF5 Datasets

HDF5 datasets constitute the _data payload_ in HDF5 files and are represented by the [`h5py.Dataset`](https://docs.h5py.org/en/2.10.0/high/dataset.html) class. [`h5py.Group.create_dataset()`](https://docs.h5py.org/en/stable/high/group.html#Group.create_dataset) method creates a new HDF5 dataset. Check its [documentation](https://docs.h5py.org/en/2.10.0/high/group.html#Group.create_dataset) for all available creation settings.

To create a [contiguous](https://portal.hdfgroup.org/display/HDF5/HDF5+Glossary#HDF5Glossary-C) dataset in the group `g1`:

In [21]:
cont = g1.create_dataset('contiguous', data=np.arange(10), dtype='i4')
cont

<HDF5 dataset "contiguous": shape (10,), type "<i4">

Use the `chunks` argument to create a [chunked](https://portal.hdfgroup.org/display/HDF5/HDF5+Glossary#HDF5Glossary-C) dataset:

In [22]:
chunked = f.create_dataset('chunked', data=np.arange(100*200),
                           shape=(100, 200), chunks=(20, 10))
chunked

<HDF5 dataset "chunked": shape (100, 200), type "<i8">

Finding the "right" chunk size is the perennial issue without "one size fits all" answer. The HDF Group has developed many resources to explain chunking and its effect on I/O performance, see [this slide deck](https://www.slideshare.net/HDFEOS/hdf5-eosxiiiadvancedchunking) for one.

Chunked datasets can also be resized or have _unlimited_ size (up to the maximum dimension size HDF5 allows). Both can be achieved by using the `maxshape` argument. To create a resizable dataset:

In [23]:
resize = g2.create_dataset('resizeable', shape=(10, 20), maxshape=(200, 150))
resize

<HDF5 dataset "resizeable": shape (10, 20), type "<f4">

Setting any of the `maxshape`'s tuple elements to `None` declares that dataset's dimension as unlimited so its size can grow up to the maximum currently supported by HDF5: $2^{64}$.

In [24]:
unlimited = f['several/levels'].create_dataset('unlimited', shape=(34, 76), maxshape=(None, 500), dtype=np.float64)
unlimited

<HDF5 dataset "unlimited": shape (34, 76), type "<f8">

Assigning attributes to datasets follows the same approach as for groups. To add attribute named `description` to a few already created datasets:

In [25]:
chunked.attrs['description'] = 'Chunked dataset'
resize.attrs['description'] = 'Resizeable dataset'
unlimited.attrs['description'] = 'Unlimited dataset'

### Data Compression

Data compression can be applied to any chunked HDF5 dataset. Compression method and its settings are set invidually for each dataset. Compression is only one of the ways how dataset data can be [manipulated](https://portal.hdfgroup.org/display/HDF5/Dynamic+Plugins+in+HDF5) when read or written.

The `compression` argument defines the compression method and the `compression_opts` provides specific compression settings:

In [26]:
comp = f['several'].create_dataset('compressed', shape=(250, 300), chunks=(50, 60), 
                                   dtype='f4', compression='gzip')
comp

<HDF5 dataset "compressed": shape (250, 300), type "<f4">

More information about available compression methods in h5py and their settings is available from its [documentation](https://docs.h5py.org/en/2.10.0/high/dataset.html#filter-pipeline).

Close the HDF5 file to flush all its content to storage:

In [27]:
f.close()

## Reading HDF5 Content

We will now show how to read what was stored in the file we created in this tutorial. First, let's display the file's content using HDF5 command-line tools:

In [28]:
!h5dump -n 1 hug-tutorial.h5

HDF5 "hug-tutorial.h5" {
FILE_CONTENTS {
 group      /
 attribute  /attribute example 1
 attribute  /attribute example 2
 dataset    /chunked
 attribute  /chunked/description
 group      /g1
 dataset    /g1/contiguous
 group      /several
 dataset    /several/compressed
 group      /several/levels
 group      /several/levels/below
 dataset    /several/levels/below/resizeable
 attribute  /several/levels/below/resizeable/description
 dataset    /several/levels/unlimited
 attribute  /several/levels/unlimited/description
 }
}


In [29]:
!h5ls -r hug-tutorial.h5

/                        Group
/chunked                 Dataset {100, 200}
/g1                      Group
/g1/contiguous           Dataset {10}
/several                 Group
/several/compressed      Dataset {250, 300}
/several/levels          Group
/several/levels/below    Group
/several/levels/below/resizeable Dataset {10/200, 20/150}
/several/levels/unlimited Dataset {34/Inf, 76/500}


Open the file for reading only:

In [30]:
f = h5py.File('hug-tutorial.h5', mode='r')
f

<HDF5 file "hug-tutorial.h5" (mode r)>

## Listing Group Content

Showing HDF5 objects in several of the file's groups:

In [31]:
for group in ('/', '/g1', '/several', '/several/levels'):
    grp = f[group]
    print(f'Members of group "{grp.name}":\n\t{list(grp.values())}\n')

Members of group "/":
	[<HDF5 dataset "chunked": shape (100, 200), type "<i8">, <HDF5 group "/g1" (1 members)>, <HDF5 group "/several" (2 members)>]

Members of group "/g1":
	[<HDF5 dataset "contiguous": shape (10,), type "<i4">]

Members of group "/several":
	[<HDF5 dataset "compressed": shape (250, 300), type "<f4">, <HDF5 group "/several/levels" (2 members)>]

Members of group "/several/levels":
	[<HDF5 group "/several/levels/below" (1 members)>, <HDF5 dataset "unlimited": shape (34, 76), type "<f8">]



## Reading Attributes

Assigned attributes to any of the groups or datasets are accessible via the `.attrs` property:

In [32]:
for path in ('/', '/chunked', '/several/levels/unlimited'):
    obj = f[path]
    print(f'Attributes of {obj}:\n\t{list(obj.attrs.items())}\n')

Attributes of <HDF5 group "/" (3 members)>:
	[('attribute example 1', 'Example 1'), ('attribute example 2', 15.0)]

Attributes of <HDF5 dataset "chunked": shape (100, 200), type "<i8">:
	[('description', 'Chunked dataset')]

Attributes of <HDF5 dataset "unlimited": shape (34, 76), type "<f8">:
	[('description', 'Unlimited dataset')]



Reading an attribute is similar to accessing a dictionary value:

In [33]:
f['/several/levels/unlimited'].attrs['description']

'Unlimited dataset'

## Reading Dataset Data

h5py supports accessing dataset data in similar fashion to NumPy arrays. (The same also applies to writing dataset data but this is a very short tutorial!) The familiar NumPy slicing syntax is translated to HDF5 [hyperslab selections](https://portal.hdfgroup.org/display/HDF5/Reading+From+or+Writing+To+a+Subset+of+a+Dataset). Check out the h5py's [documentation](https://docs.h5py.org/en/2.10.0/high/dataset.html#reading-writing-data) for more information.

Below are several examples of reading the `chunked` dataset.

In [34]:
chnkd = f['/chunked']
chnkd

<HDF5 dataset "chunked": shape (100, 200), type "<i8">

Note that the `chnkd` object **is not** a NumPy array itself but the result of slicing (reading) it is.

In [35]:
out = chnkd[10:20, 167:177]
out

array([[2167, 2168, 2169, 2170, 2171, 2172, 2173, 2174, 2175, 2176],
       [2367, 2368, 2369, 2370, 2371, 2372, 2373, 2374, 2375, 2376],
       [2567, 2568, 2569, 2570, 2571, 2572, 2573, 2574, 2575, 2576],
       [2767, 2768, 2769, 2770, 2771, 2772, 2773, 2774, 2775, 2776],
       [2967, 2968, 2969, 2970, 2971, 2972, 2973, 2974, 2975, 2976],
       [3167, 3168, 3169, 3170, 3171, 3172, 3173, 3174, 3175, 3176],
       [3367, 3368, 3369, 3370, 3371, 3372, 3373, 3374, 3375, 3376],
       [3567, 3568, 3569, 3570, 3571, 3572, 3573, 3574, 3575, 3576],
       [3767, 3768, 3769, 3770, 3771, 3772, 3773, 3774, 3775, 3776],
       [3967, 3968, 3969, 3970, 3971, 3972, 3973, 3974, 3975, 3976]])

In [36]:
type(out)

numpy.ndarray

In [37]:
chnkd[10:20:2, 167:177:3]

array([[2167, 2170, 2173, 2176],
       [2567, 2570, 2573, 2576],
       [2967, 2970, 2973, 2976],
       [3367, 3370, 3373, 3376],
       [3767, 3770, 3773, 3776]])

In [38]:
chnkd[15:18, [170, 172, 176]]

array([[3170, 3172, 3176],
       [3370, 3372, 3376],
       [3570, 3572, 3576]])

In [39]:
f['/g1/contiguous'][...]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)

## Low-Level h5py API

So far in the tutorial we featured the high-level h5py API. h5py also has a low-level API on which the high-level API is built upon. The low-level API consists of Python classes and their methods closely related to the HDF5 C API. Complete documentation for the low-level API is at http://api.h5py.org.

The low-level API allows fine-grained control over all aspects of the HDF5 library. Finding an appropriate low-level class or method is very easy. For example, let's consider the [`H5Dset_extent()`](https://portal.hdfgroup.org/display/HDF5/H5D_SET_EXTENT) C function. It belongs to the HDF5 dataset (H5D) API. It's equivalent in the low-level h5py API is the [`h5py.h5d.set_extent()`](http://api.h5py.org/h5d.html#h5py.h5d.DatasetID.set_extent) method.

Low-level API is available from all h5py high-level objects via the `.id` property. Using the `chnkd` dataset:

In [40]:
chnkd.id

<h5py.h5d.DatasetID at 0x111ecb860>

The `h5py.h5d.DatasetID` is the class with the methods from the HDF5 library's [H5D C API](https://portal.hdfgroup.org/display/HDF5/Datasets). For example, to find out the dataspace status of this dataset ([`H5Dget_space_status`](https://portal.hdfgroup.org/display/HDF5/H5D_GET_SPACE_STATUS)):

In [41]:
status = {h5py.h5d.SPACE_STATUS_NOT_ALLOCATED: 'SPACE_STATUS_NOT_ALLOCATED',
          h5py.h5d.SPACE_STATUS_PART_ALLOCATED: 'SPACE_STATUS_PART_ALLOCATED',
          h5py.h5d.SPACE_STATUS_ALLOCATED: 'SPACE_STATUS_ALLOCATED'}
status[chnkd.id.get_space_status()]

'SPACE_STATUS_ALLOCATED'

The HDF5 library's [identifier](https://portal.hdfgroup.org/display/HDF5/Using+Identifiers) related to the `chnkd` dataset is available from the `.id` property of `chnkd.id` object:

In [42]:
chnkd.id.id

360287970189639692

In [43]:
f.close()

# THE END