# H5Py Primer
---
H5Py is a Python library for interacting with HDF5 files. Below are the most commonly used features.

Table of Contents:
1. [Importing h5py](#section1)
2. [Working with files](#section2)
3. [Working with groups](#section3)
4. [Working with datasets](#section4)
5. [Working with attributes](#section5)

REFERENCES:
- [1] Johansson, *Numerical Python: A Practical Techniques Approach for Industry*

## 1. Importing h5py <a id='section1'></a>
Before using the various commands from the h5py module, you first have to load it.

In [1]:
# import the library
import h5py
import numpy as np #numpy will also be needed for some of the examples below

## 2. Working with files <a id='section2'></a>
This section shows examples on how to create new hdf5 files as well as open and read existing hdf5 files. 

In [2]:
# create a new file in write mode
%mkdir -p h5py_files
f = h5py.File('h5py_files/testfile.hdf5', mode='w')

In [3]:
# check the current mode of a file
# NOTE: once a file is created/opened, its mode is either read-only ('r') or read-write ('r+'). 
f.mode

'r+'

In [4]:
# flush buffer
f.flush()

In [5]:
# close a file
f.close()

In [6]:
# open an existing file in read-only mode (file must exist)
f = h5py.File('h5py_files/testfile.hdf5', mode='r')
f.mode

'r'

In [7]:
# open an existing file in read-write mode (file must exist)
f.flush()
f.close()
f = h5py.File('h5py_files/testfile.hdf5', mode='r+')
f.mode

'r+'

In [8]:
# try to create a new file, fail if file already exists
f2 = h5py.File('h5py_files/testfile2.hdf5', mode='x')
f2.mode

OSError: Unable to create file (unable to open file: name = 'h5py_files/testfile2.hdf5', errno = 17, error message = 'File exists', flags = 15, o_flags = c2)

In [9]:
# open an existing file in read-write mode, create file if it doesn't already exist
f.flush()
f.close()
f = h5py.File('h5py_files/testfile.hdf5', mode='a')
f.mode

'r+'

## 3. Working with groups <a id='section3'></a>
This section illustrates how to create and explore groups in an hdf5 file.

In [10]:
# first make sure the file is opened in read-write mode
f = h5py.File('h5py_files/testfile.hdf5', mode='a')

In [11]:
# read the name of the root group
f.name

'/'

In [12]:
# create a new subgroup
grp1 = f.create_group("experiment1")
grp1.name

'/experiment1'

In [13]:
# create a group hierarchy (automatically creating parent groups if they don't already exsist)
grp2_s1 = f.create_group("experiment2/simulation1")
grp2_s2 = f.create_group("experiment2/simulation2")
grp2_s1.name

'/experiment2/simulation1'

In [14]:
# dictionary-like lookup for a group
f['/'], f['/experiment1'], f['/experiment2']

(<HDF5 group "/" (2 members)>,
 <HDF5 group "/experiment1" (0 members)>,
 <HDF5 group "/experiment2" (2 members)>)

In [15]:
# dictionary-like lookup for a subgroup
exp2 = f['/experiment2']
exp2['simulation2']

<HDF5 group "/experiment2/simulation2" (0 members)>

In [16]:
# get names of subgroups within a group
list(f.keys())

['experiment1', 'experiment2']

In [17]:
# get (name, value) tuples for each item in a group
list(f.items())

[('experiment1', <HDF5 group "/experiment1" (0 members)>),
 ('experiment2', <HDF5 group "/experiment2" (2 members)>)]

In [18]:
# traverse the hierarchy of groups in a file
f.visit(lambda x: print(x))

experiment1
experiment2
experiment2/simulation1
experiment2/simulation2


In [19]:
# traverse the hierarchy of (name, item) tuples in a file
f.visititems(lambda name, item: print(name, item))

experiment1 <HDF5 group "/experiment1" (0 members)>
experiment2 <HDF5 group "/experiment2" (2 members)>
experiment2/simulation1 <HDF5 group "/experiment2/simulation1" (0 members)>
experiment2/simulation2 <HDF5 group "/experiment2/simulation2" (0 members)>


In [20]:
# test group membership with the 'in' operator
print('simulation1' in f)
print('simulation1' in f['experiment1'])
print('simulation1' in f['experiment2'])

False
False
True


In [21]:
# use external hdf5 utilities to explore a file (provided by the package hdf5-tools)
f.flush()
f.close()
!h5ls -r ./h5py_files/testfile.hdf5

/                        Group
/experiment1             Group
/experiment2             Group
/experiment2/simulation1 Group
/experiment2/simulation2 Group


## 4. Working with datasets <a id='section4'></a>
This section illustrates how to create, retrieve, modify, and delete datasets in an hdf5 file.

In [22]:
# first make sure the file is opened in read-write mode
f = h5py.File('h5py_files/testfile.hdf5', mode='a')

In [23]:
# create a dataset using the 'create_dataset' method
data1 = np.random.randn(100, 100)
f.create_dataset('experiment1/simulation1/data1', data=data1)
f.visititems(lambda name, item: print(name, item))

experiment1 <HDF5 group "/experiment1" (1 members)>
experiment1/simulation1 <HDF5 group "/experiment1/simulation1" (1 members)>
experiment1/simulation1/data1 <HDF5 dataset "data1": shape (100, 100), type "<f8">
experiment2 <HDF5 group "/experiment2" (2 members)>
experiment2/simulation1 <HDF5 group "/experiment2/simulation1" (0 members)>
experiment2/simulation2 <HDF5 group "/experiment2/simulation2" (0 members)>


In [24]:
# create a dataset of zeros using 'fillvalue' attribute 
f.create_dataset('experiment1/simulation1/data2', shape=(100, 100), fillvalue=0, dtype='float64')
f.visititems(lambda name, item: print(name, item))

experiment1 <HDF5 group "/experiment1" (1 members)>
experiment1/simulation1 <HDF5 group "/experiment1/simulation1" (2 members)>
experiment1/simulation1/data1 <HDF5 dataset "data1": shape (100, 100), type "<f8">
experiment1/simulation1/data2 <HDF5 dataset "data2": shape (100, 100), type "<f8">
experiment2 <HDF5 group "/experiment2" (2 members)>
experiment2/simulation1 <HDF5 group "/experiment2/simulation1" (0 members)>
experiment2/simulation2 <HDF5 group "/experiment2/simulation2" (0 members)>


In [25]:
# create a dataset by direct assignment
data3 = np.random.randn(100, 100)
f['/experiment1/simulation1/data3'] = data3
f.visititems(lambda name, item: print(name, item))

experiment1 <HDF5 group "/experiment1" (1 members)>
experiment1/simulation1 <HDF5 group "/experiment1/simulation1" (3 members)>
experiment1/simulation1/data1 <HDF5 dataset "data1": shape (100, 100), type "<f8">
experiment1/simulation1/data2 <HDF5 dataset "data2": shape (100, 100), type "<f8">
experiment1/simulation1/data3 <HDF5 dataset "data3": shape (100, 100), type "<f8">
experiment2 <HDF5 group "/experiment2" (2 members)>
experiment2/simulation1 <HDF5 group "/experiment2/simulation1" (0 members)>
experiment2/simulation2 <HDF5 group "/experiment2/simulation2" (0 members)>


In [26]:
# retreive a dataset by dictionary-like lookup
dset = f['/experiment1/simulation1/data1']
dset

<HDF5 dataset "data1": shape (100, 100), type "<f8">

In [27]:
# get dataset attributes
print('dataset name is', dset.name)
print('dataset type is', dset.dtype)
print('dataset shape is', dset.shape)
print('dataset length is', dset.len())
print('dataset data are: ', dset.value) ### DEPRECATED ###

dataset name is /experiment1/simulation1/data1
dataset type is float64
dataset shape is (100, 100)
dataset length is 100
dataset data are:  [[ 0.00759621 -0.79660501 -0.2410277  ... -1.80652704 -1.30000677
  -0.43426946]
 [ 1.15378264  0.64884298 -0.02068101 ... -0.85425229  0.88495712
   0.20576674]
 [-0.06303498  1.31807936 -0.607546   ... -0.14065196 -0.31403767
  -1.60787702]
 ...
 [ 0.84856364  0.08263046 -0.55221824 ... -0.05525498 -0.13053546
  -1.29125411]
 [-1.48165655  0.87319025 -1.50751444 ...  0.81342877 -0.37800457
  -0.58107189]
 [-0.72237232 -0.56761963 -1.16381831 ...  0.8182137   1.89720933
   0.05937715]]




In [28]:
# extract data from a dataset using the 'value' attribute
### DEPRECATED ###
npdset = dset.value
print(type(npdset))
npdset

<class 'numpy.ndarray'>


array([[ 0.00759621, -0.79660501, -0.2410277 , ..., -1.80652704,
        -1.30000677, -0.43426946],
       [ 1.15378264,  0.64884298, -0.02068101, ..., -0.85425229,
         0.88495712,  0.20576674],
       [-0.06303498,  1.31807936, -0.607546  , ..., -0.14065196,
        -0.31403767, -1.60787702],
       ...,
       [ 0.84856364,  0.08263046, -0.55221824, ..., -0.05525498,
        -0.13053546, -1.29125411],
       [-1.48165655,  0.87319025, -1.50751444, ...,  0.81342877,
        -0.37800457, -0.58107189],
       [-0.72237232, -0.56761963, -1.16381831, ...,  0.8182137 ,
         1.89720933,  0.05937715]])

In [29]:
# extract data from dataset using the '[...]' syntax
npdset = dset[...]
print(type(npdset))
npdset

<class 'numpy.ndarray'>


array([[ 0.00759621, -0.79660501, -0.2410277 , ..., -1.80652704,
        -1.30000677, -0.43426946],
       [ 1.15378264,  0.64884298, -0.02068101, ..., -0.85425229,
         0.88495712,  0.20576674],
       [-0.06303498,  1.31807936, -0.607546  , ..., -0.14065196,
        -0.31403767, -1.60787702],
       ...,
       [ 0.84856364,  0.08263046, -0.55221824, ..., -0.05525498,
        -0.13053546, -1.29125411],
       [-1.48165655,  0.87319025, -1.50751444, ...,  0.81342877,
        -0.37800457, -0.58107189],
       [-0.72237232, -0.56761963, -1.16381831, ...,  0.8182137 ,
         1.89720933,  0.05937715]])

In [30]:
# extract data from dataset using the '[()]' syntax
npdset = dset[()]
print(type(npdset))
npdset

<class 'numpy.ndarray'>


array([[ 0.00759621, -0.79660501, -0.2410277 , ..., -1.80652704,
        -1.30000677, -0.43426946],
       [ 1.15378264,  0.64884298, -0.02068101, ..., -0.85425229,
         0.88495712,  0.20576674],
       [-0.06303498,  1.31807936, -0.607546  , ..., -0.14065196,
        -0.31403767, -1.60787702],
       ...,
       [ 0.84856364,  0.08263046, -0.55221824, ..., -0.05525498,
        -0.13053546, -1.29125411],
       [-1.48165655,  0.87319025, -1.50751444, ...,  0.81342877,
        -0.37800457, -0.58107189],
       [-0.72237232, -0.56761963, -1.16381831, ...,  0.8182137 ,
         1.89720933,  0.05937715]])

In [31]:
# get part of the data from a dataset using numpy-like slicing
# NOTE: the slicing is done within the HDF5 library, not NumPy, which means the entire 
# dataset is not read into memory!!!
first_col = dset[:,0]
first_col.shape

(100,)

In [32]:
# change/fill dataset using numpy-like assignments
dset[:, 0] = np.arange(100)
dset[:, 0]

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.,
       13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25.,
       26., 27., 28., 29., 30., 31., 32., 33., 34., 35., 36., 37., 38.,
       39., 40., 41., 42., 43., 44., 45., 46., 47., 48., 49., 50., 51.,
       52., 53., 54., 55., 56., 57., 58., 59., 60., 61., 62., 63., 64.,
       65., 66., 67., 68., 69., 70., 71., 72., 73., 74., 75., 76., 77.,
       78., 79., 80., 81., 82., 83., 84., 85., 86., 87., 88., 89., 90.,
       91., 92., 93., 94., 95., 96., 97., 98., 99.])

In [33]:
# delete items from a group
del f['experiment1/simulation1/data3']
f.visititems(lambda name, item: print(name, item))

experiment1 <HDF5 group "/experiment1" (1 members)>
experiment1/simulation1 <HDF5 group "/experiment1/simulation1" (2 members)>
experiment1/simulation1/data1 <HDF5 dataset "data1": shape (100, 100), type "<f8">
experiment1/simulation1/data2 <HDF5 dataset "data2": shape (100, 100), type "<f8">
experiment2 <HDF5 group "/experiment2" (2 members)>
experiment2/simulation1 <HDF5 group "/experiment2/simulation1" (0 members)>
experiment2/simulation2 <HDF5 group "/experiment2/simulation2" (0 members)>


## 5. Working with attributes <a id='section5'></a>
This section illustrates how to read, create, and modify attributes (metadata) conatined in hdf5 files. 

In [34]:
# first make sure the file is opened in read-write mode
f = h5py.File('h5py_files/testfile.hdf5', mode='a')

In [35]:
# access attributes of hdf5 objects using the attrs method
f.attrs

<Attributes of HDF5 object at 139836888465832>

In [36]:
# retreive attributes 
list(f.attrs.keys())

[]

In [37]:
# create an attribute to the root group
f.attrs['description'] = 'Simulation data for project X'
list(f['/'].attrs.keys())

['description']

In [38]:
# create attributes to a (sub)group
f['experiment1'].attrs['mass ratio'] = '2.5'
f['experiment2'].attrs['mass ratio'] = '1.0'
list(f['experiment1'].attrs.keys()), list(f['experiment2'].attrs.keys())

(['mass ratio'], ['mass ratio'])

In [39]:
# create attributes to a dataset
f['experiment1/simulation1/data1'].attrs['m1'] = 1.0
f['experiment1/simulation1/data1'].attrs['m2'] = 2.5
f['experiment1/simulation1/data1'].attrs['r0'] = 30.0
list(f['experiment1/simulation1/data1'].attrs.keys())

['m1', 'm2', 'r0']

In [40]:
# test for the existence of an attribute using the 'in' operator
print('mass ratio' in f['experiment1'].attrs)
print('mass ratio' in f['experiment2'].attrs)

True
True


In [41]:
# delete existing attributes
del f['experiment1'].attrs['mass ratio']
print('mass ratio' in f['experiment1'].attrs)

False
