# 1 HDF5 Overview

>Objectives:
>
> * Learn about HDF5 data structure
> * Use HDF5 in Python with h5py and PyTables

## HDF5 Structure

HDF5, or Hierarchical Data Format, is data format that structures and organises data in a hierarchical way.  There are 3 components to an HDF5 file, and they function much like a directory structure on a computer does:

* Groups (directories in file system) 
* Datasets (files in a file system)
* Attributes (metadata for a file or directory)

<img src="../img/hdf5_structure.jpg" style="height:350px">

Here's an example of sequencing data stored in an HDF5 format:
```
                              group              name       otype  dclass       dim
0                                 /           YAL001C   H5I_GROUP                  
1                          /YAL001C 2016_Weinberg_RPF   H5I_GROUP                  
2        /YAL001C/2016_Weinberg_RPF             reads   H5I_GROUP                  
3  /YAL001C/2016_Weinberg_RPF/reads              data H5I_DATASET INTEGER 36 x 3980
4                                 /           YAL002W   H5I_GROUP                  
5                          /YAL002W 2016_Weinberg_RPF   H5I_GROUP                  
6        /YAL002W/2016_Weinberg_RPF             reads   H5I_GROUP                  
7  /YAL002W/2016_Weinberg_RPF/reads              data H5I_DATASET INTEGER 36 x 4322
8                                 /           YAL003W   H5I_GROUP                  
9                          /YAL003W 2016_Weinberg_RPF   H5I_GROUP                  
10       /YAL003W/2016_Weinberg_RPF             reads   H5I_GROUP                  
11 /YAL003W/2016_Weinberg_RPF/reads              data H5I_DATASET INTEGER 36 x 1118
[...]
```

There are 3 sets of reads, each organised into different groups, and the data itself is arranged into an array of integers.  Each row represents a read length, and columns represent nucleotide positions.

We'll cover two different Python packages for working with HDF5 data:

* PyTables
  * Provides a high-level API
  * Adds advanced features to HDF5 like compression and improved indexing
  * "Batteries included"
* h5py
  * Pythonic interface to HDF5
  * Exposes the entire HDF5 library in Python
  * Much more "low-level"

To start, we'll set up our environment for the tutorial:

In [2]:
import numpy as np
import tables
import h5py
import os
import shutil
data_dir = "../demos/data/test_data"
if os.path.exists(data_dir):
    shutil.rmtree(data_dir)
os.mkdir(data_dir)

## PyTables

To create a an HDF5 file:

In [3]:
FILENAME = os.path.join(data_dir, "layout.h5")
f = tables.open_file(FILENAME, "w")

Create a group:

In [4]:
group = f.create_group('/', 'a_group')
group

/a_group (Group) ''
  children := []

Create datasets inside this group:

In [5]:
f.create_array(group, "my_array1", np.arange(10))
f.create_array(group, "my_array2", np.ones(100).reshape(10, 10));

In [6]:
# Create another group
f.create_group('/a_group', 'another_group')

/a_group/another_group (Group) ''
  children := []

Inspect the structure of the HDF5 file:

In [7]:
print(f)

../demos/data/test_data/layout.h5 (File) ''
Last modif.: 'Fri Apr 12 00:16:58 2019'
Object Tree: 
/ (RootGroup) ''
/a_group (Group) ''
/a_group/my_array1 (Array(10,)) ''
/a_group/my_array2 (Array(10, 10)) ''
/a_group/another_group (Group) ''



### Natural naming in PyTables

One of the nice features of PyTables is its support for natural naming.  This is the common 'dot' notation you would have seen in object oriented programming:

In [8]:
f.root.a_group.my_array1

/a_group/my_array1 (Array(10,)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None

This is equivalent to using PyTable function calls:

In [9]:
f.get_node('/a_group/my_array1')

/a_group/my_array1 (Array(10,)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None

In [10]:
f.close()

## h5py

In [11]:
import h5py

In [12]:
f = h5py.File(FILENAME, 'a')

In [13]:
list(f)

['a_group']

With h5py, we can access objects as if they were dictionaries:

In [14]:
f['/a_group']

<HDF5 group "/a_group" (3 members)>

We can now access and view members in Python-familiar manner:

In [15]:
grp = f['/a_group']
list(grp.items())

[('another_group', <HDF5 group "/a_group/another_group" (0 members)>),
 ('my_array1', <HDF5 dataset "my_array1": shape (10,), type "<i8">),
 ('my_array2', <HDF5 dataset "my_array2": shape (10, 10), type "<f8">)]

and we can then add those as datasets to our new group:

Equivalently:

In [16]:
list(grp)

['another_group', 'my_array1', 'my_array2']

## Datatypes in HDF5

One of the features of HDF5 is the ability to mix data sets and datatypes in a single file.  We'll loodk at how to do this with PyTables and h5py.  First, let's create a homogeneous data set:

In [17]:
arr_to_store = np.arange(10, dtype=np.int8)
arr_to_store

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int8)

### Using h5py

In [18]:
FILENAME = os.path.join(data_dir, "homogenous_h5py.h5")
f = h5py.File(FILENAME, "w")

We have several options for creating data sets; a `create_dataset` method or using a Python `dict`:

In [19]:
f.create_dataset(data=arr_to_store, name="mydata")

<HDF5 dataset "mydata": shape (10,), type "|i1">

In [20]:
f['/mydata2'] = arr_to_store  

In [21]:
list(f)

['mydata', 'mydata2']

We can read the data set with the `:` we saw in NumPy earlier:

In [22]:
f['/mydata'][:]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int8)

In [23]:
f.close()

We can also use some utilities from the HDF5 library itself to examine datasets, `h5ls` and `h5dump`:

In [24]:
!h5ls {FILENAME}

mydata                   Dataset {10}
mydata2                  Dataset {10}


In [25]:
!h5ls -rv {FILENAME}

Opened "../demos/data/test_data/homogenous_h5py.h5" with sec2 driver.
/                        Group
    Location:  1:96
    Links:     1
/mydata                  Dataset {10/10}
    Location:  1:800
    Links:     1
    Storage:   10 logical bytes, 10 allocated bytes, 100.00% utilization
    Type:      native signed char
/mydata2                 Dataset {10/10}
    Location:  1:1400
    Links:     1
    Storage:   10 logical bytes, 10 allocated bytes, 100.00% utilization
    Type:      native signed char


In [26]:
!h5dump {FILENAME}

HDF5 "../demos/data/test_data/homogenous_h5py.h5" {
GROUP "/" {
   DATASET "mydata" {
      DATATYPE  H5T_STD_I8LE
      DATASPACE  SIMPLE { ( 10 ) / ( 10 ) }
      DATA {
      (0): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
      }
   }
   DATASET "mydata2" {
      DATATYPE  H5T_STD_I8LE
      DATASPACE  SIMPLE { ( 10 ) / ( 10 ) }
      DATA {
      (0): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
      }
   }
}
}


### Using PyTables

In [27]:
import tables
FILENAME = os.path.join(data_dir, "homogenous_pytables.h5")
f2 = tables.open_file(FILENAME, "w")

PyTables uses high level objects to wrap datasets:
* **Array** - homogeneous dataset
* **CArray** - chunked homogeneous dataset
* **EArray** - extendable homogeneous dataset
* **Table** - extendable, compound dataset

Chuncking is an HDF5 storage layout; rather than store all of the data in a single, contiguous block in the file, data can be stored as n-dim arrays, in any order.  This can help with data access patterns and improve performance.  We won't cover chunking in this course.

Extendable simply means we can use an `append()` method to add data to the file.

To begin, let's create an array (homogeneous dataset):

In [28]:
f2.create_array(f2.root, name="mydata", obj=arr_to_store)

/mydata (Array(10,)) ''
  atom := Int8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := None

And we can read this data with either the familiar `:` notation:

In [29]:
f2.root.mydata[:]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int8)

or with a `read()` method

In [30]:
f2.root.mydata.read()

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int8)

In [31]:
f2.close()

## Compound datatypes

Let's generate a data set composed of different datatypes:

In [32]:
dtype = np.dtype([("myfield1", np.int32), ("myfield2", np.float64), ("myfield3", "S5")])
table_to_store = np.fromiter(((i, i**2, "foo_%d"%i) for i in range(10)), dtype=dtype)

In [33]:
table_to_store

array([(0,  0., b'foo_0'), (1,  1., b'foo_1'), (2,  4., b'foo_2'),
       (3,  9., b'foo_3'), (4, 16., b'foo_4'), (5, 25., b'foo_5'),
       (6, 36., b'foo_6'), (7, 49., b'foo_7'), (8, 64., b'foo_8'),
       (9, 81., b'foo_9')],
      dtype=[('myfield1', '<i4'), ('myfield2', '<f8'), ('myfield3', 'S5')])

### Using h5py

In [34]:
FILENAME = os.path.join(data_dir, "compound_h5py.h5")
f = h5py.File(FILENAME, "w")

In [35]:
f['mydata'] = table_to_store

In [36]:
f['mydata']

<HDF5 dataset "mydata": shape (10,), type "|V17">

In [37]:
f['mydata'].dtype

dtype([('myfield1', '<i4'), ('myfield2', '<f8'), ('myfield3', 'S5')])

In [38]:
f['mydata'][:]

array([(0,  0., b'foo_0'), (1,  1., b'foo_1'), (2,  4., b'foo_2'),
       (3,  9., b'foo_3'), (4, 16., b'foo_4'), (5, 25., b'foo_5'),
       (6, 36., b'foo_6'), (7, 49., b'foo_7'), (8, 64., b'foo_8'),
       (9, 81., b'foo_9')],
      dtype=[('myfield1', '<i4'), ('myfield2', '<f8'), ('myfield3', 'S5')])

In [39]:
f.close()

In [40]:
!h5ls -v {FILENAME}

Opened "../demos/data/test_data/compound_h5py.h5" with sec2 driver.
mydata                   Dataset {10/10}
    Location:  1:800
    Links:     1
    Storage:   170 logical bytes, 170 allocated bytes, 100.00% utilization
    Type:      struct {
                   "myfield1"         +0    native int
                   "myfield2"         +4    native double
                   "myfield3"         +12   5-byte null-padded ASCII string
               } 17 bytes


### Using PyTables

Recall that a compound dataset is called a `Table` in PyTables.

In [41]:
FILENAME = os.path.join(data_dir, "compound_pytables1.h5")
f2 = tables.open_file(FILENAME, "w")

We use the `create_table()` method: 

In [42]:
table = f2.create_table(f2.root, name="mydata", description=table_to_store.dtype)
table

/mydata (Table(0,)) ''
  description := {
  "myfield1": Int32Col(shape=(), dflt=0, pos=0),
  "myfield2": Float64Col(shape=(), dflt=0.0, pos=1),
  "myfield3": StringCol(itemsize=5, shape=(), dflt=b'', pos=2)}
  byteorder := 'little'
  chunkshape := (3855,)

PyTables high-level functions allows us to easily to do things like add data with `append()` or remove rows with `remove_row()`:

In [43]:
table.append(table_to_store)

In [44]:
table.read()

array([(0,  0., b'foo_0'), (1,  1., b'foo_1'), (2,  4., b'foo_2'),
       (3,  9., b'foo_3'), (4, 16., b'foo_4'), (5, 25., b'foo_5'),
       (6, 36., b'foo_6'), (7, 49., b'foo_7'), (8, 64., b'foo_8'),
       (9, 81., b'foo_9')],
      dtype=[('myfield1', '<i4'), ('myfield2', '<f8'), ('myfield3', 'S5')])

In [45]:
table.remove_row(5)

In [46]:
table.read()

array([(0,  0., b'foo_0'), (1,  1., b'foo_1'), (2,  4., b'foo_2'),
       (3,  9., b'foo_3'), (4, 16., b'foo_4'), (6, 36., b'foo_6'),
       (7, 49., b'foo_7'), (8, 64., b'foo_8'), (9, 81., b'foo_9')],
      dtype=[('myfield1', '<i4'), ('myfield2', '<f8'), ('myfield3', 'S5')])

In [47]:
f2.close()

## Integration with Pandas

Pandas is Python module that provides convenient data structures and data analysis tools.  One of the central data structures in Pandas is the DataFrame, and we can use them with HDF5 and Python.

To start, let's generate a data frame:

In [69]:
import numpy as np
import pandas as pd
import os
import shutil

data_dir = "../demos/data/hdfstore"
if os.path.exists(data_dir):
    shutil.rmtree(data_dir)
os.mkdir(data_dir)

df = pd.DataFrame(np.random.randn(15, 3), columns=['A', 'B', 'C'])

In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 3 columns):
A    15 non-null float64
B    15 non-null float64
C    15 non-null float64
dtypes: float64(3)
memory usage: 440.0 bytes


In [71]:
df.head()

Unnamed: 0,A,B,C
0,0.434603,0.101105,-1.054922
1,0.301304,-2.718253,0.280248
2,0.672155,-0.477084,-0.159161
3,0.068657,-1.278624,-0.553689
4,-1.661793,0.610631,1.607877


In [72]:
df['A'].mean()

0.15239934298136623

We can easily store a Panda dataframe as an HDF5 file with the `HDFStore()` function:

In [73]:
# Create an HDF5 file to store the 
fn = os.path.join(data_dir, 'test.h5')
hdfstore = pd.HDFStore(fn, 'w')

pandas.HDFStore acts like a dict similar to h5py:

In [74]:
hdfstore['my_array'] = df

In [75]:
print(hdfstore)

<class 'pandas.io.pytables.HDFStore'>
File path: ../demos/data/hdfstore/test.h5



In [76]:
hdfstore.put('my_table', df[:5], format='table')
hdfstore

<class 'pandas.io.pytables.HDFStore'>
File path: ../demos/data/hdfstore/test.h5

In [77]:
hdfstore.put('my_table', df[:5], format='table')

In [78]:
hdfstore.append('my_table', df[5:])

In [79]:
hdfstore.close()

In [80]:
!ptdump -v {fn}

/ (RootGroup) ''
/my_array (Group) ''
/my_array/axis0 (Array(3,)) ''
  atom := StringAtom(itemsize=1, shape=(), dflt=b'')
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := None
/my_array/axis1 (Array(15,)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/my_array/block0_items (Array(3,)) ''
  atom := StringAtom(itemsize=1, shape=(), dflt=b'')
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := None
/my_array/block0_values (Array(15, 3)) ''
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/my_table (Group) ''
/my_table/table (Table(15,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(3,), dflt=0.0, pos=1)}
  byteorder := 'little'
  chunkshape := (2048,)
  autoindex := True
  colindexes := {
    "

## NetCDF

In addition to HDF5, another common I/O library is NetCDF.  NetCDF is built upon HDF5 (newer versions), and has a similar structure (datasets have dimensions, variables, and attributes).

In [81]:
import netCDF4
import numpy as np

In [82]:
f = netCDF4.Dataset('../demos/data/rtofs_glo_3dz_f006_6hrly_reg3.nc')
print(f)

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4_CLASSIC data model, file format HDF5):
    Conventions: CF-1.0
    title: HYCOM ATLb2.00
    institution: National Centers for Environmental Prediction
    source: HYCOM archive file
    experiment: 90.9
    history: archv2ncdf3z
    dimensions(sizes): MT(1), Y(850), X(712), Depth(10)
    variables(dimensions): float64 [4mMT[0m(MT), float64 [4mDate[0m(MT), float32 [4mDepth[0m(Depth), int32 [4mY[0m(Y), int32 [4mX[0m(X), float32 [4mLatitude[0m(Y,X), float32 [4mLongitude[0m(Y,X), float32 [4mu[0m(MT,Depth,Y,X), float32 [4mv[0m(MT,Depth,Y,X), float32 [4mtemperature[0m(MT,Depth,Y,X), float32 [4msalinity[0m(MT,Depth,Y,X)
    groups: 



To begin, let's look at the variables in a netCDF file.  We can access them via a Python `dict`, much like we did with HDF5

In [83]:
print(f.variables.keys()) # get all variable names
temp = f.variables['temperature']  # temperature variable
print(temp)

odict_keys(['MT', 'Date', 'Depth', 'Y', 'X', 'Latitude', 'Longitude', 'u', 'v', 'temperature', 'salinity'])
<class 'netCDF4._netCDF4.Variable'>
float32 temperature(MT, Depth, Y, X)
    coordinates: Longitude Latitude Date
    standard_name: sea_water_potential_temperature
    units: degC
    _FillValue: 1.2676506e+30
    valid_range: [-5.078603  11.1498995]
    long_name:   temp [90.9H]
unlimited dimensions: MT
current shape = (1, 10, 850, 712)
filling on


All variables in a netCDF file have dimensions associated with them:

In [84]:
for d in f.dimensions.items():
    print(d)

('MT', <class 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'MT', size = 1
)
('Y', <class 'netCDF4._netCDF4.Dimension'>: name = 'Y', size = 850
)
('X', <class 'netCDF4._netCDF4.Dimension'>: name = 'X', size = 712
)
('Depth', <class 'netCDF4._netCDF4.Dimension'>: name = 'Depth', size = 10
)


Each variable has attributes `temp` and `shape`:

In [85]:
temp.dimensions

('MT', 'Depth', 'Y', 'X')

In [86]:
temp.shape

(1, 10, 850, 712)

We can access data in netCDF files much like we do with NumPy arrays:

In [87]:
mt = f.variables['MT']
depth = f.variables['Depth']
x,y = f.variables['X'], f.variables['Y']
print(mt)
print(x)

<class 'netCDF4._netCDF4.Variable'>
float64 MT(MT)
    long_name: time
    units: days since 1900-12-31 00:00:00
    calendar: standard
    axis: T
unlimited dimensions: MT
current shape = (1,)
filling on, default _FillValue of 9.969209968386869e+36 used

<class 'netCDF4._netCDF4.Variable'>
int32 X(X)
    point_spacing: even
    axis: X
unlimited dimensions: 
current shape = (712,)
filling on, default _FillValue of -2147483647 used



In [88]:
time = mt[:]  # Reads the netCDF variable MT, array of one element
print(time)

[41023.25]


## Final Thoughts

* Binary, parallel file formats like HDF5 and netCDF are vital to performance on HPC systems
* Python has convenient modules to access this (don't need to write a C library to read in your data)
* The libraries also have parallel version (read/write across 1000s of processors
* Consider adapting your data to HDF5...it's more flexible and performant in the long run
* Still allows you to work with things like Pandas and Matplotlib