HDF5 for HSP2 Input & Output Storage
===

[HDF5](https://www.hdfgroup.org/solutions/hdf5/) (Hierarchical Data Format v5) was chosen as a primary data storage file format for HSP2 model inputs and outputs. It is a high-performance, open binary format that has a long [history](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) of use by academia, federal agencies, and industry. For more information on its selection, read [Why HSP2?: The Solution](https://github.com/respec/HSPsquared/wiki/Why-HSP2%3F#the-solution---hsp2).

HDF5 has many benefits, but can nevertheless present challenges to new users. 

**This notebook demonstrates different approaches for opening and interacting with HDF5 files within the Python environment for HSP2** (i.e. the conda environment created by `environment_hsp2_py38.yml`).

The [HDFView](https://www.hdfgroup.org/downloads/hdfview/) desktop utility software provides a visual HDF5 file browser that can support exploration of HDF5 files. We recommend downloading and using it.

## Import Dependencies

HSP2 interacts with HDF5 files primarily via the [PyTables](https://www.pytables.org) library, which offers some high-level capabilities and is tightly integrated with Pandas.

The [h5py](http://www.h5py.org) library offers a full Pythonic interface to the HDF5 format and technology suite.

Note: The [jupyterlab-hdf5](https://github.com/jupyterlab/jupyterlab-hdf5) extension is very promising, but not yet compatible with jupyterlab>=3.  
https://github.com/jupyterlab/jupyterlab-hdf5/issues/42

In [1]:
import tables  # PyTables
import h5py

## Set Paths to HDF5 Files with `pathlib`

Use the [pathlib](https://docs.python.org/3/library/pathlib.html) library (built-in to Python 3) to manage paths indpendentely of OS or environment.

This blog post describes `pathlib`'s benefits relative to using the `os` library or manual approaches.
- https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f


In [2]:
from pathlib import Path

In [3]:
# get home directory
Path.home()

PosixPath('/Users/aaufdenkampe')

In [4]:
# get current working directory
Path.cwd()

PosixPath('/Users/aaufdenkampe/Documents/Python/limno.HSPsquared/HSP2notebooks')

In [5]:
# use the Path module to construct file paths
project_folder = Path('Documents/Python/limno.HSPsquared/')
data_folder    = Path('Data/')

file_to_open = Path.cwd() / data_folder / "hsp2.h5"
file_to_open

PosixPath('/Users/aaufdenkampe/Documents/Python/limno.HSPsquared/HSP2notebooks/Data/hsp2.h5')

In [6]:
# list test files 
paths = [
    file_to_open,
    Path.home() / project_folder / 'tests/test10b/HSP2results/hsp2.h5',
    Path.home() / project_folder / 'tests/test10b/HSP2results/hspf.h5',
]
paths

[PosixPath('/Users/aaufdenkampe/Documents/Python/limno.HSPsquared/HSP2notebooks/Data/hsp2.h5'),
 PosixPath('/Users/aaufdenkampe/Documents/Python/limno.HSPsquared/tests/test10b/HSP2results/hsp2.h5'),
 PosixPath('/Users/aaufdenkampe/Documents/Python/limno.HSPsquared/tests/test10b/HSP2results/hspf.h5')]

In [7]:
# test if your path points to a valid file
file_to_open.exists()

True

In [8]:
paths[2].exists()

True

# Open an HSP2-created HDF5 file with PyTables

Docs: https://www.pytables.org  
Demo: https://nbviewer.jupyter.org/github/jackdbd/hdf5-pydata-munich/blob/master/hdf5_in_python.ipynb

Key points:
- PyTables files have different parameters,attributes and methods than those created with H5py.
- After processing data, 
  - "flushing a table (i.e. `table.flush()`) is a very important step as it will not only help to maintain the integrity of your file, but also will free valuable memory resources (i.e. internal buffers)". See https://www.pytables.org/usersguide/tutorials.html#creating-a-new-table
- After opening a file,
  - It is important to close it (i.e. `h5file.close()`), otherwise it will be corrupted to other programs. See https://www.pytables.org/usersguide/tutorials.html#closing-the-file-and-looking-at-its-content


In [9]:
# Open HSP2 HDF5 file with PyTables
h5file = tables.open_file(
   paths[0],
)

In [10]:
# Inspect HSP2 HDF5 object using iPython magic command

h5file?

[0;31mType:[0m      File
[0;31mFile:[0m      ~/opt/anaconda3/envs/hsp2_py38_dev/lib/python3.8/site-packages/tables/file.py
[0;31mDocstring:[0m
The in-memory representation of a PyTables file.

An instance of this class is returned when a PyTables file is
opened with the :func:`tables.open_file` function. It offers methods
to manipulate (create, rename, delete...) nodes and handle their
attributes, as well as methods to traverse the object tree.
The *user entry point* to the object tree attached to the HDF5 file
is represented in the root_uep attribute.
Other attributes are available.

File objects support an *Undo/Redo mechanism* which can be enabled
with the :meth:`File.enable_undo` method. Once the Undo/Redo
mechanism is enabled, explicit *marks* (with an optional unique
name) can be set on the state of the database using the
:meth:`File.mark`
method. There are two implicit marks which are always available:
the initial mark (0) and the final mark (-1).  Both the identifier
of a

In [11]:
# For a PyTables file, a dump of the object tree can be easily displayed
h5file

# Warning: This is quite verbose!

File(filename=/Users/aaufdenkampe/Documents/Python/limno.HSPsquared/HSP2notebooks/Data/hsp2.h5, title='', mode='r', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/CONTROL (Group) ''
/FTABLES (Group) ''
/IMPLND (Group) ''
/PERLND (Group) ''
/RCHRES (Group) ''
/RESULTS (Group) ''
/RUN_INFO (Group) ''
/TIMESERIES (Group) ''
/TIMESERIES/LAPSE_Table (Group) ''
/TIMESERIES/LAPSE_Table/table (Table(24,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values": Float64Col(shape=(), dflt=0.0, pos=1)}
  byteorder := 'little'
  chunkshape := (4096,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "values": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/TIMESERIES/SEASONS_Table (Group) ''
/TIMESERIES/SEASONS_Table/table (Table(12,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values": BoolCol(sha

## PyTables Browsing the Object Tree
https://www.pytables.org/usersguide/tutorials.html#browsing-the-object-tree


In [12]:
# Use file iterator

for node in h5file:
    print(node)

/ (RootGroup) ''
/CONTROL (Group) ''
/FTABLES (Group) ''
/IMPLND (Group) ''
/PERLND (Group) ''
/RCHRES (Group) ''
/RESULTS (Group) ''
/RUN_INFO (Group) ''
/TIMESERIES (Group) ''
/CONTROL/EXT_SOURCES (Group) ''
/CONTROL/GLOBAL (Group) ''
/CONTROL/LINKS (Group) ''
/CONTROL/MASS_LINKS (Group) ''
/CONTROL/OP_SEQUENCE (Group) ''
/FTABLES/FT001 (Group) ''
/FTABLES/FT002 (Group) ''
/FTABLES/FT003 (Group) ''
/FTABLES/FT004 (Group) ''
/FTABLES/FT005 (Group) ''
/IMPLND/GENERAL (Group) ''
/IMPLND/IQUAL (Group) ''
/IMPLND/IWATER (Group) ''
/IMPLND/IWTGAS (Group) ''
/IMPLND/SNOW (Group) ''
/IMPLND/SOLIDS (Group) ''
/PERLND/GENERAL (Group) ''
/PERLND/PSTEMP (Group) ''
/PERLND/PWATER (Group) ''
/PERLND/PWTGAS (Group) ''
/PERLND/SNOW (Group) ''
/RCHRES/CONS (Group) ''
/RCHRES/GENERAL (Group) ''
/RCHRES/GQUAL (Group) ''
/RCHRES/HTRCH (Group) ''
/RCHRES/HYDR (Group) ''
/RCHRES/NUTRX (Group) ''
/RCHRES/OXRX (Group) ''
/RCHRES/PHCARB (Group) ''
/RCHRES/PLANK (Group) ''
/RCHRES/RQUAL (Group) ''
/RCHRES/SED

In [15]:
for group in h5file.walk_groups():
    print(group)

/ (RootGroup) ''
/CONTROL (Group) ''
/FTABLES (Group) ''
/IMPLND (Group) ''
/PERLND (Group) ''
/RCHRES (Group) ''
/RESULTS (Group) ''
/RUN_INFO (Group) ''
/TIMESERIES (Group) ''
/TIMESERIES/LAPSE_Table (Group) ''
/TIMESERIES/SEASONS_Table (Group) ''
/TIMESERIES/SUMMARY (Group) ''
/TIMESERIES/Saturated_Vapor_Pressure_Table (Group) ''
/TIMESERIES/TS039 (Group) ''
/TIMESERIES/TS041 (Group) ''
/TIMESERIES/TS042 (Group) ''
/TIMESERIES/TS046 (Group) ''
/TIMESERIES/TS113 (Group) ''
/TIMESERIES/TS119 (Group) ''
/TIMESERIES/TS121 (Group) ''
/TIMESERIES/TS122 (Group) ''
/TIMESERIES/TS123 (Group) ''
/TIMESERIES/TS124 (Group) ''
/TIMESERIES/TS125 (Group) ''
/TIMESERIES/TS126 (Group) ''
/TIMESERIES/TS127 (Group) ''
/TIMESERIES/TS131 (Group) ''
/TIMESERIES/TS132 (Group) ''
/TIMESERIES/TS134 (Group) ''
/TIMESERIES/TS135 (Group) ''
/TIMESERIES/TS136 (Group) ''
/TIMESERIES/TS140 (Group) ''
/RUN_INFO/LOGFILE (Group) ''
/RUN_INFO/VERSIONS (Group) ''
/RESULTS/IMPLND_I001 (Group) ''
/RESULTS/PERLND_P001 (G

In [16]:
# From https://www.pytables.org/usersguide/tutorials.html#setting-and-getting-user-attributes

h5file.root

/ (RootGroup) ''
  children := ['CONTROL' (Group), 'FTABLES' (Group), 'IMPLND' (Group), 'PERLND' (Group), 'RCHRES' (Group), 'RESULTS' (Group), 'RUN_INFO' (Group), 'TIMESERIES' (Group)]

In [17]:
h5file.root.CONTROL

/CONTROL (Group) ''
  children := ['EXT_SOURCES' (Group), 'GLOBAL' (Group), 'LINKS' (Group), 'MASS_LINKS' (Group), 'OP_SEQUENCE' (Group)]

In [18]:
h5file.root.CONTROL.EXT_SOURCES.table

/CONTROL/EXT_SOURCES/table (Table(50,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "SVOL": StringCol(itemsize=1, shape=(), dflt=b'', pos=1),
  "SVOLNO": StringCol(itemsize=5, shape=(), dflt=b'', pos=2),
  "SMEMN": StringCol(itemsize=4, shape=(), dflt=b'', pos=3),
  "SMEMSB": StringCol(itemsize=2, shape=(), dflt=b'', pos=4),
  "SSYST": StringCol(itemsize=4, shape=(), dflt=b'', pos=5),
  "SGAPST": StringCol(itemsize=4, shape=(), dflt=b'', pos=6),
  "MFACTOR": Float64Col(shape=(), dflt=0.0, pos=7),
  "TRAN": StringCol(itemsize=4, shape=(), dflt=b'', pos=8),
  "TVOL": StringCol(itemsize=6, shape=(), dflt=b'', pos=9),
  "TGRPN": StringCol(itemsize=5, shape=(), dflt=b'', pos=10),
  "TMEMN": StringCol(itemsize=6, shape=(), dflt=b'', pos=11),
  "TMEMSB": StringCol(itemsize=1, shape=(), dflt=b'', pos=12),
  "TVOLNO": StringCol(itemsize=4, shape=(), dflt=b'', pos=13),
  "COMMENT": StringCol(itemsize=1, shape=(), dflt=b'', pos=14)}
  byteorder := 'little'
  chunkshape :

## Always close an HDF5 file!

It is important to close the HDF5 file, using `h5file.close()`). Otherwise it will be corrupted to other programs. See https://www.pytables.org/usersguide/tutorials.html#closing-the-file-and-looking-at-its-content

In [19]:
# Check if open (1 = True)
h5file.isopen

1

In [20]:
h5file.close()

In [21]:
h5file

<closed File>

## PyTables `ptdump` command line utility

The command line “ptdump” PyTables utility (located in utils/ directory)
https://www.pytables.org/usersguide/tutorials.html#closing-the-file-and-looking-at-its-content

In [22]:
# get path, as pathlib only works within Python and not at the operating system command prompt.
file_to_open

PosixPath('/Users/aaufdenkampe/Documents/Python/limno.HSPsquared/HSP2notebooks/Data/hsp2.h5')

In [23]:
# This uses the `!` ipython magic command to run from the console
!ptdump '/Users/aaufdenkampe/Documents/Python/limno.HSPsquared/HSP2notebooks/Data/hsp2.h5'

# You can pass the -v or -d options to ptdump if you want more verbosity

/ (RootGroup) ''
/CONTROL (Group) ''
/FTABLES (Group) ''
/IMPLND (Group) ''
/PERLND (Group) ''
/RCHRES (Group) ''
/RESULTS (Group) ''
/RUN_INFO (Group) ''
/TIMESERIES (Group) ''
/TIMESERIES/LAPSE_Table (Group) ''
/TIMESERIES/LAPSE_Table/table (Table(24,)) ''
/TIMESERIES/SEASONS_Table (Group) ''
/TIMESERIES/SEASONS_Table/table (Table(12,)) ''
/TIMESERIES/SUMMARY (Group) ''
/TIMESERIES/SUMMARY/table (Table(19,)) ''
/TIMESERIES/Saturated_Vapor_Pressure_Table (Group) ''
/TIMESERIES/Saturated_Vapor_Pressure_Table/table (Table(40,)) ''
/TIMESERIES/TS039 (Group) ''
/TIMESERIES/TS039/table (Table(8784,), shuffle, blosc(9)) ''
/TIMESERIES/TS041 (Group) ''
/TIMESERIES/TS041/table (Table(366,), shuffle, blosc(9)) ''
/TIMESERIES/TS042 (Group) ''
/TIMESERIES/TS042/table (Table(366,), shuffle, blosc(9)) ''
/TIMESERIES/TS046 (Group) ''
/TIMESERIES/TS046/table (Table(4392,), shuffle, blosc(9)) ''
/TIMESERIES/TS113 (Group) ''
/TIMESERIES/TS113/table (Table(366,), shuffle, blosc(9)) ''
/TIMESERIES/TS119

# Open with h5py library

Docs: https://docs.h5py.org
Demo: https://nbviewer.jupyter.org/github/jackdbd/hdf5-pydata-munich/blob/master/hdf5_in_python.ipynb

In [24]:
# https://docs.h5py.org/en/stable/high/file.html#opening-creating-files

h5py_file = h5py.File(paths[0])

# NOTE: if the file was left open, you might get this error: `OSError: Unable to open file (File signature not found)`
# https://stackoverflow.com/questions/38089950/error-opening-file-in-h5py-file-signature-not-found/43607837

In [25]:
h5py_file?

[0;31mType:[0m           File
[0;31mString form:[0m    <HDF5 file "hsp2.h5" (mode r)>
[0;31mLength:[0m         8
[0;31mFile:[0m           ~/opt/anaconda3/envs/hsp2_py38_dev/lib/python3.8/site-packages/h5py/_hl/files.py
[0;31mDocstring:[0m      Represents an HDF5 file.
[0;31mInit docstring:[0m
Create a new file object.

See the h5py user guide for a detailed explanation of the options.

name
    Name of the file on disk, or file-like object.  Note: for files
    created with the 'core' driver, HDF5 still requires this be
    non-empty.
mode
    r        Readonly, file must exist (default)
    r+       Read/write, file must exist
    w        Create file, truncate if exists
    w- or x  Create file, fail if exists
    a        Read/write if exists, create otherwise
driver
    Name of the driver to use.  Legal values are None (default,
    recommended), 'core', 'sec2', 'stdio', 'mpio'.
libver
    Library version bounds.  Supported values: 'earliest', 'v108',
    'v110', 'v112'

In [26]:
h5py_file.attrs

<Attributes of HDF5 object at 140644074207248>

In [27]:
h5py_file.keys()

<KeysViewHDF5 ['CONTROL', 'FTABLES', 'IMPLND', 'PERLND', 'RCHRES', 'RESULTS', 'RUN_INFO', 'TIMESERIES']>

In [28]:
for key in h5py_file.keys():
  print(key)

CONTROL
FTABLES
IMPLND
PERLND
RCHRES
RESULTS
RUN_INFO
TIMESERIES


In [29]:
h5py_file['CONTROL'].keys()

<KeysViewHDF5 ['EXT_SOURCES', 'GLOBAL', 'LINKS', 'MASS_LINKS', 'OP_SEQUENCE']>

In [30]:
h5py_file['CONTROL']['GLOBAL'].keys()

<KeysViewHDF5 ['_i_table', 'table']>

In [31]:
key_list = [k for k in h5py_file.keys()]
key_list

['CONTROL',
 'FTABLES',
 'IMPLND',
 'PERLND',
 'RCHRES',
 'RESULTS',
 'RUN_INFO',
 'TIMESERIES']

In [32]:
for name in h5py_file:
    for subname in h5py_file[name]:
        print(name+r'\ '+subname)
#         print(r'')
#         for subsubname in h5py_file[name][subname]:
#             print(name+r'\ '+subname+r'\ '+subsubname)

CONTROL\ EXT_SOURCES
CONTROL\ GLOBAL
CONTROL\ LINKS
CONTROL\ MASS_LINKS
CONTROL\ OP_SEQUENCE
FTABLES\ FT001
FTABLES\ FT002
FTABLES\ FT003
FTABLES\ FT004
FTABLES\ FT005
IMPLND\ GENERAL
IMPLND\ IQUAL
IMPLND\ IWATER
IMPLND\ IWTGAS
IMPLND\ SNOW
IMPLND\ SOLIDS
PERLND\ GENERAL
PERLND\ PSTEMP
PERLND\ PWATER
PERLND\ PWTGAS
PERLND\ SNOW
RCHRES\ CONS
RCHRES\ GENERAL
RCHRES\ GQUAL
RCHRES\ HTRCH
RCHRES\ HYDR
RCHRES\ NUTRX
RCHRES\ OXRX
RCHRES\ PHCARB
RCHRES\ PLANK
RCHRES\ RQUAL
RCHRES\ SEDTRN
RESULTS\ IMPLND_I001
RESULTS\ PERLND_P001
RESULTS\ RCHRES_R001
RESULTS\ RCHRES_R002
RESULTS\ RCHRES_R003
RESULTS\ RCHRES_R004
RESULTS\ RCHRES_R005
RUN_INFO\ LOGFILE
RUN_INFO\ VERSIONS
TIMESERIES\ LAPSE_Table
TIMESERIES\ SEASONS_Table
TIMESERIES\ SUMMARY
TIMESERIES\ Saturated_Vapor_Pressure_Table
TIMESERIES\ TS039
TIMESERIES\ TS041
TIMESERIES\ TS042
TIMESERIES\ TS046
TIMESERIES\ TS113
TIMESERIES\ TS119
TIMESERIES\ TS121
TIMESERIES\ TS122
TIMESERIES\ TS123
TIMESERIES\ TS124
TIMESERIES\ TS125
TIMESERIES\ TS126
TI

In [33]:
h5py_file.close()

In [34]:
h5py_file

<Closed HDF5 file>

### h5py `h5dump` command line utility

The command line “h5dump” utility can also be useful, but it is very verbose!
https://nbviewer.jupyter.org/github/jackdbd/hdf5-pydata-munich/blob/master/hdf5_in_python.ipynb

In [35]:
# get path, as pathlib only works within Python and not at the operating system command prompt.
paths[0]

PosixPath('/Users/aaufdenkampe/Documents/Python/limno.HSPsquared/HSP2notebooks/Data/hsp2.h5')

In [None]:

!h5dump '/Users/aaufdenkampe/Documents/Python/limno.HSPsquared/HSP2notebooks/Data/hsp2.h5'

# WARNING: This is extremely verbose!!!!