HDF5 for HSP2 Input & Output Storage
===

[HDF5](https://www.hdfgroup.org/solutions/hdf5/) (Hierarchical Data Format v5) was chosen as a primary data storage file format for HSP2 model inputs and outputs. It is a high-performance, open binary format that has a long [history](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) of use by academia, federal agencies, and industry. For more information on its selection, read [Why HSP2?: The Solution](https://github.com/respec/HSPsquared/wiki/Why-HSP2%3F#the-solution---hsp2).

HDF5 has many benefits, but can nevertheless present challenges to new users. 

**This notebook demonstrates different approaches for opening and interacting with HDF5 files within the Python environment for HSP2** (i.e. the conda environment created by `environment_hsp2_py38.yml`).

The [HDFView](https://www.hdfgroup.org/downloads/hdfview/) desktop utility software provides a visual HDF5 file browser that can support exploration of HDF5 files. We recommend downloading and using it.

In [1]:
# Confirm environments, using '!' magic command to run conda commands as if from a console.
# The active environment has a '*' in front of it.
!conda info --envs

# conda environments:
#
                         /Users/aaufdenkampe/opt/anaconda3
                         /Users/aaufdenkampe/opt/anaconda3/envs/R
                         /Users/aaufdenkampe/opt/anaconda3/envs/hsp2_py38
base                  *  /Users/aaufdenkampe/opt/anaconda3/envs/hsp2_py38_dev
                         /Users/aaufdenkampe/opt/anaconda3/envs/pangeo
                         /Users/aaufdenkampe/opt/anaconda3/envs/pangeo-holoviz
                         /Users/aaufdenkampe/opt/anaconda3/envs/pangeo-holoviz-manual
                         /Users/aaufdenkampe/opt/anaconda3/envs/webinar



In [2]:
# An alternate approach to confirming the environment, using Python
import os
os.environ['CONDA_DEFAULT_ENV']

'hsp2_py38_dev'

## Import Dependencies

HSP2 interacts with HDF5 files primarily via the [PyTables](https://www.pytables.org) library, which offers some high-level capabilities and is tightly integrated with Pandas.

The [h5py](http://www.h5py.org) library offers a full Pythonic interface to the HDF5 format and technology suite.

Note: The [jupyterlab-hdf5](https://github.com/jupyterlab/jupyterlab-hdf5) extension is very promising, but not yet compatible with jupyterlab>=3.  
https://github.com/jupyterlab/jupyterlab-hdf5/issues/42

In [3]:
import tables  # PyTables
import h5py
import pandas as pd

## Set Paths to HDF5 Files with `pathlib`

Use the [pathlib](https://docs.python.org/3/library/pathlib.html) library (built-in to Python 3) to manage paths indpendentely of OS or environment.

This blog post describes `pathlib`'s benefits relative to using the `os` library or manual approaches.
- https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f


In [4]:
from pathlib import Path

In [5]:
# get home directory
Path.home()

PosixPath('/Users/aaufdenkampe')

In [6]:
# get current working directory
Path.cwd()

PosixPath('/Users/aaufdenkampe/Documents/Python/limno.HSPsquared/tests')

In [7]:
# use the Path module to construct file paths
project_folder = Path('Documents/Python/limno.HSPsquared/')
data_folder    = Path('HSP2_stash/')

file_to_open = Path.cwd() / data_folder / "hsp2.h5"
file_to_open

PosixPath('/Users/aaufdenkampe/Documents/Python/limno.HSPsquared/tests/HSP2_stash/hsp2.h5')

In [8]:
# list test files 
paths = [
    Path.cwd() / data_folder / 'test10_hsp2_dev_input_ALL.h5',
#     tests/test10/HSP2results/test10_hsp2_dev2WDM_3.h5
    Path.cwd() / 'test10/HSP2results/test10_hsp2_dev2WDM_input_3.h5',
]
paths

[PosixPath('/Users/aaufdenkampe/Documents/Python/limno.HSPsquared/tests/HSP2_stash/test10_hsp2_dev_input_ALL.h5'),
 PosixPath('/Users/aaufdenkampe/Documents/Python/limno.HSPsquared/tests/test10/HSP2results/test10_hsp2_dev2WDM_input_3.h5')]

In [9]:
# test if your path points to a valid file
paths[0].exists()

True

In [10]:
paths[1].exists()

True

# Open an HSP2-created HDF5 file with Pandas

Docs: 
- https://pandas.pydata.org/docs/user_guide/io.html#io-hdf5
- https://pandas.pydata.org/docs/reference/api/pandas.read_hdf.html


In [11]:
store0 = pd.HDFStore(paths[0])
store1 = pd.HDFStore(paths[1])

In [12]:
store0.keys()

['/TIMESERIES/LAPSE_Table',
 '/TIMESERIES/SEASONS_Table',
 '/TIMESERIES/SUMMARY',
 '/TIMESERIES/Saturated_Vapor_Pressure_Table',
 '/TIMESERIES/TS039',
 '/TIMESERIES/TS041',
 '/TIMESERIES/TS042',
 '/TIMESERIES/TS046',
 '/TIMESERIES/TS113',
 '/TIMESERIES/TS119',
 '/TIMESERIES/TS121',
 '/TIMESERIES/TS122',
 '/TIMESERIES/TS123',
 '/TIMESERIES/TS124',
 '/TIMESERIES/TS125',
 '/TIMESERIES/TS126',
 '/TIMESERIES/TS127',
 '/TIMESERIES/TS131',
 '/TIMESERIES/TS132',
 '/TIMESERIES/TS134',
 '/TIMESERIES/TS135',
 '/TIMESERIES/TS136',
 '/TIMESERIES/TS140',
 '/RCHRES/SEDTRN/CLAY',
 '/RCHRES/SEDTRN/FLAGS',
 '/RCHRES/SEDTRN/PARAMETERS',
 '/RCHRES/SEDTRN/SAVE',
 '/RCHRES/SEDTRN/SILT',
 '/RCHRES/SEDTRN/STATES',
 '/RCHRES/RQUAL/FLAGS',
 '/RCHRES/RQUAL/PARAMETERS',
 '/RCHRES/PLANK/FLAGS',
 '/RCHRES/PLANK/PARAMETERS',
 '/RCHRES/PLANK/SAVE',
 '/RCHRES/PLANK/STATES',
 '/RCHRES/PHCARB/PARAMETERS',
 '/RCHRES/PHCARB/SAVE',
 '/RCHRES/PHCARB/STATES',
 '/RCHRES/OXRX/FLAGS',
 '/RCHRES/OXRX/PARAMETERS',
 '/RCHRES/OXRX/

In [31]:
ds0 = store0.select('/TIMESERIES/TS039')
ds0?

[0;31mType:[0m        Series
[0;31mString form:[0m
1976-01-01 00:00:00    0.0
           1976-01-01 01:00:00    0.0
           1976-01-01 02:00:00    0.0
           1976-01-01 03:00 <...>   0.0
           1976-12-31 22:00:00    0.0
           1976-12-31 23:00:00    0.0
           Freq: H, Length: 8784, dtype: float32
[0;31mLength:[0m      8784
[0;31mFile:[0m        ~/opt/anaconda3/envs/hsp2_py38_dev/lib/python3.8/site-packages/pandas/core/series.py
[0;31mDocstring:[0m  
One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).

Operations between Series (+, -, /, *, **) align values based on their
associated index values-- they need not be the same length. 

In [30]:
ds0

1976-01-01 00:00:00    0.0
1976-01-01 01:00:00    0.0
1976-01-01 02:00:00    0.0
1976-01-01 03:00:00    0.0
1976-01-01 04:00:00    0.0
                      ... 
1976-12-31 19:00:00    0.0
1976-12-31 20:00:00    0.0
1976-12-31 21:00:00    0.0
1976-12-31 22:00:00    0.0
1976-12-31 23:00:00    0.0
Freq: H, Length: 8784, dtype: float32

In [32]:
ds1 = store1.select('/TIMESERIES/TS039')
ds1?

[0;31mType:[0m        Series
[0;31mString form:[0m
1976-01-01 00:00:00    0.0
           1976-01-01 01:00:00    0.0
           1976-01-01 02:00:00    0.0
           1976-01-01 03:00 <...> 1:00:00    0.0
           1976-12-31 22:00:00    0.0
           1976-12-31 23:00:00    0.0
           Length: 8784, dtype: float64
[0;31mLength:[0m      8784
[0;31mFile:[0m        ~/opt/anaconda3/envs/hsp2_py38_dev/lib/python3.8/site-packages/pandas/core/series.py
[0;31mDocstring:[0m  
One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).

Operations between Series (+, -, /, *, **) align values based on their
associated index values-- they need not be the same length. 

In [27]:
ds1

Unnamed: 0,BENRFG
R001,1.0
R004,1.0
R005,1.0
R003,0.0
R002,0.0


In [28]:
compare = ds0 == ds1
# compare = store0.select('/TIMESERIES/SUMMARY') == store1.select('/TIMESERIES/SUMMARY')
compare

ValueError: Can only compare identically-labeled DataFrame objects

In [18]:
ds0.info() == ds1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19 entries, TS039 to TS140
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Start   19 non-null     object 
 1   Stop    19 non-null     object 
 2   Freq    19 non-null     object 
 3   Length  19 non-null     int64  
 4   TSTYPE  19 non-null     object 
 5   TFILL   19 non-null     float64
 6   STAID   19 non-null     object 
 7   STNAM   19 non-null     object 
dtypes: float64(1), int64(1), object(6)
memory usage: 1.3+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 19 entries, TS039 to TS140
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Start   19 non-null     object 
 1   Stop    19 non-null     object 
 2   Freq    19 non-null     object 
 3   Length  19 non-null     int64  
 4   TSTYPE  19 non-null     object 
 5   TFILL   19 non-null     float64
 6   STAID   19 non-null     object 
 7   STNAM   19 non-null   

True

In [20]:
key = store1.keys()[105]
key

'/CONTROL/GLOBAL'

In [21]:
store0.select(key).info() == store1.select(key).info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, Comment to Stop
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Info    3 non-null      object
dtypes: object(1)
memory usage: 48.0+ bytes
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, Comment to Stop
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Info    3 non-null      object
dtypes: object(1)
memory usage: 48.0+ bytes


True

In [22]:
store1.select(key)

Unnamed: 0,Info
Comment,Version 11 test run: PERLND and IMPLND w/ RCHR...
Start,1976-01-01 00:00
Stop,1977-01-01 00:00


In [134]:
store0.select(key).index

Index(['Comment', 'Start', 'Stop'], dtype='object')

In [135]:
store1.select(key).index

Index(['Comment', 'Start', 'Stop'], dtype='object')

In [136]:
store0.select(key).index == store1.select(key).index

array([ True,  True,  True])

In [116]:
compares = []  # initialize empty list

for key in store0.keys()[5:10]:
    ds0_ = store0.select(key)
#     print(ds0_)
    ds1_ = store1.select(key)
#     print(ds1_)
    compare = ds0_ == ds1_
#     compare_info = ds0_.info() == ds1_.info()

    compares.append([key, compare_info, compare])

In [117]:
compares

[['/TIMESERIES/TS041',
  True,
  1976-01-01    True
  1976-01-02    True
  1976-01-03    True
  1976-01-04    True
  1976-01-05    True
                ... 
  1976-12-27    True
  1976-12-28    True
  1976-12-29    True
  1976-12-30    True
  1976-12-31    True
  Freq: D, Length: 366, dtype: bool],
 ['/TIMESERIES/TS042',
  True,
  1976-01-01    True
  1976-01-02    True
  1976-01-03    True
  1976-01-04    True
  1976-01-05    True
                ... 
  1976-12-27    True
  1976-12-28    True
  1976-12-29    True
  1976-12-30    True
  1976-12-31    True
  Freq: D, Length: 366, dtype: bool],
 ['/TIMESERIES/TS046',
  True,
  1976-01-01 00:00:00    True
  1976-01-01 02:00:00    True
  1976-01-01 04:00:00    True
  1976-01-01 06:00:00    True
  1976-01-01 08:00:00    True
                         ... 
  1976-12-31 14:00:00    True
  1976-12-31 16:00:00    True
  1976-12-31 18:00:00    True
  1976-12-31 20:00:00    True
  1976-12-31 22:00:00    True
  Freq: 2H, Length: 4392, dtype: bool],

In [25]:
store1.select('/CONTROL/GLOBAL')

Unnamed: 0,Info
Comment,Version 11 test run: PERLND and IMPLND w/ RCHR...
Start,1976-01-01 00:00
Stop,1977-01-01 00:00


In [1]:
from HSP2 import get_uci

In [19]:
uci_tuple = get_uci(store1)

In [23]:
uci_tuple

(  OPERATION SEGMENT  INDELT_minutes
 0    PERLND    P001              60
 1    RCHRES    R001              60
 2    RCHRES    R002              60
 3    RCHRES    R003              60
 4    RCHRES    R004              60
 5    IMPLND    I001              60
 6    RCHRES    R005              60,
 defaultdict(list,
             {'R001': [Pandas(Index=0, AFACTR=6000.0, MLNO='ML001', SVOL='PERLND', SVOLNO='P001', TMEMSB1='', TMEMSB2='', TVOL='RCHRES', TVOLNO='R001', MFACTOR='', SGRPN='', SMEMN='', SMEMSB='', TGRPN='', TMEMN='', TMEMSB='', TRAN='', COMMENTS='')],
              'R002': [Pandas(Index=2, AFACTR=1.0, MLNO='ML003', SVOL='RCHRES', SVOLNO='R001', TMEMSB1='', TMEMSB2='', TVOL='RCHRES', TVOLNO='R002', MFACTOR='', SGRPN='', SMEMN='', SMEMSB='', TGRPN='', TMEMN='', TMEMSB='', TRAN='', COMMENTS='')],
              'R003': [Pandas(Index=3, AFACTR=1.0, MLNO='ML004', SVOL='RCHRES', SVOLNO='R001', TMEMSB1='', TMEMSB2='', TVOL='RCHRES', TVOLNO='R003', MFACTOR='', SGRPN='', SMEMN='', SMEMSB

# Open an HSP2-created HDF5 file with PyTables

Docs: https://www.pytables.org  
Demo: https://nbviewer.jupyter.org/github/jackdbd/hdf5-pydata-munich/blob/master/hdf5_in_python.ipynb

Key points:
- PyTables files have different parameters,attributes and methods than those created with H5py.
- After processing data, 
  - "flushing a table (i.e. `table.flush()`) is a very important step as it will not only help to maintain the integrity of your file, but also will free valuable memory resources (i.e. internal buffers)". See https://www.pytables.org/usersguide/tutorials.html#creating-a-new-table
- After opening a file,
  - It is important to close it (i.e. `h5file.close()`), otherwise it will be corrupted to other programs. See https://www.pytables.org/usersguide/tutorials.html#closing-the-file-and-looking-at-its-content


In [32]:
# Open HSP2 HDF5 file with PyTables
h5file0 = tables.open_file(
   paths[0],
)
# For a PyTables file, a dump of the object tree can be easily displayed
# Warning: Output is quite verbose!

# h5file0  # Comment out to avoid display

Copy/Paste output into a YAML file for split diff comparisons

In [36]:
h5file0.isopen

0

In [35]:
# Always close an HDF5 file!
h5file0.close()

In [20]:
h5file1 = tables.open_file(
   paths[1],
)
# h5file1

In [23]:
h5file1.close()

## PyTables Browsing the Object Tree
https://www.pytables.org/usersguide/tutorials.html#browsing-the-object-tree


In [None]:
# Use file iterator

for node in h5file:
    print(node)

In [None]:
for group in h5file.walk_groups():
    print(group)

In [None]:
# From https://www.pytables.org/usersguide/tutorials.html#setting-and-getting-user-attributes

h5file.root

In [None]:
h5file.root.CONTROL

In [None]:
h5file.root.CONTROL.EXT_SOURCES.table

## Always close an HDF5 file!

It is important to close the HDF5 file, using `h5file.close()`). Otherwise it will be corrupted to other programs. See https://www.pytables.org/usersguide/tutorials.html#closing-the-file-and-looking-at-its-content

In [None]:
# Check if open (1 = True)
h5file.isopen

In [None]:
h5file.close()

In [None]:
h5file

## PyTables `ptdump` command line utility

The command line “ptdump” PyTables utility (located in utils/ directory)
https://www.pytables.org/usersguide/tutorials.html#closing-the-file-and-looking-at-its-content

In [None]:
# get path, as pathlib only works within Python and not at the operating system command prompt.
file_to_open

In [None]:
# This uses the `!` ipython magic command to run from the console
!ptdump '/Users/aaufdenkampe/Documents/Python/limno.HSPsquared/HSP2notebooks/Data/hsp2.h5'

# You can pass the -v or -d options to ptdump if you want more verbosity

# Open with h5py library

Docs: https://docs.h5py.org
Demo: https://nbviewer.jupyter.org/github/jackdbd/hdf5-pydata-munich/blob/master/hdf5_in_python.ipynb

In [None]:
# https://docs.h5py.org/en/stable/high/file.html#opening-creating-files

h5py_file = h5py.File(paths[0])

# NOTE: if the file was left open, you might get this error: `OSError: Unable to open file (File signature not found)`
# https://stackoverflow.com/questions/38089950/error-opening-file-in-h5py-file-signature-not-found/43607837

In [None]:
h5py_file?

In [None]:
h5py_file.attrs

In [None]:
h5py_file.keys()

In [None]:
for key in h5py_file.keys():
  print(key)

In [None]:
h5py_file['CONTROL'].keys()

In [None]:
h5py_file['CONTROL']['GLOBAL'].keys()

In [None]:
key_list = [k for k in h5py_file.keys()]
key_list

In [None]:
for name in h5py_file:
    for subname in h5py_file[name]:
        print(name+r'\ '+subname)
#         print(r'')
#         for subsubname in h5py_file[name][subname]:
#             print(name+r'\ '+subname+r'\ '+subsubname)

In [None]:
h5py_file.close()

In [None]:
h5py_file

### h5py `h5dump` command line utility

The command line “h5dump” utility can also be useful, but it is very verbose!
https://nbviewer.jupyter.org/github/jackdbd/hdf5-pydata-munich/blob/master/hdf5_in_python.ipynb

In [None]:
# get path, as pathlib only works within Python and not at the operating system command prompt.
paths[0]

In [None]:

!h5dump '/Users/aaufdenkampe/Documents/Python/limno.HSPsquared/HSP2notebooks/Data/hsp2.h5'

# WARNING: This is extremely verbose!!!!