## HDF5 IO

PyHIPP provides high-level wrappers on the top of the `h5py` package.

To use it, import the h5 submodule from `pyhipp.io`:

In [2]:
from pyhipp.io import h5
from pyhipp.stats import Rng
import numpy as np

Create a catalog, i.e., in the form of a arbitrarily nested dictionary:

In [10]:
rng = Rng(seed=10086)
l_box = 500.0
n_subhalos = 10

catalog = {
    'header': {
        'source': 'ELUCID simulation', 'last_update': '2024-06-06',
    },
    'subhalos': {
        'id': np.arange(n_subhalos),
        'x': rng.uniform(high=l_box, size=(n_subhalos, 3)),
    },
}

Dump it to a file:
- By default, the file flag is exclusive, i.e., it will raise an error if the file already exists.
  The `'w'` flag can be used to truncate the file if it already exists.

In [32]:
file_name = 'catalog.hdf5'
h5.File.dump_to(file_name, catalog, f_flag='w')

List the contents of the file:
- This is similar to the `h5py` CLI provided by the official HDF5 installation.

In [33]:
h5.File.ls_from(file_name)

/
├─ header/
   ├─ last_update(object)
   └─ source(object)
└─ subhalos/
   ├─ id(int64, (10,))
   └─ x(float64, (10, 3))


Open the file and modify it.

- mode `'a'`: open an existing file for read/writing.
- file.ls() or data_group.ls(): list its contents.
- dump(): save contents as datasets, or sub data group if the value is a dict-like object.
- attrs: the attribute manager. 
- attrs.dump(): save contents as attributes.

In [34]:
with h5.File(file_name, 'a') as f:
    subhalos = f['subhalos']
    
    print('--- before modifying ---')
    f.ls()
    
    print('\n--- after modifying ---')
    subhalos.dump({
        'v': rng.normal(scale=200, size=(n_subhalos, 3)),
        'mass': rng.uniform(0., 100., size=n_subhalos),
    })
    subhalos.attrs.dump({
        'n_subhalos': n_subhalos,
    })
    f.ls()

--- before modifying ---
/
├─ header/
   ├─ last_update(object)
   └─ source(object)
└─ subhalos/
   ├─ id(int64, (10,))
   └─ x(float64, (10, 3))

--- after modifying ---
/
├─ header/
   ├─ last_update(object)
   └─ source(object)
└─ subhalos/[n_subhalos=10]
   ├─ id(int64, (10,))
   ├─ mass(float64, (10,))
   ├─ v(float64, (10, 3))
   └─ x(float64, (10, 3))


Load back all the data from the file:

In [38]:
catalog = h5.File.load_from(file_name)
catalog

{ 'header': {'last_update': b'2024-06-06', 'source': b'ELUCID simulation'},
  'subhalos': { 'id': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
                'mass': array([45.23633189, 78.24576429, 13.17180825,  6.09529128, 73.67279848,
       23.8991323 , 32.9038259 , 89.40551753, 47.28830156, 88.55313355]),
                'v': array([[-271.43364188,   30.66825076,   -4.72063145],
       [  90.21116365,   -3.08840841,  217.10598144],
       [-172.27645923,   93.1028734 ,   81.17318407],
       [ 136.39478322,  157.84724842,   33.14150887],
       [  17.6955707 ,   76.86272393,  171.13420932],
       [ 209.09596349, -117.06806898,   66.55152947],
       [-246.15164814,  210.46296219,  133.0375081 ],
       [-157.81499475,   54.7480528 , -144.68423047],
       [  54.9370893 ,   -1.68798678,  108.00899249],
       [ 199.56878883,  173.29967731,   58.13932404]]),
                'x': array([[261.24859496, 298.5067223 , 146.16116083],
       [144.53162763, 315.18873326, 367.15953774],
       

To load a subset, pass a key:

In [39]:
subhalos = h5.File.load_from(file_name, key='subhalos')
subhalos

{ 'id': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
  'mass': array([45.23633189, 78.24576429, 13.17180825,  6.09529128, 73.67279848,
       23.8991323 , 32.9038259 , 89.40551753, 47.28830156, 88.55313355]),
  'v': array([[-271.43364188,   30.66825076,   -4.72063145],
       [  90.21116365,   -3.08840841,  217.10598144],
       [-172.27645923,   93.1028734 ,   81.17318407],
       [ 136.39478322,  157.84724842,   33.14150887],
       [  17.6955707 ,   76.86272393,  171.13420932],
       [ 209.09596349, -117.06806898,   66.55152947],
       [-246.15164814,  210.46296219,  133.0375081 ],
       [-157.81499475,   54.7480528 , -144.68423047],
       [  54.9370893 ,   -1.68798678,  108.00899249],
       [ 199.56878883,  173.29967731,   58.13932404]]),
  'x': array([[261.24859496, 298.5067223 , 146.16116083],
       [144.53162763, 315.18873326, 367.15953774],
       [293.71877935,  73.13813842,  24.14471203],
       [376.7850225 , 338.91419166, 297.37809599],
       [401.3352333 ,  55.99929851, 4

It is also possible to load a single dataset:

In [44]:
x = h5.File.load_from(file_name, key='subhalos/x')
x

array([[261.24859496, 298.5067223 , 146.16116083],
       [144.53162763, 315.18873326, 367.15953774],
       [293.71877935,  73.13813842,  24.14471203],
       [376.7850225 , 338.91419166, 297.37809599],
       [401.3352333 ,  55.99929851, 408.51861218],
       [ 39.71583672,  18.74817409, 418.51912984],
       [365.58739017,  64.59553126,  87.43307286],
       [308.77804769, 403.92009538, 345.97113507],
       [189.39080029, 347.04598553, 281.95556375],
       [157.9750106 ,  57.10295362, 413.35000259]])

To work in more detail, open the file and read its content:
- attrs: the attribute manager. `attrs[key]` returns the value of the attribute.
- datasets: the dataset manager. `datasets[key]` returns the value of the dataset.
- Multiple keys can be passed at the same time. The returned value will be a tuple of the corresponding values.
- Attribute access, such as `datasets.key` is also allowed (Thanks Zhaozhou Li for the idea).

In [53]:
with h5.File(file_name) as f:
    subhalos = f['subhalos']
    n_subhalos = subhalos.attrs['n_subhalos']
    x, v = subhalos.datasets['x', 'v']
    mass = subhalos.datasets.mass
n_subhalos, x, v, mass

(10,
 array([[261.24859496, 298.5067223 , 146.16116083],
        [144.53162763, 315.18873326, 367.15953774],
        [293.71877935,  73.13813842,  24.14471203],
        [376.7850225 , 338.91419166, 297.37809599],
        [401.3352333 ,  55.99929851, 408.51861218],
        [ 39.71583672,  18.74817409, 418.51912984],
        [365.58739017,  64.59553126,  87.43307286],
        [308.77804769, 403.92009538, 345.97113507],
        [189.39080029, 347.04598553, 281.95556375],
        [157.9750106 ,  57.10295362, 413.35000259]]),
 array([[-271.43364188,   30.66825076,   -4.72063145],
        [  90.21116365,   -3.08840841,  217.10598144],
        [-172.27645923,   93.1028734 ,   81.17318407],
        [ 136.39478322,  157.84724842,   33.14150887],
        [  17.6955707 ,   76.86272393,  171.13420932],
        [ 209.09596349, -117.06806898,   66.55152947],
        [-246.15164814,  210.46296219,  133.0375081 ],
        [-157.81499475,   54.7480528 , -144.68423047],
        [  54.9370893 ,   -1.6879

To load a data group as a whole, use the `load()` method:
- Internally, `h5.File.load_from(file_name, key='header')` is implemented as following codes.

In [54]:
with h5.File(file_name) as f:
    header = f['header'].load()
header

{'last_update': b'2024-06-06', 'source': b'ELUCID simulation'}