# Dataset creation

Dataset creation works almost as known from `h5py`. However, to facilitate and streamline the work with HDF5 files further some featurs are added.

In [1]:
import h5rdmtoolbox as h5tbx
import numpy as np
import xarray as xr

h5tbx.use(None)

2023-03-01_11:24:37,256 INFO     [__init__.py:58] Switched to "default"


Obligatory parameters during dataset creation know from the base package `h5py` are `name` and `data` or `shape`. Additionally, attributes can be passed during dataset creation right away:

In [2]:
with h5tbx.File() as h5:
    h5.create_dataset('x', shape=(4,),
                      attrs=dict(description='x coordinate'))
    h5.dump()

The name of the dataset is the path within the HDF5 file. It is possible to create the dataset although the (sub-)groups don't exist.

In [3]:
with h5tbx.File() as h5:
    h5.create_dataset('grp/subgrp/x', shape=(4,))
    h5.dump()

## Attributes

More flexibility and additional features are given also to attributes. One of the main ones to mention is the ability to intepret the attribute strings as "value and quantity" using the package `pint`:<br>
Let's say we store the attribute `length` then most probably it will inlcude the unit,e.g. `1 m`. We could also saved it as a dataset, but we did not. By calling `.to_pint()` on the return object (which is a subclass of `str`) we receive a `pint.Qunatity` (see https://pint.readthedocs.io/en/stable/getting/tutorial.html for more info):

In [4]:
with h5tbx.File() as h5:
    h5.attrs['length'] = '1 m'
    p = h5.attrs.length.to_pint()
p

## Dimension scales

Dimension scales can be defined during dataset creation. Let `time` be the dimension scale and `pressure` be the dataset to which it is attached.<br>
In order to make seamingless use of the HDF dimension scales, the feature is provided back to the user by returning a `xarray.DataArray` instead of a `np.ndarray` object. See more on this [slicing datasets](./DatasetSlicing.ipynb).

In [9]:
fname_dimcales = h5tbx.generate_temporary_filename()
with h5tbx.File(fname_dimcales, 'w') as h5:
    h5.create_dataset('time', data=[0,1,2,3,4,5],
                      make_scale=True,
                      attrs={'units': 's'})
    h5.create_dataset('pressure', data=np.random.rand(6),
                      attach_scale=((h5['time'])),
                      attrs={'units': 'Pa'})
    h5.dump()

In order to be compliant with `xarray` objects, single value "dimension scales" are set via the attribute `COORDINATES`. An example is the location of the pressure sensor in our case. Let's first create the datasets and then add them as attributes to "pressure":

In [10]:
with h5tbx.File(fname_dimcales, 'r+') as h5:
    h5.create_dataset('x', data=5.32)
    h5.create_dataset('y', data=-3.1)
    h5['pressure'].attrs['COORDINATES'] = ('x', 'y')
    h5.dump()

### String datasets
String datasets can be created very quickly. No standard_name, long_name or units *must* be given. As units generally anyhow makes no sense, there is still the option to pass long and standard name via the method parameters.<br>
The dump method will display single strings but not lists of strings.<br>
The return value when sliced will still be a `xarray.DataArray` as attributes should still be attached to the object. Use `.values` to get the raw string:

In [15]:
with h5tbx.File() as h5:
    h5.create_string_dataset('astr', 'hello_world')
    h5.create_string_dataset('string_list', ['hello', 'world'])
    h5.dump()
    
    print('> ', h5['astr'][()])
    print('> ',h5['astr'].values[()])
    
    print('> ', h5['string_list'][:])
    print('> ',h5['string_list'].values[:])

>  hello_world
>  b'hello_world'
>  ('hello', 'world')
>  [b'hello' b'world']


### Advanced dataset creation

There is more to dataset creation. You can:
- add attributes

In [16]:
with h5tbx.File() as h5:
    h5.create_dataset('ds', shape=(10, ), attrs=dict(long_name='a long name', anothera='another attr'))  # unitless dataset. long_name is passed via parameter attrs

- make and attach scales (Note the output using `dump()`: the scale "link" is shown)

In [17]:
with h5tbx.File() as h5:
    h5.create_dataset('x', data=[1,2,3], attrs=dict(units='m', standard_name='x_coordinate'), make_scale=True)
    h5.create_dataset('t', data=[20.1, 18.5, 24.7], attrs=dict(units='degC', standard_name='temperature'), attach_scale=h5['x'])
    print(h5.t.x)  # note, that you can access the dimension scale using attribute-style-syntax
    h5.dump()

<HDF5 dataset "x": shape (3,), type "<i4", convention "default">


- add `xarry.DataArrays`

In [18]:
arr =  xr.DataArray(dims=('y', 'x'), data=np.random.rand(3, 2),
                                 coords={'y': xr.DataArray(dims='y', data=[1, 2, 3],
                                                               attrs={'units': 'm',
                                                                      'standard_name': 'y_coordinate'}),
                                         'x': xr.DataArray(dims='x',
                                                               data=[0, 1],
                                                               attrs={'standard_name': 'x_coordinate'})
                                        },
                                 attrs={'long_name': 'a long name',
                                        'units': 'm/s'})

with h5tbx.File() as h5:
    h5.create_dataset('temperature', data=arr)
    h5.dump()

- add `xarry.Dataset`

In [19]:
ds = xr.Dataset({'foo': [1,2,3], 'bar': ('x', [1, 2]), 'baz': np.pi})
ds

In [20]:
try:
    with h5tbx.File() as h5:
        h5.create_dataset_from_xarray_dataset(ds)
except h5tbx.errors.UnitsError as e:
    print(e)

In [21]:
ds.foo.attrs['units']='m'
ds.foo.attrs['long_name']='foo'

ds.bar.attrs['units']='m'
ds.bar.attrs['long_name']='bar'

ds.baz.attrs['units']='m'
ds.baz.attrs['long_name']='baz'

ds

In [22]:
with h5tbx.File() as h5:
    h5.create_dataset_from_xarray_dataset(ds)

We may also create a dataset by using the `__setitem__`:

In [23]:
with h5tbx.File() as h5:
    h5['x'] = ([1,2,3], dict(attrs={'hello': 'world'}, compression='gzip'))
    h5.dump()