# Working with hdf5 files

As dicussed before, `hdf5` is a good format to store sensor data, as it allows to store meta data together with data in a compressed format (besides other advantages compared to plain text storage).
However, this also makes reading and working with the data more complicated, than just reading in a plain `csv` file.
In the following you can find a short guideline and some code recepies to simplify handling `hdf5` files.

## I have an hdf5 file without further documentation - What should I do?

Because of the nested nature of the `hdf5` file format you first need to figure out where the relevant data and meta data is stored inside of the file's tree structure.  
This can be done by using a graphical hdf5-viewer.
We recommend [Panoply](https://www.giss.nasa.gov/tools/panoply/) developed by NASA.
You can either click on a group or dataset to show meta information about it.
If you double click on a dataset, you can create a simple plot from the containing data and also view the raw data.


**Example for viewing an hdf5 file using Panolpy:**

![panolpy](media/panolpy.png)


**Tip:** Meta data concerning the whole measurement, like the measurement date, subject information, or **sampling rate** can often be found at the root group. In the viewer, these group attributes are listed at the very bottom of the metadata list.

If you do not want to a GUI, you can also use the `h5py` Python package to load and explore a dataset.
Below, you can find a couple of helpful recipies to quickly scan through the whole file.

**Note:** The examples are using the h5py file from the first exercise. If you want to run the examples on your own, modify the path to this file in the following cells.

In [1]:
import h5py


In [2]:
example_hdf = h5py.File('data/signal3.h5', 'r')

### Iterating all groups in a file

`h5py` provides two functions that can be used to iterate through the full tree of groups and objects inside an hdf5 file: [`Group.visit`](http://docs.h5py.org/en/latest/high/group.html#Group.visit) and [`Group.visititems`](http://docs.h5py.org/en/latest/high/group.html#Group.visititems).
Both functions can be passed an a callable (aka a function), which will be executed on every Group in the file.
Below you can find a couple of usefull examples.

In [3]:
# Printing all groups

example_hdf.visit(print)

trial1
trial1/date
trial1/muscles
trial1/muscles/gastrocnemius
trial1/muscles/perenous
trial1/muscles/soleus
trial1/muscles/tibialis
trial1/sampling rate
trial1/subject
trial1/subjectnr
trial1/trialnr


In [4]:
# Finding a specfic object

def find(name, search_key):
    if search_key in name:
        print(name)

example_hdf.visit(lambda x: find(x, 'channel_1')) 

In [5]:
# Find all dataset and check if they contain data

def check_data(name, obj):
    if isinstance(obj, h5py.Dataset):
        print(name, 'data shape:', obj.shape)
    
example_hdf.visititems(check_data) 

trial1/date data shape: ()
trial1/muscles/gastrocnemius data shape: (1201,)
trial1/muscles/perenous data shape: (1201,)
trial1/muscles/soleus data shape: (1201,)
trial1/muscles/tibialis data shape: (1201,)
trial1/sampling rate data shape: ()
trial1/subject data shape: ()
trial1/subjectnr data shape: ()
trial1/trialnr data shape: ()


### HDF5 Attributs

Hdf5 files don't just store raw data values, but also additional meta data.
This meta data is stored inside the attributes of each group.
Even if a group or dataset does not contain any actual data, it might contain meta data.
Meta data can be accesed using the `attr` parameter of a group.

In [6]:
# Get all attributes

def print_attributes(name, obj):
    attr_list = list(obj.attrs.items())
    if attr_list:
        print(name, 'Attribute keys:', attr_list)
    
example_hdf.visititems(print_attributes) 

trial1/muscles/gastrocnemius Attribute keys: [('sampling rate', 2000)]
trial1/muscles/perenous Attribute keys: [('sampling rate', 2000)]
trial1/muscles/soleus Attribute keys: [('sampling rate', 2000)]
trial1/muscles/tibialis Attribute keys: [('sampling rate', 2000)]


In [7]:
# Find a specific attribute by name

def find_attribute(name, obj, search_key):
    if search_key in obj.attrs.keys():
        print(name, search_key, obj.attrs[search_key])
    
example_hdf.visititems(lambda n, o: find_attribute(n, o, 'sampling rate')) 

trial1/muscles/gastrocnemius sampling rate 2000
trial1/muscles/perenous sampling rate 2000
trial1/muscles/soleus sampling rate 2000
trial1/muscles/tibialis sampling rate 2000


In [8]:
# Manually access an attribute

# Toplevel
print('Toplevel access - Sampling rate:', example_hdf['trial1/muscles/gastrocnemius'].attrs['sampling rate'])

Toplevel access - Sampling rate: 2000


## I know where my data is and I want to access it

Once you located you data in the tree structure, you probably want to load it into you Python program for further processing or plotting.

The important thing to understand about the `h5py` library is: **Groups work like dictionaries, and datasets work like NumPy arrays**.
You can read more about this in the [official documentation](http://docs.h5py.org/en/latest/quick.html)

In the following, these principals are used to access some data inside our `hdf5` file.

In [9]:
# Listing subgroups/datasets of a group

print(list(example_hdf['trial1/'].items()))

# Or just the keys
print(list(example_hdf['trial1/'].keys()))

[('date', <HDF5 dataset "date": shape (), type "|O">), ('muscles', <HDF5 group "/trial1/muscles" (4 members)>), ('sampling rate', <HDF5 dataset "sampling rate": shape (), type "<i4">), ('subject', <HDF5 dataset "subject": shape (), type "|O">), ('subjectnr', <HDF5 dataset "subjectnr": shape (), type "<i4">), ('trialnr', <HDF5 dataset "trialnr": shape (), type "<i4">)]
['date', 'muscles', 'sampling rate', 'subject', 'subjectnr', 'trialnr']


In [10]:
# Access a nested group

subgroup = example_hdf['trial1/muscles']
print(subgroup)

# or
nested_group = example_hdf['trial1']['muscles']
print(nested_group)

<HDF5 group "/trial1/muscles" (4 members)>
<HDF5 group "/trial1/muscles" (4 members)>


In [11]:
# Access a dataset (identical to a subgroup, but the return value is different)

dataset = example_hdf['trial1/muscles/gastrocnemius']
print(dataset)

# Get the values as numpy array
values = dataset[()]
print(type(values), values.shape, values)

# often you need to transpose the output to have it in a usable shape
values = values.T
print(values.shape, values)

<HDF5 dataset "gastrocnemius": shape (1201,), type "<f4">
<class 'numpy.ndarray'> (1201,) [ 0.04074097 -0.00228882 -0.05096436 ... -0.00648499  0.0213623
  0.01426697]
(1201,) [ 0.04074097 -0.00228882 -0.05096436 ... -0.00648499  0.0213623
  0.01426697]
