# Precomputing features as HDF5 datasets

Note that this tutorial is using some features that are available only with our Professional plan.

## Introduction

Machine learning is an iterative process, especially during the phase of training. However, it might not be necessary to recompute the features each time. Therefore it is convenient to preprocess the data once and store it in a preprocessed format. In Metavision, we chose to use [HDF5](http://docs.h5py.org/en/stable/), a high-performance database format that is versatile and offers interfaces in many languages, including Python.

In this tutorial, we will learn how to precompute features from event-based data and save them in HDF5 format, with the goal of reducing storage size and avoiding repeated preprocessing.


In `Metavision ML`, we offer two options to process it:

- Use `generate_hdf5.py` Python sample script
- Use ML module directly


## Use  `generate_hdf5.py`

The python script `generate_hdf5.py` is available to convert RAW or DAT files into tensors in HDF5 format, with a predefined preprocessing function. To call it, you can simply use the following command:

    python3 <path to generate_hdf5.py> <path to the RAW or DAT file> -o <path to output folder> --preprocess <choice of preprocessing function> 
    

To see the full list of parameters of `generate_hdf5.py`:

    python3 <path to generate_hdf5.py> -h

    
You can process more than one file at a time by using wildcards. For example, if the RAW files are in `dataset/train/`, call:

    python3 <path to generate_hdf5.py> [...]
    
use `--max-duration-ms` to set the maximum storage duration in ms. If the raw data is longer than this period, then several HDF5 files will be produced accordingly. Note that if a NPY label file is associated with the RAW or DAT file, it will be processed as well.


## Use ML module directly

Alternatively, you can create your own script by importing the ML module. 

Here are the general steps that you can follow. 

### 1. Load all necessary libraries and input data


In [None]:
%matplotlib inline
import os

import numpy as np
import h5py
from matplotlib import pyplot as plt

from metavision_ml.preprocessing.viz import filter_outliers
from metavision_ml.preprocessing.hdf5 import generate_hdf5

In [None]:
input_path = "traffic_monitoring.raw"
# if the file doesn't exist, it will be downloaded from Prophesee's public sample server 
from metavision_core.utils import get_sample

get_sample(input_path, folder=".")

### 2. Run function `generate_hdf5`

with the following main parameters:

- input (`paths`)
- output (`output_folder`)
- preprocessing function (`preprocess`)
- sampling period (`delta_t`)


Note that here we choose to use `timesurface` to process the raw events, but you could use any of the available preprocessing methods or a customized function that has been registered using the `register_new_preprocessing()` function in <metavision_ml/preprocessing> module. Take a look at the preprocessing tutorial for more information.


In [None]:
output_folder = "."
output_path = output_folder + os.sep + os.path.basename(input_path).replace('.raw', '.h5')
if not os.path.exists(output_path):
    generate_hdf5(paths=input_path, output_folder=output_folder, preprocess="timesurface", delta_t=250000, height=None, width=None,
              start_ts=0, max_duration=None)

print('\nOriginal file \"{}" is of size: {:.3f}MB'.format(input_path, os.path.getsize(input_path)/1e6))
print('\nResult file \"{}" is of size: {:.3f}MB'.format(output_path, os.path.getsize(output_path)/1e6))

We can see how the size of the stored data is reduced compared to the original input. Also, now that the data is preprocessed, we do not need to re-process it again every time when we want to use it. 

**Note on the RAW file size and HDF5 file size**

The size of the RAW file is directly linked to the number of events in the file. This number is a consequence of the motions and illumination changes that occurred in front of the sensor. The HDF5 file, on the other hand, is frame based. With a given preprocessing function and sampling duration, all HDF5 files should normally contain the same number of tensors regardless of the content. 

As a result, uncompressed RAW files, which contain a really high temporal precision will be quite large on rich motion scene but a lot smaller for simpler scenes and fixed camera setting, while compressed HDF5 files, with their fixed frame rate, will yield huge gains on those rich scenes but less so on simpler scenes. 



### A closer look at our HDF5 Datasets 

We store the precomputed event-based tensors in HDF5 [datasets](http://docs.h5py.org/en/stable/high/dataset.html) named "data". The metadata and parameters used during preprocessing are stored as associated attributes of "data".

Let's first have a look at the "data".

In [None]:
f  = h5py.File(output_path, 'r')  # open the HDF5 file in read mode
print(f['data'])  # show the 'data' dataset

Similar to NumPy arrays, you can get their `shape` and `dtype` attributes.

In [None]:
hdf5_shape = f['data'].shape
print(hdf5_shape)  
print(f['data'].dtype)


What's more, you can get the attributes associated to the datasets as well. 

In [None]:
print("Attributes :\n")
for key in f['data'].attrs:
    print('\t', key, ' : ', f['data'].attrs[key])

Let's now use visualize the data stored within this HDF5 file.

In [None]:
for i, timesurface in enumerate(f['data'][:10]):
    
    plt.imshow(filter_outliers(timesurface[0], 7)) #filter out some noise
    plt.title("{:s} feature computed at time {:d} μs".format(f['data'].attrs['events_to_tensor'],
                                                          f['data'].attrs["delta_t"] * i))
    plt.pause(0.01)

Note that the HDF5 `dataset` variable is similar to a numpy `ndarray` but has some unique features. An important difference is that, if you read from an HDF5 dataset, the data is actually *read from drive* and put to memory as a numpy array. **If you are handling a large dataset, it is recommended not to read the whole file all at once, to avoid saturating the memory.**

## Enriching an HDF5 file

HDF5 files are flexible and can be easily edited, which means that you can add other `datasets` than events. For example, you can include the ground truth labels, statistics and so on. Look at the [h5py documentation](http://docs.h5py.org/en/stable/) for more examples.

Now let's add a confidence score label to our existing "data" dataset at each delta time. 

In [None]:
label = np.random.rand(hdf5_shape[0])
# we first close the file handle
f.close()

# the _with_ syntax is a cleaner way to open any file in python
with h5py.File(output_path, 'r+') as f:  # r+ reading mode allow to edit a file
    try:
        # We can create a dataset directly from a numpy array in the following fashion:
        label_dset = f.create_dataset("confidence", data=label)
        # and then add some metadata to this dataset
        label_dset.attrs['label_type'] = np.string_("random confidence score")
    except ValueError:
        print("confidence dataset exists already")
    
print("The file now contains the following keys: ", h5py.File(output_path, 'r').keys())

That's it. We have extended an existing HDF5 dataset. 

In [None]:
# remove output file in the end
import os
os.remove(output_path)