# HDF5

This repository is about testing how to work with HDF5 format, within Python, R and C++ programming languages.

Documentation on [h5py] and [rhdf5].

[h5py]: https://docs.h5py.org/en/stable/quick.html
[rhdf5]: https://bioconductor.org/packages/release/bioc/vignettes/rhdf5/inst/doc/rhdf5.html

## Requirements

To use HighFive library (for reading and writing of HDF5 files in C++) need to install (through `apt` tool in Ubuntu):
* libboost-serialization1.71-dev
* libboost-system1.71-dev
* libboost1.71-dev
* libhdf5-dev
* hdf5-helpers
* hdf5-tools


# h5py package (Python)

In [None]:
%load_ext rpy2.ipython
import h5py
import numpy as np

We can read the output of `use_h5py.py` script as follows:

In [None]:
with h5py.File("Output/h5py_test.h5") as fh:
    print(fh.keys())
    print(fh["array"][:])
    print(fh["int"][()])
    print(fh["string"][()].decode("UTF-8"))
    print([x.decode("UTF-8") for x in fh["strings"][()]])

# rhdf5 package (R)

We can open data saved by `rhdf5` package within Python like so:

In [None]:
fname = "Output/rhdf5_test1.h5"
with h5py.File(fname) as fh:
    print(fh.keys())
    print(*[(x, fh[x][()]) for x in fh], sep="\n")

In [None]:
fname = "Output/rhdf5_test2.h5"
with h5py.File(fname) as fh:
    print(fh.keys())
    print(fh["foo"].keys())
    print(fh["foo/A"][:])
    print(fh["df"][:])

Or natively in R like so:

In [None]:
%%R
library(rhdf5)

fname <- "Output/rhdf5_test2.h5"
fh = H5Fopen(fname)
print(fh)
print(fh$"df")
h5closeAll()

**Important**: `rhdf5` package saves matrices transposed, which is explained in `rhdf5` docs [here](https://bioconductor.org/packages/release/bioc/vignettes/rhdf5/inst/doc/rhdf5.html#reading-hdf5-files-with-external-software). If we open matrix D in `rhdf5_test1.h5` in Python:

In [None]:
with h5py.File("Output/rhdf5_test1.h5") as fh:
    print(fh["D"][:])

And in R:

In [None]:
%%R
fh = H5Fopen("Output/rhdf5_test1.h5")
print(fh$"D")
h5closeAll()

The matrix D, when read in Python is transposed and boolean entries have been changed to integers. We can observe the same behaviour when opening a file created by `rhdf5` package within C++. This happens because `rhdf5` "This is due to the fact the fastest changing dimension on C is the last one, but on R it is the first one (as in Fortran)." based on `rhdf5` documentation.

# HighFive package (C++)

Below is the C++ code on how to read and write hdf5 format:

In [None]:
from IPython.display import Markdown as md

with open("src/highfive_test.cpp") as fh:
    cpp_file = fh.read()
md(f"```C++\n{cpp_file}```")

We can open `hdf5` files generated in C++, using `HighFive` package, like so:

In [None]:
with h5py.File("Output/highfive_test.h5") as fh:
    print(fh["path/to"].keys())
    print(fh["path/to/A"][()])
    print([x.decode("UTF-8") for x in fh["path/to/B"]])