# HDF5

This repository is about testing how to work with HDF5 format, within Python, R and C++ programming languages.

Documentation on [h5py] and [rhdf5].

[h5py]: https://docs.h5py.org/en/stable/quick.html
[rhdf5]: https://bioconductor.org/packages/release/bioc/vignettes/rhdf5/inst/doc/rhdf5.html

## Requirements

To use HighFive library (for reading and writing of HDF5 files in C++) need to install (through `apt` tool in Ubuntu):
* libboost-serialization1.71-dev
* libboost-system1.71-dev
* libboost1.71-dev
* libhdf5-dev
* hdf5-helpers
* hdf5-tools


# h5py package (Python)

In [None]:
%load_ext rpy2.ipython
import h5py
import numpy as np
from os import path, makedirs
import pandas as pd


if not path.isdir("Output"): makedirs("Output")

Writing to an hfd5 file:

In [None]:
fname = "Output/h5py_test.h5"

with h5py.File(fname, "w") as fh:
    fh.create_dataset("int", data=1)
    fh.create_dataset("string", data="test")
    fh.create_dataset("array", data=[0, 10, 5])
    fh.create_dataset("strings", data=["hello", "world"])

df = pd.DataFrame([["a", 1], ["b", 2]], columns=["letter", "number"])
df.to_hdf(fname, "df")
df

Reading from an hdf5 file:

In [None]:
with h5py.File(fname) as fh:
    print(fh.keys())
    print(fh["int"][()])
    print(fh["string"][()].decode("UTF-8"))
    print(fh["array"][:])
    print([x.decode("UTF-8") for x in fh["strings"][()]])
pd.read_hdf(fname, "df")

**Note**: Pandas dataframes can be saved and read from an hdf5 file, but through pandas' own API, not through h5py.

# rhdf5 package (R)

In [None]:
%%R
library(rhdf5)

Writing to an hdf5 file:

In [None]:
%%R

fname <- "Output/rhdf5_test.h5"
if (file.exists(fname)) { file.remove(fname) } 

h5createFile(fname)
h5write(42, fname, "int")
h5write("test", fname, "string")
h5write(c(1.0, 2.0, 5.0), fname, "vector")
strings <- c("hello", "world", "foofoofoo", "barbarbarbar")
h5write(strings, fname, "strings")

d <- rbind(c(FALSE, TRUE),
           c(TRUE, FALSE),
           c(TRUE, FALSE))
h5write(d, fname, "rectangular_matrix")

name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
age <- c(23, 41, 32, 58, 26)
df <- data.frame(name, age)
h5write(df, fname, "df")

h5closeAll()

Reading from an hdf5 file:

In [None]:
%%R

fh = H5Fopen(fname)
print(fh)
print(fh$"int")
print(fh$"string")
print(fh$"vector")
print(fh$"strings")
print(fh$"rectangular_matrix")
print(fh$"df")
h5closeAll()

Data in an hdf5 file created in one language can be read in another language, but we need to be careful, as shown later

In [None]:
with h5py.File("Output/rhdf5_test.h5") as fh:
    print(fh.keys())
    print(fh["int"][0])
    print(fh["string"][0].decode("UTF-8"))
    print(fh["vector"][:])
    print([x.decode("UTF-8") for x in fh["strings"]])
    df = pd.DataFrame(np.array(fh["df"]))
    df["name"] = df["name"].apply(lambda x : x.decode("UTF-8"))
    print(df)

**Important**: `rhdf5` package saves matrices transposed, which is explained in `rhdf5` docs [here](https://bioconductor.org/packages/release/bioc/vignettes/rhdf5/inst/doc/rhdf5.html#reading-hdf5-files-with-external-software). If we open matrix D in `rhdf5_test.h5` in Python:

In [None]:
with h5py.File("Output/rhdf5_test.h5") as fh:
    print(fh["rectangular_matrix"][:])

The matrix `rectangular_matrix`, when read in Python is transposed and boolean entries have been changed to integers. We can observe the same behaviour when opening a file created by `rhdf5` package within C++. This happens because `rhdf5` "This is due to the fact the fastest changing dimension on C is the last one, but on R it is the first one (as in Fortran)." based on `rhdf5` documentation.

In [None]:
%%R
fh = H5Fopen("Output/rhdf5_test.h5")
print(fh$"rectangular_matrix")
h5closeAll()

Lists and vectors from R saved in hdf5 file:

In [None]:
%%R 

fname <- "Output/rhdf5_test2.h5"
if (file.exists(fname)) { file.remove(fname) } 

h5createFile(fname)

num <- 10
a <- rep(1, num)
names(a) <- paste0("M", 1:num)
print(a)  # names in a vector are not saved into hdf5 format (see below)
h5write(a, fname, "a")

b <- list(M1=1, M2=1)
h5write(b, fname, "b")

num <- 10
c <- as.list(rep(1, num))
names(c) <- paste0("M", 1:num)
h5write(c, fname, "c")

In [None]:
%%R

fh = H5Fopen(fname)
print("a = ")
print(fh$"a")
print("b = ")
print(fh$"b")
print("c = ")
print(fh$"c")
h5closeAll()

We can also read R's lists and vectors in python with `h5py` module:

In [None]:
with h5py.File("Output/rhdf5_test2.h5") as fh:
    print(fh["a"][:])
    print({x: fh["b"][x][0] for x in fh["b"].keys()})
    print({x: fh["c"][x][0] for x in fh["c"].keys()})

# HighFive package (C++)

Below is the C++ code on how to read and write hdf5 format:

In [None]:
from IPython.display import Markdown as md

with open("src/highfive_test.cpp") as fh:
    cpp_file = fh.read()
md(f"```C++\n{cpp_file}```")

We can open `hdf5` files generated in C++, using `HighFive` package, like so:

In [None]:
with h5py.File("Output/highfive_test.h5") as fh:
    print(fh["path/to"].keys())
    print(fh["path/to/A"][()])
    print([x.decode("UTF-8") for x in fh["path/to/B"]])