Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support variable length data types #369

Open
sabasehrish opened this issue Aug 31, 2020 · 20 comments
Open

Support variable length data types #369

sabasehrish opened this issue Aug 31, 2020 · 20 comments
Labels
enhancement v3 Anything that needs to be resolved before `v3`.

Comments

@sabasehrish
Copy link

For an application code I am working with, I have data represented as std::vector<std::vector>, where each std::vector has different length. Here are three ways I have used so far, and interested in knowing better or more efficient approach using HighFive API, may be use of variable length datatypes, and if there are any examples.

  • Flatten the structure, and use an std::vector representation for one HDF5 1D dataset, where each element is a char, and then have another dataset capturing size of each inner vector of chars to know how many elements will give me back the original vector upon reading.

  • Keep the vector of vector structure, and for HDF5 dataset use 2D dataset of chars and make use of dataset resize to adjust the second dimension before writing, this seems very in-efficient since I am calling resize and write for each inner vector.

  • Use std::vectorstd::string, which seemed to be the most straightforward but I am unable to make it work because after writing a string, all I see is a single char written to the file per element instead of complete string.

@tdegeus
Copy link
Collaborator

tdegeus commented Sep 1, 2020

I don't think it's so much the API that is restricting here, it are the intrinsic properties of HDF5 as it stores ND-arrays (that have a fixed number of columns per row, etc. for higher dimensions).

Personally I would store as a sparse matrix (e.g. as compressed sparse row, see wiki). In particular, you could store the data (v) as the dataset, and add the row pointers (row_ptr) and column index (col_ind) as attributes.

I would not convert to strings, it will cost you space, you might loose accuracy, and surely it will be much slower.

@sabasehrish
Copy link
Author

I will look into sparse matrix. I am also okay with first approach I am using, provided that after reading back assembling the vector is not too costly.

@maxnoe
Copy link
Contributor

maxnoe commented Nov 28, 2020

HDF5 supports variable length arrays. It should be no problem to store a vector of vectors of variable length.

@maxnoe
Copy link
Contributor

maxnoe commented Nov 29, 2020

Here is how you do it with h5py (using cython wrapping the C-API), similar is possible with pytables (also using cython to work with the C-API).

import h5py
import numpy as np


dt = h5py.vlen_dtype(np.dtype('float64'))

N_ROWS = 100

with h5py.File('vlen.hdf5', 'w') as f:
    dset = f.create_dataset('test', (N_ROWS, ), dtype=dt)

    for i in range(N_ROWS):
        N = np.random.poisson(25)
        dset[i] = np.random.normal(size=N)

One of the limitations of both pytables and h5py is that they do not support variable length arrays in composed datatypes, because the deeply rely on numpy in handling those.

I am interested in lifting those limitations and currently exploring the best way how to do that. Since HighFIve offers a considerably easier to wrap API (using pybind11) than the official C++ API, it would be great if HighFive would support variable length dtypes everywhere.

This is how you write variable length using the official C++ API:

#include <iostream>
#include <string>
#include <H5Cpp.h>
#include <vector>
#include <random>


const hsize_t n_dims = 1;
const hsize_t n_rows = 100;
const std::string dataset_name = "test";


int main () {
    H5::H5File file("vlen_cpp.hdf5", H5F_ACC_TRUNC);

    H5::DataSpace dataspace(n_dims, &n_rows);

    // target dtype for the file
    H5::FloatType item_type(H5::PredType::IEEE_F64LE);
    H5::VarLenType file_type(item_type);

    // dtype of the generated data
    H5::FloatType mem_item_type(H5::PredType::NATIVE_DOUBLE);
    H5::VarLenType mem_type(item_type);

    H5::DataSet dataset = file.createDataSet(dataset_name, file_type, dataspace);

    std::vector<std::vector<double>> data;
    data.reserve(n_rows);

    // this structure stores length of each varlen row and a pointer to
    // the actual data
    std::vector<hvl_t> varlen_spec(n_rows);

    std::mt19937 gen;
    std::normal_distribution<double> normal(0.0, 1.0);
    std::poisson_distribution<hsize_t> poisson(20);


    for (hsize_t idx=0; idx < n_rows; idx++) {
        data.emplace_back();

        hsize_t size = poisson(gen);
        data.at(idx).reserve(size);

        varlen_spec.at(idx).len = size;
        varlen_spec.at(idx).p = (void*) &data.at(idx).front();

        for (hsize_t i = 0; i < size; i++) {
            data.at(idx).push_back(normal(gen));
        }
    }

    dataset.write(&varlen_spec.front(), mem_type);

    return 0;
}

and here is how you'd write using a compound data type that has a variable length part:

#include <iostream>
#include <string>
#include <H5Cpp.h>
#include <vector>
#include <random>


const hsize_t n_dims = 1;
const hsize_t n_rows = 100;
const std::string dataset_name = "test";

struct CompoundData {
    double x;
    double y;
    hvl_t  values;

    CompoundData(double x, double y) : x(x), y(y) {};
};


int main () {
    H5::H5File file("vlen_compound.hdf5", H5F_ACC_TRUNC);

    H5::DataSpace dataspace(n_dims, &n_rows);

    // target dtype for the file
    H5::CompType data_type(sizeof(CompoundData));
    data_type.insertMember("x", HOFFSET(CompoundData, x), H5::PredType::NATIVE_DOUBLE);
    data_type.insertMember("y", HOFFSET(CompoundData, y), H5::PredType::NATIVE_DOUBLE);
    data_type.insertMember("values", HOFFSET(CompoundData, values), H5::VarLenType(H5::PredType::NATIVE_DOUBLE));

    H5::DataSet dataset = file.createDataSet(dataset_name, data_type, dataspace);

    // one vector holding the actual data
    std::vector<std::vector<double>> values;
    values.reserve(n_rows);

    // and one holding the hdf5 description and the "simple" columns
    std::vector<CompoundData> data;
    data.reserve(n_rows);

    std::mt19937 gen;
    std::normal_distribution<double> normal(0.0, 1.0);
    std::poisson_distribution<hsize_t> poisson(20);

    for (hsize_t idx = 0; idx < n_rows; idx++) {
        hsize_t size = poisson(gen);
        values.emplace_back();
        values.at(idx).reserve(size);


        for (hsize_t i = 0; i < size; i++) {
            values.at(idx).push_back(normal(gen));
        }

        // set len and pointer for the variable length descriptor
        data.emplace_back(normal(gen), normal(gen));
        data.at(idx).values.len = size;
        data.at(idx).values.p = (void*) &values.at(idx).front();
    }

    dataset.write(&data.front(), data_type);

    return 0;
}

@WanXinTao
Copy link

@maxnoe
What is your hdf5 version?
I want to read H5T_VLEN type data from H5 file, but I can't find H5cpp.h when I use your demo
Or, you still have a demo that uses highfive to read H5T_VLEN type data, that would be even better

@maxnoe
Copy link
Contributor

maxnoe commented Apr 12, 2021

@WanXinTao This issue is about adding support for vlen to highfive. So I cannot give you a solution using high five.

H5cpp.h is the c++ interface, which might need an additional package to be installed.

@WanXinTao
Copy link

@WanXinTao This issue is about adding support for vlen to highfive. So I cannot give you a solution using high five.

H5cpp.h is the c++ interface, which might need an additional package to be installed.

I understand, thanks for your answer

@maxnoe
Copy link
Contributor

maxnoe commented Apr 12, 2021

@WanXinTao If you are on Ubuntu and have libhdf5-dev installed, the header is here: /usr/include/hdf5/serial/H5Cpp.h.

Which is also advertised when using pkg-config:

$ pkg-config hdf5 --cflags --libs
-I/usr/include/hdf5/serial -L/usr/lib/x86_64-linux-gnu/hdf5/serial -lhdf5

This is what I had to do to get my example to compile on ubuntu:

$ g++ write_compound_varlen.cxx -o write_compound_varlen  `pkg-config hdf5 --cflags --libs` -lhdf5_cpp

@maxnoe
Copy link
Contributor

maxnoe commented Jun 14, 2021

@tdegeus Could you maybe change the tag from "question" to "enhancement" and adjust the title? Maybe to "Support variable length data types"?

Or should I open a new issue to track this feature request?

@tdegeus tdegeus changed the title Writing HDF5 dataset using HighFive API representing a vector of vector of variable length Support variable length data types Jun 14, 2021
@tdegeus
Copy link
Collaborator

tdegeus commented Jun 14, 2021

Done @maxnoe !

@tdegeus tdegeus closed this as completed Jun 14, 2021
@tdegeus tdegeus reopened this Jun 14, 2021
@maxnoe
Copy link
Contributor

maxnoe commented Jun 15, 2021

A first step would be to support simple variable length arrays, this is how you write one using h5py:

import h5py
import numpy as np

data = np.array([
    np.array(row, dtype=np.int32)
    for row in [[1, 2, 3], [1, 2], [1, 2, 3, 4], [1, 2, 3, 4, 5]]
], dtype=object)


with h5py.File('varlen_array.h5', 'w') as f:
    f.create_dataset(
        'test',
        dtype=h5py.vlen_dtype(np.int32),
        data=data,
    )

and this is how I'd like to read it using HighFive:

#include <highfive/H5File.hpp>
#include <cstdint>
#include <iostream>
#include <string>
#include <vector>


int main() {
    HighFive::File file("varlen_array.h5", HighFive::File::ReadOnly);

    std::vector<std::vector<int32_t>> data;
    HighFive::DataSet dataset = file.getDataSet("test");
    dataset.read(data);

    for (const auto& row: data) {
        for (auto val: row) {
            std::cout << val << " ";
        }
    }
    std::cout << '\n';
    return 0;
}

Which compiles but then errors with this:

HighFive WARNING: data and hdf5 dataset have different types: Integer32 -> Varlen128
terminate called after throwing an instance of 'HighFive::DataSpaceException'
  what():  Impossible to read DataSet of dimensions 1 into arrays of dimensions 2
[3]    80485 abort (core dumped)  ./Debug/read_varlen_array

@maxnoe
Copy link
Contributor

maxnoe commented Jun 15, 2021

I am a bit lost where I could start with looking into this. If you'd give me a couple of pointers which parts of the code would need to be adapted, I'd be happy to give it a try.

@ferdonline
Copy link
Contributor

Hi @maxnoe. The change you propose is significant, both in potential and amount of work :) Varlen arrays are a special kind which basically, like strings, store a pointer to the actual data arrays. Therefore we can't simply read the data, we have to take care of the indirection and subtract 1 to the dimensionality (hence the error you get!)

At BlueBrain we haven't really had use cases for VarLen arrays, so we can't dedicate much time to this. However if you are feeling brave to implement the change I will be happy to review it.
As mentioned varlen (traditional) strings are a good source of inspiration. See how they are handled in include/highfive/bits/H5Converter_misc.hpp :L349
Cheers

@maxnoe
Copy link
Contributor

maxnoe commented Jun 15, 2021

The change you propose is significant, both in potential and amount of work :)

I suspected as much. Still, I suspect it will be less work for me to contribute here than to basically duplicate all the work again to get a nicely pybind11 wrapable C++ interface supporting variable length arrays and compound data comprising variable length arrays.

So I'll try with the simple variable length arrays first and see how far I can go.

@motiv-ncb
Copy link

motiv-ncb commented Oct 30, 2021

This is how you write variable length using the official C++ API:

#include <iostream>
#include <string>
#include <H5Cpp.h>
#include <vector>
#include <random>


const hsize_t n_dims = 1;
const hsize_t n_rows = 100;
const std::string dataset_name = "test";


int main () {
    H5::H5File file("vlen_cpp.hdf5", H5F_ACC_TRUNC);

    H5::DataSpace dataspace(n_dims, &n_rows);

    // target dtype for the file
    H5::FloatType item_type(H5::PredType::IEEE_F64LE);
    H5::VarLenType file_type(item_type);

    // dtype of the generated data
    H5::FloatType mem_item_type(H5::PredType::NATIVE_DOUBLE);
    H5::VarLenType mem_type(item_type);

    H5::DataSet dataset = file.createDataSet(dataset_name, file_type, dataspace);

    std::vector<std::vector<double>> data;
    data.reserve(n_rows);

    // this structure stores length of each varlen row and a pointer to
    // the actual data
    std::vector<hvl_t> varlen_spec(n_rows);

    std::mt19937 gen;
    std::normal_distribution<double> normal(0.0, 1.0);
    std::poisson_distribution<hsize_t> poisson(20);


    for (hsize_t idx=0; idx < n_rows; idx++) {
        data.emplace_back();

        hsize_t size = poisson(gen);
        data.at(idx).reserve(size);

        varlen_spec.at(idx).len = size;
        varlen_spec.at(idx).p = (void*) &data.at(idx).front();

        for (hsize_t i = 0; i < size; i++) {
            data.at(idx).push_back(normal(gen));
        }
    }

    dataset.write(&varlen_spec.front(), mem_type);

    return 0;
}

Hello @maxnoe,

Thank you very much for your example on how to write variable length data using the c++ API, it works great, and I've used it to create some datasets I need for work. Would it be possible for you to provide a similar example on how to read that data back into std::vector using the c++ API? In python it's very simple, but I'm having a lot of trouble finding an example on how to do it with c++.

Specifically, in python if I want to access one particular array in the dataset, say the sixth one, I would do

import h5py
vlenData = h5py.File("vlen_cpp.hdf5", "r")
sixthArray = vlenData["test"][5]

But I don't know how this works in c++. Any advice would be very much appreciated.

Thank you,
Nate

@motiv-ncb
Copy link

motiv-ncb commented Nov 1, 2021

Hello again @maxnoe,

To follow up a bit, here is what I tried. After writing your "vlen_cpp.hdf5", which works just fine, I want to read the hdf5 file and load the data back into some kind of container (doesn't really matter what for now). I tried reading the first row of the hdf5 file into various containers (array, arma::vec, Eigen::VectorXd), none of which work. The program below happily executes but what is read into the containers is just garbage.

If you have any ideas, that would be wonderful.

Thank you!
Nate

#include <H5Cpp.h>
#include <string>
#include <vector>
#include <stdio.h>
#include <Eigen/Dense>
#include <Eigen/Core>
#include <armadillo>


int main(int argc, char **argv) {

	std::string filename = argv[1];

	// memtype of the file
    auto itemType = H5::PredType::NATIVE_DOUBLE;
    auto memType = H5::VarLenType(&itemType);

    // get dataspace
	H5::H5File file(filename, H5F_ACC_RDONLY);
	H5::DataSet dataset = file.openDataSet("test");
	H5::DataSpace dataspace = dataset.getSpace();

    // get the size of the dataset
    hsize_t rank;
    hsize_t dims[1];
    rank = dataspace.getSimpleExtentDims(dims); // rank = 1
    std::cout << "Data size: "<< dims[0] << std::endl; // this is the correct number of values
    std::cout << "Data rank: "<< rank << std::endl; // this is the correct rank

    // create memspace
    hsize_t memDims[1] = {1};
    H5::DataSpace memspace(rank, memDims);

    // Select hyperslabs
    hsize_t dataCount[1] = {1};
    hsize_t dataOffset[1] = {0};  // this would be i if reading in a loop
    hsize_t memCount[1] = {1};
    hsize_t memOffset[1] = {0};

    dataspace.selectHyperslab(H5S_SELECT_SET, dataCount, dataOffset);
    memspace.selectHyperslab(H5S_SELECT_SET, memCount, memOffset);

    // Create storage to hold read data
    int i;
    int NX = 20;
    double data_out[NX];
    for (i = 0; i < NX; i++)
        data_out[i] = 0;
    arma::vec temp(20);
    Eigen::VectorXd temp2(20);

    // Read data into data_out (array)
    dataset.read(data_out, memType, memspace, dataspace);

    std::cout << "data_out: " << "\n";
    for (i = 0; i < NX; i++)
        std::cout << data_out[i] << " ";
    std::cout << std::endl;

    // Read data into temp (arma vec)
    dataset.read(temp.memptr(), memType, memspace, dataspace);

    std::cout << "arma vec: " << "\n";
    std::cout << temp << std::endl;

    // Read data into temp (eigen vec)
    dataset.read(temp2.data(), memType, memspace, dataspace);

    std::cout << "eigen vec: " << "\n";
    std::cout << temp2 << std::endl;

	return 0;
}

Oddly, in python this gives me no issues whatsoever:

import h5py
import numpy as np

data = h5py.File("vlen_cpp.hdf5", "r")
i = 0  # This is the row I would want to read
arr = data["test"][i]  # <-- This is the simplest way.    

# Now trying to mimic something closer to C++
did = data["test"].id
dataspace = did.get_space()
dataspace.select_hyperslab(start=(i, ), count=(1, ))
memspace = h5py.h5s.create_simple(dims_tpl=(1, ))
memspace.select_hyperslab(start=(0, ), count=(1, ))
arr = np.zeros((1, ), dtype=object)
did.read(memspace, dataspace, arr)
print(arr)  # This gives back the correct data

@motiv-ncb
Copy link

motiv-ncb commented Nov 2, 2021

Hello again @maxnoe,

I finally figured out a way to do it:

#include <H5Cpp.h>
#include <string>
#include <vector>
#include <stdio.h>


int main(int argc, char **argv) {

	std::string filename = argv[1];

	// memtype of the file
    auto itemType = H5::PredType::NATIVE_DOUBLE;
    auto memType = H5::VarLenType(&itemType);

    // get dataspace
	H5::H5File file(filename, H5F_ACC_RDONLY);
	H5::DataSet dataset = file.openDataSet("test");
	H5::DataSpace dataspace = dataset.getSpace();

    // get the size of the dataset
    hsize_t rank;
    hsize_t dims[1];
    rank = dataspace.getSimpleExtentDims(dims); // rank = 1
    std::cout << "Data size: "<< dims[0] << std::endl; // this is the correct number of values
    std::cout << "Data rank: "<< rank << std::endl; // this is the correct rank

    // create memspace
    hsize_t memDims[1] = {1};
    H5::DataSpace memspace(rank, memDims);

    // Initialize hyperslabs
    hsize_t dataCount[1];
    hsize_t dataOffset[1];
    hsize_t memCount[1];
    hsize_t memOffset[1];

    // Create storage to hold read data
    std::vector<std::vector<double>> dataOut;
    
    for (hsize_t i = 0; i < dims[0]; i++) {

        // Select hyperslabs
        dataCount[0] = 1;
        dataOffset[0] = i;
        memCount[0] = 1;
        memOffset[0] = 0;

        dataspace.selectHyperslab(H5S_SELECT_SET, dataCount, dataOffset);
        memspace.selectHyperslab(H5S_SELECT_SET, memCount, memOffset);
        
        hvl_t *rdata = new hvl_t[1];
        dataset.read(rdata, memType, memspace, dataspace);

        double* ptr = (double*)rdata[0].p;
        std::vector<double> thisRow;

        for (int j = 0; j < rdata[0].len; j++) {
            double* val = (double*)&ptr[j];
            thisRow.push_back(*val);
        }
        
        dataOut.push_back(thisRow);
    }

    for (int i = 0; i < dataOut.size(); i++) {
        std::cout << "Row " << i << ":\n";
        for (int j = 0; j < dataOut[i].size(); j++) {
            std::cout << dataOut[i][j] << " ";
        }
        std::cout << "\n";
    }

	return 0;
}

If you know of a more efficient way that can pull out an entire row in one go (as you can see I'm looping over the elements), that would be so helpful. But I'm happy enough with this for now.

Thanks,
Nate

@maxnoe
Copy link
Contributor

maxnoe commented Nov 4, 2021

Hi @motiv-ncb, if you want to just read everything into a std::vector<std::vector<double>>, I came up with this:

#include <iostream>
#include <string>
#include <H5Cpp.h>
#include <vector>
#include <random>




int main () {
    H5::H5File file("vlen_cpp.hdf5", H5F_ACC_RDONLY);
    H5::DataSet dataset {file.openDataSet("test")};


    H5::DataSpace dataspace = dataset.getSpace();
    const int n_dims = dataspace.getSimpleExtentNdims();
    std::vector<hsize_t> dims(n_dims);
    dataspace.getSimpleExtentDims(dims.data());

    std::cout << "n_dims: " << dims.size() << '\n';

    std::cout << "shape: (";
    for (hsize_t dim: dims) {
        std::cout << dim << ", ";
    }
    std::cout << ")\n";

    if (dims.size() != 1) {
        throw std::runtime_error("Unexpected dimensions");
    }

    const hsize_t n_rows = dims[0];
    std::vector<hvl_t> varlen_specs(n_rows);
    std::vector<std::vector<double>> data;
    data.reserve(n_rows);

    H5::FloatType mem_item_type(H5::PredType::NATIVE_DOUBLE);
    H5::VarLenType mem_type(mem_item_type);
    dataset.read(varlen_specs.data(), mem_type);

    for (const auto& varlen_spec: varlen_specs) {
        auto data_ptr = static_cast<double*>(varlen_spec.p);
        data.emplace_back(data_ptr, data_ptr + varlen_spec.len);
        H5free_memory(varlen_spec.p);
    }

    return 0;
}

@motiv-ncb
Copy link

Thank you very much @maxnoe!

@alkino
Copy link
Member

alkino commented Jun 28, 2022

Hey, due to the new type system, I'm looking back to this error.

What will be the HighFive public API to write such vlen vector?

Regards

@BlueBrain BlueBrain deleted a comment from WanXinTao Jun 28, 2022
@alkino alkino added the v3 Anything that needs to be resolved before `v3`. label Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement v3 Anything that needs to be resolved before `v3`.
Projects
None yet
Development

No branches or pull requests

7 participants