Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing fixed-length string with Ascii encoding #678

Closed
jdumas opened this issue Jan 24, 2023 · 14 comments
Closed

Writing fixed-length string with Ascii encoding #678

jdumas opened this issue Jan 24, 2023 · 14 comments
Labels
String Anything related to handling strings, sequences of chars.

Comments

@jdumas
Copy link

jdumas commented Jan 24, 2023

Is your feature request related to a problem? Please describe.

Hi. I'm trying to write a VTK HDF file using HighFive. This requires writing the "Type" attribute as a fixed-length string with encoding:

ATTRIBUTE "Type" {
   DATATYPE  H5T_STRING {
      STRSIZE 16;
      STRPAD H5T_STR_NULLPAD;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   }
   DATASPACE  SCALAR
   DATA {
   (0): "UnstructuredGrid"
   }
}

Describe the solution you'd like
I'd appreciate some help in being able to write an attribute with the same properties as above (fixed-length string, ascii encoding, nullpad padding).

Describe alternatives you've considered
I've tried to write the attribute in a few different ways:

group.createAttribute("Type", std::string("UnstructuredGrid"));
group.createAttribute("Type2", "UnstructuredGrid");
group.createAttribute("Type3", std::to_array("UnstructuredGrid")); // using the std::to_array implementation from https://en.cppreference.com/w/cpp/container/array/to_array
  1. The first version produces the following:

    ATTRIBUTE "Type" {
       DATATYPE  H5T_STRING {
          STRSIZE H5T_VARIABLE;
          STRPAD H5T_STR_NULLTERM;
          CSET H5T_CSET_UTF8;
          CTYPE H5T_C_S1;
       }
       DATASPACE  SCALAR
       DATA {
       (0): "UnstructuredGrid"
       }
    }
    
  2. The second version crashes with the following error:

    HDF5-DIAG: Error detected in HDF5 (1.14.0) thread 0:
    #000: _deps/hdf5-src/src/H5Tcset.c line 57 in H5Tget_cset(): operation not defined for data type class
      major: Datatype
      minor: Feature is unsupported
    
  3. The third version produces the following output:

    ATTRIBUTE "Type3" {
       DATATYPE  H5T_STD_I8LE
       DATASPACE  SIMPLE { ( 17 ) / ( 17 ) }
       DATA {
       (0): 85, 110, 115, 116, 114, 117, 99, 116, 117, 114, 101, 100, 71,
       (13): 114, 105, 100, 0
       }
    }
    

Additional context

  • HighFive version: v2.6.2
  • HDF5 version: hdf5-1_14_0
@alkino
Copy link
Member

alkino commented Jan 24, 2023

Hello,
This is a good question.
You are a bit in a corner case of our API (fixed length strings).

Here is a working example doing what you want:

#include <highfive/H5File.hpp>

using namespace HighFive;

/// First define a new fixed size string of 16, with PADDING, and ASCII
struct MyString: public DataType {
    MyString() {
        _hid = H5Tcopy(H5T_C_S1);
        if (H5Tset_size(_hid, 16) < 0) {
            HDF5ErrMapper::ToException<DataTypeException>("Unable to define datatype size to 16");
        }
        // define encoding to ASCII
        H5Tset_cset(_hid, H5T_CSET_ASCII);
        H5Tset_strpad(_hid, H5T_STR_SPACEPAD);
    }
} myStringType;

int main() {
    File file("test_jdumas.h5", File::ReadWrite | File::Create | File::Truncate);
    auto group = file.createGroup("/a");
    char utf8Str[] = "UnstructuredGrid";
    // create the attribute to avoid magicness
    auto attr = group.createAttribute("Type", DataSpace{1}, myStringType);
    // write it
    attr.write("UnstructuredGrid");

    return 0;
}

Giving:

HDF5 "test_jdumas.h5" {
GROUP "/" {
   GROUP "a" {
      ATTRIBUTE "Type" {
         DATATYPE  H5T_STRING {
            STRSIZE 16;
            STRPAD H5T_STR_SPACEPAD;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): "UnstructuredGrid"
         }
      }
   }
}
}

@alkino
Copy link
Member

alkino commented Jan 24, 2023

About your 3 examples:

group.createAttribute("Type", std::string("UnstructuredGrid"));
group.createAttribute("Type2", "UnstructuredGrid");
group.createAttribute("Type3", std::to_array("UnstructuredGrid")); // using the std::to_array implementation from https://en.cppreference.com/w/cpp/container/array/to_array

First one is not valid because on HighFive side, std::string is a variable length string.
Second one is bugged in HighFive (and hard to fix bug).
Third one, is an std::array<char>, that is not a string in HighFive.

In HighFive, FixedLenStringArray, char[N] and char* are fixed-length strings.

@jdumas
Copy link
Author

jdumas commented Jan 24, 2023

Thanks! It looks like I can open my exported VTK HDF with Paraview now. It crashes when I try to visualize it, but that may be an issue with my exported data -- I'll investigate more.

In HighFive, FixedLenStringArray, char[N] and char* are fixed-length strings.

Yeah, I figured as much. But the former is an array of string, not a string, and the last two are a bit bugged as you mentioned.

@jdumas
Copy link
Author

jdumas commented Jan 24, 2023

Ok after debugging my mesh export I got the VTK HDF thing working, using your workaround to write a fixed-length string to the HDF file. Feel free to close this issue or leave it open!

@tdegeus
Copy link
Collaborator

tdegeus commented Jan 27, 2023

A bit of a side step: Have you found http://xdmf.org/index.php/XDMF_Model_and_Format ? I made some tools https://github.com/tdegeus/XDMFWrite_h5py and https://github.com/tdegeus/XDMFWrite_HighFive (with the latter potentially needing some TLC).

@jdumas
Copy link
Author

jdumas commented Jan 27, 2023

Having to lug around a .xml file separate from the main .hdf is a chore. That's why the Paraview folks introduced VTK HDF with Paraview 5.11. Imho that's a much better solution, and I was able to write it fairly easily with HighFive once I used the workaround for fixed strings.

@1uc 1uc added the String Anything related to handling strings, sequences of chars. label Mar 17, 2023
@maxnoe
Copy link
Contributor

maxnoe commented Apr 12, 2023

 /// First define a new fixed size string of 16, with PADDING, and ASCII
struct MyString: public DataType {
    MyString() {
        _hid = H5Tcopy(H5T_C_S1);
        if (H5Tset_size(_hid, 16) < 0) {
            HDF5ErrMapper::ToException<DataTypeException>("Unable to define datatype size to 16");
        }
        // define encoding to ASCII
        H5Tset_cset(_hid, H5T_CSET_ASCII);
        H5Tset_strpad(_hid, H5T_STR_SPACEPAD);
    }
} myStringType;

This looks like it could be added to HighFive as a template over the size?

template<size_t N> 
struct FixedLengthString {...}; 

?

@maxnoe
Copy link
Contributor

maxnoe commented Apr 13, 2023

I tried to give that a go, but the data isn't actually written, this program:

#include <highfive/H5File.hpp>

using HighFive::File;
using HighFive::DataSpace;
using HighFive::DataType;
using HighFive::DataTypeException;
using HighFive::HDF5ErrMapper;
using DataspaceType = DataSpace::DataspaceType;

template<
    size_t N,
    H5T_cset_t cset = H5T_CSET_UTF8,
    H5T_str_t strpad = H5T_STR_NULLTERM
>
struct FixedLengthString: public DataType {
    static const H5T_cset_t character_set = cset;
    static const H5T_str_t padding = strpad;
    static const size_t size = N;

    FixedLengthString() {
        _hid = H5Tcopy(H5T_C_S1);

        if (H5Tset_size(_hid, size) < 0) {
            HDF5ErrMapper::ToException<DataTypeException>("Unable to define datatype size");
        }
        H5Tset_cset(_hid, character_set);
        H5Tset_strpad(_hid, padding);
    }
};

int main (int argc, const char* argv[]) {
    const std::string bar{"bar"};

    File file("test.h5", File::ReadWrite | File::Create | File::Truncate);
    auto group = file.createGroup("/foo");

    DataSpace scalar{DataspaceType::dataspace_scalar};
    FixedLengthString<20> dtype;
    auto attr = group.createAttribute("bar", scalar, dtype);
    attr.write(bar);

    return 0;
}

Runs, but results in an empty string (no data in the attribute):

❯ h5dump test.h5
HDF5 "test.h5" {
GROUP "/" {
   GROUP "foo" {
      ATTRIBUTE "bar" {
         DATATYPE  H5T_STRING {
            STRSIZE 20;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_UTF8;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SIMPLE { ( 0 ) / ( 0 ) }
         DATA {
         }
      }
   }
}
}

@1uc
Copy link
Collaborator

1uc commented Apr 13, 2023

Very interesting! Try write_raw, i.e. something close to attr.write_raw(bar.c_str(), dtype). See:
https://github.com/BlueBrain/HighFive/blob/master/include/highfive/bits/H5Attribute_misc.hpp#L140

I think I should have time to look at something HighFive related tomorrow. Strings are defiinitely on the TODO list.

@maxnoe
Copy link
Contributor

maxnoe commented Apr 13, 2023

@1uc nope, same result

@1uc
Copy link
Collaborator

1uc commented Apr 13, 2023

@maxnoe Thank you, that narrowed down the places the issue could be considerably. If you look at the output you see:

         DATASPACE  SIMPLE { ( 0 ) / ( 0 ) }

but you're asking for a scalar dataspace. To fix this you need to use the round-ctor (not curly-ctor):

    auto space = H5Screate(H5S_SCALAR);

with that I'm not able to use write. However, write_raw does work in the following variation:

const std::string bar{"blablabla"};

File file("test.h5", File::Truncate);
auto group = file.createGroup("/foo");

auto space = H5Screate(H5S_SCALAR);

DataSpace scalar(DataspaceType::dataspace_scalar);
FixedLengthString<20> dtype;
auto attr = group.createAttribute("bar", scalar, dtype);

// Note, it's important to repeat the datatype here:
attr.write_raw(bar.c_str(), dtype);

which results in:

HDF5 "test.h5" {
GROUP "/" {
   GROUP "foo" {
      ATTRIBUTE "bar" {
         DATATYPE  H5T_STRING {
            STRSIZE 20;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_UTF8;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "blablabla"
         }
      }
   }
}
}

(Note the DATASPACE SCALAR.)

@maxnoe
Copy link
Contributor

maxnoe commented Apr 13, 2023

Thanks, so the problem is that with {} the initializer list ctor is chosen? And that somehow works but breaks?

What is the reason to not allow writing strings or char* into a fixed length string?

@1uc
Copy link
Collaborator

1uc commented Apr 14, 2023

I think it's picking up the initializer_list overload by casting/promoting the enum to size_t:

DataSpace(const std::initializer_list<size_t>& items);

The reason why strings don't work is rather mundane: nobody has made sure they work (hence inevitably they broke or have always been broken) and at BBP we simply don't seem to have any need for them beyond what's implemented. Hence, we never urgently needed to make sure they work. That said, it would be nice if strings worked a lot better than they do now.

@1uc
Copy link
Collaborator

1uc commented Oct 31, 2023

With #744 merged std::strings and containers thereof, e.g. std::vector<std::string> can be read and written to fixed or variable length HDF5 strings. There's an example here:
https://github.com/BlueBrain/HighFive/blob/master/src/examples/read_write_std_strings.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
String Anything related to handling strings, sequences of chars.
Projects
None yet
Development

No branches or pull requests

5 participants