H5DataSet read and write H5std_string functions are likely to get called accidentally #2501

calvertdw · 2023-02-24T21:59:48Z

The H5DataSet read and write functions have options for doing so with either a void* or a H5std_string.

We ran into an issue with a Java binding for HDF5, documented here.

The issue was that in trying to read and write bytes to an H5DataSet, we were accidentally calling the H5std_string versions without knowing it and the read function in that case will terminate the read on any 0 bytes.

It took a while to figure out why we couldn't read and write bytes when other types (ints, floats, etc) worked fine. Our team spent several days scratching our heads.

I was wondering if the same issue might apply to all users of HDF5, who might be wishing to read char* with 0s in it and getting unexpected results.

Once possible solution would be to rename the H5std_string versions to readString and writeString. What do you guys think?

Thanks!

The text was updated successfully, but these errors were encountered:

mkitti · 2023-02-25T16:19:27Z

Just to clarify, you are using the JavaCpp org.bytedeco bindings which are automated from the C++ bindings rather than the hdf.hdf5lib bindings, right?

https://docs.hdfgroup.org/hdf5/develop/group___j_h5_d.html

bmribler · 2023-02-27T17:24:34Z

I'll check about the H5DataSet read and write

calvertdw · 2023-02-27T17:32:04Z

Yes that is correct @mkitti, I am referring to and JavaCPP binds to the C++ API.

bmribler · 2023-04-25T20:49:46Z

Just to clarify, you are using the JavaCpp org.bytedeco bindings which are automated from the C++ bindings rather than the hdf.hdf5lib bindings, right?

https://docs.hdfgroup.org/hdf5/develop/group___j_h5_d.html

Thank you!

mkitti · 2023-04-25T22:33:42Z

Yes that is correct @mkitti, I am referring to and JavaCPP binds to the C++ API.

Isn't this a bug for JavaCPP then?

calvertdw · 2023-05-26T15:44:48Z

@mkitti Well I originally reported this issue there and @saudet recommended I pursue the name change of the read/write string functions upstream. The 0 termination behavior is very different and I think would warrant the name change. Even better if the functions were documented as such.

bytedeco/javacpp-presets#1311

mkitti · 2023-05-26T18:03:32Z

I agree the 0 termination issue is strange if you were using a Java API such as the one that HDF Group provides.

However, you are using an automatically generated Java API from a C/C++ API where null terminated strings are normal. Your requested fix is to change the C/C++ API.

Wouldn't it make sense to use the canonical Java API instead to resolve this?

mkitti · 2023-05-26T18:09:30Z

@calvertdw , I think you what you want is hdf.hdf5lib.H5.H5Dread_string:
https://docs.hdfgroup.org/hdf5/develop/group___j_h5_d.html#gaca7b980e81fc5d46ae316c0b3a1fa31e
https://mkitti.github.io/janelia/javacpp-hdf5/1.14.0-1.5.9-SNAPSHOT/javadoc/hdf/hdf5lib/H5.html#H5Dread_string(long,long,long,long,long,java.lang.String%5B%5D)

mkitti · 2023-05-26T18:15:52Z

Specifcally the JNI function NewStringUTF

hdf5/java/src/jni/h5dImp.c

Line 1033 in 248045e

if (NULL == (jstr = ENVPTR->NewStringUTF(ENVONLY, cstr))) {

would use the modified UTF-8 encoding documented here:
https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/io/DataInput.html#modified-utf-8
avoiding the internal null bytes.

calvertdw · 2023-05-26T18:30:15Z

Wouldn't it make sense to use the canonical Java API instead to resolve this?

The JNI that HDF5 provides isn't sufficient for our large team and codebase. The Bytedeco JavaCPP library provides many crucial benefits like portability and seamless use of many AI related C/C++ libraries. It is available on Maven Central with prebuilt binaries for virtually every platform included in the artifacts.

I would consider HDF5 to be a very "enterprise" level library, such that the biggest users are likely to be organizations which would value the features that Bytedeco provides.

With that said, is there some reason for not changing the names of the string functions to readString and writeString? It seems if you wanted to read/write char* with HDF5 which includes 0s, users are going to struggle to find why their data is terminated early on the read. HDF5 is for storing data, and an array of chars including 0 is just data. Currently, users would have to cast to void* to use to correct functions and HDF5 leaves every C++ user to fend for themselves on that one.

mkitti · 2023-05-26T22:33:47Z

The JNI that HDF5 provides isn't sufficient for our large team and codebase. The Bytedeco JavaCPP library provides many crucial benefits like portability and seamless use of many AI related C/C++ libraries. It is available on Maven Central with prebuilt binaries for virtually every platform included in the artifacts.

I would consider HDF5 to be a very "enterprise" level library, such that the biggest users are likely to be organizations which would value the features that Bytedeco provides.

As of bytedeco/javacpp-presets#1327 you can use hdf.hdf5lib.* via the JavaCPP distribution as well. This will probably be released along with JavaCPP 1.5.9.

For now you can use the instructions at http://bytedeco.org/builds/ to pull the snapshots via maven or ask saudet when the release will occur:
https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/hdf5/1.14.0-1.5.9-SNAPSHOT/

With that said, is there some reason for not changing the names of the string functions to readString and writeString? It seems if you wanted to read/write char* with HDF5 which includes 0s, users are going to struggle to find why their data is terminated early on the read. HDF5 is for storing data, and an array of chars including 0 is just data. Currently, users would have to cast to void* to use to correct functions and HDF5 leaves every C++ user to fend for themselves on that one.

It probably would have be an additional method rather than a name change in the near future as to not break existing applications.
The main focus of HDF Group is the C API (e.g. H5Dread which uses void *)
There are other actively developed C++ bindings such as @steven-varga's https://github.com/steven-varga/h5cpp or https://bluebrain.github.io/HighFive/
It's already been marked as low priority

There probably would be better traction in clearly stating at that reading a fixed length string into std::string should not occur via operator= (const char* s) but rather the two argument string (const char* s, size_t n) constructor with the second argument indicating the length explicitly.

hdf5/c++/src/H5DataSet.cpp

Line 732 in 79bb60c

strg = strg_C;

calvertdw · 2023-06-01T21:08:22Z

Thanks for the detailed reply.

The statements "It probably would have be [...] as to not break existing applications" and "The main focus of HDF Group is the C API" seem to conflict. If you aren't focusing on the C++ API anyway, why spend extra effort maintaining backwards compatibility?

an additional method rather than a name change

Well this wouldn't work, because the old string method having the same signature is the actual issue here.

Additionally, the fixed length string reader doesn't help in the situation when you don't know beforehand how long it is. I guess you could find that out using other methods, but that's laborious for no reason.

mkitti · 2023-06-02T01:35:23Z

To be clear, I am not HDF Group.

Well this wouldn't work, because the old string method having the same signature is the actual issue here.

I disagree. The problem is that the C++ API always makes the assumption that it is reading a null terminated C string even when the true length is known. Rather the C++ API should create a C++ std::string of the known length. In your case, JavaCPP can then translate the std::string of the correct length to a java.lang.String, properly. This is a bug in the C++ API and we should fix the C++ API. Then H5DataSet::read will just work correctly for everyone.

There are three ways to encode a string in HDF5 as detailed in following URL.
https://docs.hdfgroup.org/hdf5/v1_14/_h5_t__u_g.html#subsubsec_datatype_other_strings

An array of characters of known length
A fixed length string of known length
A variable length null terminated string

Do you have a variable length string with one or more embedded nulls? If so, how do you know the actual length of the string?

mkitti · 2023-06-03T02:53:40Z

I filed #3034 in relation to this post. If we resolved that, then we should be able to read embedded null bytes into C++'s std::string. Then I presume JavaCPP should be able to read that into Java.

steven-varga · 2023-06-05T15:27:03Z

An interesting conversation: std::string is a template and indeed allows storing any data, HDF5 C API is a storage system with some constraints and it's fixed/variable length string datatype AFAIK can be represented within the std::basic_string C++ datatype, a superset of HDF5 H5_STRING datatype.
The latter consists of a null terminated sequence of H5T_NATIVE_CHAR and the C definition of char is: A C-style string is an array of characters that is terminated by a null character ('\0').

Lib HDF5 defines strings as a sequence of ASCII or utf8 characters, neither of the standards allow null in the middle. Is there anything wrong with this interpretation?

<string> ::= H5T_STRING {
                 STRSIZE <strsize>;
                 STRPAD <strpad>;
                 CSET <cset>;
                 CTYPE <ctype>;
             }
<strsize> ::= <int_value>
<strpad> ::= H5T_STR_NULLTERM | H5T_STR_NULLPAD | H5T_STR_SPACEPAD
<cset> ::= H5T_CSET_ASCII | H5T_CSET_UTF8
<ctype> ::= H5T_C_S1 | H5T_FORTRAN_S1

mkitti · 2023-06-05T21:33:16Z

I believe it may be possible to have a fixed length H5T_FORTRAN_S1 that contains null bytes. I believe that should be able to be represented in C++ as std::string, a specialization of the std::basic_string template.

steven-varga · 2023-06-06T03:12:34Z

Some implementations allow non-conforming output, if I understood you right Fortran can insert a null in the middle of a sequence of bytes (which are not UTF8 or ASCII strings); but other conforming libraries will ignore the data after the /0 so this feature will result in hard to detect non-conforming behaviour -- which of course could be a use case for hiding some information.

Yes, you are correct to say that std::basic_string<T> is not limited to uft8 or ascii, but the inverse doesn't work ie: an arbitrary byte sequence is not a valid H5_STRING type -- given that my interpretation is correct (which is AFAIK matching with h5py ).

OTOH: Opaque is a good candidate to save binary content, and personally this is what I do. Binary to string encoding such as base64 and variants are alternatives, but will look unusual.

Probably this could make a good topic for @gheber Call the Doctor ?

steven-varga · 2023-06-06T20:57:51Z

@mkitti convinced me to reconsider, and from then on I can't un-see this: <ctype> ::= H5T_C_S1 | H5T_FORTRAN_S1 The second term H5T_FORTRAN_S1 indeed allows 0-255 values as characters, how they should behave printed on the terminal is not the concern here, but how the sequence is read in from storage: must be the entire length as describe by the Fortran spec.

Since C++ std::basic_string<T> supports this, when the H5T_FORTRAN_S1 set, the entire length should be read in, and stored within the string starting from: std::basic_string<T>::data()

calvertdw added the bug label Feb 24, 2023

bmribler self-assigned this Feb 27, 2023

derobins removed the bug label Mar 3, 2023

mattjala added the Type - Improvement Improvements that don't add a new feature or functionality label Mar 7, 2023

bmribler assigned byrnHDF and unassigned bmribler Apr 25, 2023

derobins changed the title ~~[BUG] H5DataSet read and write H5std_string functions are likely to get called accidentally~~ H5DataSet read and write H5std_string functions are likely to get called accidentally May 4, 2023

derobins added Priority - 3. Low 🔽 Code cleanup, small feature change requests, etc. Component - C++ C++ wrappers labels May 4, 2023

calvertdw mentioned this issue Jun 1, 2023

[hdf5] Reading with BytePointer stops on 0 values (zeros) bytedeco/javacpp-presets#1311

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H5DataSet read and write H5std_string functions are likely to get called accidentally #2501

H5DataSet read and write H5std_string functions are likely to get called accidentally #2501

calvertdw commented Feb 24, 2023 •

edited

mkitti commented Feb 25, 2023 •

edited

bmribler commented Feb 27, 2023

calvertdw commented Feb 27, 2023

bmribler commented Apr 25, 2023

mkitti commented Apr 25, 2023

calvertdw commented May 26, 2023 •

edited

mkitti commented May 26, 2023

mkitti commented May 26, 2023

mkitti commented May 26, 2023 •

edited

calvertdw commented May 26, 2023 •

edited

mkitti commented May 26, 2023

calvertdw commented Jun 1, 2023

mkitti commented Jun 2, 2023

mkitti commented Jun 3, 2023

steven-varga commented Jun 5, 2023

mkitti commented Jun 5, 2023 •

edited

steven-varga commented Jun 6, 2023

steven-varga commented Jun 6, 2023

H5DataSet read and write H5std_string functions are likely to get called accidentally #2501

H5DataSet read and write H5std_string functions are likely to get called accidentally #2501

Comments

calvertdw commented Feb 24, 2023 • edited

mkitti commented Feb 25, 2023 • edited

bmribler commented Feb 27, 2023

calvertdw commented Feb 27, 2023

bmribler commented Apr 25, 2023

mkitti commented Apr 25, 2023

calvertdw commented May 26, 2023 • edited

mkitti commented May 26, 2023

mkitti commented May 26, 2023

mkitti commented May 26, 2023 • edited

calvertdw commented May 26, 2023 • edited

mkitti commented May 26, 2023

calvertdw commented Jun 1, 2023

mkitti commented Jun 2, 2023

mkitti commented Jun 3, 2023

steven-varga commented Jun 5, 2023

mkitti commented Jun 5, 2023 • edited

steven-varga commented Jun 6, 2023

steven-varga commented Jun 6, 2023

calvertdw commented Feb 24, 2023 •

edited

mkitti commented Feb 25, 2023 •

edited

calvertdw commented May 26, 2023 •

edited

mkitti commented May 26, 2023 •

edited

calvertdw commented May 26, 2023 •

edited

mkitti commented Jun 5, 2023 •

edited