Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H5DataSet read and write H5std_string functions are likely to get called accidentally #2501

Open
calvertdw opened this issue Feb 24, 2023 · 18 comments
Assignees
Labels
Component - C++ C++ wrappers Priority - 3. Low 🔽 Code cleanup, small feature change requests, etc. Type - Improvement Improvements that don't add a new feature or functionality

Comments

@calvertdw
Copy link

calvertdw commented Feb 24, 2023

The H5DataSet read and write functions have options for doing so with either a void* or a H5std_string.

We ran into an issue with a Java binding for HDF5, documented here.

The issue was that in trying to read and write bytes to an H5DataSet, we were accidentally calling the H5std_string versions without knowing it and the read function in that case will terminate the read on any 0 bytes.

It took a while to figure out why we couldn't read and write bytes when other types (ints, floats, etc) worked fine. Our team spent several days scratching our heads.

I was wondering if the same issue might apply to all users of HDF5, who might be wishing to read char* with 0s in it and getting unexpected results.

Once possible solution would be to rename the H5std_string versions to readString and writeString. What do you guys think?

Thanks!

@calvertdw calvertdw added the bug label Feb 24, 2023
@mkitti
Copy link
Contributor

mkitti commented Feb 25, 2023

Just to clarify, you are using the JavaCpp org.bytedeco bindings which are automated from the C++ bindings rather than the hdf.hdf5lib bindings, right?

https://docs.hdfgroup.org/hdf5/develop/group___j_h5_d.html

@bmribler bmribler self-assigned this Feb 27, 2023
@bmribler
Copy link
Contributor

I'll check about the H5DataSet read and write

@calvertdw
Copy link
Author

Yes that is correct @mkitti, I am referring to and JavaCPP binds to the C++ API.

@derobins derobins removed the bug label Mar 3, 2023
@mattjala mattjala added the Type - Improvement Improvements that don't add a new feature or functionality label Mar 7, 2023
@bmribler bmribler assigned byrnHDF and unassigned bmribler Apr 25, 2023
@bmribler
Copy link
Contributor

Just to clarify, you are using the JavaCpp org.bytedeco bindings which are automated from the C++ bindings rather than the hdf.hdf5lib bindings, right?

https://docs.hdfgroup.org/hdf5/develop/group___j_h5_d.html

Thank you!

@mkitti
Copy link
Contributor

mkitti commented Apr 25, 2023

Yes that is correct @mkitti, I am referring to and JavaCPP binds to the C++ API.

Isn't this a bug for JavaCPP then?

@derobins derobins changed the title [BUG] H5DataSet read and write H5std_string functions are likely to get called accidentally H5DataSet read and write H5std_string functions are likely to get called accidentally May 4, 2023
@derobins derobins added Priority - 3. Low 🔽 Code cleanup, small feature change requests, etc. Component - C++ C++ wrappers labels May 4, 2023
@calvertdw
Copy link
Author

calvertdw commented May 26, 2023

@mkitti Well I originally reported this issue there and @saudet recommended I pursue the name change of the read/write string functions upstream. The 0 termination behavior is very different and I think would warrant the name change. Even better if the functions were documented as such.

bytedeco/javacpp-presets#1311

@mkitti
Copy link
Contributor

mkitti commented May 26, 2023

I agree the 0 termination issue is strange if you were using a Java API such as the one that HDF Group provides.

However, you are using an automatically generated Java API from a C/C++ API where null terminated strings are normal. Your requested fix is to change the C/C++ API.

Wouldn't it make sense to use the canonical Java API instead to resolve this?

@mkitti
Copy link
Contributor

mkitti commented May 26, 2023

Specifcally the JNI function NewStringUTF

if (NULL == (jstr = ENVPTR->NewStringUTF(ENVONLY, cstr))) {

would use the modified UTF-8 encoding documented here:
https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/io/DataInput.html#modified-utf-8
avoiding the internal null bytes.

@calvertdw
Copy link
Author

calvertdw commented May 26, 2023

Wouldn't it make sense to use the canonical Java API instead to resolve this?

The JNI that HDF5 provides isn't sufficient for our large team and codebase. The Bytedeco JavaCPP library provides many crucial benefits like portability and seamless use of many AI related C/C++ libraries. It is available on Maven Central with prebuilt binaries for virtually every platform included in the artifacts.

I would consider HDF5 to be a very "enterprise" level library, such that the biggest users are likely to be organizations which would value the features that Bytedeco provides.

With that said, is there some reason for not changing the names of the string functions to readString and writeString? It seems if you wanted to read/write char* with HDF5 which includes 0s, users are going to struggle to find why their data is terminated early on the read. HDF5 is for storing data, and an array of chars including 0 is just data. Currently, users would have to cast to void* to use to correct functions and HDF5 leaves every C++ user to fend for themselves on that one.

@mkitti
Copy link
Contributor

mkitti commented May 26, 2023

The JNI that HDF5 provides isn't sufficient for our large team and codebase. The Bytedeco JavaCPP library provides many crucial benefits like portability and seamless use of many AI related C/C++ libraries. It is available on Maven Central with prebuilt binaries for virtually every platform included in the artifacts.

I would consider HDF5 to be a very "enterprise" level library, such that the biggest users are likely to be organizations which would value the features that Bytedeco provides.

As of bytedeco/javacpp-presets#1327 you can use hdf.hdf5lib.* via the JavaCPP distribution as well. This will probably be released along with JavaCPP 1.5.9.

For now you can use the instructions at http://bytedeco.org/builds/ to pull the snapshots via maven or ask saudet when the release will occur:
https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/hdf5/1.14.0-1.5.9-SNAPSHOT/

With that said, is there some reason for not changing the names of the string functions to readString and writeString? It seems if you wanted to read/write char* with HDF5 which includes 0s, users are going to struggle to find why their data is terminated early on the read. HDF5 is for storing data, and an array of chars including 0 is just data. Currently, users would have to cast to void* to use to correct functions and HDF5 leaves every C++ user to fend for themselves on that one.

  1. It probably would have be an additional method rather than a name change in the near future as to not break existing applications.
  2. The main focus of HDF Group is the C API (e.g. H5Dread which uses void *)
  3. There are other actively developed C++ bindings such as @steven-varga's https://github.com/steven-varga/h5cpp or https://bluebrain.github.io/HighFive/
  4. It's already been marked as low priority

There probably would be better traction in clearly stating at that reading a fixed length string into std::string should not occur via operator= (const char* s) but rather the two argument string (const char* s, size_t n) constructor with the second argument indicating the length explicitly.

strg = strg_C;

@calvertdw
Copy link
Author

Thanks for the detailed reply.

The statements "It probably would have be [...] as to not break existing applications" and "The main focus of HDF Group is the C API" seem to conflict. If you aren't focusing on the C++ API anyway, why spend extra effort maintaining backwards compatibility?

an additional method rather than a name change

Well this wouldn't work, because the old string method having the same signature is the actual issue here.

Additionally, the fixed length string reader doesn't help in the situation when you don't know beforehand how long it is. I guess you could find that out using other methods, but that's laborious for no reason.

@mkitti
Copy link
Contributor

mkitti commented Jun 2, 2023

To be clear, I am not HDF Group.

Well this wouldn't work, because the old string method having the same signature is the actual issue here.

I disagree. The problem is that the C++ API always makes the assumption that it is reading a null terminated C string even when the true length is known. Rather the C++ API should create a C++ std::string of the known length. In your case, JavaCPP can then translate the std::string of the correct length to a java.lang.String, properly. This is a bug in the C++ API and we should fix the C++ API. Then H5DataSet::read will just work correctly for everyone.

There are three ways to encode a string in HDF5 as detailed in following URL.
https://docs.hdfgroup.org/hdf5/v1_14/_h5_t__u_g.html#subsubsec_datatype_other_strings

  1. An array of characters of known length
  2. A fixed length string of known length
  3. A variable length null terminated string

Do you have a variable length string with one or more embedded nulls? If so, how do you know the actual length of the string?

@mkitti
Copy link
Contributor

mkitti commented Jun 3, 2023

I filed #3034 in relation to this post. If we resolved that, then we should be able to read embedded null bytes into C++'s std::string. Then I presume JavaCPP should be able to read that into Java.

@steven-varga
Copy link

An interesting conversation: std::string is a template and indeed allows storing any data, HDF5 C API is a storage system with some constraints and it's fixed/variable length string datatype AFAIK can be represented within the std::basic_string C++ datatype, a superset of HDF5 H5_STRING datatype.
The latter consists of a null terminated sequence of H5T_NATIVE_CHAR and the C definition of char is: A C-style string is an array of characters that is terminated by a null character ('\0').

Lib HDF5 defines strings as a sequence of ASCII or utf8 characters, neither of the standards allow null in the middle. Is there anything wrong with this interpretation?

<string> ::= H5T_STRING {
                 STRSIZE <strsize>;
                 STRPAD <strpad>;
                 CSET <cset>;
                 CTYPE <ctype>;
             }
<strsize> ::= <int_value>
<strpad> ::= H5T_STR_NULLTERM | H5T_STR_NULLPAD | H5T_STR_SPACEPAD
<cset> ::= H5T_CSET_ASCII | H5T_CSET_UTF8
<ctype> ::= H5T_C_S1 | H5T_FORTRAN_S1

@mkitti
Copy link
Contributor

mkitti commented Jun 5, 2023

I believe it may be possible to have a fixed length H5T_FORTRAN_S1 that contains null bytes. I believe that should be able to be represented in C++ as std::string, a specialization of the std::basic_string template.

@steven-varga
Copy link

Some implementations allow non-conforming output, if I understood you right Fortran can insert a null in the middle of a sequence of bytes (which are not UTF8 or ASCII strings); but other conforming libraries will ignore the data after the /0 so this feature will result in hard to detect non-conforming behaviour -- which of course could be a use case for hiding some information.

Yes, you are correct to say that std::basic_string<T> is not limited to uft8 or ascii, but the inverse doesn't work ie: an arbitrary byte sequence is not a valid H5_STRING type -- given that my interpretation is correct (which is AFAIK matching with h5py ).

OTOH: Opaque is a good candidate to save binary content, and personally this is what I do. Binary to string encoding such as base64 and variants are alternatives, but will look unusual.

Probably this could make a good topic for @gheber Call the Doctor ?

@steven-varga
Copy link

@mkitti convinced me to reconsider, and from then on I can't un-see this: <ctype> ::= H5T_C_S1 | H5T_FORTRAN_S1 The second term H5T_FORTRAN_S1 indeed allows 0-255 values as characters, how they should behave printed on the terminal is not the concern here, but how the sequence is read in from storage: must be the entire length as describe by the Fortran spec.

Since C++ std::basic_string<T> supports this, when the H5T_FORTRAN_S1 set, the entire length should be read in, and stored within the string starting from: std::basic_string<T>::data()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component - C++ C++ wrappers Priority - 3. Low 🔽 Code cleanup, small feature change requests, etc. Type - Improvement Improvements that don't add a new feature or functionality
Projects
None yet
Development

No branches or pull requests

7 participants