Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fixed-length strings with UTF-8 character set #270

Closed
mattjala opened this issue Oct 10, 2023 · 7 comments
Closed

Support fixed-length strings with UTF-8 character set #270

mattjala opened this issue Oct 10, 2023 · 7 comments

Comments

@mattjala
Copy link
Contributor

HSDS currently does not support these (see hdf5dtype.py:617)

@jreadey
Copy link
Member

jreadey commented Oct 10, 2023

Are they supported in the library?

@ajelenak
Copy link
Contributor

Yes, string encoding and how many bytes are reserved for its storage are decoupled.

@mattjala
Copy link
Contributor Author

Are they supported in the library?

Yep, see here for an example of fixed-length unicode strings being used in datasets/attributes - the native VOL passes both of these tests.

@ajelenak
Copy link
Contributor

The question may be more related to how h5py treats HDF5 strings where this combo is not really supported. Any fixed-length string is treated as bytes object, not Unicode string.

@jreadey
Copy link
Member

jreadey commented Oct 10, 2023

A fixed width unicode would be utf-32, but like @ajelenak says, it's not explicitly supported by the library. (or HSDS).

@mattjala mattjala changed the title Support fixed-width unicode strings Support fixed-length strings with UTF-8 character set Oct 10, 2023
@mattjala
Copy link
Contributor Author

mattjala commented Oct 10, 2023

A fixed width unicode would be utf-32, but like @ajelenak says, it's not explicitly supported by the library. (or HSDS).

I think there's a confusion in terminology here. The request is not support for a unicode character encoding where each particular character has a fixed width in bytes (e.g. UTF-32), but support for string datatypes that have a fixed total length in bytes (fixed length strings) AND have the character set/encoding UTF-8 (where a particular character does not have a fixed number of bytes associated with it).

I've updated the title of this issue to be more clear. The library does support fixed-length strings in UTF-8 (See the tests I linked above).

jreadey added a commit that referenced this issue Nov 3, 2023
* add support for fixed width UTF8 strings - #270

* add support for binary request of utf8 fixed width strings

* updates for fixed utf8 attribute values
@mattjala
Copy link
Contributor Author

mattjala commented Nov 6, 2023

Implemented in #278

@mattjala mattjala closed this as completed Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants