-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
character encoding #514
Comments
This is a tricky point. Filesystems and object stores do not necessarily use any particular character encoding and/or support multiple encodings. Related is the fact the upper and lowercase mappings are only uniquely defined for 7-bit ASCII even in Unicode, and not all systems are case sensitive or case presewrving. |
Note: unless I'm missing something, we (the OCFL community) can completely control what encoding is used for the contents of inventory.json (although like @pwinckles noted above, if an OCFL repo is not considered a closed ecosystem, UTF8 is required for it to be valid JSON for exchange), and what encoding of the object ID is used for hashing to generate the path tuples in the storage layout extensions. As far as process goes, though - we can't change the spec now, right? 1.0 is out. Our only option for now is to specify in extensions what encoding is required? |
I'm struggling to see a downside to saying that the IMO, that was the intention of the comments in the old ticket just unfortunately we wrote UTF instead of UTF-8. |
2020-11-17 Editors agree that |
@bcail raised an issue with an extension spec in regards to defining what character encoding is being used for the OCFL object ID.
As noted in the linked discussion, I had assumed that the character encoding was UTF-8 because it is stored as JSON. I looked through the OCFL spec, and discovered that it never addresses character encoding, and then dug up this old ticket where it was decided not to add any language to the spec.
The issue I have is that, while I assuming UTF-8 is intended, the encoding is still ambiguous. The relevant portion of the JSON spec reads:
My understanding of that text is that UTF-8 is only required if a) JSON is "exchanged" and b) the "exchange" is "not part of a closed ecosystem". In the case of OCFL, I think you could make a reasonable argument for either of those points being true or false based on how OCFL is being used, and therefore OCFL inventories do not necessarily need to be encoded as UTF-8.
If you look at the previous JSON spec, it states:
Based on that, it sounds using UTF-8, UTF-16, or UTF-32 are all valid choices, and only encodings like UTF-16BE and UTF-16LE are forbidden.
Back to the original issue, the character encoding for OCFL object IDs. Let's say that you have the object ID
testing
. The linked extension needs to hash this id to produce a path to the object's root directory. Hashing the id with sha256 produces the following results based on the encoding:So, is OCFL assuming UTF-8 or is it intended to support any valid JSON encoding?
The text was updated successfully, but these errors were encountered: