Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

character encoding #514

Closed
pwinckles opened this issue Oct 22, 2020 · 4 comments · Fixed by #523
Closed

character encoding #514

pwinckles opened this issue Oct 22, 2020 · 4 comments · Fixed by #523
Milestone

Comments

@pwinckles
Copy link

@bcail raised an issue with an extension spec in regards to defining what character encoding is being used for the OCFL object ID.

As noted in the linked discussion, I had assumed that the character encoding was UTF-8 because it is stored as JSON. I looked through the OCFL spec, and discovered that it never addresses character encoding, and then dug up this old ticket where it was decided not to add any language to the spec.

The issue I have is that, while I assuming UTF-8 is intended, the encoding is still ambiguous. The relevant portion of the JSON spec reads:

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].

Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability.

Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

My understanding of that text is that UTF-8 is only required if a) JSON is "exchanged" and b) the "exchange" is "not part of a closed ecosystem". In the case of OCFL, I think you could make a reasonable argument for either of those points being true or false based on how OCFL is being used, and therefore OCFL inventories do not necessarily need to be encoded as UTF-8.

If you look at the previous JSON spec, it states:

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32).

Implementations MUST NOT add a byte order mark to the beginning of a JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

Based on that, it sounds using UTF-8, UTF-16, or UTF-32 are all valid choices, and only encodings like UTF-16BE and UTF-16LE are forbidden.

Back to the original issue, the character encoding for OCFL object IDs. Let's say that you have the object ID testing. The linked extension needs to hash this id to produce a path to the object's root directory. Hashing the id with sha256 produces the following results based on the encoding:

  • UTF-8: cf80cd8aed482d5d1527d7dc72fceff84e6326592848447d2dc0b0e87dfc9a90
  • UTF-16: 04364fb7e1f0fbc49ac9228952c61cc7af3e831a171daf5f1d8a6db16fe6204c
  • UTF-32: c87dea511cca5cb8d51e5613046e432d3d68c6df1a316d6a7814f7572d45eb10

So, is OCFL assuming UTF-8 or is it intended to support any valid JSON encoding?

@neilsjefferies
Copy link
Member

This is a tricky point. Filesystems and object stores do not necessarily use any particular character encoding and/or support multiple encodings. Related is the fact the upper and lowercase mappings are only uniquely defined for 7-bit ASCII even in Unicode, and not all systems are case sensitive or case presewrving.

@bcail
Copy link
Contributor

bcail commented Nov 9, 2020

Note: unless I'm missing something, we (the OCFL community) can completely control what encoding is used for the contents of inventory.json (although like @pwinckles noted above, if an OCFL repo is not considered a closed ecosystem, UTF8 is required for it to be valid JSON for exchange), and what encoding of the object ID is used for hashing to generate the path tuples in the storage layout extensions.

As far as process goes, though - we can't change the spec now, right? 1.0 is out. Our only option for now is to specify in extensions what encoding is required?

@zimeon
Copy link
Contributor

zimeon commented Nov 9, 2020

I'm struggling to see a downside to saying that the inventory.json MUST be UTF-8 which seems quite separate from filesystem filename issues. This could be understood for now and clarified in a 1.1 I think.

IMO, that was the intention of the comments in the old ticket just unfortunately we wrote UTF instead of UTF-8.

@zimeon
Copy link
Contributor

zimeon commented Nov 17, 2020

2020-11-17 Editors agree that inventory.json MUST be UTF-8. Therefore within the inventory.json the object id is always expressed in UTF-8. If an extension depends upon an operation the is affected by the encoding of the object id then the extension should clarify the expected encoding (likely UTF-8) -- this is outside of the core OCFL specification

@rosy1280 rosy1280 added this to the 1.1 milestone May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants