@bcail raised an issue with an extension spec in regards to defining what character encoding is being used for the OCFL object ID.
As noted in the linked discussion, I had assumed that the character encoding was UTF-8 because it is stored as JSON. I looked through the OCFL spec, and discovered that it never addresses character encoding, and then dug up this old ticket where it was decided not to add any language to the spec.
The issue I have is that, while I assuming UTF-8 is intended, the encoding is still ambiguous. The relevant portion of the JSON spec reads:
JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].
Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability.
Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.
My understanding of that text is that UTF-8 is only required if a) JSON is "exchanged" and b) the "exchange" is "not part of a closed ecosystem". In the case of OCFL, I think you could make a reasonable argument for either of those points being true or false based on how OCFL is being used, and therefore OCFL inventories do not necessarily need to be encoded as UTF-8.
If you look at the previous JSON spec, it states:
JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32).
Implementations MUST NOT add a byte order mark to the beginning of a JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.
Based on that, it sounds using UTF-8, UTF-16, or UTF-32 are all valid choices, and only encodings like UTF-16BE and UTF-16LE are forbidden.
Back to the original issue, the character encoding for OCFL object IDs. Let's say that you have the object ID testing. The linked extension needs to hash this id to produce a path to the object's root directory. Hashing the id with sha256 produces the following results based on the encoding:
- UTF-8: cf80cd8aed482d5d1527d7dc72fceff84e6326592848447d2dc0b0e87dfc9a90
- UTF-16: 04364fb7e1f0fbc49ac9228952c61cc7af3e831a171daf5f1d8a6db16fe6204c
- UTF-32: c87dea511cca5cb8d51e5613046e432d3d68c6df1a316d6a7814f7572d45eb10
So, is OCFL assuming UTF-8 or is it intended to support any valid JSON encoding?
@bcail raised an issue with an extension spec in regards to defining what character encoding is being used for the OCFL object ID.
As noted in the linked discussion, I had assumed that the character encoding was UTF-8 because it is stored as JSON. I looked through the OCFL spec, and discovered that it never addresses character encoding, and then dug up this old ticket where it was decided not to add any language to the spec.
The issue I have is that, while I assuming UTF-8 is intended, the encoding is still ambiguous. The relevant portion of the JSON spec reads:
My understanding of that text is that UTF-8 is only required if a) JSON is "exchanged" and b) the "exchange" is "not part of a closed ecosystem". In the case of OCFL, I think you could make a reasonable argument for either of those points being true or false based on how OCFL is being used, and therefore OCFL inventories do not necessarily need to be encoded as UTF-8.
If you look at the previous JSON spec, it states:
Based on that, it sounds using UTF-8, UTF-16, or UTF-32 are all valid choices, and only encodings like UTF-16BE and UTF-16LE are forbidden.
Back to the original issue, the character encoding for OCFL object IDs. Let's say that you have the object ID
testing. The linked extension needs to hash this id to produce a path to the object's root directory. Hashing the id with sha256 produces the following results based on the encoding:So, is OCFL assuming UTF-8 or is it intended to support any valid JSON encoding?