Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow ICU 'precision 1' fallbacks to be used during encoding conversion #3

Open
mbeckerle opened this issue Nov 23, 2020 · 1 comment
Labels
DFDL 2.0 For issues associated with DFDL v2.0 (next major revision)

Comments

@mbeckerle
Copy link
Collaborator

(moved from https://redmine.ogf.org/issues/301)

Today the DFDL 1.0 spec has property dfdl:encodingErrorPolicy to control what happens when an unmappable or malformed character is encountered - 'error' or 'replace'. When 'replace' the appropriate substitution character is used.

There is also the orthogonal question of fallback mappings, which are mappings specified by an encoding which is not a normal round-trip mapping. DFDL does not currently provide for switching on fallback mappings. Here's what ICU says about this at http://userguide.icu-project.org/conversion/data.
_
In the CHARMAP section of a .ucm file, each line contains a Unicode code point (like <U(1-6 hexadecimal digits for the code point)> ), a codepage character byte sequence (each byte like \xhh (2 hexadecimal digits} ), and an optional "precision" or "fallback" indicator. The precision indicator either must be present in all mappings or in none of them. The indicator is a pipe symbol ‘|’ followed by a 0, 1, 2, 3, or 4 that has the following meaning:

|0 - A "normal", roundtrip mapping from a Unicode code point and back. |1 - A "fallback" mapping only from Unicode to the codepage, but not back. |2 – A subchar1 mapping. The code point is unmappable, and if a substitution is performed, then the subchar1 should be used rather than the subchar. Otherwise, such mappings are ignored. |3 - A "reverse fallback" mapping only from the codepage to Unicode, but not back to the codepage. |4 - A "good one-way" mapping only from Unicode to the codepage, but not back.

Fallback mappings from Unicode typically do not map codes for the same character, but for "similar" ones. This mapping is sometimes done if a character exists in Unicode but not in the codepage. To replace it, ICU maps a codepage code to a similar-looking code for human-readable output. This mapping feature is not useful for text data transmission especially in markup languages where a Unicode code point can be escaped with its code point value. The ICU application programming interface (API) ucnv_setFallback() controls this fallback behavior.

"Reverse fallbacks" are technically similar, but the same Unicode character can be encoded twice in the codepage. ICU always uses reverse fallbacks at runtime.

A subset of the fallback mappings from Unicode is always used at runtime: Those that map private-use Unicode code points. Fallbacks from private-use code points are often introduced as replacements for previous roundtrip mappings for the same pair of codes. These replacements are used when a Unicode version assigns a new character that was previously mapped to that private-use code point. The mapping table is then changed to map the same codepage byte sequence to the new Unicode code point (as a new roundtrip) and the mapping from the old private-use code point to the same codepage code is preserved as a fallback.

A "good one-way" mapping is like a fallback, but ICU always uses "good one-way" mappings at runtime, regardless of the fallback API flag.

The idea is that fallbacks normally lose information, such as mapping from a compatibility variant of a letter to the ASCII version; however, fallbacks from PUA and reverse fallbacks are assumed to be for "the same character", just an older code for it._

So the default behaviour for ICU is to use "good one-way" mappings, "reverse fallback" mappings, and "fallback" mappings from private-use-area code points, but only to use normal "fallback" mappings if the setFallback API has been used.

IBM customers have requested the ability to use normal "fallback" mappings. At the current time, the only solution open to them is to change the .ucm file (or create a variant) and change the "|1" mappings to "|4" so that "fallback" mappings become "good one-way" mappings.

@mbeckerle mbeckerle added the DFDL 2.0 For issues associated with DFDL v2.0 (next major revision) label Nov 23, 2020
@mbeckerle
Copy link
Collaborator Author

(comment by Steve Hanson, moved from redmine)
WG agreed that precision 1 fallback mappings should be able to be switched on. Accordingly dfdl:encodingErrorPolicy will be extended as follows:

  1. Error unmappable characters; fallbacks not required => "error"
  2. Replace unmappable characters; fallbacks not required => "replace"
  3. Error unmappable characters; fallbacks required => "fallbackOrError"
  4. Replace unmappable characters; fallbacks required => "fallbackOrReplace"

Also noted: The wordng for dfdl:encodingErrorPolicy 'replace' says: If 'replace' then any error when decoding characters results in the insertion of the Unicode Replacement Character (U+FFFD) as the replacement for that error. That is not strictly true, as ICU behaviour is:
1.If the input sequence is of length 1 and a subchar1 byte is specified for the codepage [in the .ucm file], output U+001A
2.Otherwise output U+FFFD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DFDL 2.0 For issues associated with DFDL v2.0 (next major revision)
Projects
None yet
Development

No branches or pull requests

1 participant