This is somewhat controversial thing, but I have made a decision:
identifiers should be normalized to fold visual ambiguity, and the
normalization form should be NFKC.
1. Compatibility decomposition is favored over canonical one because
it provides useful folding for letter ligatures, fullwidth forms,
certain CJK ideographs, etc.
2. Compatibility decomposition is favored over canonical one because
it provides more protection from visual spoofing.
3. Standard Unicode transformation should be favored over anything
ad-hoc because it's predictable and more mature.
4. Normalization is a compromise between freedom of expression and
ease of implementation. Source code is not prose, there are rules.
Here are some references to other languages:
SRFI 52: http://srfi-email.schemers.org/srfi-52/
Unfortunately, there aren't very many precedents and open discussions
about Unicode usage in programming languages, especially in languages
with very permissive identifier syntax (like Scheme).
Aside from identifiers there are more places where Unicode can be used:
* Characters are not normalized, not even to NFC. This may have been
useful, for example, to recompose combining marks, but unfortunately
NFC may do more transformations than that, so it is no go. We preserve
the exact Unicode character.
* Character names, on the other hand, are case-sensitive identifiers,
so they are normalized as such.
* Strings and escaped identifiers are left untouched in order to preserve
the exact spelling as in the source code.
* Directives are case-insensitive identifiers and are normalized as such.
* Numbers should be composed from ASCII only so they are not normalized.
Sometimes this produces weird parses because characters that look like
signs are not treated as such. However, these characters are invalid in
numbers, so it's somewhat justified.
* Peculiar identifiers are shit. I'm sorry. Because of NFKC is is possible
to write a plain, unescaped identifier that will parse as a number after
going through NFKC. It may even look exactly like a number without being
one. There is not much we can do about this, so we produce a warning
just in case.
* Datum labels are mostly numbers, so they are not normalized as well.
Note that sometimes they can be treated as numbers with invalid prefix.
* Comments are ignored.
* Delimiters should be ASCII-only. No discussion on this. Unicode has
various fancy whitespaces and line separators, but this is source
code, not a rich text document in a word processor.
Also, currently case-folding is performed only for ASCII range.
Identifiers should use NFKC_casefold transformation. It will be