Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Normalize all unicode identifiers to NFC #5462
This addresses issue #5434. As per the apparent consensus in that issue, all identifiers are normalized to NFC, which canonicalizes composite characters but does not canonicalize easily-confused characters ("compatibility equivalents") such as µ (micro) and μ (mu).
This patch adds the utf8proc library to
Yes, thanks for moving on this – much needed after all the talking :-)
The NFC normalization is completely uncontroversial and clearly a good idea. On the other hand, I think this approach to NFKC collision avoidance is pretty broken. How about separating the two so that we can get the uncontroversial part merged and figure out how to do the harder collision avoidance bit separately?
This comment has been minimized.
This comment has been minimized.Show comment Hide comment
This is good, but we might need to do the normalization even earlier to handle the case where different forms of an identifier appear in the same scope (the front end does some identifier matching).
I agree with Stefan about NFKC collisions --- they will not necessarily manifest as undefined variables.
Whoops, I accidentally deleted my comment about
Currently, I'm not performing normalization on the argument of this function on the principle that we currently allow the programmer to call
The counter-argument is that it is hard to imagine a circumstance in which a non-NFC symbol is actually desired, and that this may lead to unexpected results if the user calls