New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

canonicalize unicode identifiers #5434

Closed
stevengj opened this Issue Jan 17, 2014 · 104 comments

Comments

Projects
None yet
@stevengj
Member

stevengj commented Jan 17, 2014

As discussed on the mailing list, It is very confusing that

const μ = 3
µ + 1

throws a µ not defined exception (because unicode codepoints 0x00b5 and 0x03bc are rendered almost identically). This could easily be encountered in real usage because option-m on a Mac produces 0x00b5 ("micro sign"), which is different from 0x03bc ("Greek small letter mu").

It would be good if Julia internally stored a table of easily confused Unicode codepoints, i.e. homoglyphs, and used them to help prevent these sorts of confusions. Three possibilities are:

  • foo not defined exceptions could check whether a homograph of foo is defined and let the user know if so.
  • Julia could issue a warning if a non-canonical homoglyph is used in an identifier.
  • Simply canonicalize all homoglyphs in identifiers (so the users can type them any way they want, but they are treated as equivalent identifiers).

My preference would be for the third option. I don't see any useful purpose being served by treating μ and µ as distinct identifiers.

@johnmyleswhite

This comment has been minimized.

Show comment
Hide comment
@johnmyleswhite

johnmyleswhite Jan 17, 2014

Member

+100 for this. Any strategy for ensuring that homoglyphs are merged seems like a big improvement to me.

Member

johnmyleswhite commented Jan 17, 2014

+100 for this. Any strategy for ensuring that homoglyphs are merged seems like a big improvement to me.

@toivoh

This comment has been minimized.

Show comment
Hide comment
@toivoh

toivoh Jan 17, 2014

Member

+1 for canonicalizing everything.

Member

toivoh commented Jan 17, 2014

+1 for canonicalizing everything.

@stevengj

This comment has been minimized.

Show comment
Hide comment
@stevengj

stevengj Jan 17, 2014

Member

(We should probably also normalize the Unicode identifiers, in addition to canonicalizing homoglyphs.)

Member

stevengj commented Jan 17, 2014

(We should probably also normalize the Unicode identifiers, in addition to canonicalizing homoglyphs.)

@stevengj

This comment has been minimized.

Show comment
Hide comment
@stevengj

stevengj Jan 17, 2014

Member

On possible software package that we could adapt for this might be utf8proc, which is MIT-licensed and fairly compact (600 lines of code plus a 1M data file). It looks like it does Unicode normalization, but not homograph canonicalization (except for a small number of special cases?). Looks like it handles homoglyphs for us.

Member

stevengj commented Jan 17, 2014

On possible software package that we could adapt for this might be utf8proc, which is MIT-licensed and fairly compact (600 lines of code plus a 1M data file). It looks like it does Unicode normalization, but not homograph canonicalization (except for a small number of special cases?). Looks like it handles homoglyphs for us.

@jiahao

This comment has been minimized.

Show comment
Hide comment
@jiahao

jiahao Jan 17, 2014

Member

+1 for canonicalization and normalization.

We certainly don't want the same disambiguation issues with combining diacritics and nonprinting control characters (like the right to left specifier). The Unicode list contains quite a few characters with combining diacritics already; not sure if it's exhaustive though.

Member

jiahao commented Jan 17, 2014

+1 for canonicalization and normalization.

We certainly don't want the same disambiguation issues with combining diacritics and nonprinting control characters (like the right to left specifier). The Unicode list contains quite a few characters with combining diacritics already; not sure if it's exhaustive though.

@stevengj

This comment has been minimized.

Show comment
Hide comment
@stevengj

stevengj Jan 17, 2014

Member

Actually, it looks like the utf8proc library completely solves this problem, because it implements (among other things) the standard "KC" Unicode normalization which canonicalizes homoglyphs.

I just compiled the utf8proc library and called it from Julia via:

function snorm(s::ByteString, options=0)
       r = Ptr{Uint8}[C_NULL]
       e = ccall((:utf8proc_map,:libutf8proc), Int, (Ptr{Uint8},Csize_t,Ptr{Ptr{Uint8}},Cint), s, sizeof(s), r, options)
       e < 0 && error(bytestring(ccall((:utf8proc_errmsg,:libutf8proc), Ptr{Uint8}, (Int,), e)))
       return bytestring(r[1])
end

and then

julia> s = "µ"
julia> uint16(snorm(s)[1])
0x00b5
julia> uint16(snorm(s, (1<<1) | (1<<2) | (1<<3) | (1<<5) | (1<<12))[1])
0x03bc

works (the second argument is various canonicalization flags copied from the utf8proc.h header file).

Moreover, the utf8proc canonicalization functions (including Unicode-aware case-folding and diacritical-stripping) would be useful to have in Julia anyway. I vote that we just put the whole utf8proc into deps and export some version of this functionality in Base, in addition to canonicalizing identifiers.

Member

stevengj commented Jan 17, 2014

Actually, it looks like the utf8proc library completely solves this problem, because it implements (among other things) the standard "KC" Unicode normalization which canonicalizes homoglyphs.

I just compiled the utf8proc library and called it from Julia via:

function snorm(s::ByteString, options=0)
       r = Ptr{Uint8}[C_NULL]
       e = ccall((:utf8proc_map,:libutf8proc), Int, (Ptr{Uint8},Csize_t,Ptr{Ptr{Uint8}},Cint), s, sizeof(s), r, options)
       e < 0 && error(bytestring(ccall((:utf8proc_errmsg,:libutf8proc), Ptr{Uint8}, (Int,), e)))
       return bytestring(r[1])
end

and then

julia> s = "µ"
julia> uint16(snorm(s)[1])
0x00b5
julia> uint16(snorm(s, (1<<1) | (1<<2) | (1<<3) | (1<<5) | (1<<12))[1])
0x03bc

works (the second argument is various canonicalization flags copied from the utf8proc.h header file).

Moreover, the utf8proc canonicalization functions (including Unicode-aware case-folding and diacritical-stripping) would be useful to have in Julia anyway. I vote that we just put the whole utf8proc into deps and export some version of this functionality in Base, in addition to canonicalizing identifiers.

@jiahao

This comment has been minimized.

Show comment
Hide comment
@jiahao

jiahao Jan 17, 2014

Member

Awesome, thanks for doing the legwork on this.

Member

jiahao commented Jan 17, 2014

Awesome, thanks for doing the legwork on this.

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Jan 17, 2014

Member

That sounds like a really good idea to me.

Member

JeffBezanson commented Jan 17, 2014

That sounds like a really good idea to me.

@pao

This comment has been minimized.

Show comment
Hide comment
@pao

pao Jan 17, 2014

Member

KC has one case that we probably don't care about but seems worth mentioning: superscript numerals will be normalized to normal numerals. (We probably don't care because why would you have superscript numerals in a numeric literal, but this seems like the sort of thing to be abused in a future International Obfuscated Julia Coding Contest.)

Member

pao commented Jan 17, 2014

KC has one case that we probably don't care about but seems worth mentioning: superscript numerals will be normalized to normal numerals. (We probably don't care because why would you have superscript numerals in a numeric literal, but this seems like the sort of thing to be abused in a future International Obfuscated Julia Coding Contest.)

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Jan 17, 2014

Member

That's not totally ideal; is a cute variable name :)

Member

JeffBezanson commented Jan 17, 2014

That's not totally ideal; is a cute variable name :)

@jiahao

This comment has been minimized.

Show comment
Hide comment
@jiahao

jiahao Jan 17, 2014

Member

I've actually used χ² somewhere.

Member

jiahao commented Jan 17, 2014

I've actually used χ² somewhere.

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Jan 17, 2014

Member

We also have to avoid normalizing out different styled letters that represent different symbols in mathematics.

Member

JeffBezanson commented Jan 17, 2014

We also have to avoid normalizing out different styled letters that represent different symbols in mathematics.

@nalimilan

This comment has been minimized.

Show comment
Hide comment
@nalimilan

nalimilan Jan 17, 2014

Contributor

The problem with ² is that it seems to mean ^2, so maybe it's better not to encourage it.

Contributor

nalimilan commented Jan 17, 2014

The problem with ² is that it seems to mean ^2, so maybe it's better not to encourage it.

@jiahao

This comment has been minimized.

Show comment
Hide comment
@jiahao

jiahao Jan 17, 2014

Member

@JeffBezanson may be referring to what UAX #15 calls font variants (see Fig. 2). They give as an example \mathfrak H vs \bbold H, but I suspect regular \phi vs script \varphi is the one that would come up fairly often. (Ironically, Github won't let me enter the characters...)

So it seems that we are learning toward canonical equivalence, as opposed to full compatibility equivalence, in which case NFD may be sufficient rather than NFKC.

Member

jiahao commented Jan 17, 2014

@JeffBezanson may be referring to what UAX #15 calls font variants (see Fig. 2). They give as an example \mathfrak H vs \bbold H, but I suspect regular \phi vs script \varphi is the one that would come up fairly often. (Ironically, Github won't let me enter the characters...)

So it seems that we are learning toward canonical equivalence, as opposed to full compatibility equivalence, in which case NFD may be sufficient rather than NFKC.

@pao

This comment has been minimized.

Show comment
Hide comment
@pao

pao Jan 17, 2014

Member

For variable names, I don't see the superscript/subscript being as much of a problem, other than i.e., χ² will be the same identifier as χ2; if you are distinguishing these I might think you were mad.

Member

pao commented Jan 17, 2014

For variable names, I don't see the superscript/subscript being as much of a problem, other than i.e., χ² will be the same identifier as χ2; if you are distinguishing these I might think you were mad.

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Jan 17, 2014

Member

Our use case is very different from something like a text formatter, which wants to know that superscript 2 is a 2. In a programming language any characters that look different should be considered different. We can perhaps be flexible about superscripts, but font variants of letters have to be supported.

Member

JeffBezanson commented Jan 17, 2014

Our use case is very different from something like a text formatter, which wants to know that superscript 2 is a 2. In a programming language any characters that look different should be considered different. We can perhaps be flexible about superscripts, but font variants of letters have to be supported.

@mathpup

This comment has been minimized.

Show comment
Hide comment
@mathpup

mathpup Jan 18, 2014

Contributor

The initial issue raised involved confusion over U+00B5 MICRO SIGN and U+03BC GREEK SMALL LETTER MU. Normalization type NFD would not fix this problem since U+00B5 has only a compatibility decomposition to U+03BC and not a canonical decomposition. NFKC will fix that issue. The utility at http://unicode.org/cldr/utility/transform.jsp?a=Any-NFKC%0D%0A&b=µ is useful for this.

Contributor

mathpup commented Jan 18, 2014

The initial issue raised involved confusion over U+00B5 MICRO SIGN and U+03BC GREEK SMALL LETTER MU. Normalization type NFD would not fix this problem since U+00B5 has only a compatibility decomposition to U+03BC and not a canonical decomposition. NFKC will fix that issue. The utility at http://unicode.org/cldr/utility/transform.jsp?a=Any-NFKC%0D%0A&b=µ is useful for this.

@stevengj

This comment has been minimized.

Show comment
Hide comment
@stevengj

stevengj Jan 18, 2014

Member

@JeffBezanson, I'm not convinced that "characters that look different should be considered different." One problem is that, unlike LaTeX, we cannot rely on a particular font/glyph being used to render particular codepoints. U+00B5 and U+03BC look distinct in some fonts (one is rendered italic) and not in others, for example. Moreover, even when codepoints are rendered distinctly, the difference will often be subtle (χ² versus χ2) and hence an invitation for bugs and confusion. (That's why these variants work for phishing scams, after all.)

I would prefer to simply state that identifiers are canonicalized to NFKC, so that only characters that look entirely distinct (as opposed to potentially slight font variations) are treated as distinct identifiers. It's useful to have variables named µ and π, but Julia shouldn't pretend that it is LaTeX.

Member

stevengj commented Jan 18, 2014

@JeffBezanson, I'm not convinced that "characters that look different should be considered different." One problem is that, unlike LaTeX, we cannot rely on a particular font/glyph being used to render particular codepoints. U+00B5 and U+03BC look distinct in some fonts (one is rendered italic) and not in others, for example. Moreover, even when codepoints are rendered distinctly, the difference will often be subtle (χ² versus χ2) and hence an invitation for bugs and confusion. (That's why these variants work for phishing scams, after all.)

I would prefer to simply state that identifiers are canonicalized to NFKC, so that only characters that look entirely distinct (as opposed to potentially slight font variations) are treated as distinct identifiers. It's useful to have variables named µ and π, but Julia shouldn't pretend that it is LaTeX.

@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Jan 18, 2014

Member

There are several different levels of distinction being discussed:

  1. "Indistinguishables". Different unnormalized but strongly equivalent forms – i.e. byte sequences that mean the same things but are represented different, such as precomposed characters like U+0065, U+0301 vs. U+00E9.
  2. "Strong confusables". Characters like μ vs. µ and other things listed here that are semantically distinct but will often cause confusion and frustration due to very similar rendering.
  3. "Weak confusables." Character sequences that are normally easy to distinguish but might end up looking similar in some renderings, e.g. χ² vs. χ2.

These call for different approaches. To deal with "indistinguishables" it's pretty clear that we should just normalize them. At the other end of the spectrum, this is a pretty lousy way to deal with "weak confusables" – imagine using both χ² and χ2 in some code and being really confused when they are silently treated as the same identifier! For weak confusables, I suspect the best behavior is to treat them as distinct but emit a warning if two weakly confusable identifiers are used in the same file (or scope). In the middle, strong confusables are a tougher call – both automatically normalizing them to be the same (like with indinstinguishables) and warning if they appear in the same file/scope (like weak confusables) are reasonable approaches. However, I tend to favor the warning.

I've intentionally avoided Unicode terms here to keep the problem statement separate from the solution. I suspect that we should first normalize source to NFD, which takes care of collapsing "indistinguishables". Then we should warn if two identifiers are the same modulo "compatibles" and "confusables". That means that using composed and uncomposed versions of è in the same source file would just silently work – they mean the same thing – but using both χ² vs. χ2 or and ffi in the same file would produce a warning and then proceed to treat them as distinct.

Member

StefanKarpinski commented Jan 18, 2014

There are several different levels of distinction being discussed:

  1. "Indistinguishables". Different unnormalized but strongly equivalent forms – i.e. byte sequences that mean the same things but are represented different, such as precomposed characters like U+0065, U+0301 vs. U+00E9.
  2. "Strong confusables". Characters like μ vs. µ and other things listed here that are semantically distinct but will often cause confusion and frustration due to very similar rendering.
  3. "Weak confusables." Character sequences that are normally easy to distinguish but might end up looking similar in some renderings, e.g. χ² vs. χ2.

These call for different approaches. To deal with "indistinguishables" it's pretty clear that we should just normalize them. At the other end of the spectrum, this is a pretty lousy way to deal with "weak confusables" – imagine using both χ² and χ2 in some code and being really confused when they are silently treated as the same identifier! For weak confusables, I suspect the best behavior is to treat them as distinct but emit a warning if two weakly confusable identifiers are used in the same file (or scope). In the middle, strong confusables are a tougher call – both automatically normalizing them to be the same (like with indinstinguishables) and warning if they appear in the same file/scope (like weak confusables) are reasonable approaches. However, I tend to favor the warning.

I've intentionally avoided Unicode terms here to keep the problem statement separate from the solution. I suspect that we should first normalize source to NFD, which takes care of collapsing "indistinguishables". Then we should warn if two identifiers are the same modulo "compatibles" and "confusables". That means that using composed and uncomposed versions of è in the same source file would just silently work – they mean the same thing – but using both χ² vs. χ2 or and ffi in the same file would produce a warning and then proceed to treat them as distinct.

@ivarne

This comment has been minimized.

Show comment
Hide comment
@ivarne

ivarne Jan 18, 2014

Contributor

@StefanKarpinski Good summary! but I think you have the wrong conclusion.

I was once challenged to find out why 10l would compare unequal to 101, in a C program (it was more elaborated), but because the font I could not find the bug.

My preference would definitely be to make Julia consider all possible ambiguous characters equal, and give a warning/error if someone use identifiers that is considered equal because of rule 2 and 3. I do not read Unicode codepoints, and i do not have a different word for and ffi, and I can't even see the difference when I am focused on logic. To me programming is about expressing ideas, and variables using both and ffi as different variables in the same scope would be the worst offence to any code style guide.

Contributor

ivarne commented Jan 18, 2014

@StefanKarpinski Good summary! but I think you have the wrong conclusion.

I was once challenged to find out why 10l would compare unequal to 101, in a C program (it was more elaborated), but because the font I could not find the bug.

My preference would definitely be to make Julia consider all possible ambiguous characters equal, and give a warning/error if someone use identifiers that is considered equal because of rule 2 and 3. I do not read Unicode codepoints, and i do not have a different word for and ffi, and I can't even see the difference when I am focused on logic. To me programming is about expressing ideas, and variables using both and ffi as different variables in the same scope would be the worst offence to any code style guide.

@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Jan 18, 2014

Member

Well, that's why it should warn. Whether it considers them the same or different is somewhat irrelevant when it causes a warning. I guess one benefit of considering such things the same rather than keeping them different is ease of implementation: if the analysis is done at the file level, you can canonicalize an entire source file and warn if two "confusable" identifiers are used in the same source file and then hand the canonicalized program off to the rest of the parsing process without worrying any further. Then again, you can do the same without considering them the same by doing the confusion warning at the same step but leaving confusable identifiers different.

Member

StefanKarpinski commented Jan 18, 2014

Well, that's why it should warn. Whether it considers them the same or different is somewhat irrelevant when it causes a warning. I guess one benefit of considering such things the same rather than keeping them different is ease of implementation: if the analysis is done at the file level, you can canonicalize an entire source file and warn if two "confusable" identifiers are used in the same source file and then hand the canonicalized program off to the rest of the parsing process without worrying any further. Then again, you can do the same without considering them the same by doing the confusion warning at the same step but leaving confusable identifiers different.

@stevengj

This comment has been minimized.

Show comment
Hide comment
@stevengj

stevengj Jan 18, 2014

Member

As a practical matter, it is far easier to implement and explain canonicalization to NFKC, taking advantage of the existing standard and utfproc, than it would be to implement and document our own nonstandard normalization. (There are a lot of codepoints we'd have to argue over.)

We can also certainly issue a warning whenever a file contains identifiers that are distinct from their canonicalized versions. (But I think it would be an unfriendly practice to issue a warning instead of canonicalizing.)

Member

stevengj commented Jan 18, 2014

As a practical matter, it is far easier to implement and explain canonicalization to NFKC, taking advantage of the existing standard and utfproc, than it would be to implement and document our own nonstandard normalization. (There are a lot of codepoints we'd have to argue over.)

We can also certainly issue a warning whenever a file contains identifiers that are distinct from their canonicalized versions. (But I think it would be an unfriendly practice to issue a warning instead of canonicalizing.)

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Jan 18, 2014

Member

It seems unfortunate to me to canonicalize distinct characters that unicode provides specifically for their use in mathematics.

Should we use a different normalization, maybe NFD, for string literals?

Member

JeffBezanson commented Jan 18, 2014

It seems unfortunate to me to canonicalize distinct characters that unicode provides specifically for their use in mathematics.

Should we use a different normalization, maybe NFD, for string literals?

@stevengj

This comment has been minimized.

Show comment
Hide comment
@stevengj

stevengj Jan 18, 2014

Member

I don't think string literals should be normalized at all by default, although we should provide functions to do normalization if that is desired. The user should be able to enter any Unicode string they want.

Member

stevengj commented Jan 18, 2014

I don't think string literals should be normalized at all by default, although we should provide functions to do normalization if that is desired. The user should be able to enter any Unicode string they want.

@jiahao

This comment has been minimized.

Show comment
Hide comment
@jiahao

jiahao Jan 18, 2014

Member

+1 for what @stevengj said. There's something to be said for preserving user input as much as possible. (What if the user wants to implement a custom normalization, for example...)

Member

jiahao commented Jan 18, 2014

+1 for what @stevengj said. There's something to be said for preserving user input as much as possible. (What if the user wants to implement a custom normalization, for example...)

@nolta

This comment has been minimized.

Show comment
Hide comment
@nolta

nolta Jan 19, 2014

Member

Just to be perverse, let's say we normalize to NFKC, and Quaternions.jl gets renamed ℍ.jl. Then using ℍ would look for .julia/H/src/H.jl?

Member

nolta commented Jan 19, 2014

Just to be perverse, let's say we normalize to NFKC, and Quaternions.jl gets renamed ℍ.jl. Then using ℍ would look for .julia/H/src/H.jl?

@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Jan 19, 2014

Member

I've actually rampantly made the assumption that package names are ASCII largely because I think it's opening a whole can of worms to use non-ASCII characters in package names.

Member

StefanKarpinski commented Jan 19, 2014

I've actually rampantly made the assumption that package names are ASCII largely because I think it's opening a whole can of worms to use non-ASCII characters in package names.

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Jan 19, 2014

Member

I'm much more concerned about identifier names. I don't think merging and H makes sense for us.

Member

JeffBezanson commented Jan 19, 2014

I'm much more concerned about identifier names. I don't think merging and H makes sense for us.

@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Jan 19, 2014

Member

@stevengj – what about the χ² vs. χ2 issue? Your proposal silently treats them as the same, which strikes me as almost as bad as the (thus far hypothetical) problems we're trying to avoid here.

Member

StefanKarpinski commented Jan 19, 2014

@stevengj – what about the χ² vs. χ2 issue? Your proposal silently treats them as the same, which strikes me as almost as bad as the (thus far hypothetical) problems we're trying to avoid here.

@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Jan 19, 2014

Member

Actually, no, it's worse – at least you can look at the contents of your source file and discover that two similar looking identifiers are actually the same. If χ² and χ2 are treated as the same identifier, there's no way to figure it out short of finding the obscure appendix of the Julia manual that explains this behavior. I find that unacceptable.

Member

StefanKarpinski commented Jan 19, 2014

Actually, no, it's worse – at least you can look at the contents of your source file and discover that two similar looking identifiers are actually the same. If χ² and χ2 are treated as the same identifier, there's no way to figure it out short of finding the obscure appendix of the Julia manual that explains this behavior. I find that unacceptable.

@jiahao

This comment has been minimized.

Show comment
Hide comment
@jiahao

jiahao Feb 24, 2014

Member

The semantic argument is even more nefarious: there are multiple code points that are semantically equivalent in ways that go far beyond the mu vs. micro problem. Unihan (UAX 38) defines an additional layer of equivalence for the Han code points to deal with so-called semantic variants. It turns out that Unihan defines two such equivalences, for partial semantic overlap (e.g. 井, water-well vs. 丼 food bowl, or also well) and complete semantic overlap (e.g. 兎 and 兔, both meaning rabbit) where the different characters would be interpreted by many native Chinese speakers to be equivalent written alternatives. These characters are not NKFC equivalent; should we then also be in the business of canonicalization by semantic equivalence as well? Where does the madness end?

Member

jiahao commented Feb 24, 2014

The semantic argument is even more nefarious: there are multiple code points that are semantically equivalent in ways that go far beyond the mu vs. micro problem. Unihan (UAX 38) defines an additional layer of equivalence for the Han code points to deal with so-called semantic variants. It turns out that Unihan defines two such equivalences, for partial semantic overlap (e.g. 井, water-well vs. 丼 food bowl, or also well) and complete semantic overlap (e.g. 兎 and 兔, both meaning rabbit) where the different characters would be interpreted by many native Chinese speakers to be equivalent written alternatives. These characters are not NKFC equivalent; should we then also be in the business of canonicalization by semantic equivalence as well? Where does the madness end?

@wlbksy

This comment has been minimized.

Show comment
Hide comment
@wlbksy

wlbksy Feb 25, 2014

Member

I think people from CJK are used to use English letters as variable names, cjk characters are mainly used in strings. This is because it's not convenient to switch between EN and CJK IME. The only habit-changing things from Julia would be that Greek letters could be used as names, which is easy to read maths.
I agree with @stevengj on similar letters should be dealt with when they are used as variable/function names.So are those fullwidth math sign, letters, punctuations. Just leave as they be when they are in strings would be sufficient for me

Member

wlbksy commented Feb 25, 2014

I think people from CJK are used to use English letters as variable names, cjk characters are mainly used in strings. This is because it's not convenient to switch between EN and CJK IME. The only habit-changing things from Julia would be that Greek letters could be used as names, which is easy to read maths.
I agree with @stevengj on similar letters should be dealt with when they are used as variable/function names.So are those fullwidth math sign, letters, punctuations. Just leave as they be when they are in strings would be sufficient for me

@vtjnash

This comment has been minimized.

Show comment
Hide comment
@vtjnash

vtjnash Feb 25, 2014

Member

If we do this it will have to be very early, at parse time.

of course. but i think, for sanity, that it needs to be a universal property of symbols. being able to make symbols through symbol() that can't be made directly is different from passing symbol() exactly the same sequence of bytes and getting back a different variable than if it went through the parser directly (and vice versa). If symbols aren't always in canonical form, then it seems that getfield and function argument splatting could also be problematic.

I think all symbols should be normalized the same way regardless of how they are entered. While it makes sense to me to treat different ways of writing the same unicode character as equivalent, it doesn't make sense to me to treat different unicode characters as sometimes equivalent.

To agree with Steven, I think we should be optimizing for the least distinguishable font that might plausibly be used.

I disagree. The beauty of having unicode identifiers is in being able to use them freely, even if it means you need to upgrade your tools.

normalizing symbols won't fix #5903, since by the time the parser has decided it is a symbol, it is too late to redefine it as a separate operator. instead, I think it is more akin to the question of whether arbitrary expressions can be used as infix operators. Since they can't, it is a limited subset of operators that would be affected by allowing full-width alternatives to the half-width punctuation. Therefore, I believe that it is reasonable to make that modification without resorting to full NFKC for all symbols.

Somewhat unrelated, but I would require that all code be normalized to the standard ascii half-width operators for pull requests to any of my repositories. Even if they are defined to work identically, and differ slightly visually, it poses a maintenance hazard if find/replace doesn't see them that way.

Member

vtjnash commented Feb 25, 2014

If we do this it will have to be very early, at parse time.

of course. but i think, for sanity, that it needs to be a universal property of symbols. being able to make symbols through symbol() that can't be made directly is different from passing symbol() exactly the same sequence of bytes and getting back a different variable than if it went through the parser directly (and vice versa). If symbols aren't always in canonical form, then it seems that getfield and function argument splatting could also be problematic.

I think all symbols should be normalized the same way regardless of how they are entered. While it makes sense to me to treat different ways of writing the same unicode character as equivalent, it doesn't make sense to me to treat different unicode characters as sometimes equivalent.

To agree with Steven, I think we should be optimizing for the least distinguishable font that might plausibly be used.

I disagree. The beauty of having unicode identifiers is in being able to use them freely, even if it means you need to upgrade your tools.

normalizing symbols won't fix #5903, since by the time the parser has decided it is a symbol, it is too late to redefine it as a separate operator. instead, I think it is more akin to the question of whether arbitrary expressions can be used as infix operators. Since they can't, it is a limited subset of operators that would be affected by allowing full-width alternatives to the half-width punctuation. Therefore, I believe that it is reasonable to make that modification without resorting to full NFKC for all symbols.

Somewhat unrelated, but I would require that all code be normalized to the standard ascii half-width operators for pull requests to any of my repositories. Even if they are defined to work identically, and differ slightly visually, it poses a maintenance hazard if find/replace doesn't see them that way.

@simonbyrne

This comment has been minimized.

Show comment
Hide comment
@simonbyrne

simonbyrne Feb 25, 2014

Contributor

As far as I can tell, no one is opposed to NFC normalization, so we should probably do that. For anything beyond that, perhaps we should wait until we have more input from users in languages that utilise non-latin character sets, since these are the parties most affected. As a monoglot, I have no real opinion as to what would be best, but I suspect the answer could be different for different languages.

I think @vtjnash may have hit upon a good solution, at least in the interim: provide recommended guidelines, along with a script for testing whether code satisfies those guidelines, which could be used as an appropriate git hook or incorporated into travis tests.

This could be enforced for Base and other JuliaLang repos, but if people really want to use two different mus in their own code, then they can. Moreover these guidelines could be later amended based on feedback without breaking existing code.

Contributor

simonbyrne commented Feb 25, 2014

As far as I can tell, no one is opposed to NFC normalization, so we should probably do that. For anything beyond that, perhaps we should wait until we have more input from users in languages that utilise non-latin character sets, since these are the parties most affected. As a monoglot, I have no real opinion as to what would be best, but I suspect the answer could be different for different languages.

I think @vtjnash may have hit upon a good solution, at least in the interim: provide recommended guidelines, along with a script for testing whether code satisfies those guidelines, which could be used as an appropriate git hook or incorporated into travis tests.

This could be enforced for Base and other JuliaLang repos, but if people really want to use two different mus in their own code, then they can. Moreover these guidelines could be later amended based on feedback without breaking existing code.

@lindahua

This comment has been minimized.

Show comment
Hide comment
@lindahua

lindahua Feb 25, 2014

Member

I concur with @simonbyrne that we keep NFC and be cautious about going beyond.

A guideline about choosing names, to me, is better than forcing a controversial behavior (i.e. quietly tying identities that look noticeably different).

In terms of asian full-width character. I think it might be better to raise an error (or at least a warning) when people are using (full-width) instead of = (normal). I believe that a better approach is to detect such cases and give warnings (in this way we also encourage programmers to use symbols consistently) than to quietly do the job (not necessarily in a correct way) by guessing the author's intention.

Member

lindahua commented Feb 25, 2014

I concur with @simonbyrne that we keep NFC and be cautious about going beyond.

A guideline about choosing names, to me, is better than forcing a controversial behavior (i.e. quietly tying identities that look noticeably different).

In terms of asian full-width character. I think it might be better to raise an error (or at least a warning) when people are using (full-width) instead of = (normal). I believe that a better approach is to detect such cases and give warnings (in this way we also encourage programmers to use symbols consistently) than to quietly do the job (not necessarily in a correct way) by guessing the author's intention.

@stevengj

This comment has been minimized.

Show comment
Hide comment
@stevengj

stevengj Feb 25, 2014

Member

I don't claim that NFKC is perfect; it is certainly possible to write obfuscated code even in ASCII. Just that it will cause far fewer problems than the alternative of NKC. The fact that there is no perfect solution is not an argument that we should do nothing. NFKC is a widely accepted, standardized, and continually updated way of normalizing strings so that different input methods generally (if not always) produce the same codepoints and that many (even if not all) codepoints with slightly different renderings but similar meanings are identified with one another. Losing the ability to use and = as distinct identifiers seems like a small price to pay for this benefit. Why is NFC a better choice?

@vtjnash, the question of whether symbol(foo) should be normalized too is orthogonal to this discussion, since the same question applies to NFC.

Member

stevengj commented Feb 25, 2014

I don't claim that NFKC is perfect; it is certainly possible to write obfuscated code even in ASCII. Just that it will cause far fewer problems than the alternative of NKC. The fact that there is no perfect solution is not an argument that we should do nothing. NFKC is a widely accepted, standardized, and continually updated way of normalizing strings so that different input methods generally (if not always) produce the same codepoints and that many (even if not all) codepoints with slightly different renderings but similar meanings are identified with one another. Losing the ability to use and = as distinct identifiers seems like a small price to pay for this benefit. Why is NFC a better choice?

@vtjnash, the question of whether symbol(foo) should be normalized too is orthogonal to this discussion, since the same question applies to NFC.

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Mar 4, 2014

Member

Ok, "do nothing" might not be the best solution, but it does have the nice property of being very transparent. You can see what's going on just by looking at code points and seeing that they are different. Similarly, erring on the side of treating identifiers as different will tend to produce not-defined errors, while silently equating identifiers will tend to produce subtle bugs.

Probably almost nobody has a good intuitive grasp of what NFKC does. If it were really true that it specifically targeted differences due to input method, that might be valuable, but instead it strikes me as a giant random list of equated code sequences.

Member

JeffBezanson commented Mar 4, 2014

Ok, "do nothing" might not be the best solution, but it does have the nice property of being very transparent. You can see what's going on just by looking at code points and seeing that they are different. Similarly, erring on the side of treating identifiers as different will tend to produce not-defined errors, while silently equating identifiers will tend to produce subtle bugs.

Probably almost nobody has a good intuitive grasp of what NFKC does. If it were really true that it specifically targeted differences due to input method, that might be valuable, but instead it strikes me as a giant random list of equated code sequences.

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Mar 4, 2014

Member

Moving, as too contentious to block 0.3.

Member

JeffBezanson commented Mar 4, 2014

Moving, as too contentious to block 0.3.

@JeffBezanson JeffBezanson modified the milestones: 0.4, 0.3 Mar 4, 2014

@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Mar 4, 2014

Member

This is why I've been arguing for an error. Our general philosophy is that if there's no obvious one right interpretation of something, raise an error. NFC is fine-grained enough that we can be sure that NFC-equivalent identifiers are meant to be the same. NFKC is coarse-grained enough that we can be sure that NFKC-distinct identifiers are clearly meant to be different. Everything between is no man's land. So we should throw an error. Otherwise, we are implicitly guessing what the user really meant. Not canonicalizing to NFKC is guessing that distinct identifiers are actually meant to be different. Canonicalizing to NFKC is guessing that distinct but NFKC-equivalent identifiers are meant to be the same. Either strategy will inevitably be wrong some of the time.

Member

StefanKarpinski commented Mar 4, 2014

This is why I've been arguing for an error. Our general philosophy is that if there's no obvious one right interpretation of something, raise an error. NFC is fine-grained enough that we can be sure that NFC-equivalent identifiers are meant to be the same. NFKC is coarse-grained enough that we can be sure that NFKC-distinct identifiers are clearly meant to be different. Everything between is no man's land. So we should throw an error. Otherwise, we are implicitly guessing what the user really meant. Not canonicalizing to NFKC is guessing that distinct identifiers are actually meant to be different. Canonicalizing to NFKC is guessing that distinct but NFKC-equivalent identifiers are meant to be the same. Either strategy will inevitably be wrong some of the time.

@jiahao

This comment has been minimized.

Show comment
Hide comment
@jiahao

jiahao Mar 4, 2014

Member

NFKC-distinct identifiers are clearly meant to be different

Only if you exclude Chinese (Unihan) characters; I've already provided counterexamples.

Member

jiahao commented Mar 4, 2014

NFKC-distinct identifiers are clearly meant to be different

Only if you exclude Chinese (Unihan) characters; I've already provided counterexamples.

@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Mar 4, 2014

Member

I'm willing to say that if Unihan has decided to ignore the standards on this matter, that is not our problem.

Member

StefanKarpinski commented Mar 4, 2014

I'm willing to say that if Unihan has decided to ignore the standards on this matter, that is not our problem.

@jiahao

This comment has been minimized.

Show comment
Hide comment
@jiahao

jiahao Mar 4, 2014

Member

That sentence is illogical; Unihan is part of the Unicode standard. You can say that the standard is inconsistent. All I'm saying is that none of the arguments I have heard in favor of NFKC are actually sufficient to cover the corner cases in Unihan.

Member

jiahao commented Mar 4, 2014

That sentence is illogical; Unihan is part of the Unicode standard. You can say that the standard is inconsistent. All I'm saying is that none of the arguments I have heard in favor of NFKC are actually sufficient to cover the corner cases in Unihan.

@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Mar 4, 2014

Member

Unless there's some even coarser equivalence standard that works for Unihan as well, NFKC is the best we've got and we're not going to get into the business of deciding what Unicode characters should or shouldn't be considered equivalent. If there isn't such a standard, then the mismatch between Unihan and NFKC is the Unicode consortium's problem as I said, not ours.

Member

StefanKarpinski commented Mar 4, 2014

Unless there's some even coarser equivalence standard that works for Unihan as well, NFKC is the best we've got and we're not going to get into the business of deciding what Unicode characters should or shouldn't be considered equivalent. If there isn't such a standard, then the mismatch between Unihan and NFKC is the Unicode consortium's problem as I said, not ours.

@nalimilan

This comment has been minimized.

Show comment
Hide comment
@nalimilan

nalimilan Mar 5, 2014

Contributor

I agree with @StefanKarpinski: there's not much to win by silently normalizing identifiers using NFKC. If we report an error/warning, people will notice the problem early and avoid much trouble. Julia IDEs will be made smart enough to detect cases where two identifiers are equal after NFKC normalization, and will suggest you to adapt automatically when typing them. OTC if the parser does the normalization, you will never be able to trust grep to find an identifier because of the many possible variants.

Contributor

nalimilan commented Mar 5, 2014

I agree with @StefanKarpinski: there's not much to win by silently normalizing identifiers using NFKC. If we report an error/warning, people will notice the problem early and avoid much trouble. Julia IDEs will be made smart enough to detect cases where two identifiers are equal after NFKC normalization, and will suggest you to adapt automatically when typing them. OTC if the parser does the normalization, you will never be able to trust grep to find an identifier because of the many possible variants.

@toivoh

This comment has been minimized.

Show comment
Hide comment
@toivoh

toivoh Mar 5, 2014

Member

I'm just concerned that the ambiguity detection might be silent while you
develop your own code, and then throw an error when someone tries to use
that code together with something else. It think that it would be much
better if we can find some (more restrictive) way that reports all the
ambiguities independent of what the code is combined with.

On Wed, Mar 5, 2014 at 9:43 AM, Milan Bouchet-Valat <
notifications@github.com> wrote:

I agree with @StefanKarpinski https://github.com/StefanKarpinski:
there's not much to win by silently normalizing identifiers using NFKC. If
we report an error/warning, people will notice the problem early and avoid
much trouble. Julia IDEs will be made smart enough to detect cases where
two identifiers are equal after NFKC normalization, and will suggest you to
adapt automatically when typing them. OTC if the parser does the
normalization, you will never be able to trust grep to find an identifier
because of the many possible variants.

Reply to this email directly or view it on GitHubhttps://github.com//issues/5434#issuecomment-36720711
.

Member

toivoh commented Mar 5, 2014

I'm just concerned that the ambiguity detection might be silent while you
develop your own code, and then throw an error when someone tries to use
that code together with something else. It think that it would be much
better if we can find some (more restrictive) way that reports all the
ambiguities independent of what the code is combined with.

On Wed, Mar 5, 2014 at 9:43 AM, Milan Bouchet-Valat <
notifications@github.com> wrote:

I agree with @StefanKarpinski https://github.com/StefanKarpinski:
there's not much to win by silently normalizing identifiers using NFKC. If
we report an error/warning, people will notice the problem early and avoid
much trouble. Julia IDEs will be made smart enough to detect cases where
two identifiers are equal after NFKC normalization, and will suggest you to
adapt automatically when typing them. OTC if the parser does the
normalization, you will never be able to trust grep to find an identifier
because of the many possible variants.

Reply to this email directly or view it on GitHubhttps://github.com//issues/5434#issuecomment-36720711
.

@stevengj

This comment has been minimized.

Show comment
Hide comment
@stevengj

stevengj Jun 6, 2014

Member

Seems like this can be closed.

Member

stevengj commented Jun 6, 2014

Seems like this can be closed.

@hayd

This comment has been minimized.

Show comment
Hide comment
@hayd

hayd Jul 14, 2016

Member

Perhaps Lint is the right place to catch this.

Member

hayd commented Jul 14, 2016

Perhaps Lint is the right place to catch this.

ilammy added a commit to ilammy/sabre that referenced this issue Nov 19, 2016

Normalize identifiers to NFKC
This is somewhat controversial thing, but I have made a decision:
identifiers should be normalized to fold visual ambiguity, and the
normalization form should be NFKC.

Rationale:

1. Compatibility decomposition is favored over canonical one because
   it provides useful folding for letter ligatures, fullwidth forms,
   certain CJK ideographs, etc.

2. Compatibility decomposition is favored over canonical one because
   it provides more protection from visual spoofing.

3. Standard Unicode transformation should be favored over anything
   ad-hoc because it's predictable and more mature.

4. Normalization is a compromise between freedom of expression and
   ease of implementation. Source code is not prose, there are rules.

Here are some references to other languages:

    SRFI 52: http://srfi-email.schemers.org/srfi-52/
    Julia:   JuliaLang/julia#5434
    Python:  http://bugs.python.org/issue10952
    Rust:    rust-lang/rust#2253

Unfortunately, there aren't very many precedents and open discussions
about Unicode usage in programming languages, especially in languages
with very permissive identifier syntax (like Scheme).

Aside from identifiers there are more places where Unicode can be used:

* Characters are not normalized, not even to NFC. This may have been
  useful, for example, to recompose combining marks, but unfortunately
  NFC may do more transformations than that, so it is no go. We preserve
  the exact Unicode character.

* Character names, on the other hand, are case-sensitive identifiers,
  so they are normalized as such.

* Strings and escaped identifiers are left untouched in order to preserve
  the exact spelling as in the source code.

* Directives are case-insensitive identifiers and are normalized as such.

* Numbers should be composed from ASCII only so they are not normalized.
  Sometimes this produces weird parses because characters that look like
  signs are not treated as such. However, these characters are invalid in
  numbers, so it's somewhat justified.

* Peculiar identifiers are shit. I'm sorry. Because of NFKC is is possible
  to write a plain, unescaped identifier that will parse as a number after
  going through NFKC. It may even look exactly like a number without being
  one. There is not much we can do about this, so we produce a warning
  just in case.

* Datum labels are mostly numbers, so they are not normalized as well.
  Note that sometimes they can be treated as numbers with invalid prefix.

* Comments are ignored.

* Delimiters should be ASCII-only. No discussion on this. Unicode has
  various fancy whitespaces and line separators, but this is source
  code, not a rich text document in a word processor.

Also, currently case-folding is performed only for ASCII range.
Identifiers should use NFKC_casefold transformation. It will be
implemented later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment