# canonicalize unicode identifiers #5434

Closed
opened this Issue Jan 17, 2014 · 104 comments

None yet

### 17 participants

Member
 As discussed on the mailing list, It is very confusing that const μ = 3 µ + 1 throws a µ not defined exception (because unicode codepoints 0x00b5 and 0x03bc are rendered almost identically). This could easily be encountered in real usage because option-m on a Mac produces 0x00b5 ("micro sign"), which is different from 0x03bc ("Greek small letter mu"). It would be good if Julia internally stored a table of easily confused Unicode codepoints, i.e. homoglyphs, and used them to help prevent these sorts of confusions. Three possibilities are: foo not defined exceptions could check whether a homograph of foo is defined and let the user know if so. Julia could issue a warning if a non-canonical homoglyph is used in an identifier. Simply canonicalize all homoglyphs in identifiers (so the users can type them any way they want, but they are treated as equivalent identifiers). My preference would be for the third option. I don't see any useful purpose being served by treating μ and µ as distinct identifiers.
Member
 +100 for this. Any strategy for ensuring that homoglyphs are merged seems like a big improvement to me.
Member
commented Jan 17, 2014
 +1 for canonicalizing everything.
Member
 (We should probably also normalize the Unicode identifiers, in addition to canonicalizing homoglyphs.)
Member
 On possible software package that we could adapt for this might be utf8proc, which is MIT-licensed and fairly compact (600 lines of code plus a 1M data file). It looks like it does Unicode normalization, but not homograph canonicalization (except for a small number of special cases?). Looks like it handles homoglyphs for us.
Member
commented Jan 17, 2014
 +1 for canonicalization and normalization. We certainly don't want the same disambiguation issues with combining diacritics and nonprinting control characters (like the right to left specifier). The Unicode list contains quite a few characters with combining diacritics already; not sure if it's exhaustive though.
Member
 Actually, it looks like the utf8proc library completely solves this problem, because it implements (among other things) the standard "KC" Unicode normalization which canonicalizes homoglyphs. I just compiled the utf8proc library and called it from Julia via: function snorm(s::ByteString, options=0) r = Ptr{Uint8}[C_NULL] e = ccall((:utf8proc_map,:libutf8proc), Int, (Ptr{Uint8},Csize_t,Ptr{Ptr{Uint8}},Cint), s, sizeof(s), r, options) e < 0 && error(bytestring(ccall((:utf8proc_errmsg,:libutf8proc), Ptr{Uint8}, (Int,), e))) return bytestring(r[1]) end and then julia> s = "µ" julia> uint16(snorm(s)[1]) 0x00b5 julia> uint16(snorm(s, (1<<1) | (1<<2) | (1<<3) | (1<<5) | (1<<12))[1]) 0x03bc works (the second argument is various canonicalization flags copied from the utf8proc.h header file). Moreover, the utf8proc canonicalization functions (including Unicode-aware case-folding and diacritical-stripping) would be useful to have in Julia anyway. I vote that we just put the whole utf8proc into deps and export some version of this functionality in Base, in addition to canonicalizing identifiers.
Member
commented Jan 17, 2014
 Awesome, thanks for doing the legwork on this.
Member
 That sounds like a really good idea to me.
Member
commented Jan 17, 2014
 KC has one case that we probably don't care about but seems worth mentioning: superscript numerals will be normalized to normal numerals. (We probably don't care because why would you have superscript numerals in a numeric literal, but this seems like the sort of thing to be abused in a future International Obfuscated Julia Coding Contest.)
Member
 That's not totally ideal; N² is a cute variable name :)
Member
commented Jan 17, 2014
 I've actually used χ² somewhere.
Member
 We also have to avoid normalizing out different styled letters that represent different symbols in mathematics.
Contributor
 The problem with ² is that it seems to mean ^2, so maybe it's better not to encourage it.
Member
commented Jan 17, 2014
 @JeffBezanson may be referring to what UAX #15 calls font variants (see Fig. 2). They give as an example \mathfrak H vs \bbold H, but I suspect regular \phi vs script \varphi is the one that would come up fairly often. (Ironically, Github won't let me enter the characters...) So it seems that we are learning toward canonical equivalence, as opposed to full compatibility equivalence, in which case NFD may be sufficient rather than NFKC.
Member
commented Jan 17, 2014
 For variable names, I don't see the superscript/subscript being as much of a problem, other than i.e., χ² will be the same identifier as χ2; if you are distinguishing these I might think you were mad.
Member
 Our use case is very different from something like a text formatter, which wants to know that superscript 2 is a 2. In a programming language any characters that look different should be considered different. We can perhaps be flexible about superscripts, but font variants of letters have to be supported.
Contributor
commented Jan 18, 2014
 The initial issue raised involved confusion over U+00B5 MICRO SIGN and U+03BC GREEK SMALL LETTER MU. Normalization type NFD would not fix this problem since U+00B5 has only a compatibility decomposition to U+03BC and not a canonical decomposition. NFKC will fix that issue. The utility at http://unicode.org/cldr/utility/transform.jsp?a=Any-NFKC%0D%0A&b=µ is useful for this.
Member
 @JeffBezanson, I'm not convinced that "characters that look different should be considered different." One problem is that, unlike LaTeX, we cannot rely on a particular font/glyph being used to render particular codepoints. U+00B5 and U+03BC look distinct in some fonts (one is rendered italic) and not in others, for example. Moreover, even when codepoints are rendered distinctly, the difference will often be subtle (χ² versus χ2) and hence an invitation for bugs and confusion. (That's why these variants work for phishing scams, after all.) I would prefer to simply state that identifiers are canonicalized to NFKC, so that only characters that look entirely distinct (as opposed to potentially slight font variations) are treated as distinct identifiers. It's useful to have variables named µ and π, but Julia shouldn't pretend that it is LaTeX.
Member
 There are several different levels of distinction being discussed: "Indistinguishables". Different unnormalized but strongly equivalent forms – i.e. byte sequences that mean the same things but are represented different, such as precomposed characters like U+0065, U+0301 vs. U+00E9. "Strong confusables". Characters like μ vs. µ and other things listed here that are semantically distinct but will often cause confusion and frustration due to very similar rendering. "Weak confusables." Character sequences that are normally easy to distinguish but might end up looking similar in some renderings, e.g. χ² vs. χ2. These call for different approaches. To deal with "indistinguishables" it's pretty clear that we should just normalize them. At the other end of the spectrum, this is a pretty lousy way to deal with "weak confusables" – imagine using both χ² and χ2 in some code and being really confused when they are silently treated as the same identifier! For weak confusables, I suspect the best behavior is to treat them as distinct but emit a warning if two weakly confusable identifiers are used in the same file (or scope). In the middle, strong confusables are a tougher call – both automatically normalizing them to be the same (like with indinstinguishables) and warning if they appear in the same file/scope (like weak confusables) are reasonable approaches. However, I tend to favor the warning. I've intentionally avoided Unicode terms here to keep the problem statement separate from the solution. I suspect that we should first normalize source to NFD, which takes care of collapsing "indistinguishables". Then we should warn if two identifiers are the same modulo "compatibles" and "confusables". That means that using composed and uncomposed versions of è in the same source file would just silently work – they mean the same thing – but using both χ² vs. χ2 or ﬃ and ffi in the same file would produce a warning and then proceed to treat them as distinct.
Contributor
commented Jan 18, 2014
 @StefanKarpinski Good summary! but I think you have the wrong conclusion. I was once challenged to find out why 10l would compare unequal to 101, in a C program (it was more elaborated), but because the font I could not find the bug. My preference would definitely be to make Julia consider all possible ambiguous characters equal, and give a warning/error if someone use identifiers that is considered equal because of rule 2 and 3. I do not read Unicode codepoints, and i do not have a different word for ﬃ and ffi, and I can't even see the difference when I am focused on logic. To me programming is about expressing ideas, and variables using both ﬃ and ffi as different variables in the same scope would be the worst offence to any code style guide.
Member
 Well, that's why it should warn. Whether it considers them the same or different is somewhat irrelevant when it causes a warning. I guess one benefit of considering such things the same rather than keeping them different is ease of implementation: if the analysis is done at the file level, you can canonicalize an entire source file and warn if two "confusable" identifiers are used in the same source file and then hand the canonicalized program off to the rest of the parsing process without worrying any further. Then again, you can do the same without considering them the same by doing the confusion warning at the same step but leaving confusable identifiers different.
Member
 As a practical matter, it is far easier to implement and explain canonicalization to NFKC, taking advantage of the existing standard and utfproc, than it would be to implement and document our own nonstandard normalization. (There are a lot of codepoints we'd have to argue over.) We can also certainly issue a warning whenever a file contains identifiers that are distinct from their canonicalized versions. (But I think it would be an unfriendly practice to issue a warning instead of canonicalizing.)
Member
 It seems unfortunate to me to canonicalize distinct characters that unicode provides specifically for their use in mathematics. Should we use a different normalization, maybe NFD, for string literals?
Member
 I don't think string literals should be normalized at all by default, although we should provide functions to do normalization if that is desired. The user should be able to enter any Unicode string they want.
Member
commented Jan 18, 2014
 +1 for what @stevengj said. There's something to be said for preserving user input as much as possible. (What if the user wants to implement a custom normalization, for example...)
Member
commented Jan 19, 2014
 Just to be perverse, let's say we normalize to NFKC, and Quaternions.jl gets renamed ℍ.jl. Then using ℍ would look for .julia/H/src/H.jl?
Member
 I've actually rampantly made the assumption that package names are ASCII largely because I think it's opening a whole can of worms to use non-ASCII characters in package names.
Member
 I'm much more concerned about identifier names. I don't think merging ℍ and H makes sense for us.
Member
 @stevengj – what about the χ² vs. χ2 issue? Your proposal silently treats them as the same, which strikes me as almost as bad as the (thus far hypothetical) problems we're trying to avoid here.
Member
 Actually, no, it's worse – at least you can look at the contents of your source file and discover that two similar looking identifiers are actually the same. If χ² and χ2 are treated as the same identifier, there's no way to figure it out short of finding the obscure appendix of the Julia manual that explains this behavior. I find that unacceptable.
Member
commented Jan 19, 2014
 I would like to point out that (on my Mac), even the strong confusing symbols render noticably differently. Swapping one for the other would maintain meaning, but loses a significant amount of typographic readability. I agree that this normalization should only apply to symbols (variable names), and I think it should only apply to Indistinguishables. Hopefully nobody tries to use X2, χ² and χ2 in their code, in much the same was as avoiding similar words (like I vs l) is a good idea
Member
 Everyone agrees that you shouldn't use both Ill1I1 and Il1IlI as variable names, but nobody thinks a language should silently canonicalize them to the same thing.
Member
 That seems to be what @stevengj is arguing for.
Member
 Yes, I think Julia should canonicalize ℍ to H internally. You are free to use ℍ as a variable name if you want, you just aren't free to use it as a distinct variable from H. Why is this such a loss for the language? Conceptually, this is quite a familiar thing. If I use a syntax-highlighting text editor, it might change the font of certain variables. No one thinks that this changes the meaning of the identifiers. To ordinary programmers (as opposed to Unicode geeks), a µ is a μ. I shudder to think of trying to explain this distinction to my students. (In contrast, everyone understands that I and l and 1 are distinct characters even though they look similar.)
Member
 That's not the part that's problematic. The problem is doing it silently. If you happen to have an editing environment where ℍ and H are obviously quite different, then it is completely surprising – in a way that's impossible to discover the cause of – that they are treated as the same identifier. That is not ok.
Member
 I don't know that it's surprising. My reaction would be Oh, it treats different fonts as the same identifier. I guess that makes sense. Because to ordinary people, ℍ and H are the "same character" in different fonts. (And if you're a Unicode nerd, you know about normalizations. But the vast majority of scientific programmers are not Unicode nerds.)
Contributor
 @stevengj I don't think that would be your reaction. You wouldn't even have considered that ℍ and H had anything to do with one another. Without a warning, you wouldn't even notice that two different identifiers are considered identical. See this potential example: julia> ℍ = 2 [many complex lines of code] julia> H = 1 julia> ℍ 1 # WTF?!
Member
 Yes, exactly. That's really not ok. If there's a warning, then you know something bad is going on.
Member
 In an editor/IDE/whatever you write your code, you use the same font for all the codes in the same window (you might change the font of course, but your changed font applies to every character in your working area). I would never expect the editor to use font A for this variable, while using font B for another. Therefore, I would expect the same name to appear exactly the same in my editor -- when they look different, they are different.
Member
 Here is my two cents: I've never encountered such problems in real coding practice, but I understand that this may become a concern in particular context. For such cases, I think a better way might be to provide tools to detect identifiers that might look strikingly similar and modify them with the code author's approval. Blindly treating two identifiers as the same thing just because they may look similar (e.g. H and ℍ) is, to me, a recipe to disastrous confusion. That being said, if two characters always look the same and there are virtually no ways to distinguish them visually, it might be safe to canonicalize them. But we should be conservative about this.
Member
commented Jan 19, 2014
 I wonder if Julia is really the first programming language to face such issues? I guess that many languages still stick to ascii identifiers to be safe. I know that Java has unicode identifiers, but my quick googling only turned up heated debates on whether to use unicode identifiers at all.
Member
commented Jan 19, 2014
 The Fortress programming language uses Unicode extensively, but even they have had absolutely nothing to say about normalization issues in the language specification. (pdf) From what I can tell, one usually codes the symbols as ASCII identifiers rather than inputing them directly.
Contributor
commented Jan 19, 2014
 The bold, italic, and sans-serif attributes in the mathematical variants of mu do not represent different fonts. In fact, each has a unique unicode code point. On the other hand, the editor may substitute a character from a different font if the requested character is not available. To be specific, in Xcode I use Monaco in the editor. If I insert a Greek letter mu U+03BC, the editor actually uses the mu from Lucida Grande because that character is not available in Monaco. On Sun, Jan 19, 2014 at 9:17 AM, Dahua Lin notifications@github.com wrote: In an editor/IDE/what ever you write your code, you use the same font for all the codes in the same window (you might change the font of course, but your changed font applies to every character in your working area). I would never expect the editor use font A for this variable, while using font B for another. Therefore, I would expect the same name to appear exactly the same in my editor -- when they look different, they are different. — Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/5434#issuecomment-32710501 .
Member
 Yes, that's right. Unicode doesn't care about fonts; it provides differently-styled letters precisely because they are used as distinct symbols in mathematics. If it weren't for that use case (which is our use case), those characters wouldn't exist. Many in the lisp/scheme world argue for case-insensitive identifiers because to them letter case is just a personal style choice, with the same character underneath. For example some people like to name functions in all-uppercase where they are defined and otherwise use lowercase. However, those people are wrong.
Contributor
commented Jan 19, 2014
 Just to be clear, the mathematical variants of mu (bold, italic, sans-serif) are distinct Unicode code points and can be present in the same font. On the other hand, a code editor might borrow a character from another font if it is not available in the requested font. Xcode does this. By the way, I looked more carefully at micro versus mu, and in Xcode's default font Menlo, they appear to be identical. I don't mean similar. I mean that at 288 point on the screen, overlaid on top of each other, they look identical. On Sun, Jan 19, 2014 at 12:23 PM, Jeff Bezanson notifications@github.comwrote: Yes, that's right. Unicode doesn't care about fonts; it provides differently-styled letters precisely because they are used as distinct symbols in mathematics. If it weren't for that use case (which is our use case), those characters wouldn't exist. Many in the lisp/scheme world argue for case-insensitive identifiers because to them letter case is just a personal style choice, with the same character underneath. For example some people like to name functions in all-uppercase where they are defined and otherwise use lowercase. However, those people are wrong. — Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/5434#issuecomment-32715371 .
Member
 I understand that different codepoints have nothing to do with choosing different fonts. I just think most people will perceive them as different fonts of the "same character".
Member
 Whatever various standard normalizations might say, I think there is a real distinction between characters that are truly identical (like the two mus), and characters that are the same abstract letter but intended to look quite different, like H vs. double-struck H. "Same character" is of course subjective and depends on the application, but in math double-struck letters are decidedly different symbols with different meanings.
Contributor
commented Jan 19, 2014
 One reasonable solution would be to restrict the set of characters in identifiers to a documented subset of Unicode. Allowing arbitrary characters in identifiers seems to be inviting problems. On Sun, Jan 19, 2014 at 1:45 PM, Jeff Bezanson notifications@github.comwrote: Whatever various standard normalizations might say, I think there is a real distinction between characters that are truly identical (like the two mus), and characters that are the same abstract letter but intended to look quite different, like H vs. double-struck H. "Same character" is of course subjective and depends on the application, but in math double-struck letters are decidedly different symbols with different meanings. — Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/5434#issuecomment-32717731 .
Member
commented Jan 19, 2014
 That would need to be a fairly large subset though - whats the point of Unicode identifiers if you don't support the various languages of the world?
Contributor
commented Jan 20, 2014
 I would prefer restricting identifiers to ASCII characters. On Sun, Jan 19, 2014 at 4:28 PM, Iain Dunning notifications@github.comwrote: That would need to be a fairly large subset though - whats the point of Unicode identifiers if you don't support the various languages of the world? — Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/5434#issuecomment-32724326 .
Member
 For maximum portability, better just limit it to uppercase. No, to be really safe, letters A-E only. On Jan 19, 2014 8:13 PM, "mathpup" notifications@github.com wrote: I would prefer restricting identifiers to ASCII characters. On Sun, Jan 19, 2014 at 4:28 PM, Iain Dunning notifications@github.comwrote: That would need to be a fairly large subset though - whats the point of Unicode identifiers if you don't support the various languages of the world? — Reply to this email directly or view it on GitHub< https://github.com/JuliaLang/julia/issues/5434#issuecomment-32724326> . — Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/5434#issuecomment-32728523 .
Member
commented Jan 20, 2014
 For maximum portability, better just limit it to uppercase Well, one to six letters ought to be enough to name every variable you could possibly want.
Member
 For the greatest portability, we should treat all unicode letters the same and distinguish variables only by their length.
Member
commented Jan 20, 2014
 More seriously, though, we are not in a good place right now. The majority opinion (or maybe just mine) is that neither NFC nor NKFC is entirely suitable. The former will not normalize Greek mu μ and micro µ, while the latter would normalize ℍ and H, and χ² and χ2. At this point, I would suggest NFD/NFC by default_, because I'm pretty sure we don't want to mess with combining diacritics regardless, and print warnings if NKFD-equivalent identifiers exist in scope. (_D may be sufficient since we don't necessarily need to recompose the Unicode string for an identifier name, although introspection would be less pretty) The other choice is a custom canonicalization...
Contributor
 Normalizing ℍ and H, and χ² and χ2 may not be a real issue if 1) the user does not have to see nor use the canonical form, and 2) a warning is printed when both are used in the same context.
Member
 That's exactly what I proposed. A good first-order approximation of my proposal is: NFC/D normalize source code silently. Warn if two NFKC/D-equivalent identifiers appear in the same file. There may be additional character equivalences that should trigger warnings, but we can add those as they come up.
Contributor
commented Jan 20, 2014
 I would imagine that it would be easier to raise an error if different symbols "canonicalize" equal (Stafan's 2. point), rather than give a warning and continue. There also does not seem to be a unanimous opinion if we should merge the variables or keep them separate, and making it an error solves that problem. If you get a warning you should fix it anyway and I would say sooner is better than later for that kind of thing.
Member
 I agree with stefan's step (1), but I truly don't understand the problem of ℍ vs. H. I don't think anybody uses a font that renders these the same. For program source the default should be to keep different things different. Attempting to apply all sorts of knowledge about what is and isn't "the same letter" puts a programming language in the linguistics business, where it does not belong.
Member
 I think we should do (1) and see wait and see if there are actually ever any issues that necessitate (2).
added a commit to stevengj/julia that referenced this issue Jan 20, 2014
 stevengj added utf8proc to deps for #5434 2caa7e7
added a commit to stevengj/julia that referenced this issue Jan 21, 2014
 stevengj added utf8proc to deps for #5434 73c3647
added a commit to stevengj/julia that referenced this issue Jan 21, 2014
 stevengj added utf8proc to deps for #5434 b1585ac
added a commit to stevengj/julia that referenced this issue Jan 21, 2014
 stevengj canonicalize identifiers to NFC, warn if undefined symbol is not equa… …l to its NFKC normalization (fix #5434) 8a7b777
referenced this issue Jan 21, 2014
Merged

#### Normalize all unicode identifiers to NFC #5462

added a commit to stevengj/julia that referenced this issue Jan 21, 2014
 stevengj normalize all flisp symbols to NFC (fix #5434) 7f8dd12
added a commit that closed this issue Jan 22, 2014
 stevengj normalize all flisp symbols to NFC (fix #5434) 7f8dd12
closed this in 7f8dd12 Jan 22, 2014
added a commit to stevengj/julia that referenced this issue Jan 27, 2014
 stevengj export utf8proc functionality in Julia (followup to #5462 and #5434) 5.799e+49
added a commit to stevengj/julia that referenced this issue Jan 27, 2014
 stevengj + stevengj export utf8proc functionality in Julia (followup to #5462 and #5434) bc7cf20
referenced this issue Jan 27, 2014
Merged

#### RFC: export utf8proc Unicode transformation functionality in Julia #5576

added a commit to stevengj/julia that referenced this issue Jan 27, 2014
 stevengj + stevengj export utf8proc functionality in Julia (followup to #5462 and #5434) 59b0f18
added a commit to stevengj/julia that referenced this issue Jan 29, 2014
 stevengj + stevengj export utf8proc functionality in Julia (followup to #5462 and #5434) bb70a9a
added a commit that referenced this issue Feb 1, 2014
 stevengj export utf8proc functionality in Julia (followup to #5462 and #5434) 9e5ce63
added a commit to stevengj/julia that referenced this issue Feb 1, 2014
 stevengj export utf8proc functionality in Julia (followup to #5462 and #5434) 6039a46
referenced this issue Feb 12, 2014
Closed

#### Profile.print() throws exception if function name has Greek characters #5769

added the unicode label Feb 22, 2014
referenced this issue Feb 22, 2014
Open

#### Parse a minimal set of fullwidth punctuation as synonyms #5903

Member
 @IainNZ's blog post, for future reference: http://iaindunning.com/2014/julia-unicode.html. In particular, I think that it will be useful to consider what Python's corresponding issue: http://legacy.python.org/dev/peps/pep-3131/.
Member

Closed

Closed

Closed

#### add custom JULIA normalization? JuliaLang/utf8proc#11

Member
commented Jul 14, 2016
 Perhaps Lint is the right place to catch this.
added a commit to ilammy/sabre that referenced this issue Nov 19, 2016
 ilammy Normalize identifiers to NFKC This is somewhat controversial thing, but I have made a decision: identifiers should be normalized to fold visual ambiguity, and the normalization form should be NFKC. Rationale: 1. Compatibility decomposition is favored over canonical one because it provides useful folding for letter ligatures, fullwidth forms, certain CJK ideographs, etc. 2. Compatibility decomposition is favored over canonical one because it provides more protection from visual spoofing. 3. Standard Unicode transformation should be favored over anything ad-hoc because it's predictable and more mature. 4. Normalization is a compromise between freedom of expression and ease of implementation. Source code is not prose, there are rules. Here are some references to other languages: SRFI 52: http://srfi-email.schemers.org/srfi-52/ Julia: JuliaLang/julia#5434 Python: http://bugs.python.org/issue10952 Rust: rust-lang/rust#2253 Unfortunately, there aren't very many precedents and open discussions about Unicode usage in programming languages, especially in languages with very permissive identifier syntax (like Scheme). Aside from identifiers there are more places where Unicode can be used: * Characters are not normalized, not even to NFC. This may have been useful, for example, to recompose combining marks, but unfortunately NFC may do more transformations than that, so it is no go. We preserve the exact Unicode character. * Character names, on the other hand, are case-sensitive identifiers, so they are normalized as such. * Strings and escaped identifiers are left untouched in order to preserve the exact spelling as in the source code. * Directives are case-insensitive identifiers and are normalized as such. * Numbers should be composed from ASCII only so they are not normalized. Sometimes this produces weird parses because characters that look like signs are not treated as such. However, these characters are invalid in numbers, so it's somewhat justified. * Peculiar identifiers are shit. I'm sorry. Because of NFKC is is possible to write a plain, unescaped identifier that will parse as a number after going through NFKC. It may even look exactly like a number without being one. There is not much we can do about this, so we produce a warning just in case. * Datum labels are mostly numbers, so they are not normalized as well. Note that sometimes they can be treated as numbers with invalid prefix. * Comments are ignored. * Delimiters should be ASCII-only. No discussion on this. Unicode has various fancy whitespaces and line separators, but this is source code, not a rich text document in a word processor. Also, currently case-folding is performed only for ASCII range. Identifiers should use NFKC_casefold transformation. It will be implemented later. 505faba
This was referenced Nov 30, 2016
Merged

Closed