Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add custom JULIA normalization? #11

Closed
stevengj opened this issue Jul 18, 2014 · 17 comments · Fixed by #89
Closed

add custom JULIA normalization? #11

stevengj opened this issue Jul 18, 2014 · 17 comments · Fixed by #89

Comments

@stevengj
Copy link
Member

For JuliaLang/julia#5903. If utf8proc can have LUMP, then libmojibake can have JULIA. Unless we want to keep this separate from libmojibake.

@tonyhffong
Copy link

+1

@StefanKarpinski
Copy link
Member

Yeah, it seems like we really need this. Unfortunately, the standardized normalizations just don't cut it.

@tonyhffong
Copy link

A JULIA mode in base/utf8proc.jl or in libmojibake? It seems utf8proc is better though we probably need to apply it before feeding code to julia-parser.scm

@StefanKarpinski
Copy link
Member

How realistic is it to actually upstream all the changes we've made to utf8proc? I would guess that a new normalization mode would be fairly easy to keep separate from other changes.

@tonyhffong
Copy link

What about this example (referring to the proposed new brackets in Julia):

a1 = ⟨c,d⟩ # canonical \langle and \rangle
a2 = ⟪c,d⟫ # using \lAngle and \rAngle (legibility preference)
a3 = ⟪c,d⟩ # unmatched brackets should throw parse error
b1 = "⟪c,d⟫"
b2 = "❰c,d❱" # dingbat angular brackets
b3 = "〈c,d〉" # full-width angular brackets U3008, U3009

I'd prefer normalizing the angular brackets for a1 and a2 so they parse, and leave the chars in the literal strings untouched.

This means the lexer/parser needs to control the normalization, at least for syntactically important symbols. Or is there a hook for that already?

@StefanKarpinski
Copy link
Member

This could be handled without the parser needing to know about it by having a mapping from brackets to their pair and just raising an error if the parser finds a pair that aren't really a pair. If the Unicode code points are always near each other, the check could just be for that. Of course this still implies that normalization has to happen after that check, and thus after lexing at least. So the sequence would be: lex, check, normalize, parse. Seems like a lot of trouble to prevent people from using unpaired Unicode brackets that happen to look similar. Maybe not worth it.

@tonyhffong
Copy link

It isn't so bad actually. The (lex, check) part of that is already in place in my PR, albeit manually and most likely non-exhaustive. I was brought here wondering if some of that work can be off-loaded to utf8proc, but it probably requires way too much finessing. So perhaps just an incremental change like so would work:

  • lex,check (with hand-rolled utf8 normalization for brackets and perhaps some critical symbols, like '=' and ':' )
  • normalize (using utf8proc) any identifiers token to impose our view of confusable symbols.
  • parse, as usual

@stevengj
Copy link
Member Author

stevengj commented Nov 5, 2014

@StefanKarpinski, the changes so far aren't too radical. The first obstacle is that we need to get copyright assignments from all of the contributors in order for upstream to consider a patch. After that, I don't know what their patch-review process will be like, but I'm guessing it will be a bit on the slow side based on past interactions.

@nalimilan
Copy link
Member

@stevengj Have you been able to get a reply from them? I didn't get any. I can help asking for copyright assignment if that can help, I'd rather not have to package libmojibake in addition to utf8proc in Fedora. :-)

@StefanKarpinski
Copy link
Member

I think that copyright assignment is not a good idea, hopefully a contributor license agreement is all they actually require. Copyright assignment isn't even legally valid in many countries, e.g. Germany.

@stevengj
Copy link
Member Author

stevengj commented Nov 5, 2014

You're right, Stefan, it actually seems to be just a contributor license.

@stevengj
Copy link
Member Author

(Update: the current changes in libmojibake, mainly Unicode-7 support, have been submitted upstream with CLAs.)

@jiahao
Copy link
Collaborator

jiahao commented Jul 1, 2015

A quick note that Unicode provides a list of confusable characters as part of UAX 39, which also provides a list of recommendations for characters in identifier names given security concerns.

@stevengj
Copy link
Member Author

stevengj commented Jul 1, 2015

@jiahao, I think we explicitly decided to reject these recommendations, along with NFKC normalization, in JuliaLang/julia#5434, in order to distinguish a wider array of mathematical symbols (e.g. 𝐇 vs. H) and to allow things like x⁽²⁾ as identifiers. So, we are on our own in deciding whether to normalize e.g. fullwidth Latin letters.

@stevengj
Copy link
Member Author

Upon reflection, I think the best thing would be to make this pluggable, by allowing the caller to supply a custom mapping function that is applied to the codepoints after normalization.

@StefanKarpinski
Copy link
Member

Providing a "reasonable" set of confusable mathematical characters won't be too crazy though.

@stevengj
Copy link
Member Author

Closed by #89.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants