Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support many more Unicode infix operators #6929

Merged
merged 3 commits into from
May 27, 2014

Conversation

stevengj
Copy link
Member

This patch adds support for a much larger number of infix operators, based on my comment in #6582: I went through category Sm manually and pulled out a list of code points that seemed (a) unambiguously infix and (b) had a clear analogy to existing operators so that a reasonable precedence choice could be made.

I don't actually provide definitions for any of the new operators, but now they are available for the user to add methods if she wants to.

It also adds the synonym for cbrt, in analogy to for sqrt.

cc: @jiahao, @JeffBezanson

@JeffBezanson
Copy link
Sponsor Member

Feels kind of...profligate.

You have some duplicates in the operator lists, where we already had some unicode operators. Won't really cause a problem, but untidy.

@JeffBezanson
Copy link
Sponsor Member

  • We might want full-width operators to be normalized, based on usage in asian scripts
  • I don't know if should be banned out-right, but the confusion with | is certainly enough to keep it out of Base.

@JeffBezanson
Copy link
Sponsor Member

The n-ary big operators like were recently added as identifier characters. I imagine them being used somewhat like ⋂(f(x) for x in y) and not as infix operators, in which case they can just be identifiers.

@stevengj
Copy link
Member Author

Sure, I can try to remove some of the most similar entries if we don't want to support easily confusable infix operators (although we don't bother to do this for other identifiers like µ and μ), I guess since this is a whitelist it makes more sense to be choosy here.

(And yes, there are a few harmless duplications that I was too lazy to remove; I wanted to get a general sense of whether we wanted to do this first.)

My feeling is that it is nice to make a rich set of operators available for users to add methods to, even though we won't use most of them in Base, and once you decide to do this it's hard to make a sensible criterion for a "non-profligate" set of operators.

@jiahao
Copy link
Member

jiahao commented May 23, 2014

Syntax like {x+1 ∣ x ∈ S} would be sweet though.

@JeffBezanson
Copy link
Sponsor Member

I agree with the overall idea. These characters are definitely infix operators, and the only things you can do with them are disallow them or parse them sensibly.

@stevengj
Copy link
Member Author

I've updated the patch to remove duplicates and near-duplicates (e.g. operators that differ only in size). I also changed the left/right arrows to prec-arrow rather than prec-assignment.

@stevengj
Copy link
Member Author

We should also really modify bin_ops_by_prec in show.jl to somehow get its list from julia-parser.scm, but I feel like that should be a separate PR.

(Not sure what the best way to do that would be, maybe define bin_ops_by_prec in C?)

@JeffBezanson
Copy link
Sponsor Member

We can add a C API call to fetch the operator table via scm_to_julia.

@stevengj
Copy link
Member Author

Hmm, random Travis error with clang but not gcc. Looks unrelated?

@JeffBezanson
Copy link
Sponsor Member

Yes, unrelated but very troubling :)

@StefanKarpinski
Copy link
Sponsor Member

I'm curious @stevengj, how you decided which operators got plus-like precedence versus time-like precedence? Some are obvious – ± and – but many are not. Since subsequent changes to precedence are likely to break code, these seem like they shouldn't be chosen too cavalierly.

@stevengj
Copy link
Member Author

@StefanKarpinski, when it wasn't obvious from the shape, I just went with their documented meaning in the Unicode standard: any operator documented as a product, conjunction, intersection, or division of some kind (e.g. ⋋ is left semidirect product) got times precedence, while any operator documented as an addition/subtraction, logical-or, or union of some kind was given plus-like precedence.

I used this list of category-Sm code points, which helpfully gives the name of each code point.

Operators whose precedence seemed unclear I left out. Did I include any operators whose precedence you found unclear?

@StefanKarpinski
Copy link
Sponsor Member

Oh, no – they just weren't all obvious to me, but that seems like a very sane way to do it.

@stevengj
Copy link
Member Author

Note that this basically fixes #552.

…hould have * precedence despite looking like a union; change NEWS table to only explicitly list operators that are predefined
@stevengj
Copy link
Member Author

Another thing that I was thinking of implementing, possibly in a separate PR, is:

  • Allow every operator (except for a small blacklist) to have a variant starting with a dot, e.g.. allowing automatically gives you .≪.
  • Allow every single-character operator to allow suffixes consisting of combining characters (categories Me and Mn), primes, and possiby a few other characters (sub/superscripts?). e.g. allowing ⊗ automatically gives you ⊗′ and ⊗̃.

Similar to Jeff's remark above, there is no question that e.g. +̂ is an infix operator, so the only things you can do are either to disallow it or to parse it sensibly, and there is no reason that I can see not to parse it sensibly (e.g. the precedence is obvious). Similarly, if we are going to allow < and .<, then it doesn't make sense to me to allow but not .≪ etcetera.

This should be pretty easy to implement: you simply strip off any . prefix and any allowed suffix before checking whether the operator is in the allowed Set. It still obeys the rule that every prefix of an operator is also an operator, and will simplify the operator list because we no longer need to list .==, .* etcetera explicitly.

I took a stab at implementing this in the parser, but I ran into trouble because of an apparent oddity in flisp's string processing: Nevermind, I see that string.char takes a byte index, not a character index, and I'm supposed to step through the string with string.inc.

@stevengj
Copy link
Member Author

Okay, I was able to put together a sample implementation of the above suggestion for operators+combining characters. It required a few more changes to the parser, though, so I'll leave it for a separate PR when(?) this one is merged.

@JeffBezanson
Copy link
Sponsor Member

I think the "big" N-ary operators should not be infix.

@stevengj
Copy link
Member Author

@JeffBezanson, I thought I got rid of the big N-ary operators; which ones did I miss?

@JeffBezanson
Copy link
Sponsor Member

U+2A00 (⨀), the big circled operators.

@stevengj
Copy link
Member Author

is not in the list. Looks like I left in though; will fix.

@JeffBezanson
Copy link
Sponsor Member

Ah, I confused it with U+29BF CIRCLED BULLET. Gotta love unicode...

@stevengj
Copy link
Member Author

Does the operator? predicate need to be replaced with a hash table?

@JeffBezanson
Copy link
Sponsor Member

Yes, that's a very good idea.

JeffBezanson added a commit that referenced this pull request May 27, 2014
support many more Unicode infix operators
@JeffBezanson JeffBezanson merged commit b78d9b4 into JuliaLang:master May 27, 2014
stevengj referenced this pull request in mbauman/julia May 30, 2014
If there is no whitespace between the nearest `\` and the cursor, try to complete a latex symbol or its name *instead* of a Julian name.  This allows for interactive discovery of latex names, but whitespace is required for completion of a Julia name. Note that if these completions were instead *appended* to the Julia options, they have to display without the leading \. I found that to be confusing when mixed in with the Julian names.

If the word matches a latex name exactly, it replaces it with the symbol. Otherwise, it attempts to complete the latex names. While there are some names that are prefixes to other names, I don't find this to be too jarring. It does effectively "shadow" the longer names, making them harder to discover.
@stevengj stevengj mentioned this pull request May 9, 2016
5 tasks
@stevengj stevengj deleted the uni_ops branch October 6, 2017 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants