Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: allow operator suffixes — combining characters and primes #22089

Merged
merged 10 commits into from
Sep 20, 2017

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented May 26, 2017

This PR implements something that I had been hoping to do for a while (see #6929 (comment)): custom operators can be defined by appending Unicode combining characters, primes, and sub/superscripts to other operators.

For example, and +″ are now parsed as binary operators with the same precedence as +.

Rationale: this allows you to define an operator that is clearly a "modified +" (etc.) without having to dig through Unicode for some vaguely appropriate symbol, and without overriding + itself. Also, it is pretty inconceivable that could be anything other than an infix operator, so it is a choice between supporting it or giving an error, and I don't see why an error would be useful.

(Note: combining characters with operators, e.g. , don't show up properly in some fonts. It should look like
image
)

@stevengj stevengj added parser Language parsing and surface syntax domain:unicode Related to unicode characters and encodings labels May 26, 2017
@stevengj
Copy link
Member Author

stevengj commented May 26, 2017

It would also be possible to support a whitelist of superscripts and subscripts. For example, if we wanted +⁽¹⁾ or *ₐ to parse as operators.

Unicode already has a couple of oddball examples of such operators, e.g. U+2a27 is , which we already parse as a +-like operator.

@dpsanders
Copy link
Contributor

Great! ⁻¹ would be very nice.

NEWS.md Outdated
@@ -4,6 +4,14 @@ Julia v0.7.0 Release Notes
New language features
---------------------

* `getpeername` on a `TCPSocket` returns the address and port of the remote
endpoint of the TCP connection ([#21825]).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad merge resolution

@stevengj
Copy link
Member Author

(Still working on some fixes to this PR. In particular, I'm updating it to blacklist many syntactic "operators" like ? and ' and : from suffixes, so that basically only ordinary binary operators are allowed to have suffixes.)

@stevengj
Copy link
Member Author

stevengj commented Jun 1, 2017

(The annoyance with supporting superscripts and subscripts is that these codepoints are scattered all over unicode; the only way to detect them is to make a manual table.)

@stevengj
Copy link
Member Author

stevengj commented Jun 1, 2017

This seems to be the list of the 93 Latin/Greek/math super/subscripts in Unicode, sorted by codepoint: ²³¹ʰʲʳʷʸˡˢˣᴬᴮᴰᴱᴳᴴᴵᴶᴷᴸᴹᴺᴼᴾᴿᵀᵁᵂᵃᵇᵈᵉᵍᵏᵐᵒᵖᵗᵘᵛᵝᵞᵟᵠᵡᵢᵣᵤᵥᵦᵧᵨᵩᵪᶜᶠᶥᶦᶫᶰᶸᶻᶿ ⁰ⁱ⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿₐₑₒₓₕₖₗₘₙₚₛₜⱼⱽ. Correction: whoops, forgot the numeric subscripts.

@stevengj
Copy link
Member Author

stevengj commented Jun 1, 2017

Rebased. Tests were green before the rebase, so it should be good to squash+merge once others approve.

@stevengj
Copy link
Member Author

stevengj commented Jun 5, 2017

@StefanKarpinski, any chance of a decision on this?

@stevengj
Copy link
Member Author

Seems like everyone agrees with this change in principle, and it is just a matter of approving the implementation. @JeffBezanson?

@stevengj
Copy link
Member Author

Rebased and fixed whitespace problem introduced by last merge.

@@ -54,9 +58,25 @@
(lambda (x)
(has? t x))))))

; only allow/strip suffixes for some operators
(define no-suffix? (Set (append prec-assignment prec-conditional prec-lazy-or prec-lazy-and
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a whitelist instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a bit easier as a blacklist because otherwise I'd have to split e.g. prec-arrow into a couple of separate lists rather than just explicitly listing -- --> -> here.

@@ -68,7 +88,9 @@
(pushprec (cdr L) (+ prec 1)))))
(pushprec (map eval prec-names) 1)
t))
(define (operator-precedence op) (get prec-table op 0))
(define (operator-precedence op) (get prec-table
(maybe-strip-op-suffix op)
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing for operators can be pretty important for performance. It would be nice to only call maybe-strip-op-suffix for precedence levels that support it.

Copy link
Member Author

@stevengj stevengj Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping that maybe-strip-op-suffix would be fast enough, because in the common case strip-op-suffix (implemented in C) is a no-op and no-suffix? isn't even called.

Is there any way to benchmark the impact of this?

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try parsing e.g. string(:[$((:(a+b) for i=1:10000)...)])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried benchmarking parse(s) for s = string(:[$((:(a+b) for i=1:10000)...)]), and removing the maybe-strip-op-suffix call from operator-precedence makes no detectable difference on my machine, so operator-precedence doesn't seem to be a problem.

However, there seems to be about an 8% slowdown overall in that benchmark compared to before this PR, so there must be something else in this PR that is the culprit.

test/parse.jl Outdated
@test parse("3 +⁽¹⁾ 4") == Expr(:call, :+⁽¹⁾, 3, 4)
@test parse("3 +₍₀₎ 4") == Expr(:call, :+₍₀₎, 3, 4)
@test Base.operator_precedence(:+̂) == Base.operator_precedence(:+)

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add some cases of suffixes on operators that don't allow them.

Copy link
Member Author

@stevengj stevengj Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Done.

@stevengj stevengj added the kind:potential benchmark Could make a good benchmark in BaseBenchmarks label Jun 15, 2017
@stevengj
Copy link
Member Author

This PR causes a slight slowdown in the parser. In particular, replacing Set with SuffSet for the various is-prec-foo? functions causes about an 8% slowdown in parse Jeff's artificial benchmark from above, and about a 5% slowdown for a more realistic benchmark (parsing about 20000 lines from base).

Is this a concern? Any suggestions?

@JeffBezanson
Copy link
Sponsor Member

Yes, I think we should try using SuffSet only for precedence levels that need it.

@JeffBezanson
Copy link
Sponsor Member

Ah, just saw your optimization. That looks good. How much does it help?

@stevengj
Copy link
Member Author

stevengj commented Jun 15, 2017

@JeffBezanson, unfortunately, that optimization hardly makes a difference (7% slowdown instead of 8%).

@stevengj
Copy link
Member Author

I don't really have a good handle on performance optimization for flisp.

@stevengj
Copy link
Member Author

stevengj commented Jul 6, 2017

Any thoughts on how I can further speed up parsing? Or whether we should just swallow the 5% parsing slowdown on realistic code and worry about parser optimization later?

@StefanKarpinski
Copy link
Sponsor Member

I'm all for swallowing the 5% slowdown for this.

@stevengj
Copy link
Member Author

Should be ready to merge if we decide we want it.

(define (maybe-strip-op-suffix op)
(if (symbol? op)
(let ((op_ (strip-op-suffix op)))
(if (or (eqv? op op_) (no-suffix? op_))
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use eq? here?

Copy link
Member Author

@stevengj stevengj Jul 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. (Makes < 1% difference in the benchmark.)

(let ((S (Set l)))
(if (every no-suffix? l)
S ; suffixes not allowed for anything in l
(lambda (op) (S (maybe-strip-op-suffix op))))))
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe try splitting l into operators that do/don't support suffixes, and testing (or (no-suff-set op) (suff-set (maybe-strip-op-suffix op))) (depending on which of the sets are non-empty of course).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried this; it makes < 1% difference in the benchmark.

@StefanKarpinski
Copy link
Sponsor Member

Whoever merges, please remember to squash!

@stevengj
Copy link
Member Author

Looks like an unrelated stalled build on Travis.

@@ -96,7 +96,7 @@ Operators like `+` are also valid identifiers, but are parsed specially. In some
can be used just like variables; for example `(+)` refers to the addition function, and `(+) = f`
will reassign it. Most of the Unicode infix operators (in category Sm), such as `⊕`, are parsed
as infix operators and are available for user-defined methods (e.g. you can use `const ⊗ = kron`
to define `⊗` as an infix Kronecker product).
to define `⊗` as an infix Kronecker product). Operators can also be suffixed with modifying marks, primes, and sub/superscripts, e.g. `+̂ₐ″` is parsed as an infix operator with the same precedence as `+`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to conform to the line length convention of the rest of the file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@stevengj
Copy link
Member Author

stevengj commented Sep 20, 2017

Bump. Okay to squash/merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings kind:potential benchmark Could make a good benchmark in BaseBenchmarks parser Language parsing and surface syntax
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants