EOF_CHAR isn't the common case, moving those first #365

PallHaraldsson · 2023-10-09T19:24:05Z

No description provided.

PallHaraldsson · 2023-10-09T19:28:25Z

I was looking into the parser, thinking if LittleDict might be faster there than Dict. Maybe not since it doesn't seem speed-critical.

I know the parser is hardly speed-critical, compared to the compiler/optimizer, or at least was. Now more code will be precompiled.

It doesn't seem to hurt to rearrange here, this likely is compiled to branches (misdirected are slow). A LittleDict of some common ASCII letters might be even faster. Note, this isn't tested or benchmarked, but I tried to be very careful to not screw up copy-pasting.

codecov · 2023-10-09T19:36:25Z

Codecov Report

Merging #365 (6fe2f58) into main (a57f093) will not change coverage.
Report is 1 commits behind head on main.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #365   +/-   ##
=======================================
  Coverage   96.56%   96.56%           
=======================================
  Files          14       14           
  Lines        4161     4161           
=======================================
  Hits         4018     4018           
  Misses        143      143

Files	Coverage Δ
src/tokenize.jl	`99.07% <100.00%> (ø)`

PallHaraldsson · 2023-10-09T21:05:34Z

I'm a bit confused about these line:

elseif c == '−' # \minus '−' treated as hyphen '-'
        return emit(l, accept(l, '=') ? K"-=" : K"-")

or I think they must work, just unclear why not similar needed for the lines above for hyphen minus. But this must have worked...

PallHaraldsson · 2023-10-09T21:40:15Z

Does the order matter for correctness rather than speed? That's only idea I have why my last commit worked around bug. If you trust this you could merge.

c42f · 2023-10-14T05:52:33Z

There's a very basic benchmarking script in test/benchmark.jl - does this show any improvement for before vs after this change? If you're working on performance improvements, please do prove to yourself (and everyone else :-) ) that they actually make a difference. For example, the compiler may be able to recognize very simple if-else chains and do something more efficient with them than comparing everything in order. I don't know if that happens in this case, but it could make any actual rearrangements here have no effect on the final runtime.

The order could matter if there's any overlap between categories. I think there's not, but worth checking whether of the explicit characters might be in one of the predicates.

PallHaraldsson · 2023-10-14T14:32:21Z

At first I just meant to move the check for EOF_CHAR as it's obviously not most common. I believe the checks are in order, the compiler cann't have any idea of the true distribution. I'm doing a best guess but I haven't benchmarked.

very simple if-else chains

if some of the checks are not as simple maybe the compiler takes that into account. And I've also thought about that possibility, why I had emit at the top.

c42f · 2023-10-15T07:05:04Z

Please do benchmark it, all performance work needs benchmarking. Without that, it's easy to code changes which don't matter - needlessly churning the code or introducing complexity.

My intuition says it would be best to focus on optimizing is_identifier_start_char() - that calls into Base right now, but we should move that into JuliaSyntax and add an ascii fast path which can be inlined. In fact, we could probably split the whole block into ascii and non-ascii branches and that may well speed things up enormously - this would be my pick of the optimizations to try first here

You could also work out an approximation for the true distribution of characters if that matters - for example just compute a histogram of character frequencies.

c42f · 2023-11-01T11:17:03Z

My intuition says it would be best to focus on optimizing is_identifier_start_char

A good part of this was effectively done in #372, and it does have some measurable performance benefit.

PallHaraldsson · 2023-11-10T13:05:09Z

no _next_token, should be optimized, have a fast path? It calls lex_identifier, that was optimized, and it isn't the main bottleneck?

is_identifier_start_char isn't most critical, did you mean is_identifier_char which is in the new fast path there.

It's interesting to see the table:
const ascii_is_identifier_char = Bool[is_identifier_char(Char(b)) for b=0x00:0x7f]

I would have thought checking for most likely (and maybe 2nd or 3rd most likely) would be faster, and possibly then a lookup table.. Lookups aren't that fast, though this one only 16 bytes. 16-entry can be made fast though, with SIMD instructions, I understand, but I'm not convinced otherwise, also bit-manipulation, not fastest?

[There are tests for BigInt literals, i.e. big macro not used, it seems, thus also no need to test for BigFloat? I'm thinking this is correct, the parser just parses a string, and hands it to the macro. I'm just thinking how much work it would be to get rid of BigFloat in Julia, it seems no problem for the parser, but BigInt can't be gotten rid of because of it.]

KristofferC · 2023-11-10T13:46:20Z

While some optimisation based solely on "mental work" can be useful, I find it better to work with real data from profiling and benchmarks. So if you want to work on the performance of this package (or any package for that matter) it would be a good idea to come with concrete data (in form of profiles) that show that certain functions take up a significant time and that the changes you propose improve those times (in form of benchmarks). Computers are complicated and trying to guess the effect of certain code changes is very hard. As it is right now, it feels like we are just kind of guessing which is not a way to do incremental performance improvements.

EOF_CHAR isn't the common case, moving those first

17aa57c

PallHaraldsson marked this pull request as draft October 9, 2023 19:40

PallHaraldsson added 3 commits October 9, 2023 19:55

Fix accidental extra copy-paste

411487c

Move more

fb267bb

Update tokenize.jl

ecd716b

PallHaraldsson added 2 commits October 9, 2023 21:18

Move quote/fix - Update tokenize.jl

bbc5346

Fix strange bug - Update tokenize.jl

6fe2f58

PallHaraldsson marked this pull request as ready for review October 9, 2023 21:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EOF_CHAR isn't the common case, moving those first #365

EOF_CHAR isn't the common case, moving those first #365

PallHaraldsson commented Oct 9, 2023

PallHaraldsson commented Oct 9, 2023

codecov bot commented Oct 9, 2023 •

edited

PallHaraldsson commented Oct 9, 2023

PallHaraldsson commented Oct 9, 2023

c42f commented Oct 14, 2023

PallHaraldsson commented Oct 14, 2023

c42f commented Oct 15, 2023

c42f commented Nov 1, 2023

PallHaraldsson commented Nov 10, 2023 •

edited

KristofferC commented Nov 10, 2023 •

edited

EOF_CHAR isn't the common case, moving those first #365

Are you sure you want to change the base?

EOF_CHAR isn't the common case, moving those first #365

Conversation

PallHaraldsson commented Oct 9, 2023

PallHaraldsson commented Oct 9, 2023

codecov bot commented Oct 9, 2023 • edited

Codecov Report

PallHaraldsson commented Oct 9, 2023

PallHaraldsson commented Oct 9, 2023

c42f commented Oct 14, 2023

PallHaraldsson commented Oct 14, 2023

c42f commented Oct 15, 2023

c42f commented Nov 1, 2023

PallHaraldsson commented Nov 10, 2023 • edited

KristofferC commented Nov 10, 2023 • edited

codecov bot commented Oct 9, 2023 •

edited

PallHaraldsson commented Nov 10, 2023 •

edited

KristofferC commented Nov 10, 2023 •

edited