Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(lexer): low hanging fruits #151

Closed
Boshen opened this issue Mar 8, 2023 · 22 comments
Closed

perf(lexer): low hanging fruits #151

Boshen opened this issue Mar 8, 2023 · 22 comments
Labels
C-enhancement Category - New feature or request E-Help Wanted Experience level - For the experienced collaborators

Comments

@Boshen
Copy link
Member

Boshen commented Mar 8, 2023

From the v8 scanner blog, I still see low hanging fruits, let's play with these and make our lexer even faster!

@Boshen
Copy link
Member Author

Boshen commented Mar 8, 2023

Nerd sniping @YoniFeng, since you've already looked at the lexer a bit 😉

@magic-akari
Copy link
Collaborator

Performance issues are bugs for oxc. 👀

@Boshen
Copy link
Member Author

Boshen commented Mar 8, 2023

Performance issues are bugs for oxc. 👀

Performance is feature.

@YoniFeng
Copy link
Contributor

YoniFeng commented Mar 8, 2023

What low hanging fruit do you see? We can make a list of things to look at / try.
I read through it but didn't follow everything (for example the whitespace part where they saved a branch), so would need to take a better look.
But quick notes and thoughts:

  1. Streaming vs reading whole file into memory: for Chromium it makes complete sense, it enables parallelization of parsing while data is still coming over the network ASAP.
    For reading from disk - my guess is that it's negligible at best, but probably worse?
    Might be worth trying for fun/learning, but I'd give it super low priority.
    The code for the whitespace/comments SIMD would need to be altered so that it doesn't need all the bytes up front, anything else like that?

  2. I didn't get the Whitepspace section just from reading. Will have to dig in to the code. I got that they managed to avoid a loop branch, but didn't fully comprehend the before/after. Also don't know if it's currently relevant for the lexer the way it currently is (given it currently skips all whitespace and doesn't return a whitespace token).
    One assumption difference worth noting - v8 expects/optimizes for minified code, but oxc is for source code (whitespace more prevalent).

  3. Identifier scanning: looks like the code in unicode-id-start is already optimal?

  4. Internalizing minified identifiers - this is already done by string_cache, right? I didn't get what "Since JavaScript code is often minified, V8 uses a simple lookup table for single ASCII character strings" means.

  5. Keywords and hashing - we already discussed this with regards to phf being optimized for binary space, not lookup speed.

  6. Surrogate pairs seems like a small thing, possibly a hassle given we only work with a char iterator. But based on what's written, doesn't that mean read_string_literal is technically wrong since it doesn't allow a lone surrogate?

Overall, I don't see much actionable experiments here:

  1. Stream rather read whole file to memory
  2. keyword matching with perfect/faster hashing (known)

What other ideas do you see can try?

@Boshen
Copy link
Member Author

Boshen commented Mar 8, 2023

Identifier scanning:

  • I have a tiny micro optimization to make, I'll update unicode-id-start.

Keywords and hashing:

  • I'd like to try gperf but it seems Rust doesn't have this
  • so the next option would be to fork phf and add https://github.com/rust-phf/rust-phf/pull/236

Internalizing minified identifiers:

  • For oxc, I removed string-cache and went for `CompactString, this is better for cache locality I think.

Things I don't like so far:

  • this JSX testing branch, it's sitting on top of every read_next_token

https://github.com/Boshen/oxc/blob/b76ffb4826a4fbf871d020b35530add16760fbf6/crates/oxc_parser/src/lexer/mod.rs#L342-L344

  • this memmove here, and the drop comes with it. I'm not sure if there's any trick to keep this on the stack. This shows up on my profiler btw.

https://github.com/Boshen/oxc/blob/b76ffb4826a4fbf871d020b35530add16760fbf6/crates/oxc_parser/src/lexer/mod.rs#L180

  • this lookahead here and the code that comes with it (doing all operations on the heap). Lookahead is always <= 4.

https://github.com/Boshen/oxc/blob/b76ffb4826a4fbf871d020b35530add16760fbf6/crates/oxc_parser/src/lexer/mod.rs#L95

@Boshen
Copy link
Member Author

Boshen commented Mar 8, 2023

I just looked at the code of V8's keyword matching to better understand what this is talking about:

Since the list of keywords is static, we can compute a perfect hash function that for each identifier gives us at most one candidate keyword. V8 uses gperf to compute this function. The result computes a hash from the length and first two identifier characters to find the single candidate keyword. We only compare the identifier with the keyword if the length of that keyword matches the input identifier length.

I think we should be able to do something similar to get this working, for example use the phf with a faster hasher, and then just copy the generated table into our code?

@Boshen Boshen added this to the AST / Lexer / Parser milestone Mar 8, 2023
@Boshen Boshen added C-enhancement Category - New feature or request E-Help Wanted Experience level - For the experienced collaborators labels Mar 8, 2023
@YangchenYe323
Copy link
Contributor

I was quite interested in your discussions in #140. And I wrote a quick prototype of quick-lint-js's approach of matching keywords in rust: https://github.com/YangchenYe323/keyword-match-experiment

The idea is simple:

  1. Generate an efficient collision-free hash table at build time:
    • Concatenate the first two characters (2 u8s), and last two characters (2 'u8's) of a string into a u32 as the hashed value. This gets us the selection of a key.
    • Hash the selection by doing some magic computation involving a random seed to get hash = f(selection, seed) Different values of seed is tested at build time to choose one without collisions.
    • Write the chosen seed and the corresponding hash table into a generated source file.
  2. Simd string equality test: we can load the candidate str and keyword str into two Simd<u8, 16> since we can filter out longer candidates early on. But if the candidate token is actually less than 16 bytes, we have to read 16 bytes regardless, which makes this function unsafe. Then we compare those two Simd<u8, 16> and use a mask to extract the final result.

The benchmark result is quite promising:
The benchmark generates 8 random strings of random length between 2 and 11. And then match each of them against the js keywords using either a baseline match statement or hashing + simd. The hashing runs from 20% to 100% faster:
Screen Shot 2023-03-09 at 8 16 47 PM

I ran this a couple times on an AWS ubuntu instance. The performance of the hashing approach is stably 32-34ns. The matching approach's run time varies from 33ns to as high as 100ns. Probably it's because the matching approach is more sensitive to the actual length of the string? That part is beyond me.

@Boshen It'd be great to get some insight from you. I could look into using this in oxc, too

@YoniFeng
Copy link
Contributor

YoniFeng commented Mar 10, 2023

Looks great!
You should definitely submit a PR imo

Few things I noticed:

  1. I would test getting rid of
if clen < keyword_list::MIN_JS_KEYWORD_LENGTH || clen > keyword_list::MAX_JS_KEYWORD_LENGTH
        {
            return false;
        }

In practice, with the hashing approach it's probably not worth the branch misprediction?
I might be wrong though.

if len != candidate.len() {
            return false;
        }

This is repeated in is_keyword as well as simd_slice_match.
Only need one.

In quick-lint-js, this check is after the string comparison.
I assume @strager's due diligence found that it's overall faster to just do the string comparison than introduce branch misprediction.
Worth testing doing the same.

  1. The SIMD comparison is currently possibly reading memory it shouldn't be reading, right? (because the slice length can be <16).
    quick-lint-js pads the file contents right when it loads from disk, so reading 16 bytes is always safe.
    I looked at this previously, but we decided not to go through with it - see refactor: pad raw content to parse with 16 empty bytes #90.

So, I think the SIMD comparison has to be dropped.
Will be interesting to see how significant it is in practice.
I'd guess it varies, with TS codebases/files (which will have longer keywords more often) being most affected.

edit: this might not be relevant for the keyword lookup

  1. Should probably sprinkle #[inline(always)]s.

  2. Maybe look at using https://docs.rs/cmov/latest/cmov/ to replace ifs?

@YoniFeng
Copy link
Contributor

YoniFeng commented Mar 10, 2023

re: the match approach fluctuating in benchmarks

Probably due to string length having a large effect, like you said.
I think a fairer/more realistic benchmark increases the sample size of tested strings, and comprises 50/50 of keywords and random strings. (I'm guessing. I don't know the actual distribution of keyword vs identifier in the wild)

@strager
Copy link
Contributor

strager commented Mar 10, 2023

@YoniFeng Thanks for tagging me!

  1. I would test getting rid of [the early length check]

This would cause select to read out of bounds for 1-character variable names. https://github.com/YangchenYe323/keyword-match-experiment/blob/8032a59ff040406afc6eead92c8985a9dd241b30/src/hash.rs#L6-L18

I assume @strager's due diligence found that it's overall faster to just do the string comparison than introduce branch misprediction [from a length check].

This is correct.

Will be interesting to see how significant [SIMD] is in practice.

SIMD (or SWAR) provides a perf benefit on its own, but SIMD unlocks another optimization: branchless comparisons. SIMD+branchless improved performance significantly (at least 2x). (Going branchless for the whole algorithm, not just the string comparison, is even better.)

(Branchless string comparisons are not implemented in quick-lint-js right now. I didn't want to microoptimize quick-lint-js' current perfect hash table because I am going to throw it away eventually anyway. 😉 I mostly made the perfect hash table for a video (which I'll publish as soon as I make a YouTube thumbnail...).)

However, it seems that @YangchenYe323's experiment returns a boolean. I assume calling code is going to branch anyway, so making the string comparison branchless might not be worth it. (In quick-lint-js' case, keyword lookup returns an integer so things are more complicated.)

Maybe look at using https://docs.rs/cmov/latest/cmov/ to replace ifs?

How would you use cmov in the string comparison while avoiding out of bounds slice access?

and comprises 50/50 of keywords and random strings. (I'm guessing. I don't know the actual distribution of keyword vs identifier in the wild)

What I did for most of my benchmarking/profiling/optimizing is take word-looking things from jQuery and feed that into my lookup function: https://github.com/strager/perfect-hash-tables/blob/3938fb8f7ddb65d0a83c12bc4dcfeabaecc7c4fc/mixed.txt

This data set is about 20% keywords, 80% variable names (or words in comments). I didn't measure the ratio if you exclude comments.

@strager
Copy link
Contributor

strager commented Mar 10, 2023

And I wrote a quick prototype of quick-lint-js's approach of matching keywords in rust: https://github.com/YangchenYe323/keyword-match-experiment

Note that I own the copyright to quick-lint-js' code, and quick-lint-js is released with a GPL-3.0-or-newer license. If your project is effectively a copy-paste of quick-lint-js' code, please respect the software license.


@YangchenYe323 I think the following code is UB in Rust. You cannot grow a slice. I think Miri can catch this.
https://github.com/YangchenYe323/keyword-match-experiment/blob/8032a59ff040406afc6eead92c8985a9dd241b30/src/lib.rs#L57-L58

@YoniFeng
Copy link
Contributor

YoniFeng commented Mar 10, 2023

This would cause select to read out of bounds for 1-character variable names. https://github.com/YangchenYe323/keyword-match-experiment/blob/8032a59ff040406afc6eead92c8985a9dd241b30/src/hash.rs#L6-L18

I wasn't keeping track, I had your implementation in memory (AND with mask to avoid out of bounds).

I didn't want to microoptimize quick-lint-js' current perfect hash table because I am going to throw it away eventually anyway. 😉

Uhh - throw it away for what?

I mostly made the perfect hash table for a video (which I'll publish as soon as I make a YouTube thumbnail...).)

Wow https://github.com/strager/perfect-hash-tables seems fairly extensive/thorough.
Looking forward to it. (A video is enough? I'd think you need a blog series at least)

However, it seems that @YangchenYe323's experiment returns a boolean. I assume calling code is going to branch anyway, so making the string comparison branchless might not be worth it. (In quick-lint-js' case, keyword lookup returns an integer so things are more complicated.)

How would you use cmov in the string comparison while avoiding out of bounds slice access?

I didn't have the string comparison in mind, only the outermost layer.
But actually nvm, because if the method returns a boolean it's irrelevant.

@YangchenYe323
Copy link
Contributor

Note that I own the copyright to quick-lint-js' code, and quick-lint-js is released with a GPL-3.0-or-newer license. If your project is effectively a copy-paste of quick-lint-js' code, please respect the software license.

Thanks for the reminder. I've added the same license to my repo.

@YangchenYe323 I think the following code is UB in Rust. You cannot grow a slice. I think Miri can catch this.
https://github.com/YangchenYe323/keyword-match-experiment/blob/8032a59ff040406afc6eead92c8985a9dd241b30/src/lib.rs#L57-L58

Yes it is UB, thanks for pointing this out! But we need to drop that part since we won't go for the Simd anyway.

@YangchenYe323
Copy link
Contributor

I wasn't keeping track, I had your implementation in memory (AND with mask to avoid out of bounds).

I suppose we couldn't use a static mask either? because we don't pad the input so that 16 bytes are always readable. Could try mask against the actual length though.

@Boshen
Copy link
Member Author

Boshen commented Mar 10, 2023

Let's respect @strager's work and quick-lint-js's license.

@YangchenYe323 can you PR under quick-lint-js's license restrictions? I'll only merge when @strager says it's ok.

@Boshen
Copy link
Member Author

Boshen commented Mar 10, 2023

Let me check with @strager, I asked him in private.

@Boshen
Copy link
Member Author

Boshen commented Mar 11, 2023

@YangchenYe323 can you draft a PR with an implement using rust-phf and without SIMD? I know it is not perfect, but something slightly faster than the current string match implementation will be good enough for us right now.

We can't use the implementation from quick-lint-js's code given its license restrictions. And @strager will eventually teach us a clean-room implementation soon ;-)

@YangchenYe323
Copy link
Contributor

@YangchenYe323 can you draft a PR with an implement using rust-phf and without SIMD? I know it is not perfect, but something slightly faster than the current string match implementation will be good enough for us right now.

Sure I'm on it. The tricky thing is that the baseline matching approach is hard to beat : ). I ran the benchmark using @strager 's JQuery input on (a). Rust HashMap, (b). Matching, (c). Perfect Hash:
Screen Shot 2023-03-10 at 7 52 00 PM

Perfect hash is about 4x faster than using a rust HashMap (roughly consistent with the stats mentioned in #140 ) but only like 25% faster than just matching. Still need to see if we're able to get some speed up when this is migrated into oxc's lexer structure.

@YangchenYe323
Copy link
Contributor

I did some more digging into the LLVM IR of the match_keyword function, and found that it's not simply comparison after comparison. Basically the input string would only compare with the match arms known to have equal length. The IR looks like this:
Screen Shot 2023-03-10 at 9 57 56 PM

The big switch in bb5 branches corresponding to the length of the input string so actually on each invocation, at most ~15 memory comparison is done. Never are all 84 keywords compared. If we assume identifiers are much more common than keywords, these branches will all have a relatively low chance, and actually prediction could do pretty decently on this situation.

I suppose this might explain why #140 doesn't give the desired performance boost.

@YoniFeng
Copy link
Contributor

I did some more digging into the LLVM IR of the match_keyword function, and found that it's not simply comparison after comparison. Basically the input string would only compare with the match arms known to have equal length. The IR looks like this: Screen Shot 2023-03-10 at 9 57 56 PM

The big switch in bb5 branches corresponding to the length of the input string so actually on each invocation, at most ~15 memory comparison is done. Never are all 84 keywords compared. If we assume identifiers are much more common than keywords, these branches will all have a relatively low chance, and actually prediction could do pretty decently on this situation.

I suppose this might explain why #140 doesn't give the desired performance boost.

When I looked at the assembly, I was surprised to "find" it was doing comparison after comparison without first branching based on the length (I know this optimization exists in .NET - and would blindly assume LLVM has it as well), so I attributed it to "some reason I don't know about".

Nice to see you found this. I probably just wasn't thorough enough looking at the ASM.. (too lazy to go look again).
Makes complete sense.

@strager
Copy link
Contributor

strager commented Mar 11, 2023

Video on my perfect hash table implementation: https://www.youtube.com/watch?v=DMQ_HcNSOAI

Feel free to adopt ideas from this video.

@Boshen
Copy link
Member Author

Boshen commented Mar 12, 2023

After trying out a bunch of micro optimizations and by looking at the profiler at larger scale within the linter, I conclude that we don't need to look at the lexer anymore (for a long time).

The only remaining micro optimization is the phf keyword lookup method, which we'll leave it as an exercise after the video, unless #171 gets an amazing result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category - New feature or request E-Help Wanted Experience level - For the experienced collaborators
Projects
None yet
Development

No branches or pull requests

5 participants