Skip to content

wasmparser: improve visit_operator performance #2228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 23, 2025

Conversation

keithw
Copy link
Contributor

@keithw keithw commented Jun 9, 2025

This PR addresses some of the performance issues discussed in #2180 and #2202.

  • When validating, it eliminates the need for the reader to have its own stack of frames or a depth count. The BinaryReader has a visit_operator function again, where the visitor implements the new FrameStack trait so that visit_operator can get the current frame's kind. Validation uses this to avoid duplicating the FrameStack logic between the reader and validator.

  • For consumers who don't want to implement the FrameStack trait, they can use an OperatorsReader which keeps an internal FrameStack, but it can now be created from its allocations (and turned back into them), avoiding the need to allocate a stack of frames each time an OperatorsReader is constructed.

For validating lots of tiny modules and one big modules, I'm seeing these absolute timings on an Intel Core i7-9700T @ 4.3 GHz, running this command: cargo bench -p wasmparser --bench benchmark -- --baseline pre-change 'validate/(tests|spidermonkey)'. Percentages are relative to the first row.

commit validate/tests validate/spidermonkey
90c156f (just before #2134) 5.1770 ms 39.707 ms
0354dde (PR #2134) 5.2136 ms (+0.7%) 39.929 ms (+0.6%)
351f153 (v1.229.0, #2149) 5.1867 ms (+0.2%) 39.474 ms (-0.6%)
2dbe8b9 (v1.230.0, #2171) 5.2155 ms (+0.7%) 40.088 ms (+1.0%)
315252f (v1.231.0, #2206) 5.2113 ms (+0.7%) 41.558 ms (+4.7%)
93610b8 (v1.232.0, #2209) 5.4440 ms (+5.2%) 42.099 ms (+6.0%)
14096f0 (v1.233.0, #2218) 5.8129 ms (+12.3%) 41.191 ms (+3.7%)
a05eaa5 (current main branch) 5.7693 ms (+11.4%) 41.529 ms (+4.6%)
e3c880c (this PR) 5.7307 ms (+10.7%) 40.691 ms (+2.4%)

These measurements seem to suggest that #2134 (and the broader issue of the BinaryReader or OperatorsReader keeping a stack of frames) may not the major culprit with regards to the slowdown discussed in #2180, but if that is a point where we want to improve things (e.g. in #2202), this PR is about as minimal as I've been able to get it. It occurs to me that the validate/tests benchmark might not be a reliable indicator of regression if the testsuite submodule is updated in-between (e.g. to add more tests)...?

@keithw keithw requested a review from a team as a code owner June 9, 2025 07:09
@keithw keithw requested review from pchickey and removed request for a team June 9, 2025 07:09
@pchickey pchickey requested review from alexcrichton and removed request for pchickey June 9, 2025 16:25
@pchickey
Copy link
Contributor

pchickey commented Jun 9, 2025

Alex has the context here so I'm swapping reviewer to him.

@alexcrichton
Copy link
Member

Wanted to drop in and say I haven't forgotten about this, sorry it's been a very busy week! I hope to be able to get to this tomorrow or Monday at the latest

@alexcrichton
Copy link
Member

Ok I've now gotten a chance to sit down with this and review it more closely, thanks for the patience. Also cc @Robbepop in case you hadn't already seen this.

I'd agree though that tracking the validate/tests benchmark over time isn't too useful since that is the whole test suite which includes new tests written, new spec tests, etc, so the set of what's being tested can fluctuate. Perhaps it should just be removed entirely from the benchmark suite...

Otherwise though I'm curious to dig in a bit more to the spidermonkey numbers. My overall stance on this is that I'm not a fan of making things less ergonomic in the name of performance but if it needs to be done it needs to be done. In that sense if this is a clear and unambiguous win then I think something like it should land, but I'm effectively double-checking that. In that sense I'm wary of the absolute numbers here in that Criterion is a wall-time-based benchmarking system which are generally notoriously unreliable when it comes to small timings, and 40ms here is IMO a pretty small amount of time (making it difficult for Criterion to measure changes).

For example the numbers you collected show that your change in this PR is a 2.2% improvement over the current main banch. However the numbers also show that the 1.233.0 release was a 2.3% improvement over 1.232.0. However in this release practically nothing changed related to core wasm parsing/validation (some component bits but those are unrelated). In that sense this aligns with what I'd roughly expect in that in the millisecond range Criterion, on the same code, can have +/- 5% swings (ish).

Interestingly the biggest regression you've measured, 3.7% in 1.230 -> 1.231 actually landed a few optimizations such as #2197. I'm not sure how best to explain that...

Basically I'm overall a bit skeptical of the numbers collected here, but at the same time I can also naively see how this would probably improve performance given that there's fewer data structures in play. I'd prefer to not have to land this in the sense that it regresses ergonomics and makes things a bit more difficult to use, but that's also not the end of the world and something that can be figured out regardless. I'm otherwise basically lamenting how the benchmarking infrastructure for wasmparser is not great and I at least personally feel I still don't have a great handle on what the regressions are...

@Robbepop
Copy link
Collaborator

Robbepop commented Jun 16, 2025

@alexcrichton thanks for the cc!

This might not really have anything to do with performance, but while trying to use the new wasmparser version in Wasmi, I noticed that the newly introduced TypedSelectMulti operator is weird in two ways:

  1. It is the only Operator that is not actually a valid Wasm operator.
  2. It is the only Operator that has a non-Copy field. (This gave me some minor headache.)

Do we really want to have such a "weird" operator? To me this indicates that wasmparser's parser and validator have an unclear border or grey area in between them since recent changes.

I'm otherwise basically lamenting how the benchmarking infrastructure for wasmparser is not great and I at least personally feel I still don't have a great handle on what the regressions are...

This sounds as if we should not attempt to optimize anything until we have better data (or benchmarks).

I'd prefer to not have to land this in the sense that it regresses ergonomics and makes things a bit more difficult to use [..]

I generally agree. The wasmparser crate is quite complex already. It would be great if we could find a solution that rather simplifies wasmparser instead of adding more complexity. Then again, stripping complexity away is among the hardest work in programming.

@keithw
Copy link
Contributor Author

keithw commented Jun 17, 2025

@alexcrichton Thanks, I appreciate these thoughtful comments. I'm not particularly attached to this PR; if there is a concern about a performance hit from the state introduced in #2134, this is my best shot to eliminate that state, without affecting conformance to chapter 5. If there turns out not to be a significant performance concern, I'm fine sticking with the status quo.

From looking at a perf report, one of the contributors to performance seems to be that if we move the visit_operator implementation back into binary_reader.rs (as this PR does), calls to things like read_var_u32 are more likely to get inlined vs. when visit_operator is implemented in crates/wasmparser/src/readers/core/operators.rs. This is maybe a <2% effect but I think it is real. (These calls do get inlined if we mark read_var_u32 #[inline(always)].)

The rest might just be the play of chance with regard to any small change going through LLVM on a non-PGO build. But obviously you should run the numbers yourself and see what you think.

One idea that you might like would be adding some more "intensive" benchmark workloads that might be more statistically robust, e.g. a validation workload that alternates validating spidermonkey.wasm and bz2.wasm 100 times in a row. If you want this, happy to send a PR but you probably know what you want here.

@keithw
Copy link
Contributor Author

keithw commented Jun 17, 2025

I noticed that the newly introduced TypedSelectMulti operator is weird in two ways:

1. It is the only `Operator` that is not actually a valid Wasm operator.

Multi-value select is similar to a multi-value expression before the multi-value feature. E.g. in Wasm 1.0, this module was well-formed, and roundtrippable between binary and text, but invalid:

(func (result i32 i32) unreachable)

The multi-value feature made it valid.

Similarly, this multi-value select is well-formed and roundtrippable between binary and text, but currently invalid: https://github.com/WebAssembly/spec/blob/wasm-3.0/test/core/select.wast#L372-L378 . A future feature extension could also make it valid. wasmparser has a representation for it so it can parse testcases like that one and so wasmprinter can print it back to text.

To me this indicates that wasmparser's parser and validator have an unclear border or grey area in between them since recent changes.

I think the border is basically what's in the spec -- the readers/Parser parse the syntax defined in chapter 5, and the validator performs the "Validation Algorithm" described in the appendix (enforcing the rules of chapter 3).

@keithw keithw force-pushed the shared-frame-stack branch from 94d7dc0 to 7f05e07 Compare June 17, 2025 23:59
- The BinaryReader can visit_operator by itself again,
  with no need for OperatorsReader or extra state, when
  the visitor implements the FrameStack trait. Validation
  uses this to avoid duplicating the frame stack.

- OperatorsReader keeps an internal FrameStack and can be created
  from its allocations (and turned back into them), avoiding
  the need to allocate a stack of frames.
@alexcrichton
Copy link
Member

Do we really want to have such a "weird" operator? To me this indicates that wasmparser's parser and validator have an unclear border or grey area in between them since recent changes.

Personally I agree with @keithw here in that there's no gray area spec-wise. The spec is quite clear that syntactically this operator is valid it's just semantically always invalid. The parser is "this is as spec-defined AST" (more-or-less) and the validator is "is this AST valid or not?".

Now to counter myself I do still think there's gray area, but I think it's different than what others are thinking. I wrote up my thoughts here on it with a tl;dr; of I think there's gray area right now in precisely how implementations are expected to follow the spec. My impression is that implementations like browsers only care about the "is this module invalid" vs "is this malformed or validation-invalid". That leaves a vacuum (in my opinion) about shaping the spec to prioritizing implementations which do want to distinguish between malformed and invalid.

I'm otherwise basically lamenting how the benchmarking infrastructure for wasmparser is not great and I at least personally feel I still don't have a great handle on what the regressions are...

This sounds as if we should not attempt to optimize anything until we have better data (or benchmarks).

While I do think better benchmarks would be useful I don't want to be so drastic as to say nothing can happen until they're improved. @keithw's idea might be good to have where a module is just validated N times to make the benchmark less noisy (in theory).

To me though the biggest discrepancy is that you're measuring 30%+ regressions in wasmi but neither Keith nor I have been able to reproduce this. I at least haven't been able to understand it for sure. What would be most helpful, ideally, is if you're able to dig in and understand where this 30% is going. It's easy enough to run a command and see "30%" pop out but understanding the structure of what led to that conclusion is crucial as well.


In any case I'm going to give a more close review to this now as this seems like an "obvious win" perf-wise (why have two stacks when you can have one?) and the numbers show a clear improvement no matter what.

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this all looks reasonable at a high level to me code-wise. Thank you again @keithw for putting work into this!

My main thinking on this is that I feel it puts the *Allocations a bit too front-and-center for a relatively niche case of optimizing this. I'd prefer to instead enable it to be possible to reuse allocations but not necessarily make it the default path. For example get_operators_reader() I think is a reasonable signature to keep on a few locations but additionally having a get_binary_reader_for_operators (or similar) with a constructor on OperatorsReader that takes BinaryReader + allocations would be sufficient. I think that'd reduce the impact in-repo and additionally reduce the impact for external consumers too.

Is that something you'd like me to push up changes here for though? I'm happy to make the edits myself, but I'm also happy to review if you'd prefer to make the changes yourself too.

The rest might just be the play of chance with regard to any small change going through LLVM on a non-PGO build.

I'll say this is something I've been worried about as well. I find giant loops/matches (as is the case with validation/reading here) to be pretty finnicky and unrelated changes can have outsized effects when they basically shouldn't (e.g. code motion). I'm not sure how best to thread that needle, but moving everything back to binary_reader.rs seems fine either way

@keithw
Copy link
Contributor Author

keithw commented Jun 23, 2025

Thank you! It's fine with me if you want to push changes directly -- and I agree it sounds good to keep the get_operators_reader() path available for most consumers,

@alexcrichton
Copy link
Member

Ok I've pushed up some changes (sorry for the noise on a few of them...)

I'm going to double-check this can be integrated well into Wasmtime and then I plan to merge.

@alexcrichton
Copy link
Member

I'm going to double-check this can be integrated well into Wasmtime and then I plan to merge.

ok well that was easy with 0 changes it passes all tests, yay!

@alexcrichton alexcrichton added this pull request to the merge queue Jun 23, 2025
Merged via the queue into bytecodealliance:main with commit 40fe970 Jun 23, 2025
32 checks passed
@keithw keithw deleted the shared-frame-stack branch June 23, 2025 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants