-
Notifications
You must be signed in to change notification settings - Fork 291
wasmparser: improve visit_operator performance #2228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wasmparser: improve visit_operator performance #2228
Conversation
Alex has the context here so I'm swapping reviewer to him. |
Wanted to drop in and say I haven't forgotten about this, sorry it's been a very busy week! I hope to be able to get to this tomorrow or Monday at the latest |
Ok I've now gotten a chance to sit down with this and review it more closely, thanks for the patience. Also cc @Robbepop in case you hadn't already seen this. I'd agree though that tracking the Otherwise though I'm curious to dig in a bit more to the spidermonkey numbers. My overall stance on this is that I'm not a fan of making things less ergonomic in the name of performance but if it needs to be done it needs to be done. In that sense if this is a clear and unambiguous win then I think something like it should land, but I'm effectively double-checking that. In that sense I'm wary of the absolute numbers here in that Criterion is a wall-time-based benchmarking system which are generally notoriously unreliable when it comes to small timings, and 40ms here is IMO a pretty small amount of time (making it difficult for Criterion to measure changes). For example the numbers you collected show that your change in this PR is a 2.2% improvement over the current Interestingly the biggest regression you've measured, 3.7% in 1.230 -> 1.231 actually landed a few optimizations such as #2197. I'm not sure how best to explain that... Basically I'm overall a bit skeptical of the numbers collected here, but at the same time I can also naively see how this would probably improve performance given that there's fewer data structures in play. I'd prefer to not have to land this in the sense that it regresses ergonomics and makes things a bit more difficult to use, but that's also not the end of the world and something that can be figured out regardless. I'm otherwise basically lamenting how the benchmarking infrastructure for wasmparser is not great and I at least personally feel I still don't have a great handle on what the regressions are... |
@alexcrichton thanks for the cc! This might not really have anything to do with performance, but while trying to use the new
Do we really want to have such a "weird" operator? To me this indicates that
This sounds as if we should not attempt to optimize anything until we have better data (or benchmarks).
I generally agree. The |
@alexcrichton Thanks, I appreciate these thoughtful comments. I'm not particularly attached to this PR; if there is a concern about a performance hit from the state introduced in #2134, this is my best shot to eliminate that state, without affecting conformance to chapter 5. If there turns out not to be a significant performance concern, I'm fine sticking with the status quo. From looking at a perf report, one of the contributors to performance seems to be that if we move the The rest might just be the play of chance with regard to any small change going through LLVM on a non-PGO build. But obviously you should run the numbers yourself and see what you think. One idea that you might like would be adding some more "intensive" benchmark workloads that might be more statistically robust, e.g. a validation workload that alternates validating spidermonkey.wasm and bz2.wasm 100 times in a row. If you want this, happy to send a PR but you probably know what you want here. |
Multi-value select is similar to a multi-value expression before the multi-value feature. E.g. in Wasm 1.0, this module was well-formed, and roundtrippable between binary and text, but invalid: (func (result i32 i32) unreachable) The multi-value feature made it valid. Similarly, this multi-value select is well-formed and roundtrippable between binary and text, but currently invalid: https://github.com/WebAssembly/spec/blob/wasm-3.0/test/core/select.wast#L372-L378 . A future feature extension could also make it valid. wasmparser has a representation for it so it can parse testcases like that one and so wasmprinter can print it back to text.
I think the border is basically what's in the spec -- the readers/Parser parse the syntax defined in chapter 5, and the validator performs the "Validation Algorithm" described in the appendix (enforcing the rules of chapter 3). |
94d7dc0
to
7f05e07
Compare
- The BinaryReader can visit_operator by itself again, with no need for OperatorsReader or extra state, when the visitor implements the FrameStack trait. Validation uses this to avoid duplicating the frame stack. - OperatorsReader keeps an internal FrameStack and can be created from its allocations (and turned back into them), avoiding the need to allocate a stack of frames.
7f05e07
to
4b449e0
Compare
Personally I agree with @keithw here in that there's no gray area spec-wise. The spec is quite clear that syntactically this operator is valid it's just semantically always invalid. The parser is "this is as spec-defined AST" (more-or-less) and the validator is "is this AST valid or not?". Now to counter myself I do still think there's gray area, but I think it's different than what others are thinking. I wrote up my thoughts here on it with a tl;dr; of I think there's gray area right now in precisely how implementations are expected to follow the spec. My impression is that implementations like browsers only care about the "is this module invalid" vs "is this malformed or validation-invalid". That leaves a vacuum (in my opinion) about shaping the spec to prioritizing implementations which do want to distinguish between malformed and invalid.
While I do think better benchmarks would be useful I don't want to be so drastic as to say nothing can happen until they're improved. @keithw's idea might be good to have where a module is just validated N times to make the benchmark less noisy (in theory). To me though the biggest discrepancy is that you're measuring 30%+ regressions in wasmi but neither Keith nor I have been able to reproduce this. I at least haven't been able to understand it for sure. What would be most helpful, ideally, is if you're able to dig in and understand where this 30% is going. It's easy enough to run a command and see "30%" pop out but understanding the structure of what led to that conclusion is crucial as well. In any case I'm going to give a more close review to this now as this seems like an "obvious win" perf-wise (why have two stacks when you can have one?) and the numbers show a clear improvement no matter what. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this all looks reasonable at a high level to me code-wise. Thank you again @keithw for putting work into this!
My main thinking on this is that I feel it puts the *Allocations
a bit too front-and-center for a relatively niche case of optimizing this. I'd prefer to instead enable it to be possible to reuse allocations but not necessarily make it the default path. For example get_operators_reader()
I think is a reasonable signature to keep on a few locations but additionally having a get_binary_reader_for_operators
(or similar) with a constructor on OperatorsReader
that takes BinaryReader
+ allocations would be sufficient. I think that'd reduce the impact in-repo and additionally reduce the impact for external consumers too.
Is that something you'd like me to push up changes here for though? I'm happy to make the edits myself, but I'm also happy to review if you'd prefer to make the changes yourself too.
The rest might just be the play of chance with regard to any small change going through LLVM on a non-PGO build.
I'll say this is something I've been worried about as well. I find giant loops/matches (as is the case with validation/reading here) to be pretty finnicky and unrelated changes can have outsized effects when they basically shouldn't (e.g. code motion). I'm not sure how best to thread that needle, but moving everything back to binary_reader.rs
seems fine either way
Thank you! It's fine with me if you want to push changes directly -- and I agree it sounds good to keep the |
Make it a non-default API that other various locations can optionally use but aren't required to do so.
Ok I've pushed up some changes (sorry for the noise on a few of them...) I'm going to double-check this can be integrated well into Wasmtime and then I plan to merge. |
ok well that was easy with 0 changes it passes all tests, yay! |
This PR addresses some of the performance issues discussed in #2180 and #2202.
When validating, it eliminates the need for the reader to have its own stack of frames or a depth count. The BinaryReader has a
visit_operator
function again, where the visitor implements the newFrameStack
trait so thatvisit_operator
can get the current frame's kind. Validation uses this to avoid duplicating the FrameStack logic between the reader and validator.For consumers who don't want to implement the
FrameStack
trait, they can use anOperatorsReader
which keeps an internalFrameStack
, but it can now be created from its allocations (and turned back into them), avoiding the need to allocate a stack of frames each time anOperatorsReader
is constructed.For validating lots of tiny modules and one big modules, I'm seeing these absolute timings on an Intel Core i7-9700T @ 4.3 GHz, running this command:
cargo bench -p wasmparser --bench benchmark -- --baseline pre-change 'validate/(tests|spidermonkey)'
. Percentages are relative to the first row.These measurements seem to suggest that #2134 (and the broader issue of the BinaryReader or OperatorsReader keeping a stack of frames) may not the major culprit with regards to the slowdown discussed in #2180, but if that is a point where we want to improve things (e.g. in #2202), this PR is about as minimal as I've been able to get it. It occurs to me that the
validate/tests
benchmark might not be a reliable indicator of regression if the testsuite submodule is updated in-between (e.g. to add more tests)...?