-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign frames and unify across AST and bytecode interpreters, and specialize calls #18
Conversation
PyCharm is a bit strange, it fails to run the debugger... Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
- this breaks the printing of stack traces - this changes the receiver of the #escapedBlock: - it may fix a bug with block semantics and clearing the previous frame (for non-local returns) Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
This is mostly to see the performance impact of this. I am currently thinking about moving away from a Frame class to a structure based on lists, but I don't think I can mark them as immutable. So, let's see whether there's any notable impact. Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
…head - locals and arguments accessed by inner scopes is in a second, optional list - to be able to store the inner list into the frame, I need to use type erasure, which makes things a little more tricky - the blocks now simply keep a reference to the inner list - because we split the args/vars that are accessed from an inner scope, we don't know in a single pass the access idx. Thus, we need to resolve it on first execution. Though, it also allows us to unify arg and var nodes Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Since the Inner only stores AbstractObjects, we don't really need erasure here. Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Part of adopting new frame design for bytecode interpreter. Signed-off-by: Stefan Marr <git@stefan-marr.de>
This shouldn't change behavior Signed-off-by: Stefan Marr <git@stefan-marr.de>
- have two types of bytecode methods: with handling of non-local returns, and without - change parser find_var to return object - introduce new bytecodes push/pop frame/inner to access the new frame structure. These bytecodes are put into place when doing a pass over the bytecode to determine stack usage. Because only then we know what the indexes are for accessing frame/inner. The old push/pop local/argument bytecode are still around during parsing, but are not supported in the interpreter Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
The problem is that only once the outer method is completely parsed, we can be sure what each single access index is going to be. And, we really don't want to make method activation more complicated by having even more complex maps there. So, let's just rewrite on first execution. In the interpreter, simply re-execute the same bytecode index after patching the bytecodes. Remove unneeded FrameDetails and early prepare_frame calls, which couldn't work this way (only with more complex index mapping for argument handling when doing a call). Signed-off-by: Stefan Marr <git@stefan-marr.de>
…able Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
…r contexts Signed-off-by: Stefan Marr <git@stefan-marr.de>
… multiple indexes in create_frame Signed-off-by: Stefan Marr <git@stefan-marr.de>
This marks the mgenc so to set the bit that we access it from outer scopes. It's not explicitly marking "self", but since we handle self specially, that's not needed. Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
With the latest changes, which adds specialized calling conventions for 1, 2, and 3 arguments, as well as specific bytecodes, we see now major speedups on for the bytecode interpreter. The execution of SomSom on the bytecode interpreter requires now 5-7% less time overall, which is a good indication for the improved interpreter speed. Next step is to apply the calling specialization also in the AST interpreter. |
Signed-off-by: Stefan Marr <git@stefan-marr.de>
This possibly saves stack space. There is also the simplification of the guard, and reducing number of times the rcvr_class needs to be determined. The Generic Dispatch is now encoded also by the expected_class, as a simple guard after the dispatch loop. This avoids having to use a polymorphic guard checking method. Though, it's not necessarily obvious. Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
…r the JIT, too Signed-off-by: Stefan Marr <git@stefan-marr.de>
- remove unused universe from Invokable Signed-off-by: Stefan Marr <git@stefan-marr.de>
I think it may be better to start the trace before creating a frame, because then we see the frame creation inside the trace. Signed-off-by: Stefan Marr <git@stefan-marr.de>
- increase code sharing a little between AST and bytecode interpreters Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
We can't fully get rid of the object, because RPython typing doesn't support having a node-node parent. But we avoid the extra call. This change also uses `node` as name for the self to avoid issues with `self` as name on the JitDriver Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
This design is very similar to the bytecode interpreter in TruffleSOM. We don't need to have the stack in the frame, it's just complexity there. However, having it in the interpreter is a little ackward, too, because of all the code. Though, it's trivial code, so, hopefully not much an issue. For n-ary calls, we pass the stack, and stack pointer, and then callee can take what it needs. Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
There's more, but let's do at least the easy stuff. Added some more tests Signed-off-by: Stefan Marr <git@stefan-marr.de>
Overall, the results are great, except for AST JIT performance, which dropped. Should be investigated at some point. Richards, Bounce, BubbleSort, List, Queens, Towers all suffered see: https://rebench.stefan-marr.de/compare/RPySOM/bc14175415c59f1cfeb9b295fc8ddc103590d4ae/3eb4d2de135b4cf0b0bf9c4029b419d77ad00d4b But interpreter performance improved very nicely. Time to merge. |
With opening this PR, the new frame design sketched in #16 and #17 is mostly implemented. Check these issues for design details. This PR will only reiterate divergence/changes from those notes.
Key implications are that frames are not longer represented by objects but 1 or 2 lists/arrays.
This also means, we won't be able to use the virtualizable mechanism of PyPy, and can't really use the immutability annotation either.
LexicalScopes
The PR introduces the notion of
LexicalScopes
. It's a basic version of what we have in TruffleSOM.It's needed to be able to lookup the variables when quickening the frame accesses in the bytecode interpreter.
It's easier in the AST interpreter, because we have the variable objects directly accessible.
AST Dispatch Made Iterative
Before this PR, the dispatched used a dispatch chain and essentially recursive methods.
In this PR, this was changed to be iterative and also return the "node" with the lookup result, where a polymorphic method is doing the actual dispatch operation.
This helps to get rid of the special case and missing lookup cache when tracing.
And avoids stack use in the dispatch chain, and who knows, perhaps faster code in the interpreter.
Specializing Function Calls for 1, 2, and 3 Arguments (incl. Receiver)
To further reduce the cost of calls, this PR also introduces specialization of function calls for 1, 2, or 3 arguments (incl. the receiver in the count). The majority of calls are for these sizes, and all but one primitive implementation.
This avoids allocating the arg array in the AST interpreter, and avoids passing the stack for most cases in the BC interpreter.
For the AST interpreter this means message send nodes or unary, binary, ternary, and n-ary sends (normal as well as super sends). Similarly, the bytecode interpreter has bytecodes for these cases.
For the AST interpreter, this gives about 18-32% and for the bytecode interpreter 20-25% https://rebench.stefan-marr.de/compare/RPySOM/bb03d10c20a27698c242fe9847b0aa2c1949c249/bf229f264078074eb474b5902cdc98868911a8d8#micro-somsom-SomSom-ast-interp
Stack Local to BC Interpreter
With the specialization of function calls in place, it's time to remove the stack from the frame in the bytecode interpreter. From TruffleSOM, we know it works well just keeping it in the interpreter loop. However, here we may need to pass it on to the frame creation when doing n-ary calls. But, beside that, it works still very well.
The bytecode interpreter gains another 7-12% in performance:
https://rebench.stefan-marr.de/compare/RPySOM/bf229f264078074eb474b5902cdc98868911a8d8/8aa86e7a1469db3dbbe4e344b8cc3127a7a135cd
Initial Performance when PR was opened (not all optimizations done at that point)
For the AST interpreter, this seems to be a small win.
On the SomSom interpreter benchmarks, I see a 3-5% win. On the JIT compiled performance, there's a good gain of 7-19% on the recursive benchmarks. However, Queens and BubbleSort don't necessarily seem to like it with 7-10% regressions.
For the bytecode interpreter, the change is a major regression, and I think the next step needs to be to remove the stack from the frame, and pass arguments in a way that no additional array needs to be allocated.
I'd hope this should also benefit the AST interpreter. Though, it may lead to more code for 1, 2, 3, 4 - n argument message sends, for which I likely need new node classes. And possibly better support in the interpreter, too.
Minor Maintenance
main_basic
to make it possible to debug basic interpreter tests reliably from PyCharm