Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign frames and unify across AST and bytecode interpreters, and specialize calls #18

Merged
merged 53 commits into from
Jul 11, 2021

Conversation

smarr
Copy link
Member

@smarr smarr commented Jul 5, 2021

With opening this PR, the new frame design sketched in #16 and #17 is mostly implemented. Check these issues for design details. This PR will only reiterate divergence/changes from those notes.

Key implications are that frames are not longer represented by objects but 1 or 2 lists/arrays.
This also means, we won't be able to use the virtualizable mechanism of PyPy, and can't really use the immutability annotation either.

LexicalScopes

The PR introduces the notion of LexicalScopes. It's a basic version of what we have in TruffleSOM.
It's needed to be able to lookup the variables when quickening the frame accesses in the bytecode interpreter.
It's easier in the AST interpreter, because we have the variable objects directly accessible.

AST Dispatch Made Iterative

Before this PR, the dispatched used a dispatch chain and essentially recursive methods.
In this PR, this was changed to be iterative and also return the "node" with the lookup result, where a polymorphic method is doing the actual dispatch operation.
This helps to get rid of the special case and missing lookup cache when tracing.
And avoids stack use in the dispatch chain, and who knows, perhaps faster code in the interpreter.

Specializing Function Calls for 1, 2, and 3 Arguments (incl. Receiver)

To further reduce the cost of calls, this PR also introduces specialization of function calls for 1, 2, or 3 arguments (incl. the receiver in the count). The majority of calls are for these sizes, and all but one primitive implementation.
This avoids allocating the arg array in the AST interpreter, and avoids passing the stack for most cases in the BC interpreter.

For the AST interpreter this means message send nodes or unary, binary, ternary, and n-ary sends (normal as well as super sends). Similarly, the bytecode interpreter has bytecodes for these cases.
For the AST interpreter, this gives about 18-32% and for the bytecode interpreter 20-25% https://rebench.stefan-marr.de/compare/RPySOM/bb03d10c20a27698c242fe9847b0aa2c1949c249/bf229f264078074eb474b5902cdc98868911a8d8#micro-somsom-SomSom-ast-interp

Stack Local to BC Interpreter

With the specialization of function calls in place, it's time to remove the stack from the frame in the bytecode interpreter. From TruffleSOM, we know it works well just keeping it in the interpreter loop. However, here we may need to pass it on to the frame creation when doing n-ary calls. But, beside that, it works still very well.

The bytecode interpreter gains another 7-12% in performance:
https://rebench.stefan-marr.de/compare/RPySOM/bf229f264078074eb474b5902cdc98868911a8d8/8aa86e7a1469db3dbbe4e344b8cc3127a7a135cd

Initial Performance when PR was opened (not all optimizations done at that point)

For the AST interpreter, this seems to be a small win.
On the SomSom interpreter benchmarks, I see a 3-5% win. On the JIT compiled performance, there's a good gain of 7-19% on the recursive benchmarks. However, Queens and BubbleSort don't necessarily seem to like it with 7-10% regressions.

For the bytecode interpreter, the change is a major regression, and I think the next step needs to be to remove the stack from the frame, and pass arguments in a way that no additional array needs to be allocated.
I'd hope this should also benefit the AST interpreter. Though, it may lead to more code for 1, 2, 3, 4 - n argument message sends, for which I likely need new node classes. And possibly better support in the interpreter, too.

Minor Maintenance

  • added main_basic to make it possible to debug basic interpreter tests reliably from PyCharm
  • added some unit tests for the stack behavior of the bytecode interpreter frame
  • reduce code duplication in primitives

smarr added 30 commits June 28, 2021 11:37
PyCharm is a bit strange, it fails to run the debugger...

Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
- this breaks the printing of stack traces
- this changes the receiver of the #escapedBlock:
- it may fix a bug with block semantics and clearing the previous frame (for non-local returns)

Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
This is mostly to see the performance impact of this.
I am currently thinking about moving away from a Frame class to a structure based on lists, but I don't think I can mark them as immutable.
So, let's see whether there's any notable impact.

Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
…head

- locals and arguments accessed by inner scopes is in a second, optional list
- to be able to store the inner list into the frame, I need to use type erasure, which makes things a little more tricky
- the blocks now simply keep a reference to the inner list
- because we split the args/vars that are accessed from an inner scope, we don't know in a single pass the access idx. Thus, we need to resolve it on first execution. Though, it also allows us to unify arg and var nodes

Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Since the Inner only stores AbstractObjects, we don't really need erasure here.

Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Part of adopting new frame design for bytecode interpreter.

Signed-off-by: Stefan Marr <git@stefan-marr.de>
This shouldn't change behavior

Signed-off-by: Stefan Marr <git@stefan-marr.de>
- have two types of bytecode methods: with handling of non-local returns, and without
- change parser find_var to return object
- introduce new bytecodes push/pop frame/inner to access the new frame structure. These bytecodes are put into place when doing a pass over the bytecode to determine stack usage. Because only then we know what the indexes are for accessing frame/inner. The old push/pop local/argument bytecode are still around during parsing, but are not supported in the interpreter

Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
The problem is that only once the outer method is completely parsed, we can be sure what each single access index is going to be.
And, we really don't want to make method activation more complicated by having even more complex maps there.
So, let's just rewrite on first execution.

In the interpreter, simply re-execute the same bytecode index after patching the bytecodes.

Remove unneeded FrameDetails and early prepare_frame calls, which couldn't work this way (only with more complex index mapping for argument handling when doing a call).

Signed-off-by: Stefan Marr <git@stefan-marr.de>
…able

Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
…r contexts

Signed-off-by: Stefan Marr <git@stefan-marr.de>
… multiple indexes in create_frame

Signed-off-by: Stefan Marr <git@stefan-marr.de>
This marks the mgenc so to set the bit that we access it from outer scopes.
It's not explicitly marking "self", but since we handle self specially, that's not needed.

Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
smarr added 2 commits July 7, 2021 10:28
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
@smarr
Copy link
Member Author

smarr commented Jul 7, 2021

With the latest changes, which adds specialized calling conventions for 1, 2, and 3 arguments, as well as specific bytecodes, we see now major speedups on for the bytecode interpreter.
The run time it takes to run in compiled code is reduced by about 7-82%. IntegerLoop, and WhileLoop don't show benefits.
Even the macro benchmarks see 11-60% reduction in run time.

The execution of SomSom on the bytecode interpreter requires now 5-7% less time overall, which is a good indication for the improved interpreter speed.

Next step is to apply the calling specialization also in the AST interpreter.

smarr added 4 commits July 7, 2021 17:29
Signed-off-by: Stefan Marr <git@stefan-marr.de>
This possibly saves stack space.

There is also the simplification of the guard, and reducing number of times the rcvr_class needs to be determined.

The Generic Dispatch is now encoded also by the expected_class, as a simple guard after the dispatch loop. This avoids having to use a polymorphic guard checking method. Though, it's not necessarily obvious.

Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
…r the JIT, too

Signed-off-by: Stefan Marr <git@stefan-marr.de>
smarr added 11 commits July 8, 2021 00:16
- remove unused universe from Invokable

Signed-off-by: Stefan Marr <git@stefan-marr.de>
I think it may be better to start the trace before creating a frame, because then we see the frame creation inside the trace.

Signed-off-by: Stefan Marr <git@stefan-marr.de>
- increase code sharing a little between AST and bytecode interpreters

Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
We can't fully get rid of the object, because RPython typing doesn't support having a node-node parent.
But we avoid the extra call.

This change also uses `node` as name for the self to avoid issues with `self` as name on the JitDriver
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
This design is very similar to the bytecode interpreter in TruffleSOM.
We don't need to have the stack in the frame, it's just complexity there.

However, having it in the interpreter is a little ackward, too, because of all the code.
Though, it's trivial code, so, hopefully not much an issue.

For n-ary calls, we pass the stack, and stack pointer, and then callee can take what it needs.

Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
Signed-off-by: Stefan Marr <git@stefan-marr.de>
There's more, but let's do at least the easy stuff.
Added some more tests

Signed-off-by: Stefan Marr <git@stefan-marr.de>
@smarr smarr changed the title Redesign frames and unify across AST and bytecode interpreters Redesign frames and unify across AST and bytecode interpreters, and specialize calls Jul 11, 2021
@smarr
Copy link
Member Author

smarr commented Jul 11, 2021

Overall, the results are great, except for AST JIT performance, which dropped. Should be investigated at some point. Richards, Bounce, BubbleSort, List, Queens, Towers all suffered see: https://rebench.stefan-marr.de/compare/RPySOM/bc14175415c59f1cfeb9b295fc8ddc103590d4ae/3eb4d2de135b4cf0b0bf9c4029b419d77ad00d4b

But interpreter performance improved very nicely.
SomSom results should 14-16% for the bytecode interpreter and 27-29% for the AST interpreter.

Time to merge.

@smarr smarr merged commit 8e16a07 into SOM-st:master Jul 11, 2021
@smarr smarr deleted the uniform-frame branch July 11, 2021 22:01
@smarr smarr linked an issue Jul 15, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make AST and BC interpreter use the same frame format
1 participant