Redesign frames and unify across AST and bytecode interpreters, and specialize calls #18

smarr · 2021-07-05T21:25:17Z

With opening this PR, the new frame design sketched in #16 and #17 is mostly implemented. Check these issues for design details. This PR will only reiterate divergence/changes from those notes.

Key implications are that frames are not longer represented by objects but 1 or 2 lists/arrays.
This also means, we won't be able to use the virtualizable mechanism of PyPy, and can't really use the immutability annotation either.

LexicalScopes

The PR introduces the notion of LexicalScopes. It's a basic version of what we have in TruffleSOM.
It's needed to be able to lookup the variables when quickening the frame accesses in the bytecode interpreter.
It's easier in the AST interpreter, because we have the variable objects directly accessible.

AST Dispatch Made Iterative

Before this PR, the dispatched used a dispatch chain and essentially recursive methods.
In this PR, this was changed to be iterative and also return the "node" with the lookup result, where a polymorphic method is doing the actual dispatch operation.
This helps to get rid of the special case and missing lookup cache when tracing.
And avoids stack use in the dispatch chain, and who knows, perhaps faster code in the interpreter.

Specializing Function Calls for 1, 2, and 3 Arguments (incl. Receiver)

To further reduce the cost of calls, this PR also introduces specialization of function calls for 1, 2, or 3 arguments (incl. the receiver in the count). The majority of calls are for these sizes, and all but one primitive implementation.
This avoids allocating the arg array in the AST interpreter, and avoids passing the stack for most cases in the BC interpreter.

For the AST interpreter this means message send nodes or unary, binary, ternary, and n-ary sends (normal as well as super sends). Similarly, the bytecode interpreter has bytecodes for these cases.
For the AST interpreter, this gives about 18-32% and for the bytecode interpreter 20-25% https://rebench.stefan-marr.de/compare/RPySOM/bb03d10c20a27698c242fe9847b0aa2c1949c249/bf229f264078074eb474b5902cdc98868911a8d8#micro-somsom-SomSom-ast-interp

Stack Local to BC Interpreter

With the specialization of function calls in place, it's time to remove the stack from the frame in the bytecode interpreter. From TruffleSOM, we know it works well just keeping it in the interpreter loop. However, here we may need to pass it on to the frame creation when doing n-ary calls. But, beside that, it works still very well.

The bytecode interpreter gains another 7-12% in performance:
https://rebench.stefan-marr.de/compare/RPySOM/bf229f264078074eb474b5902cdc98868911a8d8/8aa86e7a1469db3dbbe4e344b8cc3127a7a135cd

Initial Performance when PR was opened (not all optimizations done at that point)

For the AST interpreter, this seems to be a small win.
On the SomSom interpreter benchmarks, I see a 3-5% win. On the JIT compiled performance, there's a good gain of 7-19% on the recursive benchmarks. However, Queens and BubbleSort don't necessarily seem to like it with 7-10% regressions.

For the bytecode interpreter, the change is a major regression, and I think the next step needs to be to remove the stack from the frame, and pass arguments in a way that no additional array needs to be allocated.
I'd hope this should also benefit the AST interpreter. Though, it may lead to more code for 1, 2, 3, 4 - n argument message sends, for which I likely need new node classes. And possibly better support in the interpreter, too.

Minor Maintenance

added main_basic to make it possible to debug basic interpreter tests reliably from PyCharm
added some unit tests for the stack behavior of the bytecode interpreter frame
reduce code duplication in primitives

PyCharm is a bit strange, it fails to run the debugger... Signed-off-by: Stefan Marr <git@stefan-marr.de>