-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(opt): Inline caches for method lookups when sending messages #13
Conversation
0a92249
to
11af59a
Compare
11af59a
to
986187f
Compare
986187f
to
77fd7e7
Compare
During a discussion with @OctaveLarose, he pointed out that my initial implementation of inline caches using hashmaps was likely suboptimal, and that a bigger performance win could potentially be obtained by using plain old arrays for lookups. This implementation and layout of the caches is similar to how PySOM does it. Performance measurements on this have yet to be done, but I'll try to get to it soon enough. |
inline_cache_receiver.get_unchecked_mut(bytecode_idx) | ||
}; | ||
let maybe_found_invocable = unsafe { | ||
inline_cache_invocable.get_unchecked_mut(bytecode_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max inline cache size is 1 right? If you used PySOM as a ref, that's what it does indeed, but you might want to investigate a max cache size that's above that. Or not, maybe 1 is already good enough (most call sites will only call a single unique method, but ones that make heavy use of polymorphism won't be happy with a cache that small)
If it's fast as is, then it's fast as is, though! PySOM is doing OK with just 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PySOM should have 2 possible cache entries cached_layout1
and cached_layout2
:
https://github.com/SOM-st/PySOM/blob/d85ed9d957c2210bee10a836dac4454432e6c965/src/som/interpreter/bc/interpreter.py#L704
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's 2, my bad... I forgot you used the free BC after send like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't quite notice this during my exploration of PySOM, thanks for pointing this out.
I'll try to implement something like this, and report on this PR how it changes the performance.
} | ||
} | ||
} | ||
FrameKind::Method { method, .. } => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something I've noticed in general when working on som-rs, there's a lot of duplication between Methods and Blocks, but they're almost the same thing right? Could they be unified somehow? (this is outside the scope of those inline caching changes, but seeing it here reminded me again)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are unified in all other SOMs: a block refers to a method. And then everything follows from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is true that there are plenty of places where the code could be a lot cleaner than how it currently is.
I think this is because I originally wrote this code when the interpreter wasn't working yet and I just wanted to get it working before doing refactoring, but I ended up never doing them.
These two branches can definitely be unified, the only difference between them is where the location of the inline cache storage.
@smarr @OctaveLarose Thanks for the code reviews and sorry for the delay of my replies. I finally went ahead and ran the benchmarks on the different implementations of inline caches up to this point, and here is a compilation of the results: I used ReBench to run all the SOM core lib benchmarks, with the same configuration as the current
These results seems to indicate that:
The remaining thing I want to do before merging this PR is cleaning up some of the code and maybe try out increasing the cache size per call site to more than 1, though maybe the latter could be done in its own PR, I am not sure yet. |
The CI benchmarks are failing due to |
Apparently @OctaveLarose has also issues running the latest ReBench. But he was using the Docker image. If you run things manually, the database likely needs the migration.*.sql files applied. |
During the upgrade a few days ago, I did encounter some database-related errors and managed to fix it by applying the migrations. |
Yeah, sorry... I am slowly working on getting rid of R completely. But that's very slow work unfortunately... |
Nice, congrats!
If relevant, I use commit 79dbe5a66a73d2ed05112956575b1a07077f8c2a, but yeah with the Docker image. I've never ran into R issues, and I dread the day I will |
The ReBenchDB instance is now back online and the benchmarking CI runs have been re-ran. |
I've cleaned up the code a bit to remove the duplication of code between the handling of I currently have an implementation of inline caches with more than one slot per call-site that I have pushed to a separate branch ( I did run the benchmarks with inline caches of size 2 and 3 and here are the results: They seem to not make much of a difference in actual performance, so I am thinking that we can merge this PR (with single-slot inline caches) and keep the multiple-slot inline caches as a branch (or open PR) for exploring if it can be improved. |
1847a4e
to
7904ae0
Compare
The branch has been rebased on the latest master branch, prior to merging. @smarr @OctaveLarose Thanks for the code reviews and the help to improve the implementation ! |
This PR adds caches to do faster method resolutions when the receiver class and the call site is the same, as a potential optimization (yet to be measured to see if it's an actual win).
This PR only affects the bytecode interpreter.
Depends on #11.