Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(opt): Inline caches for method lookups when sending messages #13

Merged
merged 6 commits into from
Jan 27, 2023

Conversation

Hirevo
Copy link
Owner

@Hirevo Hirevo commented Aug 11, 2020

This PR adds caches to do faster method resolutions when the receiver class and the call site is the same, as a potential optimization (yet to be measured to see if it's an actual win).

This PR only affects the bytecode interpreter.

Depends on #11.

@Hirevo Hirevo added M-interpreter Module: Interpreter P-medium Priority: Medium C-performance Category: Performance improvements labels Aug 11, 2020
@Hirevo Hirevo self-assigned this Aug 11, 2020
@Hirevo
Copy link
Owner Author

Hirevo commented Jan 18, 2023

During a discussion with @OctaveLarose, he pointed out that my initial implementation of inline caches using hashmaps was likely suboptimal, and that a bigger performance win could potentially be obtained by using plain old arrays for lookups.
So, this is exactly what this new set of changes are doing.

This implementation and layout of the caches is similar to how PySOM does it.

Performance measurements on this have yet to be done, but I'll try to get to it soon enough.

inline_cache_receiver.get_unchecked_mut(bytecode_idx)
};
let maybe_found_invocable = unsafe {
inline_cache_invocable.get_unchecked_mut(bytecode_idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max inline cache size is 1 right? If you used PySOM as a ref, that's what it does indeed, but you might want to investigate a max cache size that's above that. Or not, maybe 1 is already good enough (most call sites will only call a single unique method, but ones that make heavy use of polymorphism won't be happy with a cache that small)
If it's fast as is, then it's fast as is, though! PySOM is doing OK with just 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PySOM should have 2 possible cache entries cached_layout1 and cached_layout2:
https://github.com/SOM-st/PySOM/blob/d85ed9d957c2210bee10a836dac4454432e6c965/src/som/interpreter/bc/interpreter.py#L704

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's 2, my bad... I forgot you used the free BC after send like this

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't quite notice this during my exploration of PySOM, thanks for pointing this out.
I'll try to implement something like this, and report on this PR how it changes the performance.

}
}
}
FrameKind::Method { method, .. } => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I've noticed in general when working on som-rs, there's a lot of duplication between Methods and Blocks, but they're almost the same thing right? Could they be unified somehow? (this is outside the scope of those inline caching changes, but seeing it here reminded me again)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are unified in all other SOMs: a block refers to a method. And then everything follows from there.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is true that there are plenty of places where the code could be a lot cleaner than how it currently is.
I think this is because I originally wrote this code when the interpreter wasn't working yet and I just wanted to get it working before doing refactoring, but I ended up never doing them.
These two branches can definitely be unified, the only difference between them is where the location of the inline cache storage.

@Hirevo
Copy link
Owner Author

Hirevo commented Jan 24, 2023

@smarr @OctaveLarose Thanks for the code reviews and sorry for the delay of my replies.

I finally went ahead and ran the benchmarks on the different implementations of inline caches up to this point, and here is a compilation of the results:

image

I used ReBench to run all the SOM core lib benchmarks, with the same configuration as the current rebench.conf, each time with the relevant binary compiled in release mode:

Binary name Corresponding commit
without-inline-caches (baseline) b1a0c82 (current master)
hashmap-inline-caches 77fd7e7
two-vec-inline-caches 4cee8fb
one-vec-inline-caches c7e8160

These results seems to indicate that:

  • The original implementation using HashMap was indeed suboptimal (granting rather small gains) and somewhat inconsistent.
  • The latest single Vec technique is a guaranteed performance improvement (between 4 and 20 % faster).

The remaining thing I want to do before merging this PR is cleaning up some of the code and maybe try out increasing the cache size per call site to more than 1, though maybe the latter could be done in its own PR, I am not sure yet.

@Hirevo
Copy link
Owner Author

Hirevo commented Jan 24, 2023

The CI benchmarks are failing due to rebench.polomack.eu currently not being set up properly (I encountered some issues when upgrading ReBenchDB, but nothing too bad).
I'll re-run them once it gets resolved.

@smarr
Copy link
Contributor

smarr commented Jan 24, 2023

Apparently @OctaveLarose has also issues running the latest ReBench. But he was using the Docker image. If you run things manually, the database likely needs the migration.*.sql files applied.

@Hirevo
Copy link
Owner Author

Hirevo commented Jan 24, 2023

During the upgrade a few days ago, I did encounter some database-related errors and managed to fix it by applying the migrations.
The issue that currently causes the instance to be down is that I made a rather sad mistake when trying to install Rscript 4.1 (now needed due to the use of |>), on a Debian-machine, using the unstable packages (because the stable package of Rscript is too low).
I ended up with the machine no longer having a working libc.so (due to an attempt to upgrade it as well), which meant no program could run properly.
Thankfully, no data is lost since the disks are still perfectly readable (and I had some backups), but it still meant I have to reinstall everything.
Since I have the old configurations and whatnot, it is going pretty smoothly, it is just that it takes quite a bit of time.

@smarr
Copy link
Contributor

smarr commented Jan 24, 2023

Yeah, sorry... I am slowly working on getting rid of R completely. But that's very slow work unfortunately...

@OctaveLarose
Copy link
Contributor

The latest single Vec technique is a guaranteed performance improvement (between 4 and 20 % faster)

Nice, congrats!

Apparently @OctaveLarose has also issues running the latest ReBench. But he was using the Docker image. If you run things manually, the database likely needs the migration.*.sql files applied.

If relevant, I use commit 79dbe5a66a73d2ed05112956575b1a07077f8c2a, but yeah with the Docker image. I've never ran into R issues, and I dread the day I will

@Hirevo
Copy link
Owner Author

Hirevo commented Jan 25, 2023

The ReBenchDB instance is now back online and the benchmarking CI runs have been re-ran.
The reports there seem to confirm the performance wins I observed on my machine.

@Hirevo
Copy link
Owner Author

Hirevo commented Jan 27, 2023

I've cleaned up the code a bit to remove the duplication of code between the handling of Send and SuperSend, which did pretty much the same things, just using different variables.
I think that removing the duplication of code between blocks and methods will require deeper changes that seem out-of-scope for this PR.

I currently have an implementation of inline caches with more than one slot per call-site that I have pushed to a separate branch (opt/bigger-inline-caches).

I did run the benchmarks with inline caches of size 2 and 3 and here are the results:

image

They seem to not make much of a difference in actual performance, so I am thinking that we can merge this PR (with single-slot inline caches) and keep the multiple-slot inline caches as a branch (or open PR) for exploring if it can be improved.

@Hirevo
Copy link
Owner Author

Hirevo commented Jan 27, 2023

The branch has been rebased on the latest master branch, prior to merging.

@smarr @OctaveLarose Thanks for the code reviews and the help to improve the implementation !

@Hirevo Hirevo merged commit 8e2c00e into master Jan 27, 2023
@Hirevo Hirevo deleted the opt/inline-caches branch January 27, 2023 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-performance Category: Performance improvements M-interpreter Module: Interpreter P-medium Priority: Medium
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants