New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement RPO hash using SVE instructions #190
Conversation
fbb0fd8
to
701a187
Compare
Awesome work! Thank you! I'll do a full review over the next couple of days - but I did ran the benchmarks on Graviton 3 and can confirm that I'm getting the same improvement. Very cool! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thank you again! I left a few small nits inline and also one comment about adding a couple of Rust-based tests.
One other thing I'm thinking about is how to fix the CI (we run tests with all-features
enabled). As far as I can tell, Graviton3 does not expose a feature flag to identify the SVE feature. One way to get around this is to assume that if the architecture is arm64
and OS is linux
, then we are on Graviton3 - but it feels a bit hacky. Another way is to modify how we run CI tests - but I'm not sure how yet.
Any thoughts on this?
I would suggest replacing |
4ef3d1b
to
01be4d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good! Thank you!
This PR addresses #158. It improves performance of the RPO hash function by leveraging SVE instructions on compatible ARMv8+SVE hardware. On the current generation of Amazon Graviton3 processors, I'm measuring 37% improvement against the baseline commit 90dd3ac (AWS c7g.medium instance).
To leverage SVE implementation, code needs to be compiled with
--features arch-arm64-sve
. Due to high latency of vector operations, only 2/3 of array elements are processed using SIMD. The rest can be processed for "free" using scalar instructions and masking the latency.Below
cargo bench --features arch-arm64-sve
improvement against the baselinecargo bench
.Checklist before requesting a review
next
according to naming convention.