New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chacha20: Use criterion for benchmarking #62
Conversation
@srijs this should unblock using other |
...with the `criterion-cycles-per-byte` plugin This unfortunately means we can no-longer run tests for chacha20 against Rust 1.27.0, since it adds 2018 edition `dev-dependencies`. However, we can still confirm release builds against this version work. As we're starting to reach a microoptimization stage, I think criterion will be extremely helpful in determining if our microoptimizations are actually improving performance. Currently I'm getting between 5.1 - 5.8 cpb across the tests on a Kaby Lake i7. Curiously `-Ctarget-cpu=native` seems to negatively impact performance. I'm not seeing much difference between the software and SSE2 backends: +5% on the `chacha20/apply_keystream/1024` benchmark, negligable on the others (i.e. +1-2%)
2c55331
to
9562b98
Compare
It's weird that you see so little difference here... On my machine, across all input sizes, criterion shows around +25% improvement on throughput between sse2 and the scalar version, which scalar at ~4.3cpb and sse2 at ~3.4cpb. |
@srijs are you using something other than |
I'm not using any additional Let me see what happens if I add this flag explicitly (although afaik |
How are you switching between the two implementations then? Setting |
Applying the flag does not change the outcome for me. To switch between implementations, I've just been reverting my commit. Perhaps we could add an on-by-default feature flag that can be turned off to opt-out of simd, which may come in useful for these types of comparative benchmarks. If you're using the flag to switch, are you sure you're not benchmarking the same sse2 implementation both times? Afaik |
That seems like the most likely explanation, thanks |
Just tried reverting the commit and I'm seeing similar numbers: ~6.7 cpb without, 5.3 cpb with, for a ~25% speedup |
Yeah, I've observed the same effect several times on different crates as well. Initially I thought the reason is in AVX2 frequency scaling, but now I think that LLVM does not work that well with |
...with the
criterion-cycles-per-byte
pluginThis unfortunately means we can no-longer run tests for chacha20 against Rust 1.27.0, since it adds 2018 edition
dev-dependencies
. However, we can still confirm release builds against this version work.As we're starting to reach a microoptimization stage, I think criterion will be extremely helpful in determining if our microoptimizations are actually improving performance.
Currently I'm getting between 5.1 - 5.8 cpb across the tests on a Kaby Lake i7. Curiously
-Ctarget-cpu=native
seems to negatively impact performance. I'm not seeing much difference between the software and SSE2 backends: +5% on thechacha20/apply_keystream/1024
benchmark, negligible on the others (i.e. +1-2%)