Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chacha20: Use criterion for benchmarking #62

Merged
merged 1 commit into from Oct 19, 2019
Merged

Conversation

tarcieri
Copy link
Member

...with the criterion-cycles-per-byte plugin

This unfortunately means we can no-longer run tests for chacha20 against Rust 1.27.0, since it adds 2018 edition dev-dependencies. However, we can still confirm release builds against this version work.

As we're starting to reach a microoptimization stage, I think criterion will be extremely helpful in determining if our microoptimizations are actually improving performance.

Currently I'm getting between 5.1 - 5.8 cpb across the tests on a Kaby Lake i7. Curiously -Ctarget-cpu=native seems to negatively impact performance. I'm not seeing much difference between the software and SSE2 backends: +5% on the chacha20/apply_keystream/1024 benchmark, negligible on the others (i.e. +1-2%)

@tarcieri
Copy link
Member Author

tarcieri commented Oct 19, 2019

@srijs this should unblock using other dev-dependencies which require a recent Rust, I hope

...with the `criterion-cycles-per-byte` plugin

This unfortunately means we can no-longer run tests for chacha20 against
Rust 1.27.0, since it adds 2018 edition `dev-dependencies`. However, we
can still confirm release builds against this version work.

As we're starting to reach a microoptimization stage, I think criterion
will be extremely helpful in determining if our microoptimizations are
actually improving performance.

Currently I'm getting between 5.1 - 5.8 cpb across the tests on a
Kaby Lake i7. Curiously `-Ctarget-cpu=native` seems to negatively impact
performance. I'm not seeing much difference between the software and
SSE2 backends: +5% on the `chacha20/apply_keystream/1024` benchmark,
negligable on the others (i.e. +1-2%)
@tarcieri tarcieri merged commit 033af9f into master Oct 19, 2019
@tarcieri tarcieri deleted the chacha20/criterion branch October 19, 2019 17:43
@srijs
Copy link
Contributor

srijs commented Oct 20, 2019

It's weird that you see so little difference here...

On my machine, across all input sizes, criterion shows around +25% improvement on throughput between sse2 and the scalar version, which scalar at ~4.3cpb and sse2 at ~3.4cpb.

@tarcieri
Copy link
Member Author

@srijs are you using something other than RUSTFLAGS="-Ctarget-feature=+sse2"?

@srijs
Copy link
Contributor

srijs commented Oct 20, 2019

I'm not using any additional RUSTFLAGS right now, just plain cargo +nightly bench.

Let me see what happens if I add this flag explicitly (although afaik rustc should enable the sse2 feature for my target by default).

@tarcieri
Copy link
Member Author

How are you switching between the two implementations then? Setting target-cpu? Something else?

@srijs
Copy link
Contributor

srijs commented Oct 20, 2019

Applying the flag does not change the outcome for me.

To switch between implementations, I've just been reverting my commit. Perhaps we could add an on-by-default feature flag that can be turned off to opt-out of simd, which may come in useful for these types of comparative benchmarks.

If you're using the flag to switch, are you sure you're not benchmarking the same sse2 implementation both times? Afaik sse2 is enabled by default on a few targets (mine included).

@tarcieri
Copy link
Member Author

If you're using the flag to switch, are you sure you're not benchmarking the same sse2 implementation both times?

That seems like the most likely explanation, thanks

@tarcieri
Copy link
Member Author

On my machine, across all input sizes, criterion shows around +25% improvement on throughput between sse2 and the scalar version, which scalar at ~4.3cpb and sse2 at ~3.4cpb.

Just tried reverting the commit and I'm seeing similar numbers: ~6.7 cpb without, 5.3 cpb with, for a ~25% speedup

@newpavlov
Copy link
Member

Curiously -Ctarget-cpu=native seems to negatively impact performance.

Yeah, I've observed the same effect several times on different crates as well. Initially I thought the reason is in AVX2 frequency scaling, but now I think that LLVM does not work that well with target-cpu=native for some reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants