chacha20: add SSE2 accelerated variant #61

srijs · 2019-10-18T11:55:21Z

This implements an SSE2-based variant of ChaCha20.

I'd also be keen to help adding support for AVX2 and AVX512, but it looks like this may need larger changes (possibly in the salsa20_core crate) to allow for processing multiple blocks simultaneously?

On my machine (early 2016 MacBook, 1.1GHz M3), the SSE2 variant shows a good improvement over the scalar code, speeding up the benchmarks from 260MB/s to 327MB/s.

Aside from the core algorithm, here are some of the things I could use feedback on:

Gating: I've only implemented compile-time feature detection for now, as that seemed easier. I could potentially add runtime detection as well, if you think that is useful? Additionally, I was considering to pull in cfg-if to simplify the detection logic, but wasn't sure if you wanted to have an additional dependency. Lastly, do we need to be backwards compatible to Rust versions before core::arch got stabilized?
Facade: I'm currently switching between the two implementations inside the cipher module, but I wonder if there is a better way? Potentially, the block module could become a facade for block/scalar and block/sse2, which would abstract away the fact that there are multiple implementations?
Testing: I've opted to pregenerate random inputs that I use in the tests, which seemed like an okay approach. I was considering pulling in proptest to essentially randomly test the new variant against the scalar implementation, but I wasn't quite sure if you wanted to have it as a dependency.

Before:

test bench1_10     ... bench:          48 ns/iter (+/- 22) = 208 MB/s
test bench2_100    ... bench:         392 ns/iter (+/- 15) = 255 MB/s
test bench3_1000   ... bench:       4,043 ns/iter (+/- 642) = 247 MB/s
test bench4_10000  ... bench:      38,370 ns/iter (+/- 1,619) = 260 MB/s
test bench5_100000 ... bench:     383,558 ns/iter (+/- 9,745) = 260 MB/s

After:

test bench1_10     ... bench:          40 ns/iter (+/- 3) = 250 MB/s
test bench2_100    ... bench:         317 ns/iter (+/- 20) = 315 MB/s
test bench3_1000   ... bench:       3,082 ns/iter (+/- 117) = 324 MB/s
test bench4_10000  ... bench:      30,681 ns/iter (+/- 808) = 325 MB/s
test bench5_100000 ... bench:     305,280 ns/iter (+/- 14,035) = 327 MB/s

tarcieri · 2019-10-18T14:29:27Z

SSE2 variant shows a good improvement over the scalar code, speeding up the benchmarks from 260MB/s to 327MB/s.

Awesome, this looks good at first glance.

I'd also be keen to help adding support for AVX2 and AVX512, but it looks like this may need larger changes (possibly in the salsa20_core crate) to allow for processing multiple blocks simultaneously?

Yes. It might make sense to look at ways to improve/simplify the salsa20-core crate as well, as its implementation is rather complex and not particularly optimal. Perhaps it would benefit from taking a 32-bit counter and outputting 512-bit blocks as [u8; 64], rather than the current approach which operates directly on the [u32; 8] output array.

Additionally note we haven't implemented an optimization where the ChaCha (or Salsa) where the input array / block state is computed once, and then copied with only the counter updated to avoid setting it up again over and over for each block, as described in RFC 8439 Section 3

Each block of ChaCha20 involves 16 move operations and one increment operation for loading the state, 80 each of XOR, addition and roll operations for the rounds, 16 more add operations and 16 XOR operations for protecting the plaintext. Section 2.3 describes the ChaCha block function as "adding the original input words". This implies that before starting the rounds on the ChaCha state, we copy it aside, only to add it in later. This is correct, but we can save a few operations if we instead copy the state and do the work on the copy. This way, for the next block you don't need to recreate the state, but only to increment the block counter. This saves approximately 5.5% of the cycles.

...so in addition to computing multiple blocks in parallel, they can all be initialized by copying them from the same pre-initialized block, and then setting their counters accordingly.

AVX2 sounds good. As for AVX-512, I had suspected that in most situations the performance tradeoffs for AVX-512 aren't worth it only to have that later confirmed by Cloudflare engineers. An AVX-512 implementation of ChaCha20 is really only useful as something like a CSPRNG feeding another SIMD workload. As soon as you try to combine it into the ChaCha20Poly1305 AEAD, performance suffers, so I'd recommend just stopping with AVX2.

Let me get back to your questions on gating / facade / testing. I had lots of fun with these in the polyval crate, but don't have time to write them up right now.

tarcieri · 2019-10-18T16:32:23Z

Answering your remaining questions...

Gating

I've only implemented compile-time feature detection for now, as that seemed easier. I could potentially add runtime detection as well, if you think that is useful?

I'd suggest getting everything working with compile-time gating only and then circling back on runtime gating. I ended up abandoning runtime feature detection for PCLMULQDQ in the polyval crate, but may end up circling back on it eventually.

Additionally, I was considering to pull in cfg-if to simplify the detection logic, but wasn't sure if you wanted to have an additional dependency. Lastly, do we need to be backwards compatible to Rust versions before core::arch got stabilized?

I'm fine with pulling in cfg-if if you find it helpful. However, I tried to use it to simplify the gating in the polyval PR linked above only to discover it was too inexpressive to really help simplify the gating there.

Facade

I'm currently switching between the two implementations inside the cipher module, but I wonder if there is a better way? Potentially, the block module could become a facade for block/scalar and block/sse2, which would abstract away the fact that there are multiple implementations?

In polyval there's a field::Element facade/newtype, but I've been wondering if that might make sense to replace with a trait or thereabouts.

I think you could potentially have an SseCipher type which impls the same API, which the ChaCha20 type consumes directly.

Testing

I've opted to pregenerate random inputs that I use in the tests, which seemed like an okay approach. I was considering pulling in proptest to essentially randomly test the new variant against the scalar implementation, but I wasn't quite sure if you wanted to have it as a dependency.

Add whatever dev-dependencies you like. That sounds interesting.

That said, I originally did something similar in the polyval crate only to abandon it after changing the conditional gating so only one implementation ends up getting compiled into the crate. I tried to support multiple implementations in the runtime feature detection PR, but abandoned it.

tl;dr: I've gone back and forth on a lot of this stuff, and don't have good answers or strong preferences either way.

srijs · 2019-10-19T00:35:58Z

Thanks for the feedback! After trying a few different approaches, I've now settled on the following choices:

Moved the SSE2 implementation to be a submodule of the block module.
Added a comparison test to check the SIMD implementation against the scalar one.
Kept the feature detection static, because making the #[no_std] directive optional via feature flag is probably a bit more work.
Kept the feature branching inside cipher, because the module seemed simple enough and I'm not quite sure this really needs more facade.
Did not pull in cfg-if, because it is incompatible with Rust 1.27 (CI was failing).
Did not pull in proptest, since it needs std to work properly (or at least alloc on nightly).

I think I'm quite happy with the changes for now, and would defer to you for any change requests you may have :)

tarcieri · 2019-10-19T15:19:10Z

Looks good to me and I'll go ahead and merge. Some notes:

Kept the feature detection static, because making the #[no_std] directive optional via feature flag is probably a bit more work.

Yeah, this was a big pain in the ass for my (abandoned) polyval runtime feature detection for PCLMULQDQ PR as well. I had the std feature off-by-default before that.

Did not pull in cfg-if, because it is incompatible with Rust 1.27 (CI was failing).

Oof, yeah at some point I think we can look into doing a 2018 edition upgrade for select crates. It was easier to keep them consistent at first, but it's certainly getting annoying to deal with such old Rust versions.

tarcieri · 2019-10-19T15:20:19Z

@srijs do you want to do an AVX2 implementation before I do another release?

I'm going to throw criterion with the cycles-per-byte plugin on this to get some better benchmarks.

tarcieri · 2019-10-19T15:57:43Z

Also, looking at the optimization described in RFC8439 Section 3, it'd probably be good to do that before adding any additional backends, since that impacts how they're structured.

srijs · 2019-10-20T01:13:33Z

I don't have a strong opinion on the release question. Supposedly releasing the sse2 changes would just be a patch version bump, which comes "for free" for any downstream consumer?

Agreed that it probably makes sense to tackle the state copying optimization first, so I'm happy to give that a go. I'm not entirely sure I fully understand the intend behind it though... is the difference really just between copying the full state vs. copying consts, key, iv and counter individually?

tarcieri · 2019-10-20T01:21:29Z

Looking over it again, I think we already compute part of what it's talking about in advance, which is here:

https://github.com/RustCrypto/stream-ciphers/blob/master/chacha20/src/lib.rs#L104

For the full effect, it seems like we might compute the initial state array sans counter, then clone that

srijs · 2019-10-20T02:13:12Z

Right, so I've prototyped this idea, but according to criterion there is no statistically significant difference between the two approaches.

My guess is that the part of the optimization that's already in place (which you linked to) achieves the main part of the speed up, to the point where implementing the last part of the optimization really only results in one or two fewer loads and stores, which is not noticeable compared to the time spent in rest of the block function.

srijs · 2019-10-20T02:28:29Z

FWIW I've also looked into salsa20-core a bit, to try and understand what it would take to allow processing two blocks simultaneously using AVX2, and I think it may take me a little bit longer to figure out.

So it may be useful to release the sse2 improvements in the meantime, and then ship the AVX2 implementation once it's ready.

tarcieri · 2019-10-20T03:29:08Z

FWIW I've also looked into salsa20-core a bit, to try and understand what it would take to allow processing two blocks simultaneously using AVX2, and I think it may take me a little bit longer to figure out.

Yeah, honestly I'd probably just rewrite it before beginning to add an optimization like that. It feels like it's written at the wrong level of abstraction to me since it caches things in terms of the u32 output array, rather than a byte-oriented interface.

Something more like one of these implementations would make it a bit more straightforward:

https://github.com/RustCrypto/AEADs/blob/master/aes-gcm-siv/src/ctr32.rs
https://github.com/RustCrypto/stream-ciphers/blob/master/ctr/src/lib.rs

tarcieri · 2019-10-20T03:31:08Z

So it may be useful to release the sse2 improvements in the meantime, and then ship the AVX2 implementation once it's ready.

Sure. I was thinking about just bumping it to v0.3.0 so the SSE-enabled version had a nice round number (the main downstream consumer is the chacha20poly1305 crate, where we can do a minor bump)

srijs force-pushed the chacha20-sse2 branch 3 times, most recently from f890007 to ff26e7c Compare October 18, 2019 12:36

tarcieri mentioned this pull request Oct 18, 2019

Upgrade signatory -> 0.15; yubihsm -> 0.28; use chacha20poly1305 crate tendermint/tmkms#366

Merged

srijs force-pushed the chacha20-sse2 branch from 4379653 to dda572d Compare October 19, 2019 00:28

chacha20: add SSE2 accelerated variant

5669681

srijs force-pushed the chacha20-sse2 branch from dda572d to 5669681 Compare October 19, 2019 00:32

tarcieri merged commit 890823a into RustCrypto:master Oct 19, 2019

srijs deleted the chacha20-sse2 branch October 20, 2019 01:05

This was referenced Oct 20, 2019

rand_chacha and reducing dependencies rust-random/rand#872

Closed

chacha20 v0.2.2 #65

Merged

salsa20: performance optimizations (e.g. SIMD) #50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chacha20: add SSE2 accelerated variant #61

chacha20: add SSE2 accelerated variant #61

srijs commented Oct 18, 2019 •

edited

tarcieri commented Oct 18, 2019 •

edited

tarcieri commented Oct 18, 2019

srijs commented Oct 19, 2019

tarcieri commented Oct 19, 2019

tarcieri commented Oct 19, 2019

tarcieri commented Oct 19, 2019

srijs commented Oct 20, 2019

tarcieri commented Oct 20, 2019

srijs commented Oct 20, 2019 •

edited

srijs commented Oct 20, 2019

tarcieri commented Oct 20, 2019

tarcieri commented Oct 20, 2019

chacha20: add SSE2 accelerated variant #61

chacha20: add SSE2 accelerated variant #61

Conversation

srijs commented Oct 18, 2019 • edited

tarcieri commented Oct 18, 2019 • edited

tarcieri commented Oct 18, 2019

Gating

Facade

Testing

srijs commented Oct 19, 2019

tarcieri commented Oct 19, 2019

tarcieri commented Oct 19, 2019

tarcieri commented Oct 19, 2019

srijs commented Oct 20, 2019

tarcieri commented Oct 20, 2019

srijs commented Oct 20, 2019 • edited

srijs commented Oct 20, 2019

tarcieri commented Oct 20, 2019

tarcieri commented Oct 20, 2019

srijs commented Oct 18, 2019 •

edited

tarcieri commented Oct 18, 2019 •

edited

srijs commented Oct 20, 2019 •

edited