Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chacha20: Process 4 blocks at a time in AVX2 backend #267

Merged
merged 7 commits into from Aug 29, 2021

Conversation

str4d
Copy link
Contributor

@str4d str4d commented Aug 28, 2021

We switch to a 4-block buffer for the combined SSE2 / AVX2 backend, which allows the AVX2 backend to process them together, while the SSE2 backend continues to process one block at a time.

The AVX2 backend is refactored to enable interleaving the instructions per pair of blocks, for better ILP.

Closes #262.

Comment on lines +57 to +62
unsafe fn shuffle_epi32<const MASK: i32>(&mut self) {
self.avx = [
_mm256_shuffle_epi32(self.avx[0], MASK),
_mm256_shuffle_epi32(self.avx[1], MASK),
];
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bleh, it turns out _mm256_shuffle_epi32 is a pseudo-function generated such that the second argument is required to be const, but there's no way to pass a const as a function argument in our own code. This approach therefore requires const generics, which requires a MSRV of 1.51.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I ended up using const generics like that here:

https://github.com/RustCrypto/universal-hashes/blob/master/polyval/src/backend/pmull.rs#L99-L106

Personally I'd be fine with bumping MSRV to 1.51. It would also let us do a zeroize 1.4 bump.

Comment on lines +66 to +70
unsafe fn rol<const BY: i32, const REST: i32>(&mut self) {
self.avx = [
_mm256_xor_si256(
_mm256_slli_epi32(self.avx[0], BY),
_mm256_srli_epi32(self.avx[0], REST),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here for _mm256_slli_epi32 and _mm256_srli_epi32.

@str4d
Copy link
Contributor Author

str4d commented Aug 28, 2021

I originally tried interleaving the AVX2 operations for pairs of blocks with a macro, but the macro was getting rather complex, so I switched to the approach in this PR (adding methods to StateWord). Turns out that approach requires an MSRV bump (I didn't realise chacha20 wasn't on 1.51 yet). I can switch back to the macro-based approach if desired (for which I'd probably instead move to using one of the array macro crate dependencies, if there isn't something else I can use in the RustCrypto crate ecosystem for this).

@tarcieri
Copy link
Member

I'd say it's fine to bump MSRV. 1.51 is now 6 months old, which is plenty of time IMO.

For a 4-block buffer, we need to be able to represent the past-the-end
buffer position of 256, which is too large for a `u8`.
@str4d
Copy link
Contributor Author

str4d commented Aug 29, 2021

This PR eliminates the performance difference between chacha20 and c2-chacha for a 2GB test when compiled with +avx2, and reduces the gap significantly for autodetect mode: str4d/rage#57 (comment)

The remaining issue is that for the rng feature, the doubled buffer size means the Results type is now [u32; 64] which doesn't impl Default 😢

@tarcieri
Copy link
Member

tarcieri commented Aug 29, 2021

This PR eliminates the performance difference...

Awesome! 🎉

The remaining issue is that for the rng feature...

Wow, that's really annoying. I was curious what rand_chacha did here and it looks like they completely abandoned the AVX2 backend?

Relevant PR: rust-random/rand#931

cc @dhardy

Edit: never mind, it still uses ppv-lite86 and its AVX2 backend

@str4d
Copy link
Contributor Author

str4d commented Aug 29, 2021

I was curious what rand_chacha did here

It uses a #[repr(transparent)] wrapper type: https://github.com/rust-random/rand/blob/ee1aacd257d0e0bdbf27342c07e04270465e09c5/rand_chacha/src/chacha.rs#L26-L44

When the non-soft backend is being used, its 4-block buffer size results
in a `BlockRngCore::Results` type of `[u32; 64]` which doesn't implement
`Default`. We replace it with a wrapper type on which we implement the
necessary traits.
@str4d str4d marked this pull request as ready for review August 29, 2021 01:33
@str4d
Copy link
Contributor Author

str4d commented Aug 29, 2021

Tests are now passing.

Ran benchmarks on my machine (i7-8700K overclocked to 4.8 GHz):

$ cargo +nightly --version
cargo 1.56.0-nightly (b51439fd8 2021-08-09)

Current master (0.7.3):

     Running unittests (target/release/deps/chacha12-b35bb60eb267f274)
test bench1_10     ... bench:           9 ns/iter (+/- 0) = 1111 MB/s
test bench2_100    ... bench:          50 ns/iter (+/- 2) = 2000 MB/s
test bench3_1000   ... bench:         397 ns/iter (+/- 9) = 2518 MB/s
test bench4_10000  ... bench:       3,889 ns/iter (+/- 168) = 2571 MB/s
test bench5_100000 ... bench:      38,739 ns/iter (+/- 1,431) = 2581 MB/s

     Running unittests (target/release/deps/chacha20-c3750fdb2e6a6143)
test bench1_10     ... bench:          12 ns/iter (+/- 0) = 833 MB/s
test bench2_100    ... bench:          72 ns/iter (+/- 3) = 1388 MB/s
test bench3_1000   ... bench:         614 ns/iter (+/- 30) = 1628 MB/s
test bench4_10000  ... bench:       5,959 ns/iter (+/- 244) = 1678 MB/s
test bench5_100000 ... bench:      59,545 ns/iter (+/- 1,724) = 1679 MB/s

     Running unittests (target/release/deps/chacha8-f8e1d7fb0cf442ec)
test bench1_10     ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100    ... bench:          38 ns/iter (+/- 1) = 2631 MB/s
test bench3_1000   ... bench:         295 ns/iter (+/- 8) = 3389 MB/s
test bench4_10000  ... bench:       2,844 ns/iter (+/- 108) = 3516 MB/s
test bench5_100000 ... bench:      28,393 ns/iter (+/- 2,068) = 3521 MB/s

This PR:

     Running unittests (target/release/deps/chacha12-b35bb60eb267f274)
test bench1_10     ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100    ... bench:          37 ns/iter (+/- 2) = 2702 MB/s
test bench3_1000   ... bench:         285 ns/iter (+/- 12) = 3508 MB/s
test bench4_10000  ... bench:       2,691 ns/iter (+/- 137) = 3716 MB/s
test bench5_100000 ... bench:      26,804 ns/iter (+/- 1,187) = 3730 MB/s

     Running unittests (target/release/deps/chacha20-c3750fdb2e6a6143)
test bench1_10     ... bench:           9 ns/iter (+/- 0) = 1111 MB/s
test bench2_100    ... bench:          52 ns/iter (+/- 3) = 1923 MB/s
test bench3_1000   ... bench:         432 ns/iter (+/- 23) = 2314 MB/s
test bench4_10000  ... bench:       4,126 ns/iter (+/- 133) = 2423 MB/s
test bench5_100000 ... bench:      41,191 ns/iter (+/- 1,258) = 2427 MB/s

     Running unittests (target/release/deps/chacha8-f8e1d7fb0cf442ec)
test bench1_10     ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100    ... bench:          31 ns/iter (+/- 1) = 3225 MB/s
test bench3_1000   ... bench:         211 ns/iter (+/- 12) = 4739 MB/s
test bench4_10000  ... bench:       1,978 ns/iter (+/- 101) = 5055 MB/s
test bench5_100000 ... bench:      19,835 ns/iter (+/- 759) = 5041 MB/s

@tarcieri tarcieri merged commit 818c4ac into RustCrypto:master Aug 29, 2021
@str4d str4d deleted the chacha20-avx2-wide branch August 29, 2021 03:02
@tarcieri tarcieri mentioned this pull request Aug 29, 2021
@dhardy
Copy link
Contributor

dhardy commented Aug 30, 2021

Looks like you've answered your question, but you'd better ask @kazcw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chacha20: Add wide (4-block) AVX2 impl
3 participants