`chacha20`: Process 4 blocks at a time in AVX2 backend #267

str4d · 2021-08-28T23:29:54Z

We switch to a 4-block buffer for the combined SSE2 / AVX2 backend, which allows the AVX2 backend to process them together, while the SSE2 backend continues to process one block at a time.

The AVX2 backend is refactored to enable interleaving the instructions per pair of blocks, for better ILP.

Closes #262.

It is now agnostic of the number of blocks processed, which is now a constant.

str4d · 2021-08-28T23:39:48Z

chacha20/src/backend/avx2.rs

+    unsafe fn shuffle_epi32<const MASK: i32>(&mut self) {
+        self.avx = [
+            _mm256_shuffle_epi32(self.avx[0], MASK),
+            _mm256_shuffle_epi32(self.avx[1], MASK),
+        ];
+    }


Bleh, it turns out _mm256_shuffle_epi32 is a pseudo-function generated such that the second argument is required to be const, but there's no way to pass a const as a function argument in our own code. This approach therefore requires const generics, which requires a MSRV of 1.51.

Yeah, I ended up using const generics like that here:

https://github.com/RustCrypto/universal-hashes/blob/master/polyval/src/backend/pmull.rs#L99-L106

Personally I'd be fine with bumping MSRV to 1.51. It would also let us do a zeroize 1.4 bump.

str4d · 2021-08-28T23:40:23Z

chacha20/src/backend/avx2.rs

+    unsafe fn rol<const BY: i32, const REST: i32>(&mut self) {
+        self.avx = [
+            _mm256_xor_si256(
+                _mm256_slli_epi32(self.avx[0], BY),
+                _mm256_srli_epi32(self.avx[0], REST),


Similarly here for _mm256_slli_epi32 and _mm256_srli_epi32.

str4d · 2021-08-28T23:42:42Z

I originally tried interleaving the AVX2 operations for pairs of blocks with a macro, but the macro was getting rather complex, so I switched to the approach in this PR (adding methods to StateWord). Turns out that approach requires an MSRV bump (I didn't realise chacha20 wasn't on 1.51 yet). I can switch back to the macro-based approach if desired (for which I'd probably instead move to using one of the array macro crate dependencies, if there isn't something else I can use in the RustCrypto crate ecosystem for this).

tarcieri · 2021-08-28T23:46:00Z

I'd say it's fine to bump MSRV. 1.51 is now 6 months old, which is plenty of time IMO.

For a 4-block buffer, we need to be able to represent the past-the-end buffer position of 256, which is too large for a `u8`.

str4d · 2021-08-29T00:58:20Z

This PR eliminates the performance difference between chacha20 and c2-chacha for a 2GB test when compiled with +avx2, and reduces the gap significantly for autodetect mode: str4d/rage#57 (comment)

The remaining issue is that for the rng feature, the doubled buffer size means the Results type is now [u32; 64] which doesn't impl Default 😢

tarcieri · 2021-08-29T01:06:40Z

This PR eliminates the performance difference...

Awesome! 🎉

The remaining issue is that for the rng feature...

Wow, that's really annoying. I was curious what rand_chacha did here ~~and it looks like they completely abandoned the AVX2 backend~~?

Relevant PR: rust-random/rand#931

cc @dhardy

Edit: never mind, it still uses ppv-lite86 and its AVX2 backend

str4d · 2021-08-29T01:22:22Z

I was curious what rand_chacha did here

It uses a #[repr(transparent)] wrapper type: https://github.com/rust-random/rand/blob/ee1aacd257d0e0bdbf27342c07e04270465e09c5/rand_chacha/src/chacha.rs#L26-L44

When the non-soft backend is being used, its 4-block buffer size results in a `BlockRngCore::Results` type of `[u32; 64]` which doesn't implement `Default`. We replace it with a wrapper type on which we implement the necessary traits.

str4d · 2021-08-29T01:42:20Z

Tests are now passing.

Ran benchmarks on my machine (i7-8700K overclocked to 4.8 GHz):

$ cargo +nightly --version
cargo 1.56.0-nightly (b51439fd8 2021-08-09)

Current master (0.7.3):

     Running unittests (target/release/deps/chacha12-b35bb60eb267f274)
test bench1_10     ... bench:           9 ns/iter (+/- 0) = 1111 MB/s
test bench2_100    ... bench:          50 ns/iter (+/- 2) = 2000 MB/s
test bench3_1000   ... bench:         397 ns/iter (+/- 9) = 2518 MB/s
test bench4_10000  ... bench:       3,889 ns/iter (+/- 168) = 2571 MB/s
test bench5_100000 ... bench:      38,739 ns/iter (+/- 1,431) = 2581 MB/s

     Running unittests (target/release/deps/chacha20-c3750fdb2e6a6143)
test bench1_10     ... bench:          12 ns/iter (+/- 0) = 833 MB/s
test bench2_100    ... bench:          72 ns/iter (+/- 3) = 1388 MB/s
test bench3_1000   ... bench:         614 ns/iter (+/- 30) = 1628 MB/s
test bench4_10000  ... bench:       5,959 ns/iter (+/- 244) = 1678 MB/s
test bench5_100000 ... bench:      59,545 ns/iter (+/- 1,724) = 1679 MB/s

     Running unittests (target/release/deps/chacha8-f8e1d7fb0cf442ec)
test bench1_10     ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100    ... bench:          38 ns/iter (+/- 1) = 2631 MB/s
test bench3_1000   ... bench:         295 ns/iter (+/- 8) = 3389 MB/s
test bench4_10000  ... bench:       2,844 ns/iter (+/- 108) = 3516 MB/s
test bench5_100000 ... bench:      28,393 ns/iter (+/- 2,068) = 3521 MB/s

This PR:

     Running unittests (target/release/deps/chacha12-b35bb60eb267f274)
test bench1_10     ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100    ... bench:          37 ns/iter (+/- 2) = 2702 MB/s
test bench3_1000   ... bench:         285 ns/iter (+/- 12) = 3508 MB/s
test bench4_10000  ... bench:       2,691 ns/iter (+/- 137) = 3716 MB/s
test bench5_100000 ... bench:      26,804 ns/iter (+/- 1,187) = 3730 MB/s

     Running unittests (target/release/deps/chacha20-c3750fdb2e6a6143)
test bench1_10     ... bench:           9 ns/iter (+/- 0) = 1111 MB/s
test bench2_100    ... bench:          52 ns/iter (+/- 3) = 1923 MB/s
test bench3_1000   ... bench:         432 ns/iter (+/- 23) = 2314 MB/s
test bench4_10000  ... bench:       4,126 ns/iter (+/- 133) = 2423 MB/s
test bench5_100000 ... bench:      41,191 ns/iter (+/- 1,258) = 2427 MB/s

     Running unittests (target/release/deps/chacha8-f8e1d7fb0cf442ec)
test bench1_10     ... bench:           8 ns/iter (+/- 0) = 1250 MB/s
test bench2_100    ... bench:          31 ns/iter (+/- 1) = 3225 MB/s
test bench3_1000   ... bench:         211 ns/iter (+/- 12) = 4739 MB/s
test bench4_10000  ... bench:       1,978 ns/iter (+/- 101) = 5055 MB/s
test bench5_100000 ... bench:      19,835 ns/iter (+/- 759) = 5041 MB/s

dhardy · 2021-08-30T08:53:18Z

Looks like you've answered your question, but you'd better ask @kazcw.

str4d added 2 commits August 28, 2021 17:25

chacha20: Refactor usage of blocks variant of avx2::StateWords

ebbb642

It is now agnostic of the number of blocks processed, which is now a constant.

chacha20: Pass around &mut avx2::StateWord instead of &mut __m256i

687f953

str4d commented Aug 28, 2021

View reviewed changes

str4d force-pushed the chacha20-avx2-wide branch from 3906f6d to 8a47574 Compare August 29, 2021 00:18

str4d added 3 commits August 29, 2021 01:20

chacha20: Add avx2::StateWord methods for required ops; MSRV 1.51+

d0c4f97

chacha20: Change buffer_pos to a u16

f089676

For a 4-block buffer, we need to be able to represent the past-the-end buffer position of 256, which is too large for a `u8`.

chacha20: Switch to 4-block buffer for SSE2 / AVX2 backend

85d81c0

str4d force-pushed the chacha20-avx2-wide branch from 8a47574 to 85d81c0 Compare August 29, 2021 00:21

str4d mentioned this pull request Aug 29, 2021

Rage is 38% slower at encrypting than Go implementation str4d/rage#57

Open

tarcieri mentioned this pull request Aug 29, 2021

Chacha: integrate c2-chacha dependency rust-random/rand#931

Merged

chacha20: Add a BlockRngResults wrapper type

05e6b48

When the non-soft backend is being used, its 4-block buffer size results in a `BlockRngCore::Results` type of `[u32; 64]` which doesn't implement `Default`. We replace it with a wrapper type on which we implement the necessary traits.

str4d marked this pull request as ready for review August 29, 2021 01:33

ctr: Allow upper-case acronyms (clippy lint)

e8419a4

tarcieri approved these changes Aug 29, 2021

View reviewed changes

tarcieri merged commit 818c4ac into RustCrypto:master Aug 29, 2021

str4d deleted the chacha20-avx2-wide branch August 29, 2021 03:02

tarcieri mentioned this pull request Aug 29, 2021

chacha20 v0.8.0 #272

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`chacha20`: Process 4 blocks at a time in AVX2 backend #267

`chacha20`: Process 4 blocks at a time in AVX2 backend #267

str4d commented Aug 28, 2021

str4d Aug 28, 2021

tarcieri Aug 28, 2021

str4d Aug 28, 2021

str4d commented Aug 28, 2021

tarcieri commented Aug 28, 2021

str4d commented Aug 29, 2021

tarcieri commented Aug 29, 2021 •

edited

str4d commented Aug 29, 2021 •

edited

str4d commented Aug 29, 2021 •

edited

dhardy commented Aug 30, 2021

chacha20: Process 4 blocks at a time in AVX2 backend #267

chacha20: Process 4 blocks at a time in AVX2 backend #267

Conversation

str4d commented Aug 28, 2021

str4d Aug 28, 2021

Choose a reason for hiding this comment

tarcieri Aug 28, 2021

Choose a reason for hiding this comment

str4d Aug 28, 2021

Choose a reason for hiding this comment

str4d commented Aug 28, 2021

tarcieri commented Aug 28, 2021

str4d commented Aug 29, 2021

tarcieri commented Aug 29, 2021 • edited

str4d commented Aug 29, 2021 • edited

str4d commented Aug 29, 2021 • edited

dhardy commented Aug 30, 2021

`chacha20`: Process 4 blocks at a time in AVX2 backend #267

`chacha20`: Process 4 blocks at a time in AVX2 backend #267

tarcieri commented Aug 29, 2021 •

edited

str4d commented Aug 29, 2021 •

edited

str4d commented Aug 29, 2021 •

edited