chacha20: Add wide (4-block) AVX2 impl #262

str4d · 2021-08-09T12:38:16Z

I've been digging into c2-chacha to figure out why its performance is consistently better than chacha20. I think I understand the main difference now: c2-chacha has a "wide" mode where it processes four ChaCha blocks at a time. This requires two sets of registers per state word (since a single 256-bit register only has room for two blocks in parallel), but I guess on modern CPUs there are enough registers that it can handle the increased number of temporaries, and the resulting interleaved instructions seem to parallelise well.

The text was updated successfully, but these errors were encountered:

This was referenced Aug 9, 2021

Rage is 38% slower at encrypting than Go implementation str4d/rage#57

Open

Replacing rand_chacha with chacha20 rust-random/rand#934

Open

str4d mentioned this issue Aug 28, 2021

chacha20: Process 4 blocks at a time in AVX2 backend #267

Merged

tarcieri closed this as completed in #267 Aug 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chacha20: Add wide (4-block) AVX2 impl #262

chacha20: Add wide (4-block) AVX2 impl #262

str4d commented Aug 9, 2021

chacha20: Add wide (4-block) AVX2 impl #262

chacha20: Add wide (4-block) AVX2 impl #262

Comments

str4d commented Aug 9, 2021