Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chacha20: Add wide (4-block) AVX2 impl #262

Closed
str4d opened this issue Aug 9, 2021 · 0 comments · Fixed by #267
Closed

chacha20: Add wide (4-block) AVX2 impl #262

str4d opened this issue Aug 9, 2021 · 0 comments · Fixed by #267

Comments

@str4d
Copy link
Contributor

str4d commented Aug 9, 2021

I've been digging into c2-chacha to figure out why its performance is consistently better than chacha20. I think I understand the main difference now: c2-chacha has a "wide" mode where it processes four ChaCha blocks at a time. This requires two sets of registers per state word (since a single 256-bit register only has room for two blocks in parallel), but I guess on modern CPUs there are enough registers that it can handle the increased number of temporaries, and the resulting interleaved instructions seem to parallelise well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant