Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sha2: Add aarch64 backends for SHA2. #490

Merged
merged 10 commits into from Jun 15, 2023
Merged

sha2: Add aarch64 backends for SHA2. #490

merged 10 commits into from Jun 15, 2023

Conversation

codahale
Copy link
Contributor

Adds NEON-enabled backends for SHA2 on aarch64.

Eliminates the need for the asm feature on aarch64 for SHA-{224, 256} performance and provides a big performance boost for SHA-512, which didn’t benefit from the asm feature.

Before:

test sha256_10    ... bench:          27 ns/iter (+/- 0) = 370 MB/s
test sha256_100   ... bench:         278 ns/iter (+/- 3) = 359 MB/s
test sha256_1000  ... bench:       2,747 ns/iter (+/- 24) = 364 MB/s
test sha256_10000 ... bench:      27,392 ns/iter (+/- 293) = 365 MB/s
test sha512_10    ... bench:          17 ns/iter (+/- 0) = 588 MB/s
test sha512_100   ... bench:         164 ns/iter (+/- 7) = 609 MB/s
test sha512_1000  ... bench:       1,650 ns/iter (+/- 28) = 606 MB/s
test sha512_10000 ... bench:      16,533 ns/iter (+/- 1,540) = 604 MB/s

After:

test sha256_10    ... bench:           4 ns/iter (+/- 0) = 2500 MB/s
test sha256_100   ... bench:          46 ns/iter (+/- 0) = 2173 MB/s
test sha256_1000  ... bench:         424 ns/iter (+/- 6) = 2358 MB/s
test sha256_10000 ... bench:       4,190 ns/iter (+/- 31) = 2386 MB/s
test sha512_10    ... bench:           6 ns/iter (+/- 0) = 1666 MB/s
test sha512_100   ... bench:          65 ns/iter (+/- 0) = 1538 MB/s
test sha512_1000  ... bench:         636 ns/iter (+/- 5) = 1572 MB/s
test sha512_10000 ... bench:       6,311 ns/iter (+/- 68) = 1584 MB/s

(Benchmarks run on my M2 Air laptop, unplugged, on my kitchen table.)

@tarcieri
Copy link
Member

Neat! I hadn't thought about using inline ASM as a sort of "polyfill" for using unstable intrinsics on stable Rust before.

This is something we should consider doing elsewhere we use unstable aarch64 intrinsics, such as in the aes and polyval crates. Then, when the intrinsics are stabilized, we can delete the inline ASM and switch to the intrinsics.

@newpavlov
Copy link
Member

Update the cross version to fix the CI failures.

sha2/src/sha256.rs Outdated Show resolved Hide resolved
@codahale
Copy link
Contributor Author

Update the cross version to fix the CI failures.

What should I bump it to?

@newpavlov
Copy link
Member

What should I bump it to?

1.59

@newpavlov
Copy link
Member

Can you compare performance after addition of the options? The resulting assembly is somewhat different. Number of instructions is the same, so hopefully it's only reordering which may even improve performance.

@codahale
Copy link
Contributor Author

No real difference.

Before:

test sha256_10    ... bench:           4 ns/iter (+/- 0) = 2500 MB/s
test sha256_100   ... bench:          46 ns/iter (+/- 2) = 2173 MB/s
test sha256_1000  ... bench:         421 ns/iter (+/- 5) = 2375 MB/s
test sha256_10000 ... bench:       4,155 ns/iter (+/- 44) = 2406 MB/s
test sha512_10    ... bench:           6 ns/iter (+/- 0) = 1666 MB/s
test sha512_100   ... bench:          65 ns/iter (+/- 0) = 1538 MB/s
test sha512_1000  ... bench:         636 ns/iter (+/- 15) = 1572 MB/s
test sha512_10000 ... bench:       6,317 ns/iter (+/- 54) = 1583 MB/s

After:

test sha256_10    ... bench:           4 ns/iter (+/- 0) = 2500 MB/s
test sha256_100   ... bench:          46 ns/iter (+/- 4) = 2173 MB/s
test sha256_1000  ... bench:         423 ns/iter (+/- 10) = 2364 MB/s
test sha256_10000 ... bench:       4,179 ns/iter (+/- 63) = 2392 MB/s
test sha512_10    ... bench:           6 ns/iter (+/- 0) = 1666 MB/s
test sha512_100   ... bench:          65 ns/iter (+/- 0) = 1538 MB/s
test sha512_1000  ... bench:         636 ns/iter (+/- 12) = 1572 MB/s
test sha512_10000 ... bench:       6,324 ns/iter (+/- 400) = 1581 MB/s

@newpavlov
Copy link
Member

Thank you!

@newpavlov newpavlov merged commit 3fa561e into RustCrypto:master Jun 15, 2023
22 checks passed
@codahale codahale deleted the sha2-neon branch June 15, 2023 16:14
@newpavlov newpavlov mentioned this pull request Jun 15, 2023
@codahale
Copy link
Contributor Author

Now that this is merged, what’s the remaining work to drop the sha2_asm dependency?

@newpavlov
Copy link
Member

We would need to migrate the x86-64 implementation from it to inline asm. IIRC it's still a bit faster than our software fallback.

@codahale
Copy link
Contributor Author

I ask b/c I was running some related benchmarks on a GCE n2-standard-4 with Ice Lake and noticed there wasn’t much of a difference:

With asm:

$ RUSTFLAGS="-C target-cpu=native" cargo +nightly bench --features=asm

test sha256_10    ... bench:           9 ns/iter (+/- 0) = 1111 MB/s
test sha256_100   ... bench:          81 ns/iter (+/- 0) = 1234 MB/s
test sha256_1000  ... bench:         719 ns/iter (+/- 5) = 1390 MB/s
test sha256_10000 ... bench:       7,126 ns/iter (+/- 64) = 1403 MB/s
test sha512_10    ... bench:          21 ns/iter (+/- 0) = 476 MB/s
test sha512_100   ... bench:         201 ns/iter (+/- 1) = 497 MB/s
test sha512_1000  ... bench:       1,836 ns/iter (+/- 8) = 544 MB/s
test sha512_10000 ... bench:      17,769 ns/iter (+/- 61) = 562 MB/s

Without asm:

$ RUSTFLAGS="-C target-cpu=native" cargo +nightly bench
test sha256_10    ... bench:           9 ns/iter (+/- 0) = 1111 MB/s
test sha256_100   ... bench:          81 ns/iter (+/- 1) = 1234 MB/s
test sha256_1000  ... bench:         718 ns/iter (+/- 4) = 1392 MB/s
test sha256_10000 ... bench:       7,116 ns/iter (+/- 20) = 1405 MB/s
test sha512_10    ... bench:          21 ns/iter (+/- 0) = 476 MB/s
test sha512_100   ... bench:         201 ns/iter (+/- 1) = 497 MB/s
test sha512_1000  ... bench:       1,836 ns/iter (+/- 8) = 544 MB/s
test sha512_10000 ... bench:      17,819 ns/iter (+/- 100) = 561 MB/s

Definitely within the margin of error.

Maybe on a different CPU?

@newpavlov
Copy link
Member

newpavlov commented Jun 15, 2023

You are getting results for the SHA-NI and AVX2 backends (the asm backend is treated as a replacement for the software backend, thus it has lower priority). On my laptop after I disabled them I get:

// without asm:
test sha256_10    ... bench:          33 ns/iter (+/- 0) = 303 MB/s
test sha256_100   ... bench:         321 ns/iter (+/- 2) = 311 MB/s
test sha256_1000  ... bench:       3,131 ns/iter (+/- 6) = 319 MB/s
test sha256_10000 ... bench:      31,227 ns/iter (+/- 69) = 320 MB/s
test sha512_10    ... bench:          21 ns/iter (+/- 0) = 476 MB/s
test sha512_100   ... bench:         207 ns/iter (+/- 0) = 483 MB/s
test sha512_1000  ... bench:       2,017 ns/iter (+/- 5) = 495 MB/s
test sha512_10000 ... bench:      20,077 ns/iter (+/- 64) = 498 MB/s

// with asm:
test sha256_10    ... bench:          28 ns/iter (+/- 3) = 357 MB/s
test sha256_100   ... bench:         274 ns/iter (+/- 7) = 364 MB/s
test sha256_1000  ... bench:       2,671 ns/iter (+/- 23) = 374 MB/s
test sha256_10000 ... bench:      26,693 ns/iter (+/- 348) = 374 MB/s
test sha512_10    ... bench:          20 ns/iter (+/- 0) = 500 MB/s
test sha512_100   ... bench:         184 ns/iter (+/- 2) = 543 MB/s
test sha512_1000  ... bench:       1,809 ns/iter (+/- 14) = 552 MB/s
test sha512_10000 ... bench:      18,032 ns/iter (+/- 177) = 554 MB/s

@codahale
Copy link
Contributor Author

Ah, gotcha. Thanks for clearing that up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants