sha2: Add aarch64 backends for SHA2. #490

codahale · 2023-06-15T03:57:57Z

Adds NEON-enabled backends for SHA2 on aarch64.

Eliminates the need for the asm feature on aarch64 for SHA-{224, 256} performance and provides a big performance boost for SHA-512, which didn’t benefit from the asm feature.

Before:

test sha256_10    ... bench:          27 ns/iter (+/- 0) = 370 MB/s
test sha256_100   ... bench:         278 ns/iter (+/- 3) = 359 MB/s
test sha256_1000  ... bench:       2,747 ns/iter (+/- 24) = 364 MB/s
test sha256_10000 ... bench:      27,392 ns/iter (+/- 293) = 365 MB/s
test sha512_10    ... bench:          17 ns/iter (+/- 0) = 588 MB/s
test sha512_100   ... bench:         164 ns/iter (+/- 7) = 609 MB/s
test sha512_1000  ... bench:       1,650 ns/iter (+/- 28) = 606 MB/s
test sha512_10000 ... bench:      16,533 ns/iter (+/- 1,540) = 604 MB/s

After:

test sha256_10    ... bench:           4 ns/iter (+/- 0) = 2500 MB/s
test sha256_100   ... bench:          46 ns/iter (+/- 0) = 2173 MB/s
test sha256_1000  ... bench:         424 ns/iter (+/- 6) = 2358 MB/s
test sha256_10000 ... bench:       4,190 ns/iter (+/- 31) = 2386 MB/s
test sha512_10    ... bench:           6 ns/iter (+/- 0) = 1666 MB/s
test sha512_100   ... bench:          65 ns/iter (+/- 0) = 1538 MB/s
test sha512_1000  ... bench:         636 ns/iter (+/- 5) = 1572 MB/s
test sha512_10000 ... bench:       6,311 ns/iter (+/- 68) = 1584 MB/s

(Benchmarks run on my M2 Air laptop, unplugged, on my kitchen table.)

tarcieri · 2023-06-15T13:10:15Z

Neat! I hadn't thought about using inline ASM as a sort of "polyfill" for using unstable intrinsics on stable Rust before.

This is something we should consider doing elsewhere we use unstable aarch64 intrinsics, such as in the aes and polyval crates. Then, when the intrinsics are stabilized, we can delete the inline ASM and switch to the intrinsics.

sha2/src/sha256/aarch64.rs

newpavlov · 2023-06-15T13:16:52Z

Update the cross version to fix the CI failures.

sha2/src/sha512/aarch64.rs

sha2/src/sha256/aarch64.rs

sha2/src/sha256.rs

codahale · 2023-06-15T14:34:03Z

Update the cross version to fix the CI failures.

What should I bump it to?

newpavlov · 2023-06-15T14:35:42Z

What should I bump it to?

1.59

sha2/src/sha256/aarch64.rs

newpavlov · 2023-06-15T15:43:09Z

Can you compare performance after addition of the options? The resulting assembly is somewhat different. Number of instructions is the same, so hopefully it's only reordering which may even improve performance.

codahale · 2023-06-15T16:08:39Z

No real difference.

Before:

test sha256_10    ... bench:           4 ns/iter (+/- 0) = 2500 MB/s
test sha256_100   ... bench:          46 ns/iter (+/- 2) = 2173 MB/s
test sha256_1000  ... bench:         421 ns/iter (+/- 5) = 2375 MB/s
test sha256_10000 ... bench:       4,155 ns/iter (+/- 44) = 2406 MB/s
test sha512_10    ... bench:           6 ns/iter (+/- 0) = 1666 MB/s
test sha512_100   ... bench:          65 ns/iter (+/- 0) = 1538 MB/s
test sha512_1000  ... bench:         636 ns/iter (+/- 15) = 1572 MB/s
test sha512_10000 ... bench:       6,317 ns/iter (+/- 54) = 1583 MB/s

After:

test sha256_10    ... bench:           4 ns/iter (+/- 0) = 2500 MB/s
test sha256_100   ... bench:          46 ns/iter (+/- 4) = 2173 MB/s
test sha256_1000  ... bench:         423 ns/iter (+/- 10) = 2364 MB/s
test sha256_10000 ... bench:       4,179 ns/iter (+/- 63) = 2392 MB/s
test sha512_10    ... bench:           6 ns/iter (+/- 0) = 1666 MB/s
test sha512_100   ... bench:          65 ns/iter (+/- 0) = 1538 MB/s
test sha512_1000  ... bench:         636 ns/iter (+/- 12) = 1572 MB/s
test sha512_10000 ... bench:       6,324 ns/iter (+/- 400) = 1581 MB/s

newpavlov · 2023-06-15T16:12:59Z

Thank you!

codahale · 2023-06-15T17:15:09Z

Now that this is merged, what’s the remaining work to drop the sha2_asm dependency?

newpavlov · 2023-06-15T17:28:20Z

We would need to migrate the x86-64 implementation from it to inline asm. IIRC it's still a bit faster than our software fallback.

codahale · 2023-06-15T17:36:04Z

I ask b/c I was running some related benchmarks on a GCE n2-standard-4 with Ice Lake and noticed there wasn’t much of a difference:

With asm:

$ RUSTFLAGS="-C target-cpu=native" cargo +nightly bench --features=asm

test sha256_10    ... bench:           9 ns/iter (+/- 0) = 1111 MB/s
test sha256_100   ... bench:          81 ns/iter (+/- 0) = 1234 MB/s
test sha256_1000  ... bench:         719 ns/iter (+/- 5) = 1390 MB/s
test sha256_10000 ... bench:       7,126 ns/iter (+/- 64) = 1403 MB/s
test sha512_10    ... bench:          21 ns/iter (+/- 0) = 476 MB/s
test sha512_100   ... bench:         201 ns/iter (+/- 1) = 497 MB/s
test sha512_1000  ... bench:       1,836 ns/iter (+/- 8) = 544 MB/s
test sha512_10000 ... bench:      17,769 ns/iter (+/- 61) = 562 MB/s

Without asm:

$ RUSTFLAGS="-C target-cpu=native" cargo +nightly bench
test sha256_10    ... bench:           9 ns/iter (+/- 0) = 1111 MB/s
test sha256_100   ... bench:          81 ns/iter (+/- 1) = 1234 MB/s
test sha256_1000  ... bench:         718 ns/iter (+/- 4) = 1392 MB/s
test sha256_10000 ... bench:       7,116 ns/iter (+/- 20) = 1405 MB/s
test sha512_10    ... bench:          21 ns/iter (+/- 0) = 476 MB/s
test sha512_100   ... bench:         201 ns/iter (+/- 1) = 497 MB/s
test sha512_1000  ... bench:       1,836 ns/iter (+/- 8) = 544 MB/s
test sha512_10000 ... bench:      17,819 ns/iter (+/- 100) = 561 MB/s

Definitely within the margin of error.

Maybe on a different CPU?

newpavlov · 2023-06-15T17:48:54Z

You are getting results for the SHA-NI and AVX2 backends (the asm backend is treated as a replacement for the software backend, thus it has lower priority). On my laptop after I disabled them I get:

// without asm:
test sha256_10    ... bench:          33 ns/iter (+/- 0) = 303 MB/s
test sha256_100   ... bench:         321 ns/iter (+/- 2) = 311 MB/s
test sha256_1000  ... bench:       3,131 ns/iter (+/- 6) = 319 MB/s
test sha256_10000 ... bench:      31,227 ns/iter (+/- 69) = 320 MB/s
test sha512_10    ... bench:          21 ns/iter (+/- 0) = 476 MB/s
test sha512_100   ... bench:         207 ns/iter (+/- 0) = 483 MB/s
test sha512_1000  ... bench:       2,017 ns/iter (+/- 5) = 495 MB/s
test sha512_10000 ... bench:      20,077 ns/iter (+/- 64) = 498 MB/s

// with asm:
test sha256_10    ... bench:          28 ns/iter (+/- 3) = 357 MB/s
test sha256_100   ... bench:         274 ns/iter (+/- 7) = 364 MB/s
test sha256_1000  ... bench:       2,671 ns/iter (+/- 23) = 374 MB/s
test sha256_10000 ... bench:      26,693 ns/iter (+/- 348) = 374 MB/s
test sha512_10    ... bench:          20 ns/iter (+/- 0) = 500 MB/s
test sha512_100   ... bench:         184 ns/iter (+/- 2) = 543 MB/s
test sha512_1000  ... bench:       1,809 ns/iter (+/- 14) = 552 MB/s
test sha512_10000 ... bench:      18,032 ns/iter (+/- 177) = 554 MB/s

codahale · 2023-06-15T17:55:41Z

Ah, gotcha. Thanks for clearing that up.

sha2: Add aarch64 backends for SHA2.

14093d5

newpavlov reviewed Jun 15, 2023

View reviewed changes

sha2/src/sha256/aarch64.rs Outdated Show resolved Hide resolved

newpavlov reviewed Jun 15, 2023

View reviewed changes

sha2/src/sha512/aarch64.rs Outdated Show resolved Hide resolved

tarcieri reviewed Jun 15, 2023

View reviewed changes

sha2/src/sha256/aarch64.rs Outdated Show resolved Hide resolved

codahale added 3 commits June 15, 2023 08:12

sha2: Reduce state loads/stores for NEON backend.

5ed2894

sha2: Fix target feature requirements for SHA-512/NEON.

f669a9b

sha2: Fix target feature requirements for SHA-256/NEON.

40571e9

newpavlov reviewed Jun 15, 2023

View reviewed changes

sha2/src/sha256.rs Outdated Show resolved Hide resolved

codahale added 2 commits June 15, 2023 08:36

sha2: Gate NEON backend behind asm feature.

ad9890f

sha2: Bump cross version to 1.59.

77c1de2

newpavlov reviewed Jun 15, 2023

View reviewed changes

sha2/src/sha256/aarch64.rs Outdated Show resolved Hide resolved

sha2: Convert SHA-512 NEON intrinsic macros to fns.

eec3d51

newpavlov reviewed Jun 15, 2023

View reviewed changes

sha2/src/sha256/aarch64.rs Outdated Show resolved Hide resolved

codahale added 3 commits June 15, 2023 09:22

sha2: Convert SHA-256 NEON intrinsic macros to fns.

cb3bfce

sha2: use nested imports

4e959eb

sha2: Add appropriate options to NEON assembly.

1693d00

newpavlov approved these changes Jun 15, 2023

View reviewed changes

newpavlov merged commit 3fa561e into RustCrypto:master Jun 15, 2023
22 checks passed

codahale deleted the sha2-neon branch June 15, 2023 16:14

newpavlov mentioned this pull request Jun 15, 2023

Release sha2 v0.10.7 #491

Merged

tarcieri mentioned this pull request Jun 26, 2023

sha2: explore addition of SSE and AVX2 backends for SHA-256 #327

Open

romanz mentioned this pull request Nov 2, 2023

Remove sha2 dep once bitcoin_hashes provide hardware accelerated sha256 RCasatta/bitcoin_slices#28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sha2: Add aarch64 backends for SHA2. #490

sha2: Add aarch64 backends for SHA2. #490

codahale commented Jun 15, 2023

tarcieri commented Jun 15, 2023

newpavlov commented Jun 15, 2023

codahale commented Jun 15, 2023

newpavlov commented Jun 15, 2023

newpavlov commented Jun 15, 2023

codahale commented Jun 15, 2023

newpavlov commented Jun 15, 2023

codahale commented Jun 15, 2023

newpavlov commented Jun 15, 2023

codahale commented Jun 15, 2023

newpavlov commented Jun 15, 2023 •

edited

codahale commented Jun 15, 2023

sha2: Add aarch64 backends for SHA2. #490

sha2: Add aarch64 backends for SHA2. #490

Conversation

codahale commented Jun 15, 2023

tarcieri commented Jun 15, 2023

newpavlov commented Jun 15, 2023

codahale commented Jun 15, 2023

newpavlov commented Jun 15, 2023

newpavlov commented Jun 15, 2023

codahale commented Jun 15, 2023

newpavlov commented Jun 15, 2023

codahale commented Jun 15, 2023

newpavlov commented Jun 15, 2023

codahale commented Jun 15, 2023

newpavlov commented Jun 15, 2023 • edited

codahale commented Jun 15, 2023

newpavlov commented Jun 15, 2023 •

edited