icp: Port AVX2 implementation of aes-gcm from BoringSSL #17058

lowjoel · 2025-02-15T07:34:13Z

Motivation and Context

Zen 3 CPUs support the VAES and VPCLMULDQ instructions which extend the width of each instruction from 128-bits to 256-bits. BoringSSL has recently implemented this version for AES-GCM and it provides up to a 80% speedup. See google/boringssl@3b6e1be.

Description

I've backported the implementation from BoringSSL, adapting code from google/boringssl@3b6e1be (but picking the tip of master), as well as from google/boringssl@62f9751 which changed the primitive signature from 6 arguments to 7 (by not implicitly relying on the address offset of the ghash structure.)

Adaptations for icp (akin to #9749) as well as to use the RET macro for kernel code are in the third commit.

The fifth to seventh commits combine the use_avx/use_avx2 flags into an enum, allowing toggling of the different implementations that are available. Also, define different values of CAN_USE_GCM_ASM to indicate various levels of compiler support.

How Has This Been Tested?

Compile tested.

I'm now running it on my ZFS-on-root main machine.

@robn has a Wycheproof test set (#17089) that has been merged; these changes pass.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

lowjoel · 2025-02-15T08:21:15Z

Hey @AttilaFueloep I tried to port your changes in #9749 and I can't figure out what you mean by the change with the comment // ICP does not zero key schedule. in the modified assembly sources. I've ported the other two changes, for the round offset in the AES_KEY struct, as well as the OpenSSL vs ICP representation of number of AES rounds. I've also fixed the gcm_init_vpclmulqdq_avx2 method which you mentioned ICP stores H in network order.

robn · 2025-02-15T09:30:14Z

Strong opening move! I've had a little play with it, here's what I got.

Before the final thing, you should run make checkstyle, and fix the errors it throws out.

I added this patch to let me see what it thinks is happening on my test machines:

diff --git module/zcommon/simd_stat.c module/zcommon/simd_stat.c
index d82a88ca9..da557bbb0 100644
--- module/zcommon/simd_stat.c
+++ module/zcommon/simd_stat.c
@@ -117,6 +117,10 @@ simd_stat_kstat_data(char *buf, size_t size, void *data)
 		    "pclmulqdq", zfs_pclmulqdq_available());
 		off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
 		    "movbe", zfs_movbe_available());
+		off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
+		    "vaes", zfs_vaes_available());
+		off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
+		    "vpclmulqdq", zfs_vpclmulqdq_available());
 
 		off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
 		    "osxsave", boot_cpu_has(X86_FEATURE_OSXSAVE));

With that, my old Intel 2019 junker laptop says:

# grep -E 'vaes|vpclmulqdq' /proc/spl/kstat/zfs/simd
vaes                    0
vpclmulqdq              0

My much nicer Ryzen 5 from last year says:

# grep -E 'vaes|vpclmulqdq' /proc/spl/kstat/zfs/simd
vaes                    1
vpclmulqdq              1

Unfortunately we don't have visibility on the ICP microbenchmarks like we do for checksums and raidz, but we can at least see the options available:

# cat /sys/module/zfs/parameters/icp_gcm_impl
cycle [fastest] avx avx2 generic pclmulqdq

So I'd say it's all wired up right, which is half the fun.

I set it to avx and created a dataset, then unmounted and exported. Then I set it to avx2 and tried to zfs load-key, but that didn't work:

# zfs load-key -a
Enter passphrase for 'tank/enc':
Key load error: Incorrect key provided for 'tank/enc'.

Trying to create the pool and dataset with avx2 selected gets a crash:

# zpool create tank -O encryption=aes-256-gcm -O keyformat=passphrase /home/robn/blk
Enter new passphrase:
Re-enter new passphrase:
Killed

And the kernel has a nice complaint:

[ 2769.438918] BUG: unable to handle page fault for address: ffff9c23c8307e10
[ 2769.439133] #PF: supervisor read access in kernel mode
[ 2769.439290] #PF: error_code(0x0000) - not-present page
[ 2769.439447] PGD 110e01067 P4D 110e01067 PUD 0
[ 2769.439588] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 2769.439733] CPU: 0 PID: 73114 Comm: zpool Tainted: P           OE      6.1.0-25-amd64 #1  Debian 6.1.106-3
[ 2769.440035] Hardware name: FreeBSD BHYVE/BHYVE, BIOS 14.0 10/17/2021
[ 2769.440236] RIP: 0010:aes_gcm_dec_update_vaes_avx2+0x3e/0x5e0 [zfs]
[ 2769.440596] Code: 0c 24 c4 e2 71 00 c8 c4 42 7d 5a 18 c4 62 25 00 d8 44 8b 91 78 01 00 00 46 8d 14 95 f0 ff ff ff 4e 8d 5c 91 60 c4 62 7d 5a 09 <c4> 42 7d 5a 13 c5 25 fe 1d d5 ed 0b 00 48 83 fa 7f 0f 8
6 31 03 00
[ 2769.441171] RSP: 0018:ffffb1b6c37df358 EFLAGS: 00010086
[ 2769.441339] RAX: ffffffffc089b140 RBX: ffffb1b6c37df43c RCX: ffff9c214b6eae00
[ 2769.441566] RDX: 0000000000000060 RSI: ffff9c208e918500 RDI: ffff9c208e918500
[ 2769.441793] RBP: 0000000000000006 R08: ffffb1b6c37df430 R09: ffff9c20e8eba3c0
[ 2769.442019] R10: 000000009f3073ec R11: ffff9c23c8307e10 R12: ffffb1b6c37df498
[ 2769.442102] R13: ffffffffc089b140 R14: ffffb1b6c37df420 R15: ffffb1b6c37df488
[ 2769.442102] FS:  00007f2d7ead5840(0000) GS:ffff9c23afc00000(0000) knlGS:0000000000000000
[ 2769.442102] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2769.442102] CR2: ffff9c23c8307e10 CR3: 00000001cccaa000 CR4: 00000000003506f0
[ 2769.442102] Call Trace:
[ 2769.442102]  <TASK>
[ 2769.442102]  ? __die_body.cold+0x1a/0x1f
[ 2769.442102]  ? page_fault_oops+0xd2/0x2b0
[ 2769.442102]  ? srso_alias_return_thunk+0x5/0x7f
[ 2769.442102]  ? search_bpf_extables+0x5b/0x80
[ 2769.442102]  ? exc_page_fault+0xca/0x170
[ 2769.442102]  ? asm_exc_page_fault+0x22/0x30
[ 2769.442102]  ? aesni_gcm_decrypt_avx+0x10/0x10 [zfs]
[ 2769.442102]  ? aesni_gcm_decrypt_avx+0x10/0x10 [zfs]
[ 2769.442102]  ? aes_gcm_dec_update_vaes_avx2+0x3e/0x5e0 [zfs]
[ 2769.442102]  ? aesni_gcm_decrypt_avx2+0x2a/0x50 [zfs]
[ 2769.442102]  ? gcm_decrypt_final_avx+0x167/0x460 [zfs]
[ 2769.442102]  ? crypto_update_uio+0xc0/0x120 [zfs]
[ 2769.442102]  ? aesni_gcm_decrypt_avx+0x10/0x10 [zfs]
[ 2769.442102]  ? aes_decrypt_atomic+0x1ca/0x310 [zfs]
[ 2769.442102]  ? crypto_decrypt+0x78/0x1c0 [zfs]
[ 2769.442102]  ? zio_do_crypt_uio+0x2ce/0x400 [zfs]
[ 2769.442102]  ? zio_crypt_key_unwrap+0x254/0x480 [zfs]
[ 2769.442102]  ? dsl_crypto_key_open.constprop.0+0x2d1/0x350 [zfs]
[ 2769.442102]  ? dsl_crypto_key_open.constprop.0+0x2d1/0x350 [zfs]
[ 2769.442102]  ? spa_keystore_dsl_key_hold_dd+0x12e/0x280 [zfs]
[ 2769.442102]  ? __kmalloc_node+0x4c/0x150
[ 2769.442102]  ? spa_keystore_create_mapping+0x79/0x200 [zfs]
[ 2769.442102]  ? dsl_dataset_hold_obj_flags+0x48/0x90 [zfs]
[ 2769.442102]  ? dsl_pool_create+0x1bc/0x490 [zfs]
[ 2769.442102]  ? spa_create+0x83d/0xdb0 [zfs]
[ 2769.442102]  ? zfs_ioc_pool_create+0xab/0x310 [zfs]
[ 2769.442102]  ? zfsdev_ioctl_common+0x6a0/0x7c0 [zfs]
[ 2769.442102]  ? __kmalloc_node+0xbf/0x150
[ 2769.442102]  ? srso_alias_return_thunk+0x5/0x7f
[ 2769.442102]  ? zfsdev_ioctl+0x4f/0xd0 [zfs]
[ 2769.442102]  ? __x64_sys_ioctl+0x90/0xd0
[ 2769.442102]  ? do_syscall_64+0x55/0xb0

That's all I have time for tonight. This is a good start!

robn · 2025-02-15T09:33:27Z

Oh the other thing I forgot to add, after loading the module it says fastest, and fails as above. So it is selecting avx2 as fastest, which is good! (we need a stat for it, for sure).

lowjoel · 2025-02-16T08:02:28Z

module/icp/algs/modes/gcm.c

 	} else {
 		/*
 		 * Handle the "cycle" implementation by creating avx and
 		 * non-avx contexts alternately.
 		 */
 		gcm_ctx->gcm_use_avx = gcm_toggle_avx();
+		gcm_ctx->gcm_use_avx2 = gcm_toggle_avx2();


Pretty sure the cycle behaviour here isn't correct.

Should be fixed in #17061.

So does this PR have a dependency on #17061?

I think cycle will still work, it just won't toggle the same way (or the assumed way) - is that behaviour documented somewhere? Under what circumstances would people be using cycle?

When this code runs on an AVX2 processor, cycle will only toggle between AVX2 and generic, ignoring AVX - I don't know if that is a deal breaker enough to consider it a "dependency"

AVX only processors should still cycle the same way.

(at least that's the intention - might have to think a bit deeper about this)

@tonyhutter I've had another go at the cycle code - since movbe only applies to AVX I've modified the toggling for that to not apply to AVX2. Furthermore, if bswap is needed neither accelerated method will be usable.

lowjoel · 2025-03-08T10:15:42Z

@AttilaFueloep renamed to avx2-vaes in 59fb58a; identifiers in code still remains just avx2 for symmetry with the assembly.

AttilaFueloep · 2025-03-10T15:31:07Z

@lowjoel Sorry for not being clear enough, was in a hurry. My main concern was the user visible module parameter that is shown in e.g. cat /sys/module/zfs/parameters/icp_gcm_impl. I think it would be good enough to just change the string, and keep the #define as is.

gcm.c L851

static const struct {
	const char *name;
	uint32_t sel;
} gcm_impl_opts[] = {
		{ "cycle",	IMPL_CYCLE },
		{ "fastest",	IMPL_FASTEST },
#ifdef CAN_USE_GCM_ASM
		{ "avx",	IMPL_AVX },
=>		{ "avx2",	IMPL_AVX2 },
#endif
};

AttilaFueloep · 2025-03-10T15:37:42Z

@lowjoel Regarding the Htab size: I'd have passed the GCM context to gcm_simd_get_htab_size() and returned an appropriate value there, but your change works as well.

tonyhutter · 2025-03-25T17:37:43Z

@AttilaFueloep @lowjoel just checking in - are you guys pretty happy with where this PR is right now? Are you just waiting on more approvals?

AttilaFueloep · 2025-03-25T21:38:30Z

@tonyhutter Well, I'm waiting for feedback from @lowjoel regarding my suggestion to rename the module parameter to avx2-vaes (#17058 (comment)). Besides that, I've no objections.

lowjoel · 2025-03-25T23:19:18Z

@tonyhutter Well, I'm waiting for feedback from @lowjoel regarding my suggestion to rename the module parameter to avx2-vaes (#17058 (comment)). Besides that, I've no objections.

Crap. I think I made a change in my branch locally and the push got rejected and I got distracted with something else... umm. Let me go figure it out later today, or if you'd want to just push the change to my branch that works too. Or if you'd want to merge and squash the change up into also works for me.

AttilaFueloep · 2025-03-25T23:41:31Z

No worries, take your time.

lowjoel · 2025-03-26T00:14:57Z

OK it seems like I did do the work but just not the string shown to the user. Just rebased and pushed. See 42dfe38

lowjoel · 2025-03-26T00:33:05Z

@AttilaFueloep we've got a few new commits when I was checking the license - https://github.com/google/boringssl/commits/main/crypto/fipsmodule/aes/asm/aes-gcm-avx2-x86_64.pl - I will backport the vzeroupper change after this is merged, or if it's still open later today (about to head out for work)

lowjoel · 2025-04-03T13:28:55Z

module/icp/algs/modes/gcm.c

@@ -767,6 +754,9 @@ gcm_impl_get_ops(void)
 		break;
 #ifdef CAN_USE_GCM_ASM
 	case IMPL_AVX:
+#if CAN_USE_GCM_ASM >= 2
+	case IMPL_AVX2:
+#endif


I realise that by specifying icp_gcm_impl = avx2 I'm seeing 50% of my CPU time spent in gcm_generic_mul, I believe because of this branch here. If I specify fastest this doesn't happen. Why do we override the generic implementation rather than the fastest when we explicitly set AVX/AVX2?

I don't think so. If you end up in gcm_generic_mul you're in a non avx2 code path. I'd suspect either setting or using ctx->impl or activating the implementation fails. I'd run this through gdb but I've no VAES box currently. (That may change soon though.)

The code above just make sure that a valid gcm_impl_ops is returned for still open non-avx contexts , see the comment.

I see what you mean now - if this path is called it's because it's an in flight operation while the global changed. Hmm, then I need to dig further.

Exactly. I'll try to have a look over the weekend.

After rebuilding my patchset and reinstalling I can't reproduce this any more. Maybe it was a bug in my patches when I was building the DKMS module.

This uses the AVX2 versions of the AESENC and PCLMULQDQ instructions; on Zen 3 this provides an up to 80% performance improvement. Original source: https://github.com/google/boringssl/blob/13840dd094f9e9c1b00a7368aa25e656554221f1/gen/bcm/aes-gcm-avx2-x86_64-linux.S See the original BoringSSL commit at google/boringssl@3b6e1be. Signed-off-by: Joel Low <joel@joelsplace.sg>

Signed-off-by: Joel Low <joel@joelsplace.sg>

- Accept GCM H variable in network endianness (ICP convention) - Fix round count offset in AES_KEY struct (ICP convention) - Use RET macro for kernel code - Explicitly use .balign directive - Use ENTRY_ALIGN and SET_SIZE macros - Use ENDBR macro Signed-off-by: Joel Low <joel@joelsplace.sg>

This is a cherry-pick from the following commits: - google/boringssl@d8beaa3 - google/boringssl@14d05a3 - google/boringssl@d5440dd Signed-off-by: Joel Low <joel@joelsplace.sg>

Signed-off-by: Joel Low <joel@joelsplace.sg>

The BoringSSL AVX2 implementation allocates uint128_t[16] but only uses the first 12 elements. Be explicit in the code. Signed-off-by: Joel Low <joel@joelsplace.sg>

Signed-off-by: Joel Low <joel@joelsplace.sg>

Signed-off-by: Rob Norris <robn@despairlabs.com>

lowjoel · 2025-05-31T23:12:32Z

@tonyhutter I think this patchset is good to review; I see that the qemu tests are just affecting almalinux8 which isn't just my PR; is this broken in master too?

behlendorf · 2025-06-03T19:07:33Z

@lowjoel it looks like a spurious failure, I've submitted the almalinux8 build again.

lowjoel marked this pull request as draft February 15, 2025 07:34

github-actions bot added the Status: Work in Progress Not yet ready for general review label Feb 15, 2025

lowjoel force-pushed the aes-gcm-avx2 branch 2 times, most recently from 940863a to fbfc371 Compare February 15, 2025 08:18

lowjoel force-pushed the aes-gcm-avx2 branch 3 times, most recently from 0ce5f7b to 1b8867f Compare February 15, 2025 09:00

lowjoel force-pushed the aes-gcm-avx2 branch 14 times, most recently from ae64131 to c0a1e8e Compare February 16, 2025 07:12

lowjoel commented Feb 16, 2025

View reviewed changes

lowjoel force-pushed the aes-gcm-avx2 branch from c0a1e8e to f0c81fa Compare February 16, 2025 13:26

lowjoel marked this pull request as ready for review February 16, 2025 23:10

github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Feb 16, 2025

lowjoel mentioned this pull request Feb 16, 2025

icp: Clean up AES-GCM AVX/AVX2 implementation selection #17061

Closed

13 tasks

lowjoel force-pushed the aes-gcm-avx2 branch from a6b8673 to 3fc6ac8 Compare March 8, 2025 10:14

lowjoel force-pushed the aes-gcm-avx2 branch 2 times, most recently from 7f1fb8c to 1f02def Compare March 9, 2025 03:28

lowjoel force-pushed the aes-gcm-avx2 branch from 1f02def to 2fdcc8c Compare March 26, 2025 00:14

lowjoel force-pushed the aes-gcm-avx2 branch from 2fdcc8c to 4e46877 Compare March 26, 2025 00:30

lowjoel commented Apr 3, 2025

View reviewed changes

lowjoel and others added 9 commits May 31, 2025 07:22

Rename to the avx2-vaes implementation instead

4329512

Signed-off-by: Joel Low <joel@joelsplace.sg>

Backport AVX2 AES-GCM changes from BoringSSL

f55f14a

This is a cherry-pick from the following commits: - google/boringssl@d8beaa3 - google/boringssl@14d05a3 - google/boringssl@d5440dd Signed-off-by: Joel Low <joel@joelsplace.sg>

Refactor use_avx and use_avx2 implementation flags

ce61071

Signed-off-by: Joel Low <joel@joelsplace.sg>

Create htab according to implementation

974af44

The BoringSSL AVX2 implementation allocates uint128_t[16] but only uses the first 12 elements. Be explicit in the code. Signed-off-by: Joel Low <joel@joelsplace.sg>

Conditionally build AVX2 code

a15ce87

Signed-off-by: Joel Low <joel@joelsplace.sg>

crypto_test: include AVX2 GCM implementation in tests

15e7c17

Signed-off-by: Rob Norris <robn@despairlabs.com>

simd_stat: expose availability of VAES and VPCLMULQDQ

be3f4d8

Signed-off-by: Rob Norris <robn@despairlabs.com>

lowjoel force-pushed the aes-gcm-avx2 branch 2 times, most recently from 49cfaf5 to be3f4d8 Compare May 31, 2025 03:46

lzsaver mentioned this pull request Jun 3, 2025

add aegis aead support #17414

Open

behlendorf self-requested a review June 3, 2025 19:02

icp: Port AVX2 implementation of aes-gcm from BoringSSL #17058

Are you sure you want to change the base?

icp: Port AVX2 implementation of aes-gcm from BoringSSL #17058

Uh oh!

Conversation

lowjoel commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

lowjoel commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robn commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robn commented Feb 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lowjoel Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lowjoel commented Mar 8, 2025

Uh oh!

AttilaFueloep commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AttilaFueloep commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tonyhutter commented Mar 25, 2025

Uh oh!

AttilaFueloep commented Mar 25, 2025

Uh oh!

lowjoel commented Mar 25, 2025

Uh oh!

AttilaFueloep commented Mar 25, 2025

Uh oh!

lowjoel commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lowjoel commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lowjoel commented May 31, 2025

Uh oh!

behlendorf commented Jun 3, 2025

Uh oh!

Uh oh!

lowjoel commented Feb 15, 2025 •

edited

Loading

lowjoel commented Feb 15, 2025 •

edited

Loading

robn commented Feb 15, 2025 •

edited

Loading

lowjoel Feb 20, 2025 •

edited

Loading

AttilaFueloep commented Mar 10, 2025 •

edited

Loading

AttilaFueloep commented Mar 10, 2025 •

edited

Loading

lowjoel commented Mar 26, 2025 •

edited

Loading

lowjoel commented Mar 26, 2025 •

edited

Loading