-
Notifications
You must be signed in to change notification settings - Fork 1.9k
icp: Port AVX2 implementation of aes-gcm from BoringSSL #17058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
940863a
to
fbfc371
Compare
Hey @AttilaFueloep I tried to port your changes in #9749 and I can't figure out what you mean by the change with the comment |
0ce5f7b
to
1b8867f
Compare
Strong opening move! I've had a little play with it, here's what I got. Before the final thing, you should run I added this patch to let me see what it thinks is happening on my test machines: diff --git module/zcommon/simd_stat.c module/zcommon/simd_stat.c
index d82a88ca9..da557bbb0 100644
--- module/zcommon/simd_stat.c
+++ module/zcommon/simd_stat.c
@@ -117,6 +117,10 @@ simd_stat_kstat_data(char *buf, size_t size, void *data)
"pclmulqdq", zfs_pclmulqdq_available());
off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
"movbe", zfs_movbe_available());
+ off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
+ "vaes", zfs_vaes_available());
+ off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
+ "vpclmulqdq", zfs_vpclmulqdq_available());
off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
"osxsave", boot_cpu_has(X86_FEATURE_OSXSAVE)); With that, my old Intel 2019 junker laptop says:
My much nicer Ryzen 5 from last year says:
Unfortunately we don't have visibility on the ICP microbenchmarks like we do for checksums and raidz, but we can at least see the options available:
So I'd say it's all wired up right, which is half the fun. I set it to
Trying to create the pool and dataset with
And the kernel has a nice complaint:
That's all I have time for tonight. This is a good start! |
Oh the other thing I forgot to add, after loading the module it says |
ae64131
to
c0a1e8e
Compare
module/icp/algs/modes/gcm.c
Outdated
} else { | ||
/* | ||
* Handle the "cycle" implementation by creating avx and | ||
* non-avx contexts alternately. | ||
*/ | ||
gcm_ctx->gcm_use_avx = gcm_toggle_avx(); | ||
gcm_ctx->gcm_use_avx2 = gcm_toggle_avx2(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure the cycle behaviour here isn't correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed in #17061.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does this PR have a dependency on #17061?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think cycle will still work, it just won't toggle the same way (or the assumed way) - is that behaviour documented somewhere? Under what circumstances would people be using cycle?
When this code runs on an AVX2 processor, cycle will only toggle between AVX2 and generic, ignoring AVX - I don't know if that is a deal breaker enough to consider it a "dependency"
AVX only processors should still cycle the same way.
(at least that's the intention - might have to think a bit deeper about this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tonyhutter I've had another go at the cycle code - since movbe
only applies to AVX I've modified the toggling for that to not apply to AVX2. Furthermore, if bswap is needed neither accelerated method will be usable.
c0a1e8e
to
f0c81fa
Compare
@AttilaFueloep renamed to avx2-vaes in 59fb58a; identifiers in code still remains just |
7f1fb8c
to
1f02def
Compare
@lowjoel Sorry for not being clear enough, was in a hurry. My main concern was the user visible module parameter that is shown in e.g. gcm.c L851 static const struct {
const char *name;
uint32_t sel;
} gcm_impl_opts[] = {
{ "cycle", IMPL_CYCLE },
{ "fastest", IMPL_FASTEST },
#ifdef CAN_USE_GCM_ASM
{ "avx", IMPL_AVX },
=> { "avx2", IMPL_AVX2 },
#endif
}; |
@lowjoel Regarding the Htab size: I'd have passed the GCM context to |
@AttilaFueloep @lowjoel just checking in - are you guys pretty happy with where this PR is right now? Are you just waiting on more approvals? |
@tonyhutter Well, I'm waiting for feedback from @lowjoel regarding my suggestion to rename the module parameter to |
Crap. I think I made a change in my branch locally and the push got rejected and I got distracted with something else... umm. Let me go figure it out later today, or if you'd want to just push the change to my branch that works too. Or if you'd want to merge and squash the change up into also works for me. |
No worries, take your time. |
OK it seems like I did do the work but just not the string shown to the user. Just rebased and pushed. See 42dfe38 |
@AttilaFueloep we've got a few new commits when I was checking the license - https://github.com/google/boringssl/commits/main/crypto/fipsmodule/aes/asm/aes-gcm-avx2-x86_64.pl - I will backport the vzeroupper change after this is merged, or if it's still open later today (about to head out for work) |
@@ -767,6 +754,9 @@ gcm_impl_get_ops(void) | |||
break; | |||
#ifdef CAN_USE_GCM_ASM | |||
case IMPL_AVX: | |||
#if CAN_USE_GCM_ASM >= 2 | |||
case IMPL_AVX2: | |||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realise that by specifying icp_gcm_impl
= avx2
I'm seeing 50% of my CPU time spent in gcm_generic_mul
, I believe because of this branch here. If I specify fastest
this doesn't happen. Why do we override the generic implementation rather than the fastest when we explicitly set AVX/AVX2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. If you end up in gcm_generic_mul
you're in a non avx2 code path. I'd suspect either setting or using ctx->impl
or activating the implementation fails. I'd run this through gdb but I've no VAES box currently. (That may change soon though.)
The code above just make sure that a valid gcm_impl_ops
is returned for still open non-avx contexts , see the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean now - if this path is called it's because it's an in flight operation while the global changed. Hmm, then I need to dig further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. I'll try to have a look over the weekend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After rebuilding my patchset and reinstalling I can't reproduce this any more. Maybe it was a bug in my patches when I was building the DKMS module.
This uses the AVX2 versions of the AESENC and PCLMULQDQ instructions; on Zen 3 this provides an up to 80% performance improvement. Original source: https://github.com/google/boringssl/blob/13840dd094f9e9c1b00a7368aa25e656554221f1/gen/bcm/aes-gcm-avx2-x86_64-linux.S See the original BoringSSL commit at google/boringssl@3b6e1be. Signed-off-by: Joel Low <joel@joelsplace.sg>
Signed-off-by: Joel Low <joel@joelsplace.sg>
- Accept GCM H variable in network endianness (ICP convention) - Fix round count offset in AES_KEY struct (ICP convention) - Use RET macro for kernel code - Explicitly use .balign directive - Use ENTRY_ALIGN and SET_SIZE macros - Use ENDBR macro Signed-off-by: Joel Low <joel@joelsplace.sg>
This is a cherry-pick from the following commits: - google/boringssl@d8beaa3 - google/boringssl@14d05a3 - google/boringssl@d5440dd Signed-off-by: Joel Low <joel@joelsplace.sg>
Signed-off-by: Joel Low <joel@joelsplace.sg>
The BoringSSL AVX2 implementation allocates uint128_t[16] but only uses the first 12 elements. Be explicit in the code. Signed-off-by: Joel Low <joel@joelsplace.sg>
Signed-off-by: Joel Low <joel@joelsplace.sg>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
49cfaf5
to
be3f4d8
Compare
@tonyhutter I think this patchset is good to review; I see that the qemu tests are just affecting almalinux8 which isn't just my PR; is this broken in master too? |
@lowjoel it looks like a spurious failure, I've submitted the almalinux8 build again. |
Motivation and Context
Zen 3 CPUs support the VAES and VPCLMULDQ instructions which extend the width of each instruction from 128-bits to 256-bits. BoringSSL has recently implemented this version for AES-GCM and it provides up to a 80% speedup. See google/boringssl@3b6e1be.
Description
I've backported the implementation from BoringSSL, adapting code from google/boringssl@3b6e1be (but picking the tip of master), as well as from google/boringssl@62f9751 which changed the primitive signature from 6 arguments to 7 (by not implicitly relying on the address offset of the ghash structure.)
Adaptations for icp (akin to #9749) as well as to use the RET macro for kernel code are in the third commit.
The fifth to seventh commits combine the use_avx/use_avx2 flags into an enum, allowing toggling of the different implementations that are available. Also, define different values of CAN_USE_GCM_ASM to indicate various levels of compiler support.
How Has This Been Tested?
Compile tested.
I'm now running it on my ZFS-on-root main machine.
@robn has a Wycheproof test set (#17089) that has been merged; these changes pass.
Types of changes
Checklist:
Signed-off-by
.