Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LibCrypto: Add x86 specific versions of SHA1 and SHA2 #24522

Merged
merged 9 commits into from
Jul 4, 2024

Conversation

Hendiadyoin1
Copy link
Contributor

@Hendiadyoin1 Hendiadyoin1 commented Jun 2, 2024

CC @MarekKnapek
I revived your PR, I hope you dont mind me giving my self co-author


Changes since #23958:

  • We now use a custom ifunc resolver, as the sha target seems to be silently ignored at best by all compilers

@github-actions github-actions bot added the 👀 pr-needs-review PR needs review from a maintainer or community member label Jun 2, 2024
@Hendiadyoin1 Hendiadyoin1 force-pushed the SIMD-SHA branch 3 times, most recently from 1e7e422 to d1cc231 Compare June 2, 2024 19:14
@MarekKnapek
Copy link
Contributor

Yes, no problem, you have done lot of work on this and helped me much, I thank you for this. Now, towards AES... should be easier than this.

Tests/AK/TestSIMDExtras.cpp Show resolved Hide resolved
// FIXME: Use __builtin_cpu_supports("sha") when compilers support it
constexpr u32 cpuid_sha_ebx = 1 << 29;
u32 eax, ebx, ecx, edx;
__cpuid_count(7, 0, eax, ebx, ecx, edx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can hide __cpuid_count in some AK::cpu_supports(Capability::SHA) which would also cache the result. (This will be similar to how __builtin_cpu_supports works)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well copilot said the same, but sadly we dont have that API yet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's never late to introduce a new helper!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is with clang only, GCC 11.1 supports both the function attribute and the query.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, we later figured on Discord that neither Clang nor GCC did the correct thing. Compare assembly for use_f and use_transform_impl here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was wrong. GCC does support the __builtin_cpu_supports("sha") test since 11.1 but clang does not, not even the latest one version 18. https://godbolt.org/z/cP7WGYqGq Both compilers do not support the automatic dispatch thing (Function Multiversioning).

Copy link
Contributor

@MarekKnapek MarekKnapek Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add test whether the CPUID has at least 7 leaves before reading from the 7th leaf. This is done by setting the eax register to zero and invoking the cpuid instruction, or, from the C language, by calling the __get_cpuid_max(0, NULL) function provided by the <cpuid.h> header. https://godbolt.org/z/v6q89rqhn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add test whether the CPUID has at least 7 leaves before reading from the 7th leaf. This is done by setting the eax register to zero and invoking the cpuid instruction, or, from the C language, by calling the __get_cpuid_max(0, NULL) function provided by the <cpuid.h> header. https://godbolt.org/z/v6q89rqhn

For now I'll elide that, as memset also avoids that, and I am pretty sure that cpus without the Extentions leaf are missing other features we rely on

@Hendiadyoin1
Copy link
Contributor Author

Hendiadyoin1 commented Jun 10, 2024

Before (Clang):

Benchmarking sha1...
Running benchmark for sha1 with size 16 for ~3000ms...3001ms, 8391481 ops, 14.3 MiB/s
Running benchmark for sha1 with size 1024 for ~3000ms...3001ms, 907621 ops, 240.3 MiB/s
Running benchmark for sha1 with size 16384 for ~3000ms...3001ms, 62750 ops, 324.2 MiB/s
Running benchmark for sha1 with size 262144 for ~3000ms...3001ms, 3931 ops, 327.1 MiB/s
Running benchmark for sha1 with size 1048576 for ~3000ms...3002ms, 960 ops, 319.4 MiB/s
Running benchmark for sha1 with size 16777216 for ~3000ms...3036ms, 62 ops, 326.1 MiB/s
Algorithm            Size       Min us/op  Max us/op  Avg us/op  Throughput
sha1                 16 B       1          112        1          14.3 MiB  /s
sha1                 1.0 KiB    4          57         4          240.3 MiB /s
sha1                 16.0 KiB   46         121        48         324.2 MiB /s
sha1                 256.0 KiB  732        1029       763        327.1 MiB /s
sha1                 1.0 MiB    3003       5511       3126       319.4 MiB /s
sha1                 16.0 MiB   48587      49801      48965      326.1 MiB /s

After:

Benchmarking sha1...
Running benchmark for sha1 with size 16 for ~3000ms...3001ms, 14939406 ops, 14.3 MiB/s
Running benchmark for sha1 with size 1024 for ~3000ms...3001ms, 4136306 ops, 965.1 MiB/s
Running benchmark for sha1 with size 16384 for ~3000ms...3001ms, 336116 ops, 1.6 GiB/s
Running benchmark for sha1 with size 262144 for ~3000ms...3001ms, 21397 ops, 1.7 GiB/s
Running benchmark for sha1 with size 1048576 for ~3000ms...3001ms, 5331 ops, 1.7 GiB/s
Running benchmark for sha1 with size 16777216 for ~3000ms...3003ms, 331 ops, 1.7 GiB/s
Algorithm            Size       Min us/op  Max us/op  Avg us/op  Throughput
sha1                 16 B       1          62         1          14.3 MiB  /s
sha1                 1.0 KiB    1          121        1          965.1 MiB /s
sha1                 16.0 KiB   9          104        9          1.6 GiB   /s
sha1                 256.0 KiB  135        310        140        1.7 GiB   /s
sha1                 1.0 MiB    536        759        563        1.7 GiB   /s
sha1                 16.0 MiB   8899       9256       9070       1.7 GiB   /s

So a 5.3x speedup for sha1 on my end

Copy link
Contributor

@DanShaders DanShaders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope I'm not too annoying with asking to use C++ standard syntax for attributes wherever possible.

Comment on lines +74 to +75
static_assert(IsSame<ElementOf<i8x4>, i8>);
static_assert(IsSame<ElementOf<f32x4>, float>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move these into tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO static_asserts should stay in the header, not sure if we have many of those in the test files
for AK/Concepts.h maybe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are countless instances of static_assert in Tests/AK, while the only thing that is being tested inline in a header in AK is explode_byte.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave them here for now
(Side note the Kernel likes inline static_asserts)

AK/SIMD.h Show resolved Hide resolved
AK/SIMD.h Show resolved Hide resolved
AK/SIMDExtras.h Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about (untested)

template<typename ElementType, size_t n, auto function>
void test(ReadonlySpan<u64> values, ReadonlySpan<u64> expected)
{
    VERIFY(values.size() == n);
    VERIFY(expected.size() == n);
    using Type = ElementType __attribute__((vector_size(n * sizeof(ElementType))));
    auto v = AK::SIMD::load_unaligned<Type>(&values[0]);
    v = function(v);
    ElementType result[n];
    AK::SIMD::store_unaligned(result, v);
    for (size_t i = 0; i < n; ++i)
        EXPECT_EQ(result[i], expected[i]);
}

and expressing all 1000 lines of tests with a nice and short array of structs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this wont work on GCC, but will look into it

Userland/Libraries/LibCrypto/Hash/SHA2.cpp Outdated Show resolved Hide resolved
Userland/Libraries/LibELF/DynamicLoader.cpp Show resolved Hide resolved
Userland/Libraries/LibELF/DynamicLoader.cpp Outdated Show resolved Hide resolved
We currently pretend that these return an ElfAddr, while in actuality
they return a function pointer.
UBSAN may check this and complain about it, so let's just disable it
for that line.
@Hendiadyoin1 Hendiadyoin1 force-pushed the SIMD-SHA branch 2 times, most recently from 1993c31 to aa6d286 Compare July 3, 2024 18:45
@Hendiadyoin1 Hendiadyoin1 force-pushed the SIMD-SHA branch 2 times, most recently from 3ec6983 to a0a7ff2 Compare July 3, 2024 18:56
Hendiadyoin1 and others added 5 commits July 3, 2024 21:09
Otherwise we'd hit a VERIFY in AK::SIMD::shuffle() when that operand
contains an out-of-range value, the spec tests indicate that a swizzle
with an out-of-range index should return 0.
Co-Authored-By: Hendiadyoin1 <leon.a@serenityos.org>
Co-Authored-By: Hendiadyoin1 <leon.a@serenityos.org>
@nico
Copy link
Contributor

nico commented Jul 3, 2024

@gmta, do the first few SIMD commits look good to you?

(I think this is fine for merging, but you know the SIMD bits much better.)

@gmta gmta merged commit 37c66c1 into SerenityOS:master Jul 4, 2024
14 checks passed
@github-actions github-actions bot removed the 👀 pr-needs-review PR needs review from a maintainer or community member label Jul 4, 2024
@Hendiadyoin1 Hendiadyoin1 deleted the SIMD-SHA branch July 5, 2024 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants