Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clz size/perf optimisation #7462

Merged
merged 8 commits into from
Apr 26, 2023

Conversation

daverodgman
Copy link
Contributor

@daverodgman daverodgman commented Apr 19, 2023

Description

For clang and gcc, this saves a few instructions (with gcc -Os on aarch64, the size of libmbedcrypto is reduced by 88b). It also covers a test gap (this function was not directly tested before). Performance is not very different according to programs/test/benchmark.

With armclang v6.19 compiling for target arm-arm-none-eabi, cortex-m33+nodsp (code, data; negative is a reduction, so an improvement):

Old Size New Size Change .o impacted
2490, 0 2460, 0 -30+0 bignum_core.o

Gatekeeper checklist

  • changelog not required - non functional change
  • backport not required - not a bugfix
  • tests provided

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
@daverodgman daverodgman added needs-review Every commit must be reviewed by at least two team members, component-crypto Crypto primitives and low-level interfaces needs-reviewer This PR needs someone to pick it up for review priority-high High priority - will be reviewed soon size-optimisation size-xs Estimated task size: extra small (a few hours at most) labels Apr 19, 2023
@daverodgman daverodgman added the needs-ci Needs to pass CI tests label Apr 19, 2023
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
@daverodgman daverodgman removed the needs-ci Needs to pass CI tests label Apr 19, 2023
@daverodgman
Copy link
Contributor Author

daverodgman commented Apr 20, 2023

It seems that on Arm, the clz instruction is guaranteed to return the answer we want for an input of 0. So the special case for 0 could possibly be omitted on Arm to save a few more instructions. Probably not worth it IMO.

Alternatively, we can re-arrange the only function which calls this in order to never call it for an input of 0 (so we safely avoid the undefined case). This ends up being a small additional net win, so I've added this to the PR (last two commits).

@daverodgman daverodgman added the needs-ci Needs to pass CI tests label Apr 20, 2023
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
@daverodgman daverodgman removed the needs-ci Needs to pass CI tests label Apr 20, 2023
Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: undefined behavior must be clearly documented in the function's documentation.

And I'm not convinced that the complexity is worth it. Can't we keep things simple?

*/
#if defined(__has_builtin)
#if __has_builtin(__builtin_clz)
if (sizeof(mbedtls_mpi_uint) == sizeof(unsigned int)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised that GCC and Clang aren't complaining about dead code. We know the size of mbedtls_mpi_uint: it's either 32-bit or 64-bit (and with no padding bits, although that doesn't actually matter here), and we know that through the preprocessor symbol MBEDTLS_HAVE_INT32 or MBEDTLS_HAVE_INT64. So we can select the clz function entirely via the preprocessor.

#if MBEDTLS_HAVE_INT32 && __has_builtin(__builtin_clz)
    return __builtin_clz(a);
#elif ((MBEDTLS_HAVE_INT64 && LONG_MAX >= 0xffffffffffffffff) || MBEDTLS_HAVE_INT32) && __has_builtin(__builtin_clzl)
    return __builtin_clzl(a);
#elif __has_builtin(__builtin_clzll)
    return __builtin_clzll(a);
#else
  // software implementation
#endif

Or, for simplicity's sake, are there compilers we care about that either don't have __builtin_clzll or generate inefficient code when __builtin_clzll is passed a value of a smaller type? If not we could just write

#if __has_builtin(__builtin_clzll)
    return __builtin_clzll(a);
#else
  // software implementation
#endif

Copy link
Contributor Author

@daverodgman daverodgman Apr 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately that doesn't work: calling __builtin_clzll on a 32-bit mbedtls_mpi_uint means that 32 leading zeros get added. So we do need separate options.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the first suggestion because it essentially relies on accurately recreating the logic from bignum.h to select the correct option, which is brittle (e.g., would break if we decided this type should always be 64 bits, or if we added an option for the user to specify the size, etc). Comparing via sizeof is much more robust (and still a compile-time decision).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could define MBEDTLS_MPI_BITS to 32 or 64 in bignum.h. Then, here we could do:

#if defined(MBEDTLS_CLZ32) && (MBEDTLS_MPI_BITS == 32)
    return MBEDTLS_CLZ32(a);
#elif defined(MBEDTLS_CLZ64) && (MBEDTLS_MPI_BITS == 64)
    return MBEDTLS_CLZ64(a);
#else
    // plain C impl
#endif

with some suitable #defines for MBEDTLS_CLZ32 outside of the function. Maybe that's a bit cleaner? I don't have a strong perference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the first suggestion because it essentially relies on accurately recreating the logic from bignum.h to select the correct option

I don't understand: what logic would we be recreating? bignum.h guarantees that MBEDTLS_HAVE_INT32 is defined if mbedtls_mpi_uint is a 32-bit type, and similar for MBEDTLS_HAVE_INT64. If we added support for a different size, then neither symbol would be defined and instead we'd define e.g. MBEDTLS_HAVE_INT16. defined(MBEDTLS_HAVE_INTxx) vs MBEDTLS_MPI_BITS == xx is just a cosmetic difference since we aren't doing anything with xx other than an equality comparison.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bignum.h guarantees that MBEDTLS_HAVE_INT32 is defined (…)

Mind you, this is not clearly documented (and somewhat surprising given the name of the symbols) — I only realized that while doing bignum recently. So it isn't a promise we make to our users as part of our API stability guarantee. But it's a property that we are already relying on, and there is a precedent (I forget where) of relying on this property to simplify some bignum code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not clearly documented (and somewhat surprising given the name of the symbols)

Yes, that was not obvious to me. In that case I'm happy to switch on that instead of MBEDTLS_MPI_BITS, but no strong preference vs. the current arrangement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I agree with Gilles here: MBEDTLS_HAVE_INTxx means precisely that mbedtls_mpi_uint is xx bits. Yes, the naming is a bit off

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, while we'd normally use MBEDTLS_HAVE_INTxx, using sizeof() as done here is perfectly natural and leads to a quite readable implementation, so I think this is fine. If we need 32- and/or 64-bit clz() functions, we can create those then then

{
/* Note: the result is undefined for a == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: This limitation must be documented.

That's assuming we really must have this limitation. How much code size does it save?

Experimentally, with arm-none-eabi-gcc -mthumb -mcpu=cortex-m0plus -Os (GCC 10.3-2021.07), more than I'd thought: the rewritten mbedtls_mpi_core_bitlen is one instruction longer, but adding if (a == 0) return biL to mbedtls_mpi_core_clz is a whooping 5 instructions longer. Still, I think 8 bytes is worth it to have an easy-to-understand semantics. Plus I'm not convinced GCC's code is good there:

   0:   0003            movs    r3, r0
   2:   b510            push    {r4, lr}
   4:   2020            movs    r0, #32
   6:   2b00            cmp     r3, #0
   8:   d002            beq.n   10 <mbedtls_mpi_core_clz+0x10>
   a:   0018            movs    r0, r3
   c:   f7ff fffe       bl      0 <__clzsi2>
  10:   bd10            pop     {r4, pc}

What's with the register shuffling there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've documented the limitation. The limitation is annoying but I think acceptable given the local scope of this function.

tests/suites/test_suite_bignum_core.function Outdated Show resolved Hide resolved
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation in bignum_core.h still needs to be updated. And I still think the sizeof comparison is more brittle than it needs to be.

@@ -33,11 +33,18 @@
#include "bn_mul.h"
#include "constant_time_internal.h"

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a static function, so its documentation is in a header file (in this case library/bignum_core.h since it's not part of the public API). Please update the public documentation, and don't add a copy here (copies are bad because they get out of sync — this one is already out of sync!).

*/
#if defined(__has_builtin)
#if __has_builtin(__builtin_clz)
if (sizeof(mbedtls_mpi_uint) == sizeof(unsigned int)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the first suggestion because it essentially relies on accurately recreating the logic from bignum.h to select the correct option

I don't understand: what logic would we be recreating? bignum.h guarantees that MBEDTLS_HAVE_INT32 is defined if mbedtls_mpi_uint is a 32-bit type, and similar for MBEDTLS_HAVE_INT64. If we added support for a different size, then neither symbol would be defined and instead we'd define e.g. MBEDTLS_HAVE_INT16. defined(MBEDTLS_HAVE_INTxx) vs MBEDTLS_MPI_BITS == xx is just a cosmetic difference since we aren't doing anything with xx other than an equality comparison.

*/
#if defined(__has_builtin)
#if __has_builtin(__builtin_clz)
if (sizeof(mbedtls_mpi_uint) == sizeof(unsigned int)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bignum.h guarantees that MBEDTLS_HAVE_INT32 is defined (…)

Mind you, this is not clearly documented (and somewhat surprising given the name of the symbols) — I only realized that while doing bignum recently. So it isn't a promise we make to our users as part of our API stability guarantee. But it's a property that we are already relying on, and there is a precedent (I forget where) of relying on this property to simplify some bignum code.

tests/suites/test_suite_bignum_core.function Outdated Show resolved Hide resolved
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Copy link
Contributor

@tom-cosgrove-arm tom-cosgrove-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tom-cosgrove-arm tom-cosgrove-arm removed the needs-reviewer This PR needs someone to pick it up for review label Apr 25, 2023
@tom-cosgrove-arm
Copy link
Contributor

@gilles-peskine-arm Do you still have objections/concerns? And if not, are you reviewing this, or should someone else take a look?

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me except the new test which tests something that happens to work, but we don't actually want to guarantee.

@@ -155,6 +155,9 @@ mpi_core_bitlen:"10":5
Test mbedtls_mpi_core_bitlen 0x0a
mpi_core_bitlen:"a":4

Test mbedtls_mpi_core_bitlen: 0 limbs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core functions don't support 0-limb bignums (see the top of bignum_core.h). (Legacy bignum functions do, it's up to them to do the right thing without calling core functions.) So we shouldn't have such a test case. It might happen to work for this particular function, but we wouldn't be bothered if it doesn't.

Most core functions fall into one of two categories: they do some kind of loop over the limb count and a 0-limb array just works, or they do some kind of memcpy over the limbs and a 0-limb array technically has undefined behavior which is caught by UBSan but tends to work when not building with UBSan.

The current implementation of core_bitlen has an explicit check limbs == 0. I think it's a leftover from the very early days of the core module, before we'd decided that it didn't need to support 0-limb. So this check could be removed, which would save a couple of words of code size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks, I was unsure on this point. I've removed the test changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we forget: #7491

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@tom-cosgrove-arm tom-cosgrove-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tom-cosgrove-arm tom-cosgrove-arm added approved Design and code approved - may be waiting for CI or backports and removed needs-review Every commit must be reviewed by at least two team members, labels Apr 26, 2023
@tom-cosgrove-arm tom-cosgrove-arm merged commit 10f4091 into Mbed-TLS:development Apr 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Design and code approved - may be waiting for CI or backports component-crypto Crypto primitives and low-level interfaces priority-high High priority - will be reviewed soon size-optimisation size-xs Estimated task size: extra small (a few hours at most)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants