-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clz size/perf optimisation #7462
clz size/perf optimisation #7462
Conversation
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
45500c6
to
d54cb83
Compare
It seems that on Arm, the clz instruction is guaranteed to return the answer we want for an input of 0. So the special case for 0 could possibly be omitted on Arm to save a few more instructions. Probably not worth it IMO. Alternatively, we can re-arrange the only function which calls this in order to never call it for an input of 0 (so we safely avoid the undefined case). This ends up being a small additional net win, so I've added this to the PR (last two commits). |
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
c15af9a
to
880a6b3
Compare
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocker: undefined behavior must be clearly documented in the function's documentation.
And I'm not convinced that the complexity is worth it. Can't we keep things simple?
*/ | ||
#if defined(__has_builtin) | ||
#if __has_builtin(__builtin_clz) | ||
if (sizeof(mbedtls_mpi_uint) == sizeof(unsigned int)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit surprised that GCC and Clang aren't complaining about dead code. We know the size of mbedtls_mpi_uint
: it's either 32-bit or 64-bit (and with no padding bits, although that doesn't actually matter here), and we know that through the preprocessor symbol MBEDTLS_HAVE_INT32
or MBEDTLS_HAVE_INT64
. So we can select the clz
function entirely via the preprocessor.
#if MBEDTLS_HAVE_INT32 && __has_builtin(__builtin_clz)
return __builtin_clz(a);
#elif ((MBEDTLS_HAVE_INT64 && LONG_MAX >= 0xffffffffffffffff) || MBEDTLS_HAVE_INT32) && __has_builtin(__builtin_clzl)
return __builtin_clzl(a);
#elif __has_builtin(__builtin_clzll)
return __builtin_clzll(a);
#else
// software implementation
#endif
Or, for simplicity's sake, are there compilers we care about that either don't have __builtin_clzll
or generate inefficient code when __builtin_clzll
is passed a value of a smaller type? If not we could just write
#if __has_builtin(__builtin_clzll)
return __builtin_clzll(a);
#else
// software implementation
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately that doesn't work: calling __builtin_clzll
on a 32-bit mbedtls_mpi_uint
means that 32 leading zeros get added. So we do need separate options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the first suggestion because it essentially relies on accurately recreating the logic from bignum.h
to select the correct option, which is brittle (e.g., would break if we decided this type should always be 64 bits, or if we added an option for the user to specify the size, etc). Comparing via sizeof is much more robust (and still a compile-time decision).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could define MBEDTLS_MPI_BITS
to 32 or 64 in bignum.h
. Then, here we could do:
#if defined(MBEDTLS_CLZ32) && (MBEDTLS_MPI_BITS == 32)
return MBEDTLS_CLZ32(a);
#elif defined(MBEDTLS_CLZ64) && (MBEDTLS_MPI_BITS == 64)
return MBEDTLS_CLZ64(a);
#else
// plain C impl
#endif
with some suitable #defines for MBEDTLS_CLZ32 outside of the function. Maybe that's a bit cleaner? I don't have a strong perference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the first suggestion because it essentially relies on accurately recreating the logic from bignum.h to select the correct option
I don't understand: what logic would we be recreating? bignum.h
guarantees that MBEDTLS_HAVE_INT32
is defined if mbedtls_mpi_uint
is a 32-bit type, and similar for MBEDTLS_HAVE_INT64
. If we added support for a different size, then neither symbol would be defined and instead we'd define e.g. MBEDTLS_HAVE_INT16
. defined(MBEDTLS_HAVE_INTxx)
vs MBEDTLS_MPI_BITS == xx
is just a cosmetic difference since we aren't doing anything with xx
other than an equality comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bignum.h guarantees that MBEDTLS_HAVE_INT32 is defined (…)
Mind you, this is not clearly documented (and somewhat surprising given the name of the symbols) — I only realized that while doing bignum recently. So it isn't a promise we make to our users as part of our API stability guarantee. But it's a property that we are already relying on, and there is a precedent (I forget where) of relying on this property to simplify some bignum code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not clearly documented (and somewhat surprising given the name of the symbols)
Yes, that was not obvious to me. In that case I'm happy to switch on that instead of MBEDTLS_MPI_BITS
, but no strong preference vs. the current arrangement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I agree with Gilles here: MBEDTLS_HAVE_INTxx
means precisely that mbedtls_mpi_uint
is xx bits. Yes, the naming is a bit off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, while we'd normally use MBEDTLS_HAVE_INTxx
, using sizeof()
as done here is perfectly natural and leads to a quite readable implementation, so I think this is fine. If we need 32- and/or 64-bit clz()
functions, we can create those then then
library/bignum_core.c
Outdated
{ | ||
/* Note: the result is undefined for a == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocker: This limitation must be documented.
That's assuming we really must have this limitation. How much code size does it save?
Experimentally, with arm-none-eabi-gcc -mthumb -mcpu=cortex-m0plus -Os
(GCC 10.3-2021.07), more than I'd thought: the rewritten mbedtls_mpi_core_bitlen
is one instruction longer, but adding if (a == 0) return biL
to mbedtls_mpi_core_clz
is a whooping 5 instructions longer. Still, I think 8 bytes is worth it to have an easy-to-understand semantics. Plus I'm not convinced GCC's code is good there:
0: 0003 movs r3, r0
2: b510 push {r4, lr}
4: 2020 movs r0, #32
6: 2b00 cmp r3, #0
8: d002 beq.n 10 <mbedtls_mpi_core_clz+0x10>
a: 0018 movs r0, r3
c: f7ff fffe bl 0 <__clzsi2>
10: bd10 pop {r4, pc}
What's with the register shuffling there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've documented the limitation. The limitation is annoying but I think acceptable given the local scope of this function.
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation in bignum_core.h
still needs to be updated. And I still think the sizeof comparison is more brittle than it needs to be.
library/bignum_core.c
Outdated
@@ -33,11 +33,18 @@ | |||
#include "bn_mul.h" | |||
#include "constant_time_internal.h" | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a static
function, so its documentation is in a header file (in this case library/bignum_core.h
since it's not part of the public API). Please update the public documentation, and don't add a copy here (copies are bad because they get out of sync — this one is already out of sync!).
*/ | ||
#if defined(__has_builtin) | ||
#if __has_builtin(__builtin_clz) | ||
if (sizeof(mbedtls_mpi_uint) == sizeof(unsigned int)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the first suggestion because it essentially relies on accurately recreating the logic from bignum.h to select the correct option
I don't understand: what logic would we be recreating? bignum.h
guarantees that MBEDTLS_HAVE_INT32
is defined if mbedtls_mpi_uint
is a 32-bit type, and similar for MBEDTLS_HAVE_INT64
. If we added support for a different size, then neither symbol would be defined and instead we'd define e.g. MBEDTLS_HAVE_INT16
. defined(MBEDTLS_HAVE_INTxx)
vs MBEDTLS_MPI_BITS == xx
is just a cosmetic difference since we aren't doing anything with xx
other than an equality comparison.
*/ | ||
#if defined(__has_builtin) | ||
#if __has_builtin(__builtin_clz) | ||
if (sizeof(mbedtls_mpi_uint) == sizeof(unsigned int)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bignum.h guarantees that MBEDTLS_HAVE_INT32 is defined (…)
Mind you, this is not clearly documented (and somewhat surprising given the name of the symbols) — I only realized that while doing bignum recently. So it isn't a promise we make to our users as part of our API stability guarantee. But it's a property that we are already relying on, and there is a precedent (I forget where) of relying on this property to simplify some bignum code.
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@gilles-peskine-arm Do you still have objections/concerns? And if not, are you reviewing this, or should someone else take a look? |
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
4d4c181
to
2e863ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me except the new test which tests something that happens to work, but we don't actually want to guarantee.
@@ -155,6 +155,9 @@ mpi_core_bitlen:"10":5 | |||
Test mbedtls_mpi_core_bitlen 0x0a | |||
mpi_core_bitlen:"a":4 | |||
|
|||
Test mbedtls_mpi_core_bitlen: 0 limbs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Core functions don't support 0-limb bignums (see the top of bignum_core.h
). (Legacy bignum functions do, it's up to them to do the right thing without calling core functions.) So we shouldn't have such a test case. It might happen to work for this particular function, but we wouldn't be bothered if it doesn't.
Most core functions fall into one of two categories: they do some kind of loop over the limb count and a 0-limb array just works, or they do some kind of memcpy over the limbs and a 0-limb array technically has undefined behavior which is caught by UBSan but tends to work when not building with UBSan.
The current implementation of core_bitlen
has an explicit check limbs == 0
. I think it's a leftover from the very early days of the core module, before we'd decided that it didn't need to support 0-limb. So this check could be removed, which would save a couple of words of code size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks, I was unsure on this point. I've removed the test changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we forget: #7491
Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
For clang and gcc, this saves a few instructions (with gcc -Os on aarch64, the size of libmbedcrypto is reduced by 88b). It also covers a test gap (this function was not directly tested before). Performance is not very different according to programs/test/benchmark.
With armclang v6.19 compiling for target
arm-arm-none-eabi
,cortex-m33+nodsp
(code, data; negative is a reduction, so an improvement):bignum_core.o
Gatekeeper checklist