clz size/perf optimisation #7462

daverodgman · 2023-04-19T17:01:05Z

Description

For clang and gcc, this saves a few instructions (with gcc -Os on aarch64, the size of libmbedcrypto is reduced by 88b). It also covers a test gap (this function was not directly tested before). Performance is not very different according to programs/test/benchmark.

With armclang v6.19 compiling for target arm-arm-none-eabi, cortex-m33+nodsp (code, data; negative is a reduction, so an improvement):

Old Size	New Size	Change	.o impacted
2490, 0	2460, 0	-30+0	`bignum_core.o`

Gatekeeper checklist

changelog not required - non functional change
backport not required - not a bugfix
tests provided

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

daverodgman · 2023-04-20T08:28:56Z

It seems that on Arm, the clz instruction is guaranteed to return the answer we want for an input of 0. So the special case for 0 could possibly be omitted on Arm to save a few more instructions. Probably not worth it IMO.

Alternatively, we can re-arrange the only function which calls this in order to never call it for an input of 0 (so we safely avoid the undefined case). This ends up being a small additional net win, so I've added this to the PR (last two commits).

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

gilles-peskine-arm

Blocker: undefined behavior must be clearly documented in the function's documentation.

And I'm not convinced that the complexity is worth it. Can't we keep things simple?

gilles-peskine-arm · 2023-04-20T20:26:04Z

library/bignum_core.c

+     */
+#if defined(__has_builtin)
+#if __has_builtin(__builtin_clz)
+    if (sizeof(mbedtls_mpi_uint) == sizeof(unsigned int)) {


I'm a bit surprised that GCC and Clang aren't complaining about dead code. We know the size of mbedtls_mpi_uint: it's either 32-bit or 64-bit (and with no padding bits, although that doesn't actually matter here), and we know that through the preprocessor symbol MBEDTLS_HAVE_INT32 or MBEDTLS_HAVE_INT64. So we can select the clz function entirely via the preprocessor.

#if MBEDTLS_HAVE_INT32 && __has_builtin(__builtin_clz) return __builtin_clz(a); #elif ((MBEDTLS_HAVE_INT64 && LONG_MAX >= 0xffffffffffffffff) || MBEDTLS_HAVE_INT32) && __has_builtin(__builtin_clzl) return __builtin_clzl(a); #elif __has_builtin(__builtin_clzll) return __builtin_clzll(a); #else // software implementation #endif

Or, for simplicity's sake, are there compilers we care about that either don't have __builtin_clzll or generate inefficient code when __builtin_clzll is passed a value of a smaller type? If not we could just write

#if __has_builtin(__builtin_clzll) return __builtin_clzll(a); #else // software implementation #endif

Unfortunately that doesn't work: calling __builtin_clzll on a 32-bit mbedtls_mpi_uint means that 32 leading zeros get added. So we do need separate options.

I don't like the first suggestion because it essentially relies on accurately recreating the logic from bignum.h to select the correct option, which is brittle (e.g., would break if we decided this type should always be 64 bits, or if we added an option for the user to specify the size, etc). Comparing via sizeof is much more robust (and still a compile-time decision).

I guess we could define MBEDTLS_MPI_BITS to 32 or 64 in bignum.h. Then, here we could do:

#if defined(MBEDTLS_CLZ32) && (MBEDTLS_MPI_BITS == 32) return MBEDTLS_CLZ32(a); #elif defined(MBEDTLS_CLZ64) && (MBEDTLS_MPI_BITS == 64) return MBEDTLS_CLZ64(a); #else // plain C impl #endif

with some suitable #defines for MBEDTLS_CLZ32 outside of the function. Maybe that's a bit cleaner? I don't have a strong perference.

I don't like the first suggestion because it essentially relies on accurately recreating the logic from bignum.h to select the correct option

I don't understand: what logic would we be recreating? bignum.h guarantees that MBEDTLS_HAVE_INT32 is defined if mbedtls_mpi_uint is a 32-bit type, and similar for MBEDTLS_HAVE_INT64. If we added support for a different size, then neither symbol would be defined and instead we'd define e.g. MBEDTLS_HAVE_INT16. defined(MBEDTLS_HAVE_INTxx) vs MBEDTLS_MPI_BITS == xx is just a cosmetic difference since we aren't doing anything with xx other than an equality comparison.

bignum.h guarantees that MBEDTLS_HAVE_INT32 is defined (…)

Mind you, this is not clearly documented (and somewhat surprising given the name of the symbols) — I only realized that while doing bignum recently. So it isn't a promise we make to our users as part of our API stability guarantee. But it's a property that we are already relying on, and there is a precedent (I forget where) of relying on this property to simplify some bignum code.

this is not clearly documented (and somewhat surprising given the name of the symbols)

Yes, that was not obvious to me. In that case I'm happy to switch on that instead of MBEDTLS_MPI_BITS, but no strong preference vs. the current arrangement.

Note: I agree with Gilles here: MBEDTLS_HAVE_INTxx means precisely that mbedtls_mpi_uint is xx bits. Yes, the naming is a bit off

However, while we'd normally use MBEDTLS_HAVE_INTxx, using sizeof() as done here is perfectly natural and leads to a quite readable implementation, so I think this is fine. If we need 32- and/or 64-bit clz() functions, we can create those then then

gilles-peskine-arm · 2023-04-20T20:46:40Z

library/bignum_core.c

 {
+    /* Note: the result is undefined for a == 0


Blocker: This limitation must be documented.

That's assuming we really must have this limitation. How much code size does it save?

Experimentally, with arm-none-eabi-gcc -mthumb -mcpu=cortex-m0plus -Os (GCC 10.3-2021.07), more than I'd thought: the rewritten mbedtls_mpi_core_bitlen is one instruction longer, but adding if (a == 0) return biL to mbedtls_mpi_core_clz is a whooping 5 instructions longer. Still, I think 8 bytes is worth it to have an easy-to-understand semantics. Plus I'm not convinced GCC's code is good there:

0: 0003 movs r3, r0 2: b510 push {r4, lr} 4: 2020 movs r0, #32 6: 2b00 cmp r3, #0 8: d002 beq.n 10 <mbedtls_mpi_core_clz+0x10> a: 0018 movs r0, r3 c: f7ff fffe bl 0 <__clzsi2> 10: bd10 pop {r4, pc}

What's with the register shuffling there?

I've documented the limitation. The limitation is annoying but I think acceptable given the local scope of this function.

tests/suites/test_suite_bignum_core.function

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

gilles-peskine-arm

The documentation in bignum_core.h still needs to be updated. And I still think the sizeof comparison is more brittle than it needs to be.

gilles-peskine-arm · 2023-04-21T21:20:36Z

library/bignum_core.c

@@ -33,11 +33,18 @@
 #include "bn_mul.h"
 #include "constant_time_internal.h"

+/**


This is not a static function, so its documentation is in a header file (in this case library/bignum_core.h since it's not part of the public API). Please update the public documentation, and don't add a copy here (copies are bad because they get out of sync — this one is already out of sync!).

gilles-peskine-arm · 2023-04-21T21:23:53Z

library/bignum_core.c

+     */
+#if defined(__has_builtin)
+#if __has_builtin(__builtin_clz)
+    if (sizeof(mbedtls_mpi_uint) == sizeof(unsigned int)) {


I don't like the first suggestion because it essentially relies on accurately recreating the logic from bignum.h to select the correct option

I don't understand: what logic would we be recreating? bignum.h guarantees that MBEDTLS_HAVE_INT32 is defined if mbedtls_mpi_uint is a 32-bit type, and similar for MBEDTLS_HAVE_INT64. If we added support for a different size, then neither symbol would be defined and instead we'd define e.g. MBEDTLS_HAVE_INT16. defined(MBEDTLS_HAVE_INTxx) vs MBEDTLS_MPI_BITS == xx is just a cosmetic difference since we aren't doing anything with xx other than an equality comparison.

gilles-peskine-arm · 2023-04-21T21:27:18Z

library/bignum_core.c

+     */
+#if defined(__has_builtin)
+#if __has_builtin(__builtin_clz)
+    if (sizeof(mbedtls_mpi_uint) == sizeof(unsigned int)) {


bignum.h guarantees that MBEDTLS_HAVE_INT32 is defined (…)

Mind you, this is not clearly documented (and somewhat surprising given the name of the symbols) — I only realized that while doing bignum recently. So it isn't a promise we make to our users as part of our API stability guarantee. But it's a property that we are already relying on, and there is a precedent (I forget where) of relying on this property to simplify some bignum code.

tests/suites/test_suite_bignum_core.function

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

tom-cosgrove-arm

LGTM

tom-cosgrove-arm · 2023-04-25T14:56:27Z

@gilles-peskine-arm Do you still have objections/concerns? And if not, are you reviewing this, or should someone else take a look?

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

gilles-peskine-arm

Looks good to me except the new test which tests something that happens to work, but we don't actually want to guarantee.

gilles-peskine-arm · 2023-04-25T16:49:13Z

tests/suites/test_suite_bignum_core.misc.data

@@ -155,6 +155,9 @@ mpi_core_bitlen:"10":5
 Test mbedtls_mpi_core_bitlen 0x0a
 mpi_core_bitlen:"a":4

+Test mbedtls_mpi_core_bitlen: 0 limbs


Core functions don't support 0-limb bignums (see the top of bignum_core.h). (Legacy bignum functions do, it's up to them to do the right thing without calling core functions.) So we shouldn't have such a test case. It might happen to work for this particular function, but we wouldn't be bothered if it doesn't.

Most core functions fall into one of two categories: they do some kind of loop over the limb count and a 0-limb array just works, or they do some kind of memcpy over the limbs and a 0-limb array technically has undefined behavior which is caught by UBSan but tends to work when not building with UBSan.

The current implementation of core_bitlen has an explicit check limbs == 0. I think it's a leftover from the very early days of the core module, before we'd decided that it didn't need to support 0-limb. So this check could be removed, which would save a couple of words of code size.

OK, thanks, I was unsure on this point. I've removed the test changes.

Before we forget: #7491

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

gilles-peskine-arm

LGTM

library/bignum_core.c

tom-cosgrove-arm

LGTM

Size/perf optimisation for mbedtls_mpi_core_clz

fe8a8cd

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

daverodgman added the needs-ci Needs to pass CI tests label Apr 19, 2023

Fix tests

d54cb83

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

daverodgman force-pushed the clz_size_opt branch from 45500c6 to d54cb83 Compare April 19, 2023 17:46

daverodgman removed the needs-ci Needs to pass CI tests label Apr 19, 2023

daverodgman added the needs-ci Needs to pass CI tests label Apr 20, 2023

Further size optimisation

880a6b3

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

daverodgman force-pushed the clz_size_opt branch from c15af9a to 880a6b3 Compare April 20, 2023 10:53

Remove test-case for all-zero

678e630

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

daverodgman removed the needs-ci Needs to pass CI tests label Apr 20, 2023

gilles-peskine-arm requested changes Apr 20, 2023

View reviewed changes

Document undefined case. Clarify test code.

bbf8810

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

gilles-peskine-arm requested changes Apr 21, 2023

View reviewed changes

Fix documentation

0f16d56

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

tom-cosgrove-arm previously approved these changes Apr 25, 2023

View reviewed changes

tom-cosgrove-arm removed the needs-reviewer This PR needs someone to pick it up for review label Apr 25, 2023

daverodgman dismissed tom-cosgrove-arm’s stale review via 4d4c181 April 25, 2023 16:37

Remove unnecessary if to save 16 bytes

2e863ec

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

daverodgman force-pushed the clz_size_opt branch from 4d4c181 to 2e863ec Compare April 25, 2023 16:41

gilles-peskine-arm requested changes Apr 25, 2023

View reviewed changes

Remove undesirable test

4f30a6a

Signed-off-by: Dave Rodgman <dave.rodgman@arm.com>

daverodgman requested a review from gilles-peskine-arm April 25, 2023 17:10

gilles-peskine-arm approved these changes Apr 25, 2023

View reviewed changes

tom-cosgrove-arm reviewed Apr 25, 2023

View reviewed changes

library/bignum_core.c Show resolved Hide resolved

daverodgman mentioned this pull request Apr 25, 2023

Remove traces of 0-limb support from bignum core #7491

Closed

3 tasks

tom-cosgrove-arm approved these changes Apr 26, 2023

View reviewed changes

tom-cosgrove-arm added approved Design and code approved - may be waiting for CI or backports and removed needs-review Every commit must be reviewed by at least two team members, labels Apr 26, 2023

tom-cosgrove-arm merged commit 10f4091 into Mbed-TLS:development Apr 26, 2023

gilles-peskine-arm mentioned this pull request Apr 27, 2023

Unify Windows testing between PR and release jobs Mbed-TLS/mbedtls-test#27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clz size/perf optimisation #7462

clz size/perf optimisation #7462

daverodgman commented Apr 19, 2023 •

edited by tom-cosgrove-arm

Loading

daverodgman commented Apr 20, 2023 •

edited

Loading

gilles-peskine-arm left a comment

gilles-peskine-arm Apr 20, 2023

daverodgman Apr 21, 2023 •

edited

Loading

daverodgman Apr 21, 2023

daverodgman Apr 21, 2023

gilles-peskine-arm Apr 21, 2023

gilles-peskine-arm Apr 21, 2023

daverodgman Apr 24, 2023

tom-cosgrove-arm Apr 25, 2023

tom-cosgrove-arm Apr 25, 2023

gilles-peskine-arm Apr 20, 2023

daverodgman Apr 21, 2023

gilles-peskine-arm left a comment

gilles-peskine-arm Apr 21, 2023

gilles-peskine-arm Apr 21, 2023

gilles-peskine-arm Apr 21, 2023

tom-cosgrove-arm left a comment

tom-cosgrove-arm commented Apr 25, 2023

gilles-peskine-arm left a comment

gilles-peskine-arm Apr 25, 2023

daverodgman Apr 25, 2023

gilles-peskine-arm Apr 25, 2023

gilles-peskine-arm left a comment

tom-cosgrove-arm left a comment

clz size/perf optimisation #7462

clz size/perf optimisation #7462

Conversation

daverodgman commented Apr 19, 2023 • edited by tom-cosgrove-arm Loading

Description

Gatekeeper checklist

daverodgman commented Apr 20, 2023 • edited Loading

gilles-peskine-arm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daverodgman Apr 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gilles-peskine-arm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tom-cosgrove-arm left a comment

Choose a reason for hiding this comment

tom-cosgrove-arm commented Apr 25, 2023

gilles-peskine-arm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gilles-peskine-arm left a comment

Choose a reason for hiding this comment

tom-cosgrove-arm left a comment

Choose a reason for hiding this comment

daverodgman commented Apr 19, 2023 •

edited by tom-cosgrove-arm

Loading

daverodgman commented Apr 20, 2023 •

edited

Loading

daverodgman Apr 21, 2023 •

edited

Loading