Introduce a new feature to use dynamic dispatch to select between ADX and portable implementations at runtime #174

andreacorbellini · 2023-07-08T00:24:31Z

Summary

If blst is compiled with the new -D__BLST_DYNAMIC__ flag, then blst will use dynamic dispatch to run either optimized ADX code or portable code. Internal functions like mul_mont_384 which have an ADX-optimized counterpart (mulx_mont_384) will be bound when blst is loaded by the linker.

This is an alternative to -D__BLST_PORTABLE__ with better performance characteristics.

How this is implemented

This code uses different implementations depending on the platform it's being compiled for:

On GCC/clang with ELF as the target format, ifunc is used. This is the same approach used by glibc for functions like memcpy, which may have different implementations depending on the CPU capabilities. The way ifunc works is similar to how symbols from a dynamic library are loaded.
On GCC/clang without ELF, function pointers and constructor. This is not as efficient as ifunc because it generates call instructions with pointers as targets.
On MSVC: function pointers and .CRT$XCU. Same remarks as above.

Future improvements

This code is dynamically binding low-level internal functions like mul_mont_384, but it may be way more efficient to use the same strategy on global functions like blst_fp_inverse: that would result in fewer relocations/fewer calls to pointers.

Also make the `__blst_cpuid` and `__blst_platform_cap` symbols accessible to other C files.

… and portable implementations at runtime If blst is compiled with the new `-D__BLST_DYNAMIC__` flag, then blst will use dynamic dispatch to run either optimized ADX code or portable code. This is an alternative to `-D__BLST_PORTABLE__` with better performance characteristics. The implementation uses: - on GCC/clang with the ELF standard: `ifunc` - on GCC/clang without ELF: function pointers and `constructor` - on MSVC: function pointers and `.CRT$XCU`

This feature triggers compilation with `-D__ADX__ -D__BLST_DYNAMIC__`.

dot-asm · 2023-07-09T20:06:40Z

But the portable build comes with a run-time switch to ADX code path now[/already]...

andreacorbellini · 2023-07-10T13:04:49Z

But the portable build comes with a run-time switch to ADX code path now[/already]...

Yes, I'm aware. That's implemented using jumps (test+jne), while this PR uses indirect calls, which seem to perform better according to benchmarks. Especially benchmarks performed on a real-world application seem to show a consistent speed-up ranging from 6% to 14%, when run on a Linux/ELF target (not sure about others; maybe performance would be the same as portable, I can't tell right now).

I would argue that using indirect calls also simplifies the code, if compared to using conditional jumps, however my PR is adding a new feature, not changing the existing portable feature, so this is a moot point.

Last but not least, if the technique used in this PR was applied to the top-level functions (blst_*), then I believe that the performance difference between portable and force-adx would be none (on Linux/ELF), when blst is used as a dynamic library, because the number of indirect calls would be essentially the same. Again, my PR does not do that, but I would like to hear your opinion on the approach before I invest time in a larger PR. I remember reading in a past GitHub issue some concerns regarding branch-target-injection; do those concerns still exist?

dot-asm · 2023-07-10T13:59:49Z

while this PR uses indirect calls, which seem to perform better according to benchmarks. Especially benchmarks performed on a real-world application seem to show a consistent speed-up ranging from 6x to 12x,

This arguably qualifies as an extra-ordinary claim. Because things being equal there is no mechanism for indirect jump to deliver better performance, let alone that much. It's either a misinterpretation of results or indication of an anomaly that you hit accidentally. The latter means that the suggested approach also runs a risk of comparable penalty. Which, again, is essentially inconceivable to imagine. So how do I replicate the results? Provide concrete instructions.

if the technique used in this PR was applied to the top-level functions (blst_*)

There is an established mechanism to achieve it as is. Compile a pair of shared libraries and choose between the two at application startup time.

dot-asm · 2023-07-10T14:05:34Z

I would like to hear your opinion on the approach before I invest time in a larger PR.

My personal [opinion] is that replicating established mechanisms, a.k.a. shared libraries, falls far beyond the focus[/scope] of this project.

dot-asm · 2023-07-10T14:10:07Z

So how do I replicate the results? Provide concrete instructions.

Just in case, this implies that you also share your results... [And the expectation is that switching from ADX to portable would result in significant performance drop.]

mratsim · 2023-07-10T15:02:33Z

But the portable build comes with a run-time switch to ADX code path now[/already]...

Yes, I'm aware. That's implemented using jumps (test+jne), while this PR uses indirect calls, which seem to perform better according to benchmarks. Especially benchmarks performed on a real-world application seem to show a consistent speed-up ranging from 6% to 14%, when run on a Linux/ELF target (not sure about others; maybe performance would be the same as portable, I can't tell right now).

In my testing vs BLST, having test+jumps has no measurable overhead vs even hardcoded function calls (i.e. before the runtime selection).

And also vs JIT with indirect calls from MCL (https://github.com/herumi/mcl/).

Tests done on both low-level field multiplications to high-level BLS signature verification including extension fields or elliptic curve arithmetic in-between.

I've detailed the current hardware architecture, in particular regarding branch predictions in this comment #10 (comment). Note that branches in this case are 100% predictable.

dot-asm · 2023-07-10T15:36:24Z

from 6% to 14%,

Oh, so the results were misinterpreted! I mean originally it read 6x to 12x. Well, small percentages are more imaginable. As already mentioned, any particular approach is prone to hitting some anomalies[/quirks] in hardware and variations don't actually reflect inherent advantages of any particular approach, rather the circumstantial nature of performance of complex execution flows. Well, arguably 14% is hard to tolerate, but it still doesn't speak in favour of indirect branches' superiority. So still tell how to reproduce it :-)

Note that branches in this case are 100% predictable.

Branch prediction logic is akin to cache, in the sense that predictions can be evicted and as result branches rendered non-predicted. As already implied in #10, "100%" applies in controlled environment like benchmarks, but in real life it's more complicated. Heck, you can run into anomalies even in controlled environments, see https://github.com/supranational/blst/blob/master/src/asm/ct_inverse_mod_256-x86_64.pl#L746 for an even local example...

dot-asm · 2023-07-10T15:46:45Z

"100%"

Just in case, it's not like branch predictability is the only factor that can affect the performance. In other words, it's not given that variations in question are the direct[/sole] effect of interference with branch prediction logic, so let's not hang on to it... [The remark was rather a reaction to "100%" itself. I mean real life is never "100%.";-)]

dot-asm · 2023-07-10T16:03:22Z

variations don't actually reflect inherent advantages of any particular approach

To clarify. In respect to performance. The thing is that there are other criteria that affect the choice. For example, in my book direct jumps are preferable because of security implications from indirect ones...

mratsim · 2023-07-11T07:19:08Z

The remark was rather a reaction to "100%" itself. I mean real life is never "100%."

No worries, don't deal in absolutes ;)

Heck, you can run into anomalies even in controlled environments, see https://github.com/supranational/blst/blob/master/src/asm/ct_inverse_mod_256-x86_64.pl#L746 for an even local example...

https://easyperf.net/blog/2018/01/18/Code_alignment_issues

and also https://lkml.org/lkml/2015/5/21/443

and for anomalies in benchmark in general and not just code alignment:

https://web.archive.org/web/20170929071306/https://www.cis.upenn.edu/~cis501/papers/producing-wrong-data.pdf

dot-asm · 2023-07-11T11:41:32Z

Code_alignment_issues

Come on! As if I don't know the stuff. My remark was actually more nuanced than that. Well, of course the nuance is not obvious, but it shouldn't grant the assumption that it was a trivial matter of alignment of the subroutine's entry point;-) The hint was the explicit reference to Coffee Lake. Because it's a problem specifically on Coffee Lake. And 40% is totally disproportional in the context, a loop with >16 iterations with strong data dependencies between iterations. Oh well...

producing-wrong-data.pdf

This kind of aligns with what I'm talking about. One data point asis just one data point, one can't use it to reach an overarching conclusion as OP suggests. Correct course of action is reproduce, identify the root cause, then see where it leads you...

dot-asm · 2023-07-28T11:02:59Z

Correct course of action is reproduce, ...

andreacorbellini added 4 commits July 7, 2023 13:12

Make sure __blst_cpuid only runs once even if invoked multiple times

a297f12

Also make the `__blst_cpuid` and `__blst_platform_cap` symbols accessible to other C files.

Add the dynamic feature to the Rust crate

518dad6

This feature triggers compilation with `-D__ADX__ -D__BLST_DYNAMIC__`.

Add support for __BLST_DYNAMIC__ to the Go bindings

48c67f9

zah mentioned this pull request Jul 11, 2023

blst: bump status-im/nim-blscurve#155

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a new feature to use dynamic dispatch to select between ADX and portable implementations at runtime #174

Introduce a new feature to use dynamic dispatch to select between ADX and portable implementations at runtime #174

andreacorbellini commented Jul 8, 2023 •

edited

dot-asm commented Jul 9, 2023 •

edited

andreacorbellini commented Jul 10, 2023 •

edited

dot-asm commented Jul 10, 2023

dot-asm commented Jul 10, 2023 •

edited

dot-asm commented Jul 10, 2023 •

edited

mratsim commented Jul 10, 2023

dot-asm commented Jul 10, 2023 •

edited

dot-asm commented Jul 10, 2023 •

edited

dot-asm commented Jul 10, 2023

mratsim commented Jul 11, 2023

dot-asm commented Jul 11, 2023 •

edited

dot-asm commented Jul 28, 2023

Introduce a new feature to use dynamic dispatch to select between ADX and portable implementations at runtime #174

Are you sure you want to change the base?

Introduce a new feature to use dynamic dispatch to select between ADX and portable implementations at runtime #174

Conversation

andreacorbellini commented Jul 8, 2023 • edited

Summary

How this is implemented

Future improvements

dot-asm commented Jul 9, 2023 • edited

andreacorbellini commented Jul 10, 2023 • edited

dot-asm commented Jul 10, 2023

dot-asm commented Jul 10, 2023 • edited

dot-asm commented Jul 10, 2023 • edited

mratsim commented Jul 10, 2023

dot-asm commented Jul 10, 2023 • edited

dot-asm commented Jul 10, 2023 • edited

dot-asm commented Jul 10, 2023

mratsim commented Jul 11, 2023

dot-asm commented Jul 11, 2023 • edited

dot-asm commented Jul 28, 2023

andreacorbellini commented Jul 8, 2023 •

edited

dot-asm commented Jul 9, 2023 •

edited

andreacorbellini commented Jul 10, 2023 •

edited

dot-asm commented Jul 10, 2023 •

edited

dot-asm commented Jul 10, 2023 •

edited

dot-asm commented Jul 10, 2023 •

edited

dot-asm commented Jul 10, 2023 •

edited

dot-asm commented Jul 11, 2023 •

edited