Lower inlining cost of floating point div #50428

Keno · 2023-07-05T20:28:48Z

Our inlining cost model is extremely primitive, though surprisingly functional given its limitations. The basic idea for it was just that we'd give every intrinsic the approximate cost in cycles, such that for sufficiently large functions (>100 cycles), the cost of the extra call would be dwarfed by the cost of the function. However, there's a few problems with this. For one, the real issue is usually not the extra overhead of the call (which is small and well-predicated), but rather the inhibition of optimizations that inlining might have allowed. Additionally, the relevant cost comparison is not generally latency, but rather the size of the resulting binary. Lastly, the latency metric is misleading on modern superscalar architectures, because the core will perform other tasks while the operation is executing. In fact, somewhat counter-intuitively, this means that it is more important to inline high-latency instructions to allow the compiler to perform better latency hiding by spreading out the high-latency instructions.

We probably need a full-on rethink of the inlining model at some point, but for the time being, this fixes a problem that I ran into in real code by reducing the inlining cost for floating point division to be the same as that of floating point multiplication. The particular case where I saw this was the batched forward AD rule for division, which had 6 calls to div_float. Inlining these provided substantially better performance.

Our inlining cost model is extremely primitive, though surprisingly functional given its limitations. The basic idea for it was just that we'd give every intrinsic the approximate cost in cycles, such that for sufficiently large functions (>100 cycles), the cost of the extra call would be dwarfed by the cost of the function. However, there's a few problems with this. For one, the real issue is usually not the extra overhead of the call (which is small and well-predicated), but rather the inhibition of optimizations that inlining might have allowed. Additionally, the relevant cost comparison is not generally latency, but rather the size of the resulting binary. Lastly, the latency metric is misleading on modern superscalar architectures, because the core will perform other tasks while the operation is executing. In fact, somewhat counter-intuitively, this means that it is *more* important to inline high-latency instructions to allow the compiler to perform better latency hiding by spreading out the high-latency instructions. We probably need a full-on rethink of the inlining model at some point, but for the time being, this fixes a problem that I ran into in real code by reducing the inlining cost for floating point division to be the same as that of floating point multiplication. The particular case where I saw this was the batched forward AD rule for division, which had 6 calls to div_float. Inlining these provided substantially better performance.

timholy · 2023-07-05T20:55:05Z

Agreed with all your points about the limitations. Rather than estimating the cost of the callee in cycles, it might be better to look at the cycles of the caller with and without inlining the callee; if it makes a big difference, then it should be inlined. But there are obvious problems with that: first, inlinability becomes a caller-by-caller issue and so the decision can't be made just once, and second we have an explosion of 2^n combinations of inlining callees. For that reason, the primitive algorithm seems a lot safer.

reducing the inlining cost for floating point division to be the same as that of floating point multiplication

hmm, that seems a little drastic? I mean, in general that's not even close to being true, is it?

Keno · 2023-07-05T21:00:49Z

hmm, that seems a little drastic? I mean, in general that's not even close to being true, is it?

It's the same cost in terms of instruction cost. latency wise it's about 4x that of float on modern architectures. Throughput is about 8-20x worse, but that's precisely why I want to inline it, so the compiler has a better chance to shuffle it in with other instructions for latency hiding.

gbaraldi · 2023-07-05T21:07:22Z

It does share the port with other floating operations, but I'm not against inlining more code anyway.

vtjnash · 2023-07-05T21:12:36Z

Curiously, while the throughput of VMULSD is much higher (0.5 vs 4-14), the latency is actually roughly the same for VDIVSD (5 vs 14), per https://uops.info/table.html. And the calling conventions on most platforms (esp. for Linux x86/64 that was being measured here) may heavily penalize any failure to inline floating point operations unfortunately.

vtjnash approved these changes Jul 5, 2023

View reviewed changes

staticfloat merged commit 2360140 into master Jul 6, 2023
6 checks passed

staticfloat deleted the kf/divinlinecost branch July 6, 2023 00:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower inlining cost of floating point div #50428

Lower inlining cost of floating point div #50428

Keno commented Jul 5, 2023

timholy commented Jul 5, 2023 •

edited

Keno commented Jul 5, 2023

gbaraldi commented Jul 5, 2023

vtjnash commented Jul 5, 2023

Lower inlining cost of floating point div #50428

Lower inlining cost of floating point div #50428

Conversation

Keno commented Jul 5, 2023

timholy commented Jul 5, 2023 • edited

Keno commented Jul 5, 2023

gbaraldi commented Jul 5, 2023

vtjnash commented Jul 5, 2023

timholy commented Jul 5, 2023 •

edited