Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower inlining cost of floating point div #50428

Merged
merged 1 commit into from Jul 6, 2023
Merged

Conversation

Keno
Copy link
Member

@Keno Keno commented Jul 5, 2023

Our inlining cost model is extremely primitive, though surprisingly functional given its limitations. The basic idea for it was just that we'd give every intrinsic the approximate cost in cycles, such that for sufficiently large functions (>100 cycles), the cost of the extra call would be dwarfed by the cost of the function. However, there's a few problems with this. For one, the real issue is usually not the extra overhead of the call (which is small and well-predicated), but rather the inhibition of optimizations that inlining might have allowed. Additionally, the relevant cost comparison is not generally latency, but rather the size of the resulting binary. Lastly, the latency metric is misleading on modern superscalar architectures, because the core will perform other tasks while the operation is executing. In fact, somewhat counter-intuitively, this means that it is more important to inline high-latency instructions to allow the compiler to perform better latency hiding by spreading out the high-latency instructions.

We probably need a full-on rethink of the inlining model at some point, but for the time being, this fixes a problem that I ran into in real code by reducing the inlining cost for floating point division to be the same as that of floating point multiplication. The particular case where I saw this was the batched forward AD rule for division, which had 6 calls to div_float. Inlining these provided substantially better performance.

Our inlining cost model is extremely primitive, though surprisingly
functional given its limitations. The basic idea for it was just that
we'd give every intrinsic the approximate cost in cycles, such that
for sufficiently large functions (>100 cycles), the cost of the extra
call would be dwarfed by the cost of the function. However, there's
a few problems with this. For one, the real issue is usually not
the extra overhead of the call (which is small and well-predicated),
but rather the inhibition of optimizations that inlining might have
allowed. Additionally, the relevant cost comparison is not generally
latency, but rather the size of the resulting binary.
Lastly, the latency metric is misleading on modern superscalar
architectures, because the core will perform other tasks while
the operation is executing. In fact, somewhat counter-intuitively,
this means that it is *more* important to inline high-latency
instructions to allow the compiler to perform better latency hiding
by spreading out the high-latency instructions.

We probably need a full-on rethink of the inlining model at some
point, but for the time being, this fixes a problem that I ran
into in real code by reducing the inlining cost for floating point
division to be the same as that of floating point multiplication.
The particular case where I saw this was the batched forward AD rule for
division, which had 6 calls to div_float. Inlining these provided
substantially better performance.
@timholy
Copy link
Sponsor Member

timholy commented Jul 5, 2023

Agreed with all your points about the limitations. Rather than estimating the cost of the callee in cycles, it might be better to look at the cycles of the caller with and without inlining the callee; if it makes a big difference, then it should be inlined. But there are obvious problems with that: first, inlinability becomes a caller-by-caller issue and so the decision can't be made just once, and second we have an explosion of 2^n combinations of inlining callees. For that reason, the primitive algorithm seems a lot safer.

reducing the inlining cost for floating point division to be the same as that of floating point multiplication

hmm, that seems a little drastic? I mean, in general that's not even close to being true, is it?

@Keno
Copy link
Member Author

Keno commented Jul 5, 2023

hmm, that seems a little drastic? I mean, in general that's not even close to being true, is it?

It's the same cost in terms of instruction cost. latency wise it's about 4x that of float on modern architectures. Throughput is about 8-20x worse, but that's precisely why I want to inline it, so the compiler has a better chance to shuffle it in with other instructions for latency hiding.

@gbaraldi
Copy link
Member

gbaraldi commented Jul 5, 2023

It does share the port with other floating operations, but I'm not against inlining more code anyway.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 5, 2023

Curiously, while the throughput of VMULSD is much higher (0.5 vs 4-14), the latency is actually roughly the same for VDIVSD (5 vs 14), per https://uops.info/table.html. And the calling conventions on most platforms (esp. for Linux x86/64 that was being measured here) may heavily penalize any failure to inline floating point operations unfortunately.

@staticfloat staticfloat merged commit 2360140 into master Jul 6, 2023
6 checks passed
@staticfloat staticfloat deleted the kf/divinlinecost branch July 6, 2023 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants