Lower inlining cost of floating point div #50428

Our inlining cost model is extremely primitive, though surprisingly functional given its limitations. The basic idea for it was just that we'd give every intrinsic the approximate cost in cycles, such that for sufficiently large functions (>100 cycles), the cost of the extra call would be dwarfed by the cost of the function. However, there's a few problems with this. For one, the real issue is usually not the extra overhead of the call (which is small and well-predicated), but rather the inhibition of optimizations that inlining might have allowed. Additionally, the relevant cost comparison is not generally latency, but rather the size of the resulting binary. Lastly, the latency metric is misleading on modern superscalar architectures, because the core will perform other tasks while the operation is executing. In fact, somewhat counter-intuitively, this means that it is *more* important to inline high-latency instructions to allow the compiler to perform better latency hiding by spreading out the high-latency instructions. We probably need a full-on rethink of the inlining model at some point, but for the time being, this fixes a problem that I ran into in real code by reducing the inlining cost for floating point division to be the same as that of floating point multiplication. The particular case where I saw this was the batched forward AD rule for division, which had 6 calls to div_float. Inlining these provided substantially better performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower inlining cost of floating point div #50428

Lower inlining cost of floating point div #50428

Commits on Jul 5, 2023