Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower inlining cost of floating point div #50428

Merged
merged 1 commit into from
Jul 6, 2023
Merged

Commits on Jul 5, 2023

  1. Lower inlining cost of floating point div

    Our inlining cost model is extremely primitive, though surprisingly
    functional given its limitations. The basic idea for it was just that
    we'd give every intrinsic the approximate cost in cycles, such that
    for sufficiently large functions (>100 cycles), the cost of the extra
    call would be dwarfed by the cost of the function. However, there's
    a few problems with this. For one, the real issue is usually not
    the extra overhead of the call (which is small and well-predicated),
    but rather the inhibition of optimizations that inlining might have
    allowed. Additionally, the relevant cost comparison is not generally
    latency, but rather the size of the resulting binary.
    Lastly, the latency metric is misleading on modern superscalar
    architectures, because the core will perform other tasks while
    the operation is executing. In fact, somewhat counter-intuitively,
    this means that it is *more* important to inline high-latency
    instructions to allow the compiler to perform better latency hiding
    by spreading out the high-latency instructions.
    
    We probably need a full-on rethink of the inlining model at some
    point, but for the time being, this fixes a problem that I ran
    into in real code by reducing the inlining cost for floating point
    division to be the same as that of floating point multiplication.
    The particular case where I saw this was the batched forward AD rule for
    division, which had 6 calls to div_float. Inlining these provided
    substantially better performance.
    Keno committed Jul 5, 2023
    Configuration menu
    Copy the full SHA
    8d62b40 View commit details
    Browse the repository at this point in the history