Decide inline-worthiness based on a more nuanced cost model #22210

timholy · 2017-06-04T15:31:32Z

In #22202 I noted a disadvantage of our current inlining heuristics: that we penalize verbosity even when two bits of code do the same thing. I decided to explore the feasibility of replacing our cost model with something based a little more closely on CPU cycles, with the idea that we should inline things that are obviously cheaper than a function call. I decided to be a bit generous and set a default threshold of 200 "cycles", and assigned costs to operations loosely inspired by this chart.

This is a horribly rough approximation: for example, there's a very large difference between success and failure when it comes to branch-prediction, and it might be reasonable to guess that the branch in

x < zero(x) && throw(DomainError())

is much less likely to be hit than

if x < rand()
    # option 1
else
    # option 2
end

Such subtleties (not to mention cache prefetch, etc) are far beyond the simple approach here. In general I tried to aim for an optimistic model, but worked in some amount of penalty for the possibility that this wouldn't hold, with the amount depending entirely on how I was feeling at that particular moment 😄.

It's also worth pointing out that I've done nothing whatsoever to test how well this works. My entire goal was to get julia to build.

This still has debugging statements in the submitted code because I can't imagine that this isn't going to need at least a teensy bit of revision 😉.

timholy · 2017-06-04T16:50:34Z

Just realized there's a really big hole in this: a for or while loop will only get counted as if its body is executed once. Probably it would be best to just not inline functions with such loops. I don't see a "loop detector" presently, but perhaps I'm just missing it. Or can one detect a loop simply by a goto to an earlier line?

yuyichao · 2017-06-06T14:19:17Z

Not an expert on inlining heuristic but I think using the number of operation in a function to decide inline-worthiness should be OK. It's perfectly fine to inline a small loop. As long as the inlining doesn't blow up the size of the parent function it shouldn't hurt much by inlining it. Obviously it might not help much but that's true in general and you probably need a much more involved cost model of the inlining or actually trying to inline it to tell that.

The change to `isassigned(::SimpleVector, ...)` and `isassigned(::BitArray, ...)` are currently no op since they will unlikely be inlined. The `SimpleVector` version is also currently unsafe to inline. These might change with a better inlining heuristic (#22210) and a smarter GC frame allocation pass (#21888).

timholy · 2017-06-06T21:20:46Z

OK, I've taken this further; a modified version of this (that had some debugging code) locally passed all tests but one. For demonstration purposes, I've also pushed a branch, teh/disable_inline_macros, that gets rid of all forced-inlining in Base; the point of this is to force the inliner to "stand on its own" and see what happens. I then rebased this PR on that branch. My original intention was to then run nanosoldier to compare the differences, but from my local testing there are so many differences that it's a bit overwhelming.

The bottom line seems pretty simple, though. Behold:

So while there are a few regressions, they are both outnumbered and outweighed by the improvements. (EDIT: each dot is one test in BaseBenchmarks.jl, >2000 tests overall.)

As one example, map(abs, t) where t = ntuple(identity, n) inlines (on its own) up to n=3 on master but n=14 on this branch; a more expensive function, map(sqrt, t), inlines up to n=5.

Now, one concern might be that this simply inlines everything, and that's cheating. What's the best way to assess such concerns? I checked the "resident size" in top, and compared to 0.6 (I didn't feel like rebuilding master) the size was pretty comparable.

yuyichao · 2017-06-06T21:32:20Z

Am I reading it correctly that there's one case that shows 10^6 speed up?

Now, one concern might be that this simply inlines everything, and that's cheating.

Well, then all optimizing compilers have been cheating all along.... I'm pretty sure most C/C++ compiler do much more aggressive inlining than we do so I wouldn't call it cheating. The important thing is to measure the overhead. Maybe measuring the time to compile sysimg, or some other standard precompilation test cases. You can also measure the total size in inferenced IR by record a statistic of non-void expression in inference after finishing each functions and if you want more low level info you can also check size -A of sysimg and record the JIT code size by adding a counter in RTDyldMemoryManagerJL::allocateCodeSection in cgmemmgr.cpp.

JeffBezanson · 2017-06-06T22:05:01Z

Wow, that's really cool. If you re-enable manual inlining declarations, do the regressions go away?

timholy · 2017-06-06T22:56:00Z

Am I reading it correctly that there's one case that shows 10^6 speed up?

Yes, that's basically right; several others cases get better than 10^4 speed up. Or rather, "not slowed down"; I am sure that our performance annotations take care of all the egregious cases. I very much doubt we'll see much improvement compared to "real" master from this, but of course it's worth testing.

Maybe measuring the time to compile sysimg

Great idea.

task	master	"master" (with disabled `@inline` annotations)	PR
build from scratch (incl. `inference.jl`)	2min 48s	2min 46s	3min 6s
build `base/precompile.jl`	1min 38s	1min 36s	1min 50s

So there's definitely a slowdown with this. Would be interested in people's thoughts on how bad this is.

If you re-enable manual inlining declarations, do the regressions go away?

Will test. But right now I'm testing how this fares against "real" master, with all of its performance annotations; I'm curious to know if we could scale back on our usage of @inline.

yuyichao · 2017-06-06T23:36:08Z

I believe we also have a ENABLE_TIMING option in options.h which can give you a breakdown of the timing. That timing different doesn't sound too bad to me though compile time is also not the thing I care about the most....

timholy · 2017-06-07T02:59:28Z

Here's a comparison against master with all of its @inline annotations:

Definitely worse on balance, but better than I expected: to my eye, this branch without performance annotations is closer to master with annotations than it is to master with the annotations stripped out.

Obviously some of the regressions are pretty severe, but I've done literally no performance tuning here. Probably worth looking into some of the regressions by hand and seeing what I can come up with...in particular, the ones that have ~10-100ns of runtime on master seem likely to be just me assigning the wrong cost and it therefore failing to inline a pretty simple function.

quinnj · 2017-06-07T19:55:17Z

I think it'd be really cool to expose the internal cost model here via profiling; I think it'd be a neat alternative to time/allocations/samples and another good signal in terms of identifying code performance bottlenecks and the relative "cost" of various code paths. Sometimes you don't realize the relative costs of various parts of a workflow or have some kind of type instability, all of which seems like would stick out with seeing this kind of data.

timholy · 2017-06-08T09:53:04Z

@quinnj, the cost model is almost certainly not accurate enough to expose to users; it doesn't take into account looping, and the cost of non-inferrability is handled crudely.

However, it can be useful to understand why a particular function didn't get inlined. I'll document how to call statement_cost for a function body.

timholy · 2017-06-09T10:30:28Z

So, I'm thinking of putting this on the back burner until #21888 merges; I'm seeing quite a few issues that I hope get resolved by better GC marking. The last commit here (c981dd1) is just one example; I'm also seeing a few big performance regressions that seem to occur because we allocate when we don't need to. A good example is mean(rand(10^3)), where I'm getting allocations due to unnecessarily boxing the blksize argument. I'm pretty sure this is not caused by my changes here, but it does get triggered by these changes.

Of course the question is whether its benefits on LLVM 3.9 will fix these issues, or whether it will have to wait until we move to LLVM 5.

timholy · 2017-06-09T12:27:08Z

Speaking of bugs triggered by this, https://travis-ci.org/JuliaLang/julia/builds/240808331 hit an assertion in gf.c. (It was the Int version that was a subtype of the Int... version of that function.) I worked around it by declaring that argument as Vararg{Int,N}, but it suggests a bug elsewhere.

timholy · 2017-06-18T15:15:40Z

@nanosoldier runbenchmarks(ALL, vs = ":master")

nanosoldier · 2017-06-18T18:43:24Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

KristofferC · 2017-06-18T19:02:42Z

Cool stuff!

This greatly reduces the cost of failure-to-inline

This switches to a model in which we estimate the runtime of the function; if it's fast, then we should inline it. The estimate of runtime is extremely crude, and doesn't even take loops into account.

timholy · 2017-07-07T21:57:44Z

If this makes it through CI, I'll merge in the morning.

vchuravy · 2017-07-08T02:14:16Z

Appveyor is failing due to libgit2-online. Do all the intermediate commits pass tests as well?

timholy · 2017-07-08T08:14:47Z

Yes, they do. (EDIT: only the last commit addresses the real topic of this PR. The earlier ones compensate for issues I detected at various points along the way while developing this. As with the @noinline around the throws, it's quite likely that they aren't strictly necessary because the final version of the inference change solved the problems more systematically. But neither are they problematic either, and in some ways represent improvements on their own.)

KristofferC · 2017-07-08T11:37:40Z

Yay!

martinholters · 2017-07-11T11:36:13Z

Hm, looks like this makes the ACME.jl tests crash. Not sure if the problem was introduced here or just uncovered by different inlining. I haven't really gotten anywhere investigating the issue, but somewhat up the call stack of a sightly reduced case, I found this:

#16 0x00007ffff7574ed4 in jl_call_fptr_internal (fptr=0x7fffffffb940, meth=0x7fffe999ec90, args=0x7fffffffbb48, nargs=2)
    at ./src/julia_internal.h:369
369	        return fptr->fptr1(args[0], &args[1], nargs-1);
(gdb) call jl_(args[0])
getfield(Base, Symbol("##494#495")){Base.MethodError}(ex=Base.MethodError(f=typeof(Base.:(*))(), args=<?#0x7fffe906f230::Core.Inference.InferenceParams(world=0x0000000000005677, inlining=true, inline_cost_threshold=100, inline_nonleaf_penalty=1000, inline_tupleret_bonus=400, MAX_METHODS=4, MAX_TUPLETYPE_LEN=15, MAX_TUPLE_DEPTH=4, MAX_TUPLE_SPLAT=16, MAX_UNION_SPLITTING=4, MAX_APPLY_UNION_ENUM=8)>, world=0x0000000000005677))

Note that the type of (inner) args is an instance of InferenceParams. Of course, this could also just be a random artifact of some memory corruption. Does anyone have any ideas or has seen anything similar?

timholy · 2017-07-11T11:58:37Z

As long as it doesn't crash in inline_worthy or statement_cost, there's a sense in which this can't introduce new bugs: the only thing it changes is the value of a single boolean flag, the decision to inline or not. It doesn't change anything about how that inlining is performed.

However, it's perfectly capable of exposing old bugs. Indeed it's already done so: at least #22516, and (at an intermediate stage) what is probably the same bug as #22592.

I've not seen that one before, though.

martinholters · 2017-07-12T08:49:12Z

Seems indeed to be a missing GC root again, ref. #22770 (comment).

tkelman · 2017-07-12T09:05:59Z

I'll write up a devdocs on this once it's been reviewed.

Sometime later then if you have time?

Added in #22210 (and earlier begun in #20853), this is no longer necessary to avoid heap allocations, and thus serves little purpose now.

yuyichao mentioned this pull request Jun 6, 2017

Treat bounds check in isassigned as, well, bounds check #22246

Merged

timholy force-pushed the teh/inlining_cost_model branch from e3c3426 to 4818fdd Compare June 6, 2017 21:02

timholy changed the title ~~WIP/RFC: Decide inline-worthiness based on a more nuanced cost model~~ Decide inline-worthiness based on a more nuanced cost model Jun 6, 2017

timholy force-pushed the teh/inlining_cost_model branch from 4818fdd to bf17edb Compare June 6, 2017 21:07

timholy changed the title ~~Decide inline-worthiness based on a more nuanced cost model~~ (Do not merge) Decide inline-worthiness based on a more nuanced cost model Jun 6, 2017

JeffBezanson added the performance Must go faster label Jun 6, 2017

timholy force-pushed the teh/inlining_cost_model branch from bf17edb to 0d6e686 Compare June 8, 2017 14:18

timholy mentioned this pull request Jun 8, 2017

Fix inference for splatting numbers. Fixes #22291 #22292

Closed

timholy force-pushed the teh/inlining_cost_model branch from 0d6e686 to c981dd1 Compare June 9, 2017 10:04

This was referenced Jun 16, 2017

Should DomainErrors accept arguments? #12152

Closed

Same LLVM IR, radically different performance #22408

Open

timholy force-pushed the teh/inlining_cost_model branch from c981dd1 to 9fcb93c Compare June 18, 2017 15:12

timholy changed the title ~~(Do not merge) Decide inline-worthiness based on a more nuanced cost model~~ Decide inline-worthiness based on a more nuanced cost model Jun 18, 2017

timholy force-pushed the teh/inlining_cost_model branch from 9fcb93c to b6f9a51 Compare June 18, 2017 19:17

timholy added 4 commits July 7, 2017 16:37

Force specialization in number of varargs in key places

4564726

This greatly reduces the cost of failure-to-inline

More inlining prevention

705a187

Work around penalty from splatting an Integer

d702605

Decide inline-worthiness based on a more nuanced cost model

7de02a3

This switches to a model in which we estimate the runtime of the function; if it's fast, then we should inline it. The estimate of runtime is extremely crude, and doesn't even take loops into account.

timholy force-pushed the teh/inlining_cost_model branch from abcdc77 to 7de02a3 Compare July 7, 2017 21:56

timholy merged commit df63db6 into master Jul 8, 2017

timholy deleted the teh/inlining_cost_model branch July 8, 2017 08:18

timholy mentioned this pull request Jul 8, 2017

Replace several intrinsics with Julia equivalents #22202

Merged

This was referenced Jul 9, 2017

Make InexactError more informative #20005

Merged

RGB(2.0,0.0,0.0) should throw an error JuliaGraphics/ColorTypes.jl#85

Closed

martinholters mentioned this pull request Jul 12, 2017

Crash due to use-after-free (?) #22770

Closed

mbauman mentioned this pull request Jul 13, 2017

Avoid inlining setindex!(::Array, ::Any, ::AbstractVector) #21401

Closed

timholy mentioned this pull request Jul 31, 2017

system image build time regression (after new optimizer and iteration protocol) #23006

Closed

Sacha0 mentioned this pull request Aug 16, 2017

Audit gc root allocations for exceptions in small functions? #23281

Open

timholy mentioned this pull request Sep 8, 2017

Fix map for Julia v0.7 JuliaArrays/StaticArrays.jl#296

Merged

timholy mentioned this pull request Jun 29, 2018

some updates and fixes to inlining cost #27857

Merged

vtjnash added a commit that referenced this pull request Jul 9, 2020

inlining: remove bonus for Tuple return with heap references

fd6149e

Added in #22210 (and earlier begun in #20853), this is no longer necessary to avoid heap allocations, and thus serves little purpose now.

vtjnash added a commit that referenced this pull request Jul 13, 2020

inlining: remove bonus for Tuple return with heap references

3f78f13

Added in #22210 (and earlier begun in #20853), this is no longer necessary to avoid heap allocations, and thus serves little purpose now.

vtjnash added a commit that referenced this pull request Jul 20, 2020

inlining: remove bonus for Tuple return with heap references

54ebed6

Added in #22210 (and earlier begun in #20853), this is no longer necessary to avoid heap allocations, and thus serves little purpose now.

vtjnash added a commit that referenced this pull request Sep 1, 2020

inlining: remove bonus for Tuple return with heap references

02ea9c4

Added in #22210 (and earlier begun in #20853), this is no longer necessary to avoid heap allocations, and thus serves little purpose now.

vtjnash added a commit that referenced this pull request Sep 3, 2020

inlining: remove bonus for Tuple return with heap references

f1d511e

Added in #22210 (and earlier begun in #20853), this is no longer necessary to avoid heap allocations, and thus serves little purpose now.

c42f pushed a commit that referenced this pull request Sep 23, 2020

inlining: remove bonus for Tuple return with heap references

2ca0cc9

Added in #22210 (and earlier begun in #20853), this is no longer necessary to avoid heap allocations, and thus serves little purpose now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide inline-worthiness based on a more nuanced cost model #22210

Decide inline-worthiness based on a more nuanced cost model #22210

timholy commented Jun 4, 2017

timholy commented Jun 4, 2017 •

edited

Loading

yuyichao commented Jun 6, 2017

timholy commented Jun 6, 2017 •

edited

Loading

yuyichao commented Jun 6, 2017 •

edited

Loading

JeffBezanson commented Jun 6, 2017

timholy commented Jun 6, 2017 •

edited

Loading

yuyichao commented Jun 6, 2017

timholy commented Jun 7, 2017

quinnj commented Jun 7, 2017

timholy commented Jun 8, 2017

timholy commented Jun 9, 2017 •

edited

Loading

timholy commented Jun 9, 2017

timholy commented Jun 18, 2017

nanosoldier commented Jun 18, 2017

KristofferC commented Jun 18, 2017

timholy commented Jul 7, 2017

vchuravy commented Jul 8, 2017

timholy commented Jul 8, 2017 •

edited

Loading

KristofferC commented Jul 8, 2017

martinholters commented Jul 11, 2017

timholy commented Jul 11, 2017

martinholters commented Jul 12, 2017

tkelman commented Jul 12, 2017

Decide inline-worthiness based on a more nuanced cost model #22210

Decide inline-worthiness based on a more nuanced cost model #22210

Conversation

timholy commented Jun 4, 2017

timholy commented Jun 4, 2017 • edited Loading

yuyichao commented Jun 6, 2017

timholy commented Jun 6, 2017 • edited Loading

yuyichao commented Jun 6, 2017 • edited Loading

JeffBezanson commented Jun 6, 2017

timholy commented Jun 6, 2017 • edited Loading

yuyichao commented Jun 6, 2017

timholy commented Jun 7, 2017

quinnj commented Jun 7, 2017

timholy commented Jun 8, 2017

timholy commented Jun 9, 2017 • edited Loading

timholy commented Jun 9, 2017

timholy commented Jun 18, 2017

nanosoldier commented Jun 18, 2017

KristofferC commented Jun 18, 2017

timholy commented Jul 7, 2017

vchuravy commented Jul 8, 2017

timholy commented Jul 8, 2017 • edited Loading

KristofferC commented Jul 8, 2017

martinholters commented Jul 11, 2017

timholy commented Jul 11, 2017

martinholters commented Jul 12, 2017

tkelman commented Jul 12, 2017

timholy commented Jun 4, 2017 •

edited

Loading

timholy commented Jun 6, 2017 •

edited

Loading

yuyichao commented Jun 6, 2017 •

edited

Loading

timholy commented Jun 6, 2017 •

edited

Loading

timholy commented Jun 9, 2017 •

edited

Loading

timholy commented Jul 8, 2017 •

edited

Loading