-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Report per-MethodInstance inference timings during a single inference call, via SnoopCompile #37749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report per-MethodInstance inference timings during a single inference call, via SnoopCompile #37749
Conversation
Since Ref{} isn't available yet from inside Core.Compiler :)
- I _think_ this should get us all of the MethodInstances that are being inferred during inference of a caller function. - The old approach was missing some edges. - Also, this now presents that info via a dependency graph, which can be used to reverse-engineer which methods are responsible for what time
…owing substituting `typeinf()`
This complicates the code a bit, but I think it should reduce overhead slightly.
…ide generated functions
|
Seems like another option would be some |
|
Thanks @timholy - i had exactly the same thought! That's what i've just finished trying now, and i think it's working. i'll push that up here so you can see what i did. I wasn't sure if a |
|
The problem, then, of course, is that I had to implement a very basic version of TimerOutputs in the Core.Compiler bootstrap, which can keep track of the nested timings, but I think i've done that. it wasn't too hard! |
This allows us to measure the _nested_ per-method-instance inference timings, which allows us to produce both a profile view over the inference data, as well as compute _exclusive_ timings per-method-instance.
Manually stop and restart the timers before and after calling child functions, so that we directly compute _exclusive_ timings for each node in the call tree.
`@inbounds` and rewriting to be more efficient. Moved the timings as close to the `invoke _typeinf` as possible, to reduce measurement of overhead. But it didn't work: the measured timing still covers the whole process.
…) to simplify code.
|
Okay! :) I believe this analysis is working! We've been using it internally at RelationalAI, and it's pretty cool, and has already proven informative. 😁 We'd love to get this merged this week. I'll try to focus on it to make sure it happens! Here's the remaining work I see:
|
|
After a good amount of profiling, I have convinced myself that we already have the overhead during snooping quite reasonably low. For example, to snoop julia> @profile timing = SnoopCompileCore.@snoopi_deep begin
@code_typed peakflops()
end;Then from PProf: |
timholy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this mostly seems good, and that of your list of remaining tasks only the first needs to be resolved; the performance seems unlikely to be an issue (and your measurements prove it), and to me your timing code seems reasonable. Maybe @KristofferC (as author of TimerOutputs) would have more to say, but I'd say that you can safely convert this from draft to "real PR" form and I expect we can merge this quite soon.
Adds more comments throughout the Core.Compiler.Timings module.
|
Thanks Tim! :) Okay, then I guess this is mostly complete! The only thing I'd like to address is maybe also adding some information to |
Sacha0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic! Thanks @NHDaly! :)
NHDaly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Sacha0 for the careful review! :) I've applied your suggested changes. 👍 thanks!
Co-Authored-By: Sacha Verweij <sacha@stanford.edu>
076dd44 to
663f47f
Compare
timholy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to ask for more stuff, but I do think this is the last before we merge.
I strongly approve of the idea of #37136, and it looks pretty straightforward, but I'd be lying if I said I was the right person to review it. I've barely glanced at the LLVM side of compilation. If you've gotten approvals from others I think it should be good to merge. |
Simplify ROOT MethodInfo construction
Sacha0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks Nathan! :)
|
Thanks for sticking with it, Nathan! Looking forward to this. |
|
:) Thanks @timholy! I appreciate your help guiding it to the finish! I'm looking forward to this too! |
|
I've played with this a bit. Could we add a flag so that only new inferences are tracked? I piped the output to a file via AbstractTrees and I'm getting lots of stuff like I count a total of 75 instances of Alternatively, it might be better to record whether this was a cache-lookup or a new inference. Then perhaps FlameGraphs could process the list and exclude cache-lookups much in the way that it (by default) handles C calls. |
|
Perhaps like you, I am also a bit puzzled about the results. Why do we have a bazillion │ ├─ 0.072064: MethodInstance for (::Colon)(::Int64, ::Int64)
│ │ └─ 0.110615: MethodInstance for UnitRange{Int64}(::Int64, ::Int64)
│ │ ├─ 0.046747: MethodInstance for convert(::Type{Int64}, ::Int64)
│ │ └─ 0.134257: MethodInstance for unitrange_last(::Int64, ::Int64)
│ │ ├─ 0.036451: MethodInstance for -(::Int64, ::Int64)
│ │ └─ 0.019563: MethodInstance for convert(::Type{Int64}, ::Int64)where the times are printed in milliseconds? This suggests it's spending 100μs inferring (presumably looking up from cache) the julia> f2(i, j) = i:j
f2 (generic function with 1 method)
julia> @code_typed f2(1, 3)
CodeInfo(
1 ─ %1 = Base.sle_int(i, j)::Bool
│ %2 = Base.sub_int(i, 1)::Int64
│ %3 = Base.ifelse(%1, j, %2)::Int64
│ %4 = %new(UnitRange{Int64}, i, %3)::UnitRange{Int64}
└── return %4
) => UnitRange{Int64}
which makes me think there's definitely no reason to reinfer this. I seem to remember that you opened an issue about this, but I am not finding it. |
@timholy I am 60% sure that indeed only new inferences are being tracked. That is, from what I understand, we are only entering this function if we need to infer something and it's not already in the cache. At least, that's what I intended. Did I not do that right? My impression is that indeed the results you're seeing are from actual duplicated inference; likely because of inferring constants! I did raise a question about this here, but unfortunately we ended up discussing it offline so the results aren't recorded: My understanding is that what you're seeing can be explained by the following:
|
|
I think we should augment the AbstractTrees view to include the |
|
See the modified |
|
Ah yeah, perfect! |
… call, via SnoopCompile (JuliaLang#37749) This allows us to measure the _nested_ per-method-instance inference timings, which allows us to produce both a profile view over the inference data, as well as compute _exclusive_ timings per-method-instance.
… call, via SnoopCompile (JuliaLang#37749) This allows us to measure the _nested_ per-method-instance inference timings, which allows us to produce both a profile view over the inference data, as well as compute _exclusive_ timings per-method-instance.
This is a second attempt, after #37535 was proving far too complicated.
This PR adds a small bit of code to Core.Compiler that provides Timers, which can be used to build nested timers, much like TimerOutputs.jl, but rewritten from scratch to get it to work inside Bootstrap.
This then optionally enables these timers inside
Core.Compiler.typeinf()whenever a global const boolean Ref,__measure_typeinf__, is set totrue.The timers construct a nested trace of the time to run type inference for each individual invocation of
typeinf(). These timers are returned as a tree structure, where each node contains the exclusive time for that invocation.For example:
Great care is taken in this implementation to record only the exclusive time for each invocation, in order to not include any of the overhead of the measurement itself in the timings for any individual node. This is important since we don't want to disproportionately report overly large exclusive times for method instances that simply fan out to many children, since this isn't actually reflective of how long it took to infer that method itself. We ultimately want to be able to produce both accurate inclusive times and accurate exclusive times, so we record the exclusive times, and take care to not include the overhead, and then reconstruct the inclusive times in post-processing (this code is in SnoopCompile.jl). This algorithm is basically:
typeinf:enter:_typeinf# May recurse intotypeinf()exit:Each Timing node contains its payload information about the invocation (currently just the MethodInstance -- see comment below where I'm asking for suggestions on what else to include), the cumulative exclusive duration for that node, the absolute start time for that node (for ProfileView), and the children from that node.
I was trying to follow the same mechanism as is done in
@snoopi- to allow swapping out thetypeinf()function to a timed version, but it turns out that the overhead of callinginvokelatestis probably too high.