Skip to content

Conversation

@NHDaly
Copy link
Member

@NHDaly NHDaly commented Sep 25, 2020

This is a second attempt, after #37535 was proving far too complicated.

This PR adds a small bit of code to Core.Compiler that provides Timers, which can be used to build nested timers, much like TimerOutputs.jl, but rewritten from scratch to get it to work inside Bootstrap.

This then optionally enables these timers inside Core.Compiler.typeinf() whenever a global const boolean Ref, __measure_typeinf__, is set to true.

The timers construct a nested trace of the time to run type inference for each individual invocation of typeinf(). These timers are returned as a tree structure, where each node contains the exclusive time for that invocation.

For example:

julia> using SnoopCompileCore

julia> module A
           foo() = 1
           bar() = foo() + 1
       end
Main.A

julia> timing = SnoopCompileCore.@snoopi_deep begin
           A.bar()
       end
Core.Compiler.Timings.Timing("root", 0x00037e9353825b46, 0x00037e9353879a8b, 0x000000000000be6c, Core.Compiler.Timings.Timing[Core.Compiler.Timings.Timing(Tuple{typeof(Main.A.bar)}, 0x00037e9353831cc8, 0x00037e9353864e03, 0x000000000002274a, Core.Compiler.Timings.Timing[Core.Compiler.Timings.Timing(Tuple{typeof(Main.A.foo)}, 0x00037e9353837468, 0x00037e9353837468, 0x0000000000013457, Core.Compiler.Timings.Timing[]), Core.Compiler.Timings.Timing(Tuple{typeof(+),Int64,Int64}, 0x00037e93538541c8, 0x00037e93538541c8, 0x0000000000010b44, Core.Compiler.Timings.Timing[])])])

Great care is taken in this implementation to record only the exclusive time for each invocation, in order to not include any of the overhead of the measurement itself in the timings for any individual node. This is important since we don't want to disproportionately report overly large exclusive times for method instances that simply fan out to many children, since this isn't actually reflective of how long it took to infer that method itself. We ultimately want to be able to produce both accurate inclusive times and accurate exclusive times, so we record the exclusive times, and take care to not include the overhead, and then reconstruct the inclusive times in post-processing (this code is in SnoopCompile.jl). This algorithm is basically:

  • First, record start time of "root"
  • typeinf:
    • enter:
      • record stop time of parent node
      • push new node for new invocation
      • record start time of new node
    • invoke _typeinf # May recurse into typeinf()
    • exit:
      • record stop time of new node
      • pop new node
      • record (continuation) start time of parent node
  • At very end, record stop time of "root"

Each Timing node contains its payload information about the invocation (currently just the MethodInstance -- see comment below where I'm asking for suggestions on what else to include), the cumulative exclusive duration for that node, the absolute start time for that node (for ProfileView), and the children from that node.

I was trying to follow the same mechanism as is done in @snoopi - to allow swapping out the typeinf() function to a timed version, but it turns out that the overhead of calling invokelatest is probably too high.

Since Ref{} isn't available yet from inside Core.Compiler :)
- I _think_ this should get us all of the MethodInstances that are being
  inferred during inference of a caller function.
- The old approach was missing some edges.
- Also, this now presents that info via a dependency graph, which can be
  used to reverse-engineer which methods are responsible for what time
This complicates the code a bit, but I think it should reduce overhead
slightly.
@timholy
Copy link
Member

timholy commented Sep 25, 2020

Seems like another option would be some const are_we_snooping = Ref(false) and modify the inference code to check it, and if true then push the MethodInstance onto your list.

@NHDaly
Copy link
Member Author

NHDaly commented Sep 25, 2020

Thanks @timholy - i had exactly the same thought! That's what i've just finished trying now, and i think it's working. i'll push that up here so you can see what i did.

I wasn't sure if a Ref{Bool} or a Array{Bool,0} is preferable - do you know what the recommended way to have a global, mutable, const bool is in Core.Compiler?

@NHDaly
Copy link
Member Author

NHDaly commented Sep 25, 2020

The problem, then, of course, is that I had to implement a very basic version of TimerOutputs in the Core.Compiler bootstrap, which can keep track of the nested timings, but I think i've done that. it wasn't too hard!

This allows us to measure the _nested_ per-method-instance inference
timings, which allows us to produce both a profile view over the
inference data, as well as compute _exclusive_ timings
per-method-instance.
Manually stop and restart the timers before and after calling child
functions, so that we directly compute _exclusive_ timings for each node
in the call tree.
`@inbounds` and rewriting to be more efficient.

Moved the timings as close to the `invoke _typeinf` as possible, to
reduce measurement of overhead.  But it didn't work: the measured timing
still covers the whole process.
@NHDaly
Copy link
Member Author

NHDaly commented Oct 5, 2020

Okay! :) I believe this analysis is working! We've been using it internally at RelationalAI, and it's pretty cool, and has already proven informative. 😁

We'd love to get this merged this week. I'll try to focus on it to make sure it happens!

Here's the remaining work I see:

  • Settle on exactly what information to record for each MethodInstance.
    • UPDATE: For now, I'm recording this tuple: (frame.linfo, frame.world, copy(frame.sptypes), frame.slottypes[2:end])
    • The linfo / MethodInstance object itself?
      • Good: contains relevant info about which method and which argument types, plus lots of other stuff.
      • Good: cheap to copy (just a pointer).
      • Bad: Cannot be serialized.
      • Bad: Somewhat opaque.
      • Bad: doesn't capture recursive call info / constant propagation info
    • A TypeTuple (Tuple{typeof(f), Int, Int})?
      • Good: can be serialized.
      • Good: Contains function type and arg types.
      • Bad: Ugly.
      • Bad: Loses the connection to the method; you have to recover it via methods(), which is expensive and annoying.
      • Bad: doesn't capture recursive call info / constant propagation info (same as linfo)
    • A two-tuple of (f, tt): ((+, (Int, Int))).
      • I like this approach, it's best of both the above worlds i think.
      • EDIT: Duh, as above this doesn't make sense. In the compiler we don't have access to the instance itself. Duh.
    • Maybe something that contains more info about the invocation? Like, when we compile the same MethodInstance multiple times, why are we doing that? Likely we're specializing it for different constants? Or there's some recursion stuff involved? This would be good to record.
    • I guess this info is in the InferenceState. What info should we pull out?
  • Performance improvements
    • Figure out best kind of "global const boolean" -- Array{0,Bool} or Ref{Bool}?
    • Try to get the overhead from snooping is as low as possible when snooping is enabled.
  • Clean up the inner "Timings" module
    • Is there any way this can be done without adding so much code? If not, what should we name that module?

@NHDaly
Copy link
Member Author

NHDaly commented Oct 6, 2020

After a good amount of profiling, I have convinced myself that we already have the overhead during snooping quite reasonably low.

For example, to snoop @code_typed peakflops(), we see only 2 / 3806 samples in this measurement, both within the call to time_ns():

julia> @profile timing = SnoopCompileCore.@snoopi_deep begin
           @code_typed peakflops()
       end;

Then from PProf:

  Total:           2       3806 (flat, cum) 99.69%
    126            .          .           function typeinf(interp::AbstractInterpreter, frame::InferenceState) 
    127            .          .               if __measure_typeinf__[] 
    128            .          1                   Timings.enter_new_timer(frame.linfo.specTypes) 
    129            .          .            
    130            .       3799                   v = _typeinf(interp, frame) 
    131            .          .            
    132            1          1                   Timings.exit_current_timer(frame.linfo.specTypes) 
    133            .          .            
    134            1          1                   return v 
    135            .          .               else 
    136            .          4                   return _typeinf(interp, frame) 
    137            .          .               end 
    138            .          .           end 
    139            .          .            
    140            .          .           function _typeinf(interp::AbstractInterpreter, frame::InferenceState) 
    141            .          .               typeinf_nocycle(interp, frame) || return false # frame is now part of a higher cycle 
Active filters:
   focus=enter_new_timer
Showing nodes accounting for 1, 100% of 1 total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                                 1   100% |   Core.Compiler.typeinf /Users/nathandaly/src/julia/usr/share/julia/base/compiler/typeinfer.jl:128 (inline)
         0     0%     0%          1   100%                | enter_new_timer /Users/nathandaly/src/julia/usr/share/julia/base/compiler/typeinfer.jl:43
                                                 1   100% |   time_ns /Users/nathandaly/src/julia/usr/share/julia/base/compiler/typeinfer.jl:25 (inline)
----------------------------------------------------------+-------------
Active filters:
   focus=exit_current_timer
Showing nodes accounting for 1, 100% of 1 total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                                 1   100% |   Core.Compiler.typeinf /Users/nathandaly/src/julia/usr/share/julia/base/compiler/typeinfer.jl:132 (inline)
         0     0%     0%          1   100%                | exit_current_timer /Users/nathandaly/src/julia/usr/share/julia/base/compiler/typeinfer.jl:84
                                                 1   100% |   time_ns /Users/nathandaly/src/julia/usr/share/julia/base/compiler/typeinfer.jl:25 (inline)
----------------------------------------------------------+-------------

Copy link
Member

@timholy timholy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this mostly seems good, and that of your list of remaining tasks only the first needs to be resolved; the performance seems unlikely to be an issue (and your measurements prove it), and to me your timing code seems reasonable. Maybe @KristofferC (as author of TimerOutputs) would have more to say, but I'd say that you can safely convert this from draft to "real PR" form and I expect we can merge this quite soon.

@NHDaly
Copy link
Member Author

NHDaly commented Oct 7, 2020

Thanks Tim! :) Okay, then I guess this is mostly complete!

The only thing I'd like to address is maybe also adding some information to mi_info that identifies if a function is being specialized for constants, and which ones, since that might help understand a function that's being inferred repeatedly. I think that's in the slottypes? I'll add that now. And maybe the sptypes? 😬 i'm worried this is just gonna be collecting half the info in the frame!

Copy link
Member

@Sacha0 Sacha0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! Thanks @NHDaly! :)

Copy link
Member Author

@NHDaly NHDaly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Sacha0 for the careful review! :) I've applied your suggested changes. 👍 thanks!

Co-Authored-By: Sacha Verweij <sacha@stanford.edu>
@NHDaly NHDaly force-pushed the nhd-snoopi-inferrable-set--TimerOutputs-profile branch from 076dd44 to 663f47f Compare October 10, 2020 03:00
NHDaly added a commit to NHDaly/SnoopCompile.jl that referenced this pull request Oct 10, 2020
Copy link
Member

@timholy timholy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to ask for more stuff, but I do think this is the last before we merge.

@timholy
Copy link
Member

timholy commented Oct 10, 2020

would you mind also reviewing #37136, I want to ensure they're both merged before the 1.6 freeze

I strongly approve of the idea of #37136, and it looks pretty straightforward, but I'd be lying if I said I was the right person to review it. I've barely glanced at the LLVM side of compilation. If you've gotten approvals from others I think it should be good to merge.

@vchuravy vchuravy requested a review from vtjnash October 10, 2020 20:56
Copy link
Member

@Sacha0 Sacha0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks Nathan! :)

@Sacha0 Sacha0 requested a review from timholy October 12, 2020 20:51
@timholy timholy merged commit ebc63b9 into JuliaLang:master Oct 12, 2020
@timholy
Copy link
Member

timholy commented Oct 12, 2020

Thanks for sticking with it, Nathan! Looking forward to this.

@NHDaly NHDaly deleted the nhd-snoopi-inferrable-set--TimerOutputs-profile branch October 12, 2020 22:28
@NHDaly
Copy link
Member Author

NHDaly commented Oct 12, 2020

:) Thanks @timholy! I appreciate your help guiding it to the finish! I'm looking forward to this too!

@timholy
Copy link
Member

timholy commented Oct 19, 2020

I've played with this a bit. Could we add a flag so that only new inferences are tracked? I piped the output to a file via AbstractTrees and I'm getting lots of stuff like

│     └─ MethodInstance for Attributes(::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol, Symbol}, NamedTuple{(:color, :strokewidth, :raw), Tuple{Observable{Any}, Int64, Bool}}})
│        └─ MethodInstance for collect(::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol, Symbol}, NamedTuple{(:color, :strokewidth, :raw), Tuple{Observable{Any}, Int64, Bool}}})
│           ├─ MethodInstance for (::Colon)(::Int64, ::Int64)
│           │  └─ MethodInstance for UnitRange{Int64}(::Int64, ::Int64)
│           │     ├─ MethodInstance for convert(::Type{Int64}, ::Int64)
│           │     └─ MethodInstance for unitrange_last(::Int64, ::Int64)
│           │        ├─ MethodInstance for >=(::Int64, ::Int64)
│           │        │  └─ MethodInstance for <=(::Int64, ::Int64)
│           │        ├─ MethodInstance for -(::Int64, ::Int64)
│           │        └─ MethodInstance for convert(::Type{Int64}, ::Int64)

I count a total of 75 instances of (::Colon)(::Int64, ::Int64) just up to depth 10. If folks want to browse a text-dump they will definitely appreciate a shorter list.

Alternatively, it might be better to record whether this was a cache-lookup or a new inference. Then perhaps FlameGraphs could process the list and exclude cache-lookups much in the way that it (by default) handles C calls.

@timholy
Copy link
Member

timholy commented Oct 19, 2020

Perhaps like you, I am also a bit puzzled about the results. Why do we have a bazillion

│  ├─ 0.072064: MethodInstance for (::Colon)(::Int64, ::Int64)
│  │  └─ 0.110615: MethodInstance for UnitRange{Int64}(::Int64, ::Int64)
│  │     ├─ 0.046747: MethodInstance for convert(::Type{Int64}, ::Int64)
│  │     └─ 0.134257: MethodInstance for unitrange_last(::Int64, ::Int64)
│  │        ├─ 0.036451: MethodInstance for -(::Int64, ::Int64)
│  │        └─ 0.019563: MethodInstance for convert(::Type{Int64}, ::Int64)

where the times are printed in milliseconds? This suggests it's spending 100μs inferring (presumably looking up from cache) the UnitRange{Int64} constructor as called by (::Colon)(::Int64, ::Int64). But if you've already inferred the return type of that call, why do you have to repeatedly dive into the callees? The UnitRange constructor doesn't even get inlined:

julia> f2(i, j) = i:j
f2 (generic function with 1 method)

julia> @code_typed f2(1, 3)
CodeInfo(
1%1 = Base.sle_int(i, j)::Bool%2 = Base.sub_int(i, 1)::Int64%3 = Base.ifelse(%1, j, %2)::Int64%4 = %new(UnitRange{Int64}, i, %3)::UnitRange{Int64}
└──      return %4
) => UnitRange{Int64}

which makes me think there's definitely no reason to reinfer this.

I seem to remember that you opened an issue about this, but I am not finding it.

@NHDaly
Copy link
Member Author

NHDaly commented Oct 19, 2020

Could we add a flag so that only new inferences are tracked?

@timholy I am 60% sure that indeed only new inferences are being tracked.

That is, from what I understand, we are only entering this function if we need to infer something and it's not already in the cache. At least, that's what I intended. Did I not do that right?

My impression is that indeed the results you're seeing are from actual duplicated inference; likely because of inferring constants! I did raise a question about this here, but unfortunately we ended up discussing it offline so the results aren't recorded:
#37535 (comment)

My understanding is that what you're seeing can be explained by the following:

  • We do cache MethodInstances specialized for a constant within a given inferable set. But we do not cache MethodInstances specialized for a constant across inferable sets.
  • And, we sometimes have to infer the same instance multiple times within recursion, though i do not understand this case well still.

@NHDaly
Copy link
Member Author

NHDaly commented Oct 19, 2020

I think we should augment the AbstractTrees view to include the Const() info in the printed message for each node; maybe by replacing that argument with the Const?

@timholy
Copy link
Member

timholy commented Oct 19, 2020

See the modified printnode method here.

@NHDaly
Copy link
Member Author

NHDaly commented Oct 20, 2020

Ah yeah, perfect!

Sacha0 pushed a commit to Sacha0/julia that referenced this pull request Nov 9, 2020
… call, via SnoopCompile (JuliaLang#37749)

This allows us to measure the _nested_ per-method-instance inference
timings, which allows us to produce both a profile view over the
inference data, as well as compute _exclusive_ timings
per-method-instance.
NHDaly added a commit to NHDaly/julia that referenced this pull request Nov 13, 2020
… call, via SnoopCompile (JuliaLang#37749)

This allows us to measure the _nested_ per-method-instance inference
timings, which allows us to produce both a profile view over the
inference data, as well as compute _exclusive_ timings
per-method-instance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants