Skip to content

Reducing latency: acceptable strategies discussion #792

@timholy

Description

@timholy

I'm opening this to ask about what changes the devs here are willing to accept in order to reduce latency. As a case study, let's consider AbstractPlotting.draw_axis2d, which has a mere 43 arguments. If I run the test suite and then do

using MethodAnalysis
mis = methodinstances(AbstractPlotting.draw_axis2d)`

it gives me 8 inferred MethodInstances. I've edited the output and stashed it in this gist, turning it into a compile-time benchmark. Here's what I get:

ulia> include("/tmp/draw_axis2d_demo.jl ")
  0.000012 seconds (5 allocations: 976 bytes)   # the all-Any MethodInstance
  4.711192 seconds (11.49 M allocations: 681.750 MiB, 7.89% gc time, 100.00% compilation time)
  0.000014 seconds (5 allocations: 976 bytes)   # the mostly-Any MethodInstance
  0.142401 seconds (232.42 k allocations: 12.758 MiB, 99.99% compilation time)
  0.111768 seconds (138.38 k allocations: 6.993 MiB, 99.98% compilation time)
  0.181831 seconds (349.69 k allocations: 19.633 MiB, 99.99% compilation time)
  0.094713 seconds (176.63 k allocations: 9.740 MiB, 99.98% compilation time)
  0.084915 seconds (105.29 k allocations: 5.205 MiB, 99.97% compilation time)
true

Obviously, the first "real" one (not the all-Any) takes a lot of time because it's also compiling a bunch of dependent functions that can be partially reused on future compiles. But the noteworthy thing here is that later "real" instances each take ~100ms. And this is just one method (and its callees), one that is not horrifically huge by Makie/AbstractPlotting/Layout standards.

Let's try to get a more systematic view. This is complex, so let me just show you how to investigate this yourself. First, you need to be running the master branch of SnoopCompile. We're going to try this on a demo that matches the user experience, i.e., the first plot in a new session:

julia> using SnoopCompile, AbstractPlotting

julia> tinf = @snoopi_deep scatter(rand(10^4), rand(10^4))
Core.Compiler.Timings.Timing(InferenceFrameInfo for Core.Compiler.Timings.ROOT()) with 658 children

julia> tminfo = flatten_times(tinf);

julia> count_time = Dict{Method,Tuple{Int,Float64}}();

julia> for (t, info) in tminfo
           mi = info.mi
           if isa(mi.def, Method)
               m = mi.def::Method
               n, tm = get(count_time, m, (0, 0.0))
               count_time[m] = (n + 1, tm + t)
           end
       end

julia> sort!(collect(count_time); by = pr -> pr.second[2])
2131-element Vector{Pair{Method, Tuple{Int64, Float64}}}:
< lots of output >

This is a list of method => (number of instances, total time to compile *just* this method and not its callees) pairs. (To emphasize, that's different from what we measured above with draw_axis2d, where the time was inclusive of the compile time for the callees.) You'll see one object, ROOT, which measures all time outside of inference (including codegen). But more than half the time is inference.

I encourage you to run this yourself, there are some interesting takeaways. Here's a log plot of the time of each of these 2131 methods (measuring the self-inference time, in seconds):

inference_time

You can see it's really dominated by a "few" bad players: there are 171 methods accounting for more than 0.01s of total inference time across all of their argument type combinations. Some of these are things like setproperty! which have hundreds of instances (and are expected to), but there are quite a few that are among the dominant ones that have relatively few MethodInstances (here I'm just showing the last, most dominant 171 methods):

count_end

Moreover, many of the ones that have too many instances to precompile might be precompiled by being precompiled into methods that have fewer instances (that works for as many calls as inference succeeds for, so YMMV).

There is a pretty natural solution, one I've discussed before: https://docs.julialang.org/en/v1/manual/style-guide/#Handle-excess-argument-diversity-in-the-caller. The idea is if you have

foo(x, y, z, ...)

you can design it like this:

foo(x::TX, y::TY, z::TZ, ...) = # the "real" version, big and slow to compile
foo(x, y, z, ...) = foo(convert(TX, x)::TX, convert(TY, y)::TY, convert(TZ, z)::TZ, ...)  # tiny and fast to compile

Obviously that's a pretty big change from the code base as it is right now. Certainly, the user's data might come in any number of variants, but a lot of AbstractPlotting's internals seems to consist of passing around data you've computed from it: things like axis limits, fonts, etc. These seem much more standardizable.

Anyway, I've decided this is more than I could tackle solo, so I'm not going to implement this without buy-in from Makie's developers. But if there is interest I'm happy to help, especially to teach the tools so that you can run these diagnostics yourself. I'm slowly getting towards a big new release of SnoopCompile, but I've realized I need a real-world test bed (beyond what I got in JuliaLang/julia#38906, since getting changes made to Base is probably off the table for many packages) and Makie seems like a good candidate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    planningFor discussion and planning development

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions