Precompilation of DataFrames #1502

bkamins · 2018-09-05T12:22:24Z

I think it would be good to put some most common cases of functions used in DataFrames.jl to the module to force their precompilation as this will improve user experience (and DataFrames.jl is probably one of the packages that new users start with). Any opinions if we should do it and how (I could manually list the cases I think are relevant and code it by hand in the worst case).

nalimilan · 2018-09-05T13:43:55Z

Good idea. We should also do this for dependencies, in particular CategoricalArrays. A reasonable way of doing that would be to run the full test suite under SnoopCompile.jl.

quinnj · 2018-09-05T14:06:10Z

Honestly I haven't found using precompile statements to be very helpful (or at least worth the effort to generate them). The problem is that precompile still doesn't generate machine code, so it's really just the lowering cost that you're avoiding. What has been more useful is to include a simple couple calls in the top-level __init__ function that will get called every time the module is loaded. Yes, it adds a little extra time to the initial module loading, but IMO, it can be worth it to have the penalty there instead of the first time a user calls a function.

nalimilan · 2018-09-05T14:18:58Z

Given how big is the DataFrames API, it doesn't really sound possible to call all functions from __init__. At least setting up precompilation will be useful in preparation of the future Julia version where native code will be stored.

KristofferC · 2018-09-06T13:39:21Z

+1 for precompile statements not helping much unless it is paired with sysimg building (and then they help tremendously, c.f the Julia REPL).

nalimilan · 2019-01-23T09:29:24Z

While discussing the h2oai benchmarks, it appeared that it would be useful to provide a function or script to compile functions for all common types. That would allow reporting times net of the initial compilation cost, which is generally only paid once, and in particular would allow separating it from the cost of specializing some functions on the particular operation at hand. See discussion at h2oai/db-benchmark#69.

I guess an easy way to do that would be to copy the output of SnoopCompile to an unexported function, which could be called by the h2oai benchmark, and possibly by users who want to put that in their .juliarc.

bkamins · 2019-01-23T09:59:12Z

I think it would be worth to do it at least to understand what should be compiled and how long this compilation takes. Then I guess people who want to build a custom sysimg instead of using .juliarc (which is now ~/.julia/config/startup.jl right?) also would have an easier life.

nalimilan · 2019-01-23T10:32:33Z

Right. Let's try using SnoopCompile while running the full test suite.

bkamins · 2020-03-23T21:09:07Z

+1 for precompile statements not helping much unless it is paired with sysimg building

@KristofferC Is this still true for Julia 1.4?

Here is a comparison (I am annotating by with @nospecialize - not sure if this matters)

Without precompile:

julia> @time using DataFrames
  0.924471 seconds (1.39 M allocations: 88.058 MiB)

julia> df = DataFrame(rand(10,4));

julia> @time by(df, :x1, [:x2, :x3] => +);
  4.523086 seconds (15.01 M allocations: 759.657 MiB, 8.18% gc time)

With precompile:

julia> @time using DataFrames
  1.311571 seconds (1.60 M allocations: 103.156 MiB)

julia> df = DataFrame(rand(10,4));

julia> @time by(df, :x1, [:x2, :x3] => +);
  3.059164 seconds (8.85 M allocations: 458.088 MiB, 2.41% gc time)

So using precompile increases load time by ~0.4 sec, but decreases time to first result by ~1.4 sec . So in total 1 second is saved.

KristofferC · 2020-03-24T08:53:39Z

It might have changed or I didn't have a correct assessment earlier. I know Plots.jl has added some and it seems to help there. So if the numbers say it helps, I say go for it.

bkamins · 2020-03-24T10:23:56Z

@nalimilan - so I will put out all new @nospecialize changes in #2158 and will think what can be proposed for @nospecialize and precompile combinations in this issue (I am not yet sure how they interact in practice and have to investigate).

bkamins · 2020-05-13T16:16:38Z

@nalimilan - I have checked what we have now. We could handle most of the compilation latency after the package is loaded if we included the following in the init file:

let
    df = DataFrame(a=1, b=2)
    g = groupby(df, :a)
    combine(g, :a => sum)
    innerjoin(df, df, on=:a, makeunique=true)
    unstack(stack(df, :a))
end;

It would make the package startup time grow from 0.9 sec to 5.5 sec. So for now this is for sure prohibitive, but maybe when we separate DataFramesBase.jl actually it could be acceptable to have such code in DataFrames.jl? It is normal for other packages to load in 5-10 seconds (I do not want to say that this is something users like of course).

(we could alternatively do precompile but I think it is simplest just to run the functions)

KristofferC · 2020-05-13T16:24:37Z

I really dislike packages running unnecessary code at startup. What if I only use the DataFrames functionality spuriously in my session, why are you adding 5 extra seconds of completely avoidable latency to me? It can also introduce invalidations that cause other things to have to be recompiled.

Unconditionally shifting eventual compilation cost to unconditional package load time cost seems like a very bad trade off.

Ref, JuliaData/CSV.jl#325.

bkamins · 2020-05-13T17:20:12Z

@KristofferC - sure.

BTW: are you aware of the current state of the work towards better caching the compiled code between Julia sessions?

nalimilan · 2020-05-13T18:15:06Z

I agree it doesn't sound like a good tradeoff to impose 5s on every load to make things faster afterwards. And we never know what kind of functions users will call, so we'd have to do that for all functions...

Would adding these to precompilation give some speedup? With SnoopCompile it should be easy to transform a script into a series of precompile statements.

bkamins · 2020-09-23T08:32:20Z

As discussed with @nalimilan - for now we can add an unexported DataFrames.compile() (or something similar) to DataFrames.jl so that the user can opt-in to compile the most common methods. (would be used in H2O benchmarks for example).

Also this function can be used by the users in combination with PackageCompiler.jl.

@nalimilan - I am assigning you to this issue just to make sure it is kept track of, as this is an old issue. I hope it is OK with you.

nalimilan · 2020-09-27T13:31:25Z

For reference, taking the radical approach of precompiling everything that is used by the tests (#2456, which takes more than 3 minutes!), here are the timings I get in a fresh session:

julia> @time using DataFrames
3.241466 seconds (2.29 M allocations: 187.625 MiB, 1.72% gc time)

julia> df = DataFrame(x=[1, 1, 2, 2], y=rand(4));

julia> @time combine(groupby(df, :x), :y => sum);
  3.899736 seconds (4.45 M allocations: 236.128 MiB, 8.39% gc time)

As opposed to current master:

julia> @time using DataFrames
  0.981128 seconds (1.23 M allocations: 81.321 MiB)

julia> df = DataFrame(x=[1, 1, 2, 2], y=rand(4));

julia> @time combine(groupby(df, :x), :y => sum);
  6.365692 seconds (16.59 M allocations: 850.439 MiB, 7.69% gc time)

So it looks like precompilation can make a difference, but at the cost of increasing the load time. Though clearly we don't want to precompile so many things. Maybe we could identify a few priorities which would give a reasonable tradeoff.

bkamins · 2020-09-27T14:04:25Z

Thank you for working on it. Clearly there is too much in the precompilation (as e.g. there are many very specific precompilations that relate to concrete field column, like :a, :b etc.). How would you prefer to approach to pruning of the proposed list?.
In general I think we should precompile the methods related with core functionalities like: constructor, groupby, combine, sort, etc..

Also there is an issue of maintenance of precompile. The reason is that when we change functionality the precompile body should change. Should we re-run it every time we make a release?

nalimilan · 2020-09-27T15:41:48Z

If we keep track of the code needed to generate the precompile statements, we can run it for each new major release without too much work. It's not the end of the world if it's not completely up to date though.

One way to compile only the major functions is to use @snoopi tmin=0.1 include("test/runtests.jl"). With that threshold I get about 200 precompile statements, which make precompilation take 30s at the first using DataFrames (versus 6s on master). This gives interesting speedups already (for some reason this one is even a bit faster than with the full precompilation):

julia> @time using DataFrames
  1.832080 seconds (1.73 M allocations: 126.133 MiB)

julia> df = DataFrame(x=[1, 1, 2, 2], y=rand(4));

julia> @time combine(groupby(df, :x), :y => sum);
  3.611778 seconds (5.33 M allocations: 281.904 MiB, 7.17% gc time)

We could use a higher threshold -- but with tmin=0.5 I get only ~5 statements.

bkamins · 2020-09-27T15:46:44Z

Actually for H2O benchmark I think that it would be even sensible to build a custom Julia image (I guess this is OK with the H2O rules - as normally user would do this).

nalimilan · 2020-09-27T16:13:24Z

Actually for H2O benchmark I think that it would be even sensible to build a custom Julia image (I guess this is OK with the H2O rules - as normally user would do this).

That would be more complex for a limited gain, wouldn't it?

bkamins · 2020-09-27T16:24:21Z

It would, but it would "kind of" show the real recommended deployment pattern.

KristofferC · 2020-09-27T16:25:21Z

Often it isn't too bad to write the statements manually. Having a few well targeted statements usually gets you quite a long way.

nalimilan · 2021-04-03T18:05:07Z

Fixed by #2456.

bkamins added the performance label Feb 12, 2020

bkamins assigned nalimilan Sep 23, 2020

bkamins added the non-breaking The proposed change is not breaking label Sep 23, 2020

nalimilan mentioned this issue Sep 27, 2020

Enable precompilation #2456

Merged

nalimilan closed this as completed Apr 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precompilation of DataFrames #1502

Precompilation of DataFrames #1502

bkamins commented Sep 5, 2018

nalimilan commented Sep 5, 2018

quinnj commented Sep 5, 2018

nalimilan commented Sep 5, 2018

KristofferC commented Sep 6, 2018

nalimilan commented Jan 23, 2019

bkamins commented Jan 23, 2019

nalimilan commented Jan 23, 2019

bkamins commented Mar 23, 2020

KristofferC commented Mar 24, 2020

bkamins commented Mar 24, 2020

bkamins commented May 13, 2020

KristofferC commented May 13, 2020 •

edited

Loading

bkamins commented May 13, 2020

nalimilan commented May 13, 2020

bkamins commented Sep 23, 2020

nalimilan commented Sep 27, 2020

bkamins commented Sep 27, 2020

nalimilan commented Sep 27, 2020

bkamins commented Sep 27, 2020

nalimilan commented Sep 27, 2020

bkamins commented Sep 27, 2020

KristofferC commented Sep 27, 2020

nalimilan commented Apr 3, 2021

Precompilation of DataFrames #1502

Precompilation of DataFrames #1502

Comments

bkamins commented Sep 5, 2018

nalimilan commented Sep 5, 2018

quinnj commented Sep 5, 2018

nalimilan commented Sep 5, 2018

KristofferC commented Sep 6, 2018

nalimilan commented Jan 23, 2019

bkamins commented Jan 23, 2019

nalimilan commented Jan 23, 2019

bkamins commented Mar 23, 2020

KristofferC commented Mar 24, 2020

bkamins commented Mar 24, 2020

bkamins commented May 13, 2020

KristofferC commented May 13, 2020 • edited Loading

bkamins commented May 13, 2020

nalimilan commented May 13, 2020

bkamins commented Sep 23, 2020

nalimilan commented Sep 27, 2020

bkamins commented Sep 27, 2020

nalimilan commented Sep 27, 2020

bkamins commented Sep 27, 2020

nalimilan commented Sep 27, 2020

bkamins commented Sep 27, 2020

KristofferC commented Sep 27, 2020

nalimilan commented Apr 3, 2021

KristofferC commented May 13, 2020 •

edited

Loading