precompilation for 1.4 release #3182

bkamins · 2022-09-26T14:49:34Z

Comparison of timing is below. The conclusion (pretty obvious is) that what we precompile is faster, what we do not precompile is comparable or slower (note that in DataFrames.jl 1.3.6 we also did precompilation but using a different mechanism).

So:

we should include in precompilation all that requires a lot of compilation
however, we should limit ourselves only to common methods (as otherwise we will increase package load time by having many precompile statements that are usually not needed)

Conclusion: we need to think carefully about the list of things we put for precompilation. Please comment what you thing we should put there.

I have chosen operations that are not included in precompilation statements for 1.4 release:

DataFrames.jl 1.3.6

julia> @time using DataFrames
  1.887191 seconds (3.37 M allocations: 232.080 MiB, 4.77% gc time)

julia> df = DataFrame(rand(10, 10), :auto);

julia> @time using DataFrames
  1.637969 seconds (3.37 M allocations: 232.055 MiB, 5.40% gc time)

julia> @time df = DataFrame(rand(10, 10), :auto);
  0.008176 seconds (2.24 k allocations: 131.895 KiB, 98.99% compilation time)

julia> @time select(df, :x2, Not(:x2));
  0.341841 seconds (182.34 k allocations: 9.533 MiB, 99.75% compilation time)

julia> @time combine(df, identity);
  0.189355 seconds (743.18 k allocations: 39.859 MiB, 4.95% gc time, 99.82% compilation time)

julia> @time leftjoin(df, df, on=:x3, makeunique=true);
  2.166899 seconds (2.12 M allocations: 103.064 MiB, 0.94% gc time, 99.96% compilation time)

julia> @time outerjoin(df, df, on=:x3, makeunique=true);
  0.040364 seconds (70.89 k allocations: 3.688 MiB, 99.30% compilation time)

julia> @time transform(df, :x1 => sum);
  0.186119 seconds (531.35 k allocations: 30.123 MiB, 99.68% compilation time)

julia> @time combine(groupby(df, :x4), :x1 => sum);
  1.430754 seconds (3.33 M allocations: 179.667 MiB, 3.18% gc time, 99.90% compilation time)

julia> @time select!(df, Not(r"x"));
  0.037020 seconds (92.96 k allocations: 4.885 MiB, 99.79% compilation time)

DataFrames.jl `main`

julia> @time using DataFrames
  1.997317 seconds (3.66 M allocations: 232.418 MiB, 3.50% gc time, 32.39% compilation time: 100% of which was recompilation)

julia> df = DataFrame(rand(10, 10), :auto);

julia> @time select(df, :x2, Not(:x2));
  0.479201 seconds (107.14 k allocations: 5.558 MiB, 99.92% compilation time)

julia> @time combine(df, identity);
  0.109145 seconds (294.37 k allocations: 15.965 MiB, 99.74% compilation time)

julia> @time leftjoin(df, df, on=:x3, makeunique=true);
  2.263144 seconds (1.75 M allocations: 83.771 MiB, 7.21% gc time, 99.96% compilation time)

julia> @time outerjoin(df, df, on=:x3, makeunique=true);
  0.053238 seconds (26.19 k allocations: 1.419 MiB, 99.40% compilation time: 20% of which was recompilation)

julia> @time transform(df, :x1 => sum);
  0.102747 seconds (93.63 k allocations: 5.107 MiB, 99.68% compilation time)

julia> @time combine(groupby(df, :x4), :x1 => sum);
  1.187194 seconds (1.20 M allocations: 64.261 MiB, 1.52% gc time, 99.94% compilation time)

julia> @time select!(df, Not(r"x"));
  0.028480 seconds (69.57 k allocations: 3.642 MiB, 99.80% compilation time)

I have chosen operations included in precompilation statements for 1.4 release

DataFrames.jl 1.3.6

julia> using DataFrames, PooledArrays, Statistics

julia> @time begin
julia> @time begin
julia> @time begin
julia> @time begin
           df = DataFrame(a=[2, 5, 3, 1, 0], b=["a", "b", "c", "a", "b"], c=1:5,
                          p=PooledArray(["a", "b", "c", "a", "b"]),
                                             q=[true, false, true, false, true],
                          f=Float64[2, 5, 3, 1, 0])
                              describe(df)
           names(df[1, 1:2])
           sort(df, :a)
           combine(df, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
           transform(df, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
           groupby(df, :a)
           groupby(df, :q)
           groupby(df, :p)
           gdf = groupby(df, :b)
           combine(gdf, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
           transform(gdf, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
           innerjoin(df, df, on=:a, makeunique=true)
               innerjoin(df, df, on=:b, makeunique=true)
           innerjoin(df, df, on=:c, makeunique=true)
           outerjoin(df, df, on=:a, makeunique=true)
           outerjoin(df, df, on=:b, makeunique=true)
           outerjoin(df, df, on=:c, makeunique=true)
           semijoin(df, df, on=:a)
           semijoin(df, df, on=:b)
           semijoin(df, df, on=:c)
           leftjoin!(df, DataFrame(a=[2, 5, 3, 1, 0]), on=:a)
           leftjoin!(df, DataFrame(b=["a", "b", "c", "d", "e"]), on=:b)
           leftjoin!(df, DataFrame(c=1:5), on=:c)
           reduce(vcat, [df, df])
           show(IOBuffer(), df)
           subset(df, :q)
               @view df[1:3, :]
           @view df[:, 1:2]
           select!(df, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
       end;
 13.120280 seconds (28.63 M allocations: 1.457 GiB, 2.40% gc time, 99.84% compilation time)

DataFrames.jl `main`

julia> using DataFrames, PooledArrays, Statistics

julia> @time begin
           df = DataFrame(a=[2, 5, 3, 1, 0], b=["a", "b", "c", "a", "b"], c=1:5,
                          p=PooledArray(["a", "b", "c", "a", "b"]),
                                             q=[true, false, true, false, true],
                          f=Float64[2, 5, 3, 1, 0])
                              describe(df)
           names(df[1, 1:2])
           sort(df, :a)
           combine(df, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
           transform(df, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
           groupby(df, :a)
           groupby(df, :q)
           groupby(df, :p)
           gdf = groupby(df, :b)
           combine(gdf, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
           transform(gdf, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
           innerjoin(df, df, on=:a, makeunique=true)
               innerjoin(df, df, on=:b, makeunique=true)
           innerjoin(df, df, on=:c, makeunique=true)
           outerjoin(df, df, on=:a, makeunique=true)
           outerjoin(df, df, on=:b, makeunique=true)
           outerjoin(df, df, on=:c, makeunique=true)
           semijoin(df, df, on=:a)
           semijoin(df, df, on=:b)
           semijoin(df, df, on=:c)
           leftjoin!(df, DataFrame(a=[2, 5, 3, 1, 0]), on=:a)
           leftjoin!(df, DataFrame(b=["a", "b", "c", "d", "e"]), on=:b)
           leftjoin!(df, DataFrame(c=1:5), on=:c)
           reduce(vcat, [df, df])
           show(IOBuffer(), df)
           subset(df, :q)
               @view df[1:3, :]
           @view df[:, 1:2]
           select!(df, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)
       end;
  9.799926 seconds (2.06 M allocations: 96.399 MiB, 0.25% gc time, 99.82% compilation time: 15% of which was recompilation)

nalimilan

That's a much simpler way of handling precompilation!

How much does this add to the loading time compared with not precompiling anything?

nalimilan · 2022-09-26T17:22:09Z

src/other/precompile.jl

+    subset(df, :q)
+    @view df[1:3, :]
+    @view df[:, 1:2]
+    select!(df, :c, [:c :f] .=> [sum, mean, std], :c => :d, [:a, :c] => cor)


Why select! and not select nor transform!?

transform was called above, and it calls select internally. I could use transform! - it should not be that different. I will change it.

bkamins · 2022-09-26T17:53:51Z

No precompilation

First call

julia> @time using DataFrames
[ Info: Precompiling DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]
  4.837268 seconds (2.82 M allocations: 163.831 MiB, 0.71% gc time, 18.74% compilation time: 95% of which was recompilation)

Next calls

julia> @time using DataFrames
  1.218950 seconds (2.18 M allocations: 131.630 MiB, 1.96% gc time, 57.11% compilation time: 100% of which was recompilation)

With precompilation

First call

julia> @time using DataFrames
[ Info: Precompiling DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]
 18.193496 seconds (4.29 M allocations: 264.117 MiB, 0.32% gc time, 4.30% compilation time: 95% of which was recompilation)

Next calls

julia> @time using DataFrames
  1.991196 seconds (3.65 M allocations: 231.916 MiB, 3.51% gc time, 32.72% compilation time: 100% of which was recompilation)

bkamins · 2022-09-26T18:12:04Z

@nalimilan - I have also checked that adding additional statements in precompilation code does not add much in the first load time but indeed improves things later.
So, if you have something more to add in mind please comment and I will benchmark it and add if it is beneficial.

bkamins · 2022-09-28T20:30:14Z

@nalimilan - given no suggestions I would merge this. We can always change the precompiled method list since it is non-breaking.

bkamins · 2022-09-29T07:56:06Z

Thank you!

bkamins added 2 commits September 26, 2022 16:37

precompilation for 1.4 release

c2737c0

remove @time

b17b703

bkamins added the ecosystem Issues in DataFrames.jl ecosystem label Sep 26, 2022

bkamins added this to the 1.4 milestone Sep 26, 2022

bkamins requested a review from nalimilan September 26, 2022 14:49

nalimilan reviewed Sep 26, 2022

View reviewed changes

change select! to transform!

54df021

bkamins added 2 commits September 26, 2022 21:17

Merge branch 'main' into bk/precompilation

a1ef4b1

Merge branch 'main' into bk/precompilation

9ae0d78

nalimilan approved these changes Sep 29, 2022

View reviewed changes

bkamins merged commit 7c1a888 into main Sep 29, 2022

bkamins deleted the bk/precompilation branch September 29, 2022 07:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

precompilation for 1.4 release #3182

precompilation for 1.4 release #3182

bkamins commented Sep 26, 2022

nalimilan left a comment

nalimilan Sep 26, 2022

bkamins Sep 26, 2022

bkamins commented Sep 26, 2022

bkamins commented Sep 26, 2022

bkamins commented Sep 28, 2022

bkamins commented Sep 29, 2022

precompilation for 1.4 release #3182

precompilation for 1.4 release #3182

Conversation

bkamins commented Sep 26, 2022

I have chosen operations that are not included in precompilation statements for 1.4 release:

DataFrames.jl 1.3.6

DataFrames.jl main

I have chosen operations included in precompilation statements for 1.4 release

DataFrames.jl 1.3.6

DataFrames.jl main

nalimilan left a comment

Choose a reason for hiding this comment

nalimilan Sep 26, 2022

Choose a reason for hiding this comment

bkamins Sep 26, 2022

Choose a reason for hiding this comment

bkamins commented Sep 26, 2022

No precompilation

First call

Next calls

With precompilation

First call

Next calls

bkamins commented Sep 26, 2022

bkamins commented Sep 28, 2022

bkamins commented Sep 29, 2022

DataFrames.jl `main`

DataFrames.jl `main`