[BREAKING] sync combine with select #2158

bkamins · 2020-03-21T12:13:01Z

I have initially implemented the functionality discussed in #2156 and added keepkeys kwarg. Here is an example:

julia> df = DataFrame(rand(3, 4))
3×4 DataFrame
│ Row │ x1       │ x2       │ x3       │ x4       │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.181501 │ 0.794075 │ 0.86532  │ 0.417367 │
│ 2   │ 0.743854 │ 0.257419 │ 0.350772 │ 0.293578 │
│ 3   │ 0.431316 │ 0.922369 │ 0.862359 │ 0.362968 │

julia> @time by(df, [:x1, :x2], :x3 => ByRow(cos))
  6.172474 seconds (20.32 M allocations: 1.041 GiB, 6.61% gc time)
3×3 DataFrame
│ Row │ x1       │ x2       │ x3_cos   │
│     │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┤
│ 1   │ 0.181501 │ 0.794075 │ 0.648397 │
│ 2   │ 0.743854 │ 0.257419 │ 0.939108 │
│ 3   │ 0.431316 │ 0.922369 │ 0.650648 │

julia> @time by(df, [:x1, :x2], :x3 => ByRow(cos), [:x1, :x2] => (x1, x2) -> x1 ./ x2)
  0.907711 seconds (2.40 M allocations: 122.147 MiB, 2.84% gc time)
3×4 DataFrame
│ Row │ x1       │ x2       │ x3_cos   │ x1_x2_function │
│     │ Float64  │ Float64  │ Float64  │ Float64        │
├─────┼──────────┼──────────┼──────────┼────────────────┤
│ 1   │ 0.181501 │ 0.794075 │ 0.648397 │ 0.228569       │
│ 2   │ 0.743854 │ 0.257419 │ 0.939108 │ 2.88967        │
│ 3   │ 0.431316 │ 0.922369 │ 0.650648 │ 0.467618       │

julia> @time by(df, [:x1, :x2], :x3 => ByRow(cos), [:x1, :x2] => (x1, x2) -> x1 ./ x2, keepkeys=false)
  0.369644 seconds (775.51 k allocations: 40.180 MiB, 3.00% gc time)
3×2 DataFrame
│ Row │ x3_cos   │ x1_x2_function │
│     │ Float64  │ Float64        │
├─────┼──────────┼────────────────┤
│ 1   │ 0.648397 │ 0.228569       │
│ 2   │ 0.939108 │ 2.88967        │
│ 3   │ 0.650648 │ 0.467618       │

I have not updated the documentation nor tests so CI should fail.

Also I have left most of the internals unchanged because:

this minimized the risk of error
this allowed an easy deprecation of kwarg-form of by
this allows in the future to allow opting-in for passing NamedTuple to a function instead of auto-splatting

The major problem I will have to look into is that the current implementation significantly strains the compiler unfortunately. Probably some more @nospecialize calls should be added (@nalimilan you might have some experience with it so can you please comment if you have tips to share)?

bkamins · 2020-03-21T15:48:53Z

Just as a reference in 0.20.2 we have:

julia> @time by(df, [:x1, :x2], :x3 => x -> cos.(x))
  3.254445 seconds (11.90 M allocations: 587.842 MiB, 8.50% gc time)
3×3 DataFrame
│ Row │ x1       │ x2        │ x3_function │
│     │ Float64  │ Float64   │ Float64     │
├─────┼──────────┼───────────┼─────────────┤
│ 1   │ 0.797467 │ 0.205025  │ 0.867137    │
│ 2   │ 0.703998 │ 0.0357978 │ 0.691523    │
│ 3   │ 0.473485 │ 0.35953   │ 0.704904    │

(which is not super fast, but still takes 2x less time)

bkamins · 2020-03-21T17:03:10Z

I have made some optimizations to reduce compilation cost (in particular dropped NamedTuple in favor of Tuple internally which is enough for auto-splatting). The timings have improved:

julia> using DataFrames

julia> df = DataFrame(rand(10, 4));

julia> @time by(df, [:x1, :x2], :x3 => x -> cos.(x));
  3.969474 seconds (13.04 M allocations: 659.181 MiB, 7.89% gc time)

bkamins · 2020-03-21T18:46:45Z

Last commit is performance related only, after it:

julia> df = DataFrame(rand(10^7, 4));

julia> @time by(df, [:x1, :x2], :x3 => x -> cos.(x));
 16.423705 seconds (174.08 M allocations: 7.480 GiB, 5.93% gc time)

and on master

julia> df = DataFrame(rand(10^7, 4));

julia> @time by(df, [:x1, :x2], :x3 => x -> cos.(x));
 47.086273 seconds (293.70 M allocations: 12.963 GiB, 3.66% gc time)

The problem was that the check 20c9c88#diff-23657e51a9cc9e627fc153ba1e6e04c1L982 was very expensive (I earlier thought it should be optimized out by the compiler).

This signals type instability risk, but it seems we do not have it, so this is strange:

julia> df = DataFrame(rand(10^7, 4));

julia> f(x) = cos.(x)
f (generic function with 1 method)

julia> g(x)::Vector{Float64} = cos.(x) # make sure Julia should know the return type
g (generic function with 1 method)

julia> @time by(df, [:x1, :x2], :x3 => f); # more expensive due to compilation
 17.630734 seconds (174.08 M allocations: 7.480 GiB, 6.21% gc time)

julia> @time by(df, [:x1, :x2], :x3 => g);
 13.528831 seconds (160.43 M allocations: 6.842 GiB, 6.97% gc time)

julia> @time by(df, [:x1, :x2], :x3 => f); # here speed is the same
 13.110972 seconds (160.00 M allocations: 6.820 GiB, 7.13% gc time)

julia> @time by(df, [:x1, :x2], :x3 => g);
 13.037767 seconds (160.00 M allocations: 6.820 GiB, 7.09% gc time)

nalimilan · 2020-03-23T11:51:40Z

So now the PR is faster than master? That type instability is indeed unexpected. Have you checked with @code_warntype whether all types are correctly inferred?

(BTW, I'd rather use combine directly for performance checks, as by includes groupby which is itself a complex beast.)

src/groupeddataframe/splitapplycombine.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-03-23T15:22:04Z

So now the PR is faster than master? That type instability is indeed unexpected. Have you checked with @code_warntype whether all types are correctly inferred?

PR is faster than master and is ~ as fast as 0.20.2. I will check @code_warntype when we stabilize the design.

(BTW, I'd rather use combine directly for performance checks, as by includes groupby which is itself a complex beast.)

Right (I checked both but in general I will report combine in the future)

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-03-24T06:43:34Z

Do you think we should allow using selectors and renaming in combine (like we allow in select now), so that something like:

combine(gdf, :, :x1, :x2 => :rename_x2, :x3 => fun => :x4)

would be allowed?

nalimilan · 2020-03-24T08:38:07Z

That would be quite natural with select (and that would allowtransform to work automatically). Probably also for combine, as I see no reason not to support it. Though it really requires recycling of scalars, otherwise it will be quite limited.

src/groupeddataframe/splitapplycombine.jl

pdeffebach

Thanks for this. I went over and think I understand the logic but don't have any substantive comments about the implementation. This looks good to me.

src/groupeddataframe/splitapplycombine.jl

pdeffebach · 2020-03-25T17:04:53Z

A method to add would be

by(df, :a, :, cols => fun)

which would transform something by group but keep all the columns and return a DataFrame.

bkamins · 2020-03-25T17:47:37Z

A method to add would be

I plan it for a separate PR as this is already big enough I think.

src/groupeddataframe/splitapplycombine.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-04-01T13:15:35Z

We have a very strange error at https://travis-ci.org/github/JuliaData/DataFrames.jl/jobs/669620787?utm_medium=notification&utm_source=github_status. It seems totally unrelated and theoretically impossible to occur. I am on x64 Linux and never had it.

I will re-run the tests to see what happens.

bkamins · 2020-04-01T14:02:53Z

The error did not repeat - @nalimilan - do you think we should worry about this?

nalimilan

Cool.

The error is annoying but without a way to reproduce it I guess we can just hope we'll be able to identify it at some point. I remember I bumped into Julia bugs before in this code, you probably hit a rare path that wasn't exercised before.

src/groupeddataframe/splitapplycombine.jl

test/grouping.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-04-02T12:50:10Z

I have committed your map suggestion as it is better than ntuple, but still manual unrolling should be left as:

Timing (with unrolling, but before map change)

julia> @btime combine(gdf, [:x] => x -> x[1]);
  251.465 ms (8998690 allocations: 316.60 MiB)

julia> @btime combine(gdf, [:x, :x] => (x1,x2) -> x1[1]+x2[1]);
  335.512 ms (10998591 allocations: 347.12 MiB)

julia> @btime combine(gdf, [:x, :x, :x] => (x1,x2,x3) -> x1[1]+x2[1]+x3[1]);
  384.069 ms (12998491 allocations: 377.63 MiB)

julia> @btime combine(gdf, [:x, :x, :x, :x] => (x1,x2,x3,x4) -> x1[1]+x2[1]+x3[1]+x4[1]);
  409.963 ms (14998391 allocations: 408.15 MiB)

julia> @btime combine(gdf, [:x, :x, :x, :x, :x] => (x1,x2,x3,x4,x5) -> x1[1]+x2[1]+x3[1]+x4[1]+x5[1]);
  1.137 s (25997841 allocations: 789.60 MiB)

Timing of ntuple without unrolling:

julia> @btime combine(gdf, [:x] => x -> x[1]);
  554.949 ms (13998511 allocations: 453.93 MiB)

julia> @btime combine(gdf, [:x, :x] => (x1,x2) -> x1[1]+x2[1]);
  827.347 ms (16998375 allocations: 545.48 MiB)

julia> @btime combine(gdf, [:x, :x, :x] => (x1,x2,x3) -> x1[1]+x2[1]+x3[1]);
  936.301 ms (19998241 allocations: 621.77 MiB)

julia> @btime combine(gdf, [:x, :x, :x, :x] => (x1,x2,x3,x4) -> x1[1]+x2[1]+x3[1]+x4[1]);
  1.003 s (22998106 allocations: 713.32 MiB)

julia> @btime combine(gdf, [:x, :x, :x, :x, :x] => (x1,x2,x3,x4,x5) -> x1[1]+x2[1]+x3[1]+x4[1]+x5[1]);
  1.188 s (25997971 allocations: 789.61 MiB)

Timing of map without unrolling:

julia> @btime combine(gdf, [:x] => x -> x[1]);
  464.953 ms (12998452 allocations: 423.40 MiB)

julia> @btime combine(gdf, [:x, :x] => (x1,x2) -> x1[1]+x2[1]);
  551.284 ms (15998292 allocations: 499.69 MiB)

julia> @btime combine(gdf, [:x, :x, :x] => (x1,x2,x3) -> x1[1]+x2[1]+x3[1]);
  672.049 ms (18998134 allocations: 575.98 MiB)

julia> @btime combine(gdf, [:x, :x, :x, :x] => (x1,x2,x3,x4) -> x1[1]+x2[1]+x3[1]+x4[1]);
  731.410 ms (21997975 allocations: 652.27 MiB)

julia> @btime combine(gdf, [:x, :x, :x, :x, :x] => (x1,x2,x3,x4,x5) -> x1[1]+x2[1]+x3[1]+x4[1]+x5[1]);
  855.288 ms (24997816 allocations: 728.56 MiB)

As you can see map and ntuple allocates

bkamins · 2020-04-02T14:31:27Z

OK, we will go for @generated - indeed this seems cleanest (@nalimilan - now it is your task then to check hygene 😄).

Generated timings:

julia> df = DataFrame(g=rand(1:10^6, 10^7), x=rand(10^7));

julia> gdf = groupby(df, :g);

julia> @btime combine(gdf, [:x] => x -> x[1]);
  261.770 ms (8998771 allocations: 316.60 MiB)

julia> @btime combine(gdf, [:x, :x] => (x1,x2) -> x1[1]+x2[1]);
  357.037 ms (10998689 allocations: 347.12 MiB)

julia> @btime combine(gdf, [:x, :x, :x] => (x1,x2,x3) -> x1[1]+x2[1]+x3[1]);
  380.764 ms (12998608 allocations: 377.64 MiB)

julia> @btime combine(gdf, [:x, :x, :x, :x] => (x1,x2,x3,x4) -> x1[1]+x2[1]+x3[1]+x4[1]);
  414.389 ms (14998526 allocations: 408.15 MiB)

julia> @btime combine(gdf, [:x, :x, :x, :x, :x] => (x1,x2,x3,x4,x5) -> x1[1]+x2[1]+x3[1]+x4[1]+x5[1]);
  430.944 ms (16998445 allocations: 438.67 MiB)

The only drawback is when we would pass hundreds of variables, but this is a different issue - I will add such a note to the docs.

This reverts commit 8f0caf7.

bkamins · 2020-04-02T15:22:25Z

I am leaving here the @generated code:

@generated function do_call(f::Any, idx::AbstractVector{<:Integer},
                 starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
                 gd::GroupedDataFrame, incols::Tuple, i::Integer)
    ex = :(f())
    incols === Tuple{} && return ex
    for idx in 1:length(incols.parameters)
        push!(ex.args, :(view(incols[$idx], idx)))
    end
    return :(idx = idx[starts[i]:ends[i]]; $ex)
end

However, the tests show that it is better to do manual unrolling for some reason for low number of arguments (and I think that for more than 4 arguments map is sufficiently good).

I will leave the @generated code in reverted commit so that you can have a look.

bkamins · 2020-04-02T15:53:25Z

For _combine(f::AbstractVector{<:Pair}, gd::GroupedDataFrame, nms::AbstractVector{Symbol}) I have not changed the signature, but added assertions as the signature is problematic (source column can be Int or AbstractVector{Int} so it would be a very complex signature).

nalimilan

OK, let's keep the manual unrolling then. Can you just add a comment explaining that @generated is slower for a low number of columns? I guess this is including compilation times, right? (That's indeed something we care about, but that isn't always considered when writing package code.)

bkamins · 2020-04-03T10:24:39Z

I will add a comment.

Actually it was worse - it was allocating more and was slightly slower even after first compilation and also compilation for some reason was invoked twice for the same signature (I have put @show in @generated and it printed twice).

In summary - this is a strange situation (indicating that @generated has some minimal overhead over manual unrolling), but rather related to how Base works than this PR.

I will rebase it against "transform" PR and add some more tests before merging.

bkamins · 2020-04-03T11:50:55Z

Well - a last part is updates to the manual. Just pushed.

docs/src/man/split_apply_combine.md

src/groupeddataframe/splitapplycombine.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-04-03T20:48:10Z

I will merge this PR when CI passes. Thank you for all the discussions!

bkamins · 2020-04-04T06:24:05Z

Anything that was broken here now should be fixed by a separate PR. Thank you for a fantastic joint effort.

sync combine with select

f7bf852

redesign non performance critical code to reduce compilation cost

e6290b8

improve performance when getting tables

20c9c88

bkamins added breaking The proposed change is breaking. grouping labels Mar 21, 2020

bkamins added this to the 1.0 milestone Mar 21, 2020

nalimilan reviewed Mar 23, 2020

View reviewed changes

Update src/groupeddataframe/splitapplycombine.jl

3c3cd46

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins and others added 4 commits March 23, 2020 16:46

Update src/groupeddataframe/splitapplycombine.jl

54a0329

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

updates after code review

5eab189

Merge remote-tracking branch 'origin/update_combine' into update_combine

faf8091

move @nospecialize

a76a249

nalimilan reviewed Mar 24, 2020

View reviewed changes

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved

bkamins mentioned this pull request Mar 24, 2020

Precompilation of DataFrames #1502

Closed

bkamins added 2 commits March 24, 2020 19:05

Merge branch 'master' into update_combine

77e5a13

remove @nospecialize and clean up code

176e646

pdeffebach reviewed Mar 25, 2020

View reviewed changes

src/groupeddataframe/splitapplycombine.jl Show resolved Hide resolved

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved

fix wrap_table and make combine ready for broadcasting

97d3449

bkamins commented Mar 25, 2020

View reviewed changes

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved

nalimilan reviewed Mar 26, 2020

View reviewed changes

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved

Update src/groupeddataframe/splitapplycombine.jl

bc8ad83

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

update docs for nrow

ce594c4

bkamins closed this Apr 1, 2020

bkamins reopened this Apr 1, 2020

nalimilan reviewed Apr 2, 2020

View reviewed changes

bkamins and others added 2 commits April 2, 2020 13:29

Apply suggestions from code review

20fd019

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

Update src/groupeddataframe/splitapplycombine.jl

066ffe0

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins added 2 commits April 2, 2020 16:52

move to @generated

8f0caf7

Revert "move to @generated"

85d298a

This reverts commit 8f0caf7.

corrections after code revieew

c33512c

nalimilan approved these changes Apr 3, 2020

View reviewed changes

bkamins added 2 commits April 3, 2020 12:28

Merge branch 'master' into update_combine

0b6f651

small updates after merging transpose

4ee405e

bkamins changed the title ~~sync combine with select~~ [BREAKING] sync combine with select Apr 3, 2020

bkamins added 2 commits April 3, 2020 13:44

update documentation

2027bd6

more docs updates

3e72841

nalimilan reviewed Apr 3, 2020

View reviewed changes

bkamins and others added 2 commits April 3, 2020 22:44

Apply suggestions from code review

1d3a238

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

update documentation

464277c

bkamins merged commit 894a012 into JuliaData:master Apr 4, 2020

bkamins deleted the update_combine branch April 4, 2020 06:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BREAKING] sync combine with select #2158

[BREAKING] sync combine with select #2158

bkamins commented Mar 21, 2020

bkamins commented Mar 21, 2020

bkamins commented Mar 21, 2020

bkamins commented Mar 21, 2020

nalimilan commented Mar 23, 2020

bkamins commented Mar 23, 2020

bkamins commented Mar 24, 2020

nalimilan commented Mar 24, 2020

pdeffebach left a comment

pdeffebach commented Mar 25, 2020

bkamins commented Mar 25, 2020

bkamins commented Apr 1, 2020

bkamins commented Apr 1, 2020

nalimilan left a comment

bkamins commented Apr 2, 2020

bkamins commented Apr 2, 2020

bkamins commented Apr 2, 2020 •

edited

Loading

bkamins commented Apr 2, 2020

nalimilan left a comment

bkamins commented Apr 3, 2020

bkamins commented Apr 3, 2020

bkamins commented Apr 3, 2020

bkamins commented Apr 4, 2020

[BREAKING] sync combine with select #2158

[BREAKING] sync combine with select #2158

Conversation

bkamins commented Mar 21, 2020

bkamins commented Mar 21, 2020

bkamins commented Mar 21, 2020

bkamins commented Mar 21, 2020

nalimilan commented Mar 23, 2020

bkamins commented Mar 23, 2020

bkamins commented Mar 24, 2020

nalimilan commented Mar 24, 2020

pdeffebach left a comment

Choose a reason for hiding this comment

pdeffebach commented Mar 25, 2020

bkamins commented Mar 25, 2020

bkamins commented Apr 1, 2020

bkamins commented Apr 1, 2020

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Apr 2, 2020

bkamins commented Apr 2, 2020

bkamins commented Apr 2, 2020 • edited Loading

bkamins commented Apr 2, 2020

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Apr 3, 2020

bkamins commented Apr 3, 2020

bkamins commented Apr 3, 2020

bkamins commented Apr 4, 2020

bkamins commented Apr 2, 2020 •

edited

Loading