Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] sync combine with select #2158

Merged
merged 44 commits into from
Apr 4, 2020
Merged

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Mar 21, 2020

Fixes #2156.

I have initially implemented the functionality discussed in #2156 and added keepkeys kwarg. Here is an example:

julia> df = DataFrame(rand(3, 4))
3×4 DataFrame
│ Row │ x1       │ x2       │ x3       │ x4       │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.181501 │ 0.794075 │ 0.86532  │ 0.417367 │
│ 2   │ 0.743854 │ 0.257419 │ 0.350772 │ 0.293578 │
│ 3   │ 0.431316 │ 0.922369 │ 0.862359 │ 0.362968 │

julia> @time by(df, [:x1, :x2], :x3 => ByRow(cos))
  6.172474 seconds (20.32 M allocations: 1.041 GiB, 6.61% gc time)
3×3 DataFrame
│ Row │ x1       │ x2       │ x3_cos   │
│     │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┤
│ 1   │ 0.181501 │ 0.794075 │ 0.648397 │
│ 2   │ 0.743854 │ 0.257419 │ 0.939108 │
│ 3   │ 0.431316 │ 0.922369 │ 0.650648 │

julia> @time by(df, [:x1, :x2], :x3 => ByRow(cos), [:x1, :x2] => (x1, x2) -> x1 ./ x2)
  0.907711 seconds (2.40 M allocations: 122.147 MiB, 2.84% gc time)
3×4 DataFrame
│ Row │ x1       │ x2       │ x3_cos   │ x1_x2_function │
│     │ Float64  │ Float64  │ Float64  │ Float64        │
├─────┼──────────┼──────────┼──────────┼────────────────┤
│ 1   │ 0.181501 │ 0.794075 │ 0.648397 │ 0.228569       │
│ 2   │ 0.743854 │ 0.257419 │ 0.939108 │ 2.88967        │
│ 3   │ 0.431316 │ 0.922369 │ 0.650648 │ 0.467618       │

julia> @time by(df, [:x1, :x2], :x3 => ByRow(cos), [:x1, :x2] => (x1, x2) -> x1 ./ x2, keepkeys=false)
  0.369644 seconds (775.51 k allocations: 40.180 MiB, 3.00% gc time)
3×2 DataFrame
│ Row │ x3_cos   │ x1_x2_function │
│     │ Float64  │ Float64        │
├─────┼──────────┼────────────────┤
│ 1   │ 0.648397 │ 0.228569       │
│ 2   │ 0.939108 │ 2.88967        │
│ 3   │ 0.650648 │ 0.467618       │

I have not updated the documentation nor tests so CI should fail.

Also I have left most of the internals unchanged because:

  1. this minimized the risk of error
  2. this allowed an easy deprecation of kwarg-form of by
  3. this allows in the future to allow opting-in for passing NamedTuple to a function instead of auto-splatting

The major problem I will have to look into is that the current implementation significantly strains the compiler unfortunately. Probably some more @nospecialize calls should be added (@nalimilan you might have some experience with it so can you please comment if you have tips to share)?

@bkamins
Copy link
Member Author

bkamins commented Mar 21, 2020

Just as a reference in 0.20.2 we have:

julia> @time by(df, [:x1, :x2], :x3 => x -> cos.(x))
  3.254445 seconds (11.90 M allocations: 587.842 MiB, 8.50% gc time)
3×3 DataFrame
│ Row │ x1       │ x2        │ x3_function │
│     │ Float64  │ Float64   │ Float64     │
├─────┼──────────┼───────────┼─────────────┤
│ 1   │ 0.797467 │ 0.205025  │ 0.867137    │
│ 2   │ 0.703998 │ 0.0357978 │ 0.691523    │
│ 3   │ 0.473485 │ 0.35953   │ 0.704904    │

(which is not super fast, but still takes 2x less time)

@bkamins
Copy link
Member Author

bkamins commented Mar 21, 2020

I have made some optimizations to reduce compilation cost (in particular dropped NamedTuple in favor of Tuple internally which is enough for auto-splatting). The timings have improved:

julia> using DataFrames

julia> df = DataFrame(rand(10, 4));

julia> @time by(df, [:x1, :x2], :x3 => x -> cos.(x));
  3.969474 seconds (13.04 M allocations: 659.181 MiB, 7.89% gc time)

@bkamins
Copy link
Member Author

bkamins commented Mar 21, 2020

Last commit is performance related only, after it:

julia> df = DataFrame(rand(10^7, 4));

julia> @time by(df, [:x1, :x2], :x3 => x -> cos.(x));
 16.423705 seconds (174.08 M allocations: 7.480 GiB, 5.93% gc time)

and on master

julia> df = DataFrame(rand(10^7, 4));

julia> @time by(df, [:x1, :x2], :x3 => x -> cos.(x));
 47.086273 seconds (293.70 M allocations: 12.963 GiB, 3.66% gc time)

The problem was that the check 20c9c88#diff-23657e51a9cc9e627fc153ba1e6e04c1L982 was very expensive (I earlier thought it should be optimized out by the compiler).

This signals type instability risk, but it seems we do not have it, so this is strange:

julia> df = DataFrame(rand(10^7, 4));

julia> f(x) = cos.(x)
f (generic function with 1 method)

julia> g(x)::Vector{Float64} = cos.(x) # make sure Julia should know the return type
g (generic function with 1 method)

julia> @time by(df, [:x1, :x2], :x3 => f); # more expensive due to compilation
 17.630734 seconds (174.08 M allocations: 7.480 GiB, 6.21% gc time)

julia> @time by(df, [:x1, :x2], :x3 => g);
 13.528831 seconds (160.43 M allocations: 6.842 GiB, 6.97% gc time)

julia> @time by(df, [:x1, :x2], :x3 => f); # here speed is the same
 13.110972 seconds (160.00 M allocations: 6.820 GiB, 7.13% gc time)

julia> @time by(df, [:x1, :x2], :x3 => g);
 13.037767 seconds (160.00 M allocations: 6.820 GiB, 7.09% gc time)

@bkamins bkamins added breaking The proposed change is breaking. grouping labels Mar 21, 2020
@bkamins bkamins added this to the 1.0 milestone Mar 21, 2020
@nalimilan
Copy link
Member

So now the PR is faster than master? That type instability is indeed unexpected. Have you checked with @code_warntype whether all types are correctly inferred?

(BTW, I'd rather use combine directly for performance checks, as by includes groupby which is itself a complex beast.)

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Mar 23, 2020

So now the PR is faster than master? That type instability is indeed unexpected. Have you checked with @code_warntype whether all types are correctly inferred?

PR is faster than master and is ~ as fast as 0.20.2. I will check @code_warntype when we stabilize the design.

(BTW, I'd rather use combine directly for performance checks, as by includes groupby which is itself a complex beast.)

Right (I checked both but in general I will report combine in the future)

@bkamins
Copy link
Member Author

bkamins commented Mar 24, 2020

Do you think we should allow using selectors and renaming in combine (like we allow in select now), so that something like:

combine(gdf, :, :x1, :x2 => :rename_x2, :x3 => fun => :x4)

would be allowed?

@nalimilan
Copy link
Member

That would be quite natural with select (and that would allowtransform to work automatically). Probably also for combine, as I see no reason not to support it. Though it really requires recycling of scalars, otherwise it will be quite limited.

Copy link
Contributor

@pdeffebach pdeffebach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. I went over and think I understand the logic but don't have any substantive comments about the implementation. This looks good to me.

src/groupeddataframe/splitapplycombine.jl Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
@pdeffebach
Copy link
Contributor

A method to add would be

by(df, :a, :, cols => fun)

which would transform something by group but keep all the columns and return a DataFrame.

@bkamins
Copy link
Member Author

bkamins commented Mar 25, 2020

A method to add would be

I plan it for a separate PR as this is already big enough I think.

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Apr 1, 2020

We have a very strange error at https://travis-ci.org/github/JuliaData/DataFrames.jl/jobs/669620787?utm_medium=notification&utm_source=github_status. It seems totally unrelated and theoretically impossible to occur. I am on x64 Linux and never had it.

I will re-run the tests to see what happens.

@bkamins bkamins closed this Apr 1, 2020
@bkamins bkamins reopened this Apr 1, 2020
@bkamins
Copy link
Member Author

bkamins commented Apr 1, 2020

The error did not repeat - @nalimilan - do you think we should worry about this?

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool.

The error is annoying but without a way to reproduce it I guess we can just hope we'll be able to identify it at some point. I remember I bumped into Julia bugs before in this code, you probably hit a rare path that wasn't exercised before.

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
bkamins and others added 2 commits April 2, 2020 13:29
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Apr 2, 2020

I have committed your map suggestion as it is better than ntuple, but still manual unrolling should be left as:

Timing (with unrolling, but before map change)

julia> @btime combine(gdf, [:x] => x -> x[1]);
  251.465 ms (8998690 allocations: 316.60 MiB)

julia> @btime combine(gdf, [:x, :x] => (x1,x2) -> x1[1]+x2[1]);
  335.512 ms (10998591 allocations: 347.12 MiB)

julia> @btime combine(gdf, [:x, :x, :x] => (x1,x2,x3) -> x1[1]+x2[1]+x3[1]);
  384.069 ms (12998491 allocations: 377.63 MiB)

julia> @btime combine(gdf, [:x, :x, :x, :x] => (x1,x2,x3,x4) -> x1[1]+x2[1]+x3[1]+x4[1]);
  409.963 ms (14998391 allocations: 408.15 MiB)

julia> @btime combine(gdf, [:x, :x, :x, :x, :x] => (x1,x2,x3,x4,x5) -> x1[1]+x2[1]+x3[1]+x4[1]+x5[1]);
  1.137 s (25997841 allocations: 789.60 MiB)

Timing of ntuple without unrolling:

julia> @btime combine(gdf, [:x] => x -> x[1]);
  554.949 ms (13998511 allocations: 453.93 MiB)

julia> @btime combine(gdf, [:x, :x] => (x1,x2) -> x1[1]+x2[1]);
  827.347 ms (16998375 allocations: 545.48 MiB)

julia> @btime combine(gdf, [:x, :x, :x] => (x1,x2,x3) -> x1[1]+x2[1]+x3[1]);
  936.301 ms (19998241 allocations: 621.77 MiB)

julia> @btime combine(gdf, [:x, :x, :x, :x] => (x1,x2,x3,x4) -> x1[1]+x2[1]+x3[1]+x4[1]);
  1.003 s (22998106 allocations: 713.32 MiB)

julia> @btime combine(gdf, [:x, :x, :x, :x, :x] => (x1,x2,x3,x4,x5) -> x1[1]+x2[1]+x3[1]+x4[1]+x5[1]);
  1.188 s (25997971 allocations: 789.61 MiB)

Timing of map without unrolling:

julia> @btime combine(gdf, [:x] => x -> x[1]);
  464.953 ms (12998452 allocations: 423.40 MiB)

julia> @btime combine(gdf, [:x, :x] => (x1,x2) -> x1[1]+x2[1]);
  551.284 ms (15998292 allocations: 499.69 MiB)

julia> @btime combine(gdf, [:x, :x, :x] => (x1,x2,x3) -> x1[1]+x2[1]+x3[1]);
  672.049 ms (18998134 allocations: 575.98 MiB)

julia> @btime combine(gdf, [:x, :x, :x, :x] => (x1,x2,x3,x4) -> x1[1]+x2[1]+x3[1]+x4[1]);
  731.410 ms (21997975 allocations: 652.27 MiB)

julia> @btime combine(gdf, [:x, :x, :x, :x, :x] => (x1,x2,x3,x4,x5) -> x1[1]+x2[1]+x3[1]+x4[1]+x5[1]);
  855.288 ms (24997816 allocations: 728.56 MiB)

As you can see map and ntuple allocates

@bkamins
Copy link
Member Author

bkamins commented Apr 2, 2020

OK, we will go for @generated - indeed this seems cleanest (@nalimilan - now it is your task then to check hygene 😄).

Generated timings:

julia> df = DataFrame(g=rand(1:10^6, 10^7), x=rand(10^7));

julia> gdf = groupby(df, :g);

julia> @btime combine(gdf, [:x] => x -> x[1]);
  261.770 ms (8998771 allocations: 316.60 MiB)

julia> @btime combine(gdf, [:x, :x] => (x1,x2) -> x1[1]+x2[1]);
  357.037 ms (10998689 allocations: 347.12 MiB)

julia> @btime combine(gdf, [:x, :x, :x] => (x1,x2,x3) -> x1[1]+x2[1]+x3[1]);
  380.764 ms (12998608 allocations: 377.64 MiB)

julia> @btime combine(gdf, [:x, :x, :x, :x] => (x1,x2,x3,x4) -> x1[1]+x2[1]+x3[1]+x4[1]);
  414.389 ms (14998526 allocations: 408.15 MiB)

julia> @btime combine(gdf, [:x, :x, :x, :x, :x] => (x1,x2,x3,x4,x5) -> x1[1]+x2[1]+x3[1]+x4[1]+x5[1]);
  430.944 ms (16998445 allocations: 438.67 MiB)

The only drawback is when we would pass hundreds of variables, but this is a different issue - I will add such a note to the docs.

@bkamins
Copy link
Member Author

bkamins commented Apr 2, 2020

I am leaving here the @generated code:

@generated function do_call(f::Any, idx::AbstractVector{<:Integer},
                 starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
                 gd::GroupedDataFrame, incols::Tuple, i::Integer)
    ex = :(f())
    incols === Tuple{} && return ex
    for idx in 1:length(incols.parameters)
        push!(ex.args, :(view(incols[$idx], idx)))
    end
    return :(idx = idx[starts[i]:ends[i]]; $ex)
end

However, the tests show that it is better to do manual unrolling for some reason for low number of arguments (and I think that for more than 4 arguments map is sufficiently good).

I will leave the @generated code in reverted commit so that you can have a look.

@bkamins
Copy link
Member Author

bkamins commented Apr 2, 2020

For _combine(f::AbstractVector{<:Pair}, gd::GroupedDataFrame, nms::AbstractVector{Symbol}) I have not changed the signature, but added assertions as the signature is problematic (source column can be Int or AbstractVector{Int} so it would be a very complex signature).

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's keep the manual unrolling then. Can you just add a comment explaining that @generated is slower for a low number of columns? I guess this is including compilation times, right? (That's indeed something we care about, but that isn't always considered when writing package code.)

@bkamins
Copy link
Member Author

bkamins commented Apr 3, 2020

I will add a comment.

Actually it was worse - it was allocating more and was slightly slower even after first compilation and also compilation for some reason was invoked twice for the same signature (I have put @show in @generated and it printed twice).

In summary - this is a strange situation (indicating that @generated has some minimal overhead over manual unrolling), but rather related to how Base works than this PR.

I will rebase it against "transform" PR and add some more tests before merging.

@bkamins bkamins changed the title sync combine with select [BREAKING] sync combine with select Apr 3, 2020
@bkamins
Copy link
Member Author

bkamins commented Apr 3, 2020

Well - a last part is updates to the manual. Just pushed.

docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
bkamins and others added 2 commits April 3, 2020 22:44
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Apr 3, 2020

I will merge this PR when CI passes. Thank you for all the discussions!

@bkamins bkamins merged commit 894a012 into JuliaData:master Apr 4, 2020
@bkamins bkamins deleted the update_combine branch April 4, 2020 06:23
@bkamins
Copy link
Member Author

bkamins commented Apr 4, 2020

Anything that was broken here now should be fixed by a separate PR. Thank you for a fantastic joint effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. grouping
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Redesign of combine Column naming in combine
3 participants