Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] new design of select, transform and combine #2214

Merged
merged 32 commits into from
May 5, 2020

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Apr 27, 2020

Fixes #2206.

This is still a draft. I have only implemented the AbstractDataFrame part now (GroupedDataFrame is pending) and not updated the documentation and tests. But you can have a look at the code to see what changes we essentially agreed to do.

@bkamins bkamins changed the title implement AbstractDataFrame functionality [BREAKING] implement AbstractDataFrame functionality Apr 27, 2020
@bkamins bkamins added breaking The proposed change is breaking. feature grouping labels Apr 27, 2020
@bkamins bkamins added this to the 1.0 milestone Apr 27, 2020
@bkamins bkamins changed the title [BREAKING] implement AbstractDataFrame functionality [BREAKING] new design of select, transform and combine Apr 27, 2020
@bkamins
Copy link
Member Author

bkamins commented Apr 28, 2020

@matthieugomez, @pdeffebach, @nalimilan - the PR should be good to have a quick look (and tests - maybe some corner cases that are incorrect will be caught).

All functions should be implemented as we discussed.

I have not updated tests nor documentation so there might be some holes, but here is how it works:

julia> df = DataFrame(g=[2,3,1,1,2,2,3,1,2,1], x=1:10)
10×2 DataFrame
│ Row │ g     │ x     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 2     │ 1     │
│ 2   │ 3     │ 2     │
│ 3   │ 1     │ 3     │
│ 4   │ 1     │ 4     │
│ 5   │ 2     │ 5     │
│ 6   │ 2     │ 6     │
│ 7   │ 3     │ 7     │
│ 8   │ 1     │ 8     │
│ 9   │ 2     │ 9     │
│ 10  │ 1     │ 10    │

julia> gdf = groupby(df, :g);

julia> select(gdf, :g, :g => :g1, :g => (x->x) => :g2, :g => mean, :x, :x => :x1, :x => (x->x) => :x2, :x => mean)
10×8 DataFrame
│ Row │ g     │ g1    │ g2    │ g_mean  │ x     │ x1    │ x2    │ x_mean  │
│     │ Int64 │ Int64 │ Int64 │ Float64 │ Int64 │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼───────┼─────────┼───────┼───────┼───────┼─────────┤
│ 1   │ 2     │ 2     │ 2     │ 2.0     │ 1     │ 1     │ 1     │ 5.25    │
│ 2   │ 3     │ 3     │ 3     │ 3.0     │ 2     │ 2     │ 2     │ 4.5     │
│ 3   │ 1     │ 1     │ 1     │ 1.0     │ 3     │ 3     │ 3     │ 6.25    │
│ 4   │ 1     │ 1     │ 1     │ 1.0     │ 4     │ 4     │ 4     │ 6.25    │
│ 5   │ 2     │ 2     │ 2     │ 2.0     │ 5     │ 5     │ 5     │ 5.25    │
│ 6   │ 2     │ 2     │ 2     │ 2.0     │ 6     │ 6     │ 6     │ 5.25    │
│ 7   │ 3     │ 3     │ 3     │ 3.0     │ 7     │ 7     │ 7     │ 4.5     │
│ 8   │ 1     │ 1     │ 1     │ 1.0     │ 8     │ 8     │ 8     │ 6.25    │
│ 9   │ 2     │ 2     │ 2     │ 2.0     │ 9     │ 9     │ 9     │ 5.25    │
│ 10  │ 1     │ 1     │ 1     │ 1.0     │ 10    │ 10    │ 10    │ 6.25    │

@matthieugomez
Copy link
Contributor

matthieugomez commented Apr 28, 2020

I’ve tried it a little bit and it feels very good from a user perspective!

@bkamins
Copy link
Member Author

bkamins commented Apr 28, 2020

@nalimilan - there is one thing on top of what was my idea in #2206 (and something you said you wanted to allow anyway).

We should treat GroupedDataFrame as canonical irrespective of its ordering, as long as it contains all groups. I guess no one will oppose this 😄, as it was only me who wanted to be strict. The reason for this is that groupby automatically sorts GroupedDataFrame - even if not asked for - if all grouping columns are CategoricalVector. I will try to make the changes needed to reflect this fact.

@bkamins bkamins marked this pull request as ready for review April 29, 2020 11:17
@bkamins
Copy link
Member Author

bkamins commented Apr 29, 2020

I have finished the development and tests of the new functionality.
In the process some optimizations of the old code to reduce copying of the data were done.

Documentation is half-finished (i.e. I tried to update it everywhere but for sure it is not perfect). Maybe even we should agree to make a separate PR for polishing it later (but if you would have good suggestions now of course I am open for them).

Copy link
Contributor

@matthieugomez matthieugomez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks great! I will do a pull request with dplyr/stata comparison, following the tables we did in the issue

docs/src/man/split_apply_combine.md Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented May 1, 2020

OK, I was surprised that the gain was so small. Now I have fixed the performance (barrier functions came in handy; @nalimilan - as a side note the old map code was not only buggy but also inefficient due to type instability); the benchmark after the update is:

julia> @time df = DataFrame(g=rand(10^8));
  0.790909 seconds (28 allocations: 1.490 GiB, 22.32% gc time)

julia> @time gdf = groupby(df, :g);
 14.090509 seconds (50 allocations: 2.490 GiB, 1.26% gc time)

julia> @time df2 = combine(gdf, :g => length);
  1.498985 seconds (179 allocations: 2.328 GiB, 16.99% gc time)

julia> @time gdf2 = groupby(df2, :g);
 14.069919 seconds (34 allocations: 2.490 GiB, 0.82% gc time)

julia> @time gdf3 = combine(gdf, :g => length, regroup=true);
  1.847234 seconds (186 allocations: 3.073 GiB, 12.57% gc time)

(and you see that regroup is almost 0 cost)

the same with select:

julia> gdf.idx;

julia> @time select(gdf, :g => length);
  6.707589 seconds (200.00 M allocations: 8.196 GiB, 6.05% gc time)

julia> @time select(gdf, :g => length, regroup=true);
  6.758230 seconds (200.00 M allocations: 8.196 GiB, 6.17% gc time)

bkamins and others added 2 commits May 1, 2020 17:36
@matthieugomez
Copy link
Contributor

Ungroup= true (default) would mean that it returns a DataFrame.

@bkamins
Copy link
Member Author

bkamins commented May 2, 2020

OK - I have pushed a commit changing regroup to ungroup

@bkamins
Copy link
Member Author

bkamins commented May 3, 2020

Only coverage fails

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ungroup sounds good (that's the term used by dplyr).

Just a few small remarks.

docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
return GroupedDataFrame(newparent, collect(1:length(gd.cols)), groups,
collect(1:length(idx)), starts, ends, j, nothing)
nothing, nothing, nothing, groups[end], nothing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't it more efficient to compute starts and ends now (since we know rows belonging to the same group are consecutive)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it. The reason to remove it is the following:

  1. we would have to compute them anyway (they are not guaranteed to be given like in the branch above)
  2. if we compute them now then we have to store them (so it takes memory), and it is likely that they might be even not needed
  3. the cost of computing indices with compute_indices is higher than computing them here but not that much higher (it doest 3 loops instead of 2 loops that could be used in an optimal case so the saving is at most ~33%)

For 10^8 groups we have:

julia> x = [1:10^8;];

julia> @time DataFrames.compute_indices(x, 10^8);
  1.419073 seconds (8 allocations: 2.235 GiB, 11.12% gc time)

and with an optimal timing we would have it a bit over 1 second (at the cost of allocating these 3 vectors - and maybe we even do not have to pay this cost at all if later this GroupedDataFrame will be used to do some aggregation function like sum when we do not need to pay this cost at all).

In summary:

  • if we have to pay this cost we will be slow anyway (as it means we had to use non-aggregating operations which will be more expensive than compute_indices anyway)
  • if we do not have to pay it it means we are doing aggregations only which are much faster than compute_indices (and this is probably the most likely use case)
  • additionally we save some memory (this "some" can be a lot if there are a lot of small groups)

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented May 5, 2020

Thank you for the comments. The PR is updated.

The decision to store or not to store idx, starts and ends is indeed double edged. I have opted to be maximally fast and space efficient in the "fast case" - i.e. when we use fast aggregation functions later on a GroupedDataFrame which is the case in which we fight for performance.


Transform a [`GroupedDataFrame`](@ref) into a `DataFrame`.
Apply operations to each group in a [`GroupedDataFrame`](@ref) and return
the combined result as a `DataFrame`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess ungroup should be mentioned here directly, or the type of the result shouldn't be mentioned here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - OK. I have updated both combine and select.

@bkamins
Copy link
Member Author

bkamins commented May 5, 2020

OK - so I am merging this as we know CI shall pass and I will open a new PR just bumping the version and there we will check that everything passes on a clean PR.

@nalimilan - will you now make an update to CSV.jl that fixes CategoricalArrays.jl compatibility so that we can synchronize the releases. Thank you!

@bkamins bkamins merged commit 954a246 into JuliaData:master May 5, 2020
@bkamins bkamins deleted the improve_selection branch May 5, 2020 10:48
@matthieugomez
Copy link
Contributor

That’s awesome! Thanks a lot for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. feature grouping
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cleaner syntax
4 participants