[BREAKING] new design of select, transform and combine #2214

bkamins · 2020-04-27T13:51:23Z

This is still a draft. I have only implemented the AbstractDataFrame part now (GroupedDataFrame is pending) and not updated the documentation and tests. But you can have a look at the code to see what changes we essentially agreed to do.

…d transform efficiently

…ame, fix bug in map

bkamins · 2020-04-28T13:05:23Z

@matthieugomez, @pdeffebach, @nalimilan - the PR should be good to have a quick look (and tests - maybe some corner cases that are incorrect will be caught).

All functions should be implemented as we discussed.

I have not updated tests nor documentation so there might be some holes, but here is how it works:

julia> df = DataFrame(g=[2,3,1,1,2,2,3,1,2,1], x=1:10)
10×2 DataFrame
│ Row │ g     │ x     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 2     │ 1     │
│ 2   │ 3     │ 2     │
│ 3   │ 1     │ 3     │
│ 4   │ 1     │ 4     │
│ 5   │ 2     │ 5     │
│ 6   │ 2     │ 6     │
│ 7   │ 3     │ 7     │
│ 8   │ 1     │ 8     │
│ 9   │ 2     │ 9     │
│ 10  │ 1     │ 10    │

julia> gdf = groupby(df, :g);

julia> select(gdf, :g, :g => :g1, :g => (x->x) => :g2, :g => mean, :x, :x => :x1, :x => (x->x) => :x2, :x => mean)
10×8 DataFrame
│ Row │ g     │ g1    │ g2    │ g_mean  │ x     │ x1    │ x2    │ x_mean  │
│     │ Int64 │ Int64 │ Int64 │ Float64 │ Int64 │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼───────┼─────────┼───────┼───────┼───────┼─────────┤
│ 1   │ 2     │ 2     │ 2     │ 2.0     │ 1     │ 1     │ 1     │ 5.25    │
│ 2   │ 3     │ 3     │ 3     │ 3.0     │ 2     │ 2     │ 2     │ 4.5     │
│ 3   │ 1     │ 1     │ 1     │ 1.0     │ 3     │ 3     │ 3     │ 6.25    │
│ 4   │ 1     │ 1     │ 1     │ 1.0     │ 4     │ 4     │ 4     │ 6.25    │
│ 5   │ 2     │ 2     │ 2     │ 2.0     │ 5     │ 5     │ 5     │ 5.25    │
│ 6   │ 2     │ 2     │ 2     │ 2.0     │ 6     │ 6     │ 6     │ 5.25    │
│ 7   │ 3     │ 3     │ 3     │ 3.0     │ 7     │ 7     │ 7     │ 4.5     │
│ 8   │ 1     │ 1     │ 1     │ 1.0     │ 8     │ 8     │ 8     │ 6.25    │
│ 9   │ 2     │ 2     │ 2     │ 2.0     │ 9     │ 9     │ 9     │ 5.25    │
│ 10  │ 1     │ 1     │ 1     │ 1.0     │ 10    │ 10    │ 10    │ 6.25    │

matthieugomez · 2020-04-28T18:10:50Z

I’ve tried it a little bit and it feels very good from a user perspective!

bkamins · 2020-04-28T21:31:17Z

@nalimilan - there is one thing on top of what was my idea in #2206 (and something you said you wanted to allow anyway).

We should treat GroupedDataFrame as canonical irrespective of its ordering, as long as it contains all groups. I guess no one will oppose this 😄, as it was only me who wanted to be strict. The reason for this is that groupby automatically sorts GroupedDataFrame - even if not asked for - if all grouping columns are CategoricalVector. I will try to make the changes needed to reflect this fact.

bkamins · 2020-04-29T22:32:04Z

I have finished the development and tests of the new functionality.
In the process some optimizations of the old code to reduce copying of the data were done.

Documentation is half-finished (i.e. I tried to update it everywhere but for sure it is not perfect). Maybe even we should agree to make a separate PR for polishing it later (but if you would have good suggestions now of course I am open for them).

src/abstractdataframe/selection.jl

matthieugomez

That looks great! I will do a pull request with dplyr/stata comparison, following the tables we did in the issue

docs/src/man/split_apply_combine.md

bkamins · 2020-05-01T14:51:08Z

OK, I was surprised that the gain was so small. Now I have fixed the performance (barrier functions came in handy; @nalimilan - as a side note the old map code was not only buggy but also inefficient due to type instability); the benchmark after the update is:

julia> @time df = DataFrame(g=rand(10^8));
  0.790909 seconds (28 allocations: 1.490 GiB, 22.32% gc time)

julia> @time gdf = groupby(df, :g);
 14.090509 seconds (50 allocations: 2.490 GiB, 1.26% gc time)

julia> @time df2 = combine(gdf, :g => length);
  1.498985 seconds (179 allocations: 2.328 GiB, 16.99% gc time)

julia> @time gdf2 = groupby(df2, :g);
 14.069919 seconds (34 allocations: 2.490 GiB, 0.82% gc time)

julia> @time gdf3 = combine(gdf, :g => length, regroup=true);
  1.847234 seconds (186 allocations: 3.073 GiB, 12.57% gc time)

(and you see that regroup is almost 0 cost)

the same with select:

julia> gdf.idx;

julia> @time select(gdf, :g => length);
  6.707589 seconds (200.00 M allocations: 8.196 GiB, 6.05% gc time)

julia> @time select(gdf, :g => length, regroup=true);
  6.758230 seconds (200.00 M allocations: 8.196 GiB, 6.17% gc time)

docs/src/man/split_apply_combine.md

Co-authored-by: pdeffebach <23196228+pdeffebach@users.noreply.github.com>

matthieugomez · 2020-05-01T17:11:30Z

Ungroup= true (default) would mean that it returns a DataFrame.

bkamins · 2020-05-02T05:09:19Z

OK - I have pushed a commit changing regroup to ungroup

bkamins · 2020-05-03T08:06:22Z

Only coverage fails

nalimilan

ungroup sounds good (that's the term used by dplyr).

Just a few small remarks.

docs/src/man/split_apply_combine.md

src/dataframe/dataframe.jl

nalimilan · 2020-05-04T16:49:43Z

src/groupeddataframe/splitapplycombine.jl

            return GroupedDataFrame(newparent, collect(1:length(gd.cols)), groups,
-                                    collect(1:length(idx)), starts, ends, j, nothing)
+                                    nothing, nothing, nothing, groups[end], nothing)


Wasn't it more efficient to compute starts and ends now (since we know rows belonging to the same group are consecutive)?

I was thinking about it. The reason to remove it is the following:

we would have to compute them anyway (they are not guaranteed to be given like in the branch above)

if we compute them now then we have to store them (so it takes memory), and it is likely that they might be even not needed

the cost of computing indices with compute_indices is higher than computing them here but not that much higher (it doest 3 loops instead of 2 loops that could be used in an optimal case so the saving is at most ~33%)

For 10^8 groups we have:

julia> x = [1:10^8;]; julia> @time DataFrames.compute_indices(x, 10^8); 1.419073 seconds (8 allocations: 2.235 GiB, 11.12% gc time)

and with an optimal timing we would have it a bit over 1 second (at the cost of allocating these 3 vectors - and maybe we even do not have to pay this cost at all if later this GroupedDataFrame will be used to do some aggregation function like sum when we do not need to pay this cost at all).

In summary:

if we have to pay this cost we will be slow anyway (as it means we had to use non-aggregating operations which will be more expensive than compute_indices anyway)

if we do not have to pay it it means we are doing aggregations only which are much faster than compute_indices (and this is probably the most likely use case)

additionally we save some memory (this "some" can be a lot if there are a lot of small groups)

src/groupeddataframe/splitapplycombine.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

…selection

bkamins · 2020-05-05T09:19:07Z

Thank you for the comments. The PR is updated.

The decision to store or not to store idx, starts and ends is indeed double edged. I have opted to be maximally fast and space efficient in the "fast case" - i.e. when we use fast aggregation functions later on a GroupedDataFrame which is the case in which we fight for performance.

nalimilan · 2020-05-05T09:28:12Z

src/groupeddataframe/splitapplycombine.jl


-Transform a [`GroupedDataFrame`](@ref) into a `DataFrame`.
+Apply operations to each group in a [`GroupedDataFrame`](@ref) and return
+the combined result as a `DataFrame`.


I guess ungroup should be mentioned here directly, or the type of the result shouldn't be mentioned here?

Ah - OK. I have updated both combine and select.

src/groupeddataframe/splitapplycombine.jl

bkamins · 2020-05-05T10:47:44Z

OK - so I am merging this as we know CI shall pass and I will open a new PR just bumping the version and there we will check that everything passes on a clean PR.

@nalimilan - will you now make an update to CSV.jl that fixes CategoricalArrays.jl compatibility so that we can synchronize the releases. Thank you!

matthieugomez · 2020-05-05T11:47:10Z

That’s awesome! Thanks a lot for this!

implement AbstractDataFrame functionality

55031d7

bkamins changed the title ~~implement AbstractDataFrame functionality~~ [BREAKING] implement AbstractDataFrame functionality Apr 27, 2020

bkamins added breaking The proposed change is breaking. feature grouping labels Apr 27, 2020

bkamins added this to the 1.0 milestone Apr 27, 2020

bkamins changed the title ~~[BREAKING] implement AbstractDataFrame functionality~~ [BREAKING] new design of select, transform and combine Apr 27, 2020

bkamins added 6 commits April 27, 2020 23:41

preparation in grouping, rename to _mutate in non-grouping

55bde12

tentative rework of _combine that should be able to support select an…

2f81c63

…d transform efficiently

continue grouping

fd951c5

implement select, transform, select! and transform! for GroupedDataFr…

eb9ace9

…ame, fix bug in map

update DataFrame constructor

6908ee8

fix handling of aggregates

7b644dd

code cleanup

2753235

bkamins mentioned this pull request Apr 28, 2020

Port to CategoricalArrays 0.8 JuliaData/CSV.jl#602

Merged

improve canonical check + start rewriting tests

2a03190

matthieugomez mentioned this pull request Apr 28, 2020

Creating new columns on a view should fill in missings everywhere else. #2211

Closed

bkamins added 2 commits April 28, 2020 23:52

allow changing sort order of groups in cannonical test

7b86eb8

make old tests pass

cb94903

bkamins marked this pull request as ready for review April 29, 2020 11:17

bkamins added 4 commits April 29, 2020 13:19

Merge branch 'master' into improve_selection

908d489

change error thrown on Julia 1.0

384c0b1

done tests of combine

ea574c4

finish tests and documentation

8977017

matthieugomez reviewed Apr 29, 2020

View reviewed changes

src/abstractdataframe/selection.jl Show resolved Hide resolved

matthieugomez reviewed Apr 29, 2020

View reviewed changes

docs/src/man/split_apply_combine.md Show resolved Hide resolved

docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved

avoid computing idx, starts and ends in combine if regroup=true

0f3d309

bkamins added 2 commits May 1, 2020 16:51

performance improvements

1d69fa3

@simd did not improve the performance here

5713194

pdeffebach reviewed May 1, 2020

View reviewed changes

docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved

docs/src/man/split_apply_combine.md Show resolved Hide resolved

bkamins and others added 2 commits May 1, 2020 17:36

Update docs/src/man/split_apply_combine.md

1f34d55

Co-authored-by: pdeffebach <23196228+pdeffebach@users.noreply.github.com>

add an example of passing function as a first argument to combine

2201789

change regroup to ungroup

2aa9170

bkamins mentioned this pull request May 3, 2020

possible test failure in upcoming Julia version 1.5 #2221

Closed

nalimilan reviewed May 4, 2020

View reviewed changes

bkamins and others added 4 commits May 5, 2020 10:44

Merge branch 'master' into improve_selection

cf4736c

Apply suggestions from code review

333cca2

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Merge remote-tracking branch 'origin/improve_selection' into improve_…

334aba0

…selection

update docs

10b9474

nalimilan reviewed May 5, 2020

View reviewed changes

improve description of what gets returned in combine and select

792b57d

nalimilan reviewed May 5, 2020

View reviewed changes

src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved

fix repeated code

f34873c

nalimilan approved these changes May 5, 2020

View reviewed changes

bkamins merged commit 954a246 into JuliaData:master May 5, 2020

bkamins deleted the improve_selection branch May 5, 2020 10:48

This was referenced May 5, 2020

Add select, select!, transform and transform! for GroupedDataFrame #2172

Closed

In-place by/combine #2127

Closed

danielolsen mentioned this pull request Jul 23, 2020

Refactor for DataFrames v0.21 compatibility Breakthrough-Energy/REISE.jl#63

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BREAKING] new design of select, transform and combine #2214

[BREAKING] new design of select, transform and combine #2214

bkamins commented Apr 27, 2020

bkamins commented Apr 28, 2020

matthieugomez commented Apr 28, 2020 •

edited

Loading

bkamins commented Apr 28, 2020

bkamins commented Apr 29, 2020

matthieugomez left a comment

bkamins commented May 1, 2020

matthieugomez commented May 1, 2020

bkamins commented May 2, 2020

bkamins commented May 3, 2020

nalimilan left a comment

nalimilan May 4, 2020

bkamins May 5, 2020

bkamins commented May 5, 2020 •

edited

Loading

nalimilan May 5, 2020

bkamins May 5, 2020

bkamins commented May 5, 2020

matthieugomez commented May 5, 2020

[BREAKING] new design of select, transform and combine #2214

[BREAKING] new design of select, transform and combine #2214

Conversation

bkamins commented Apr 27, 2020

bkamins commented Apr 28, 2020

matthieugomez commented Apr 28, 2020 • edited Loading

bkamins commented Apr 28, 2020

bkamins commented Apr 29, 2020

matthieugomez left a comment

Choose a reason for hiding this comment

bkamins commented May 1, 2020

matthieugomez commented May 1, 2020

bkamins commented May 2, 2020

bkamins commented May 3, 2020

nalimilan left a comment

Choose a reason for hiding this comment

nalimilan May 4, 2020

Choose a reason for hiding this comment

bkamins May 5, 2020

Choose a reason for hiding this comment

bkamins commented May 5, 2020 • edited Loading

nalimilan May 5, 2020

Choose a reason for hiding this comment

bkamins May 5, 2020

Choose a reason for hiding this comment

bkamins commented May 5, 2020

matthieugomez commented May 5, 2020

matthieugomez commented Apr 28, 2020 •

edited

Loading

bkamins commented May 5, 2020 •

edited

Loading