Modify aggregate for efficiency and decide its future #1246

cjprybol · 2017-10-09T05:03:45Z

Bypassing map and combine for aggregate speeds up code and reduces
memory allocations by ~1-2 orders of magnitude. The By function
still uses map and combine to allow more complex anonymous functions
that can return DataFrames and do blocks. Functions for naming new
columns were inlined, and their Julia 0.4 Compat removed. Added tests.

Replaces JuliaData/DataTables.jl#65. Should be re-benchmarked, and the benchmarks should be made permanent with PkgBenchmark.jl. We should try using master

Bypassing map and combine for aggregate speeds up code and reduces memory allocations by ~1-2 orders of magnitude. The `By` function still uses map and combine to allow more complex anonymous functions that can return DataFrames and do blocks. Functions for naming new columns were inlined, and their Julia 0.4 Compat removed. Added tests.

coveralls · 2017-10-09T05:43:51Z

Coverage decreased (-0.07%) to 72.484% when pulling 2b42e2d on cjp/aggregate into eb3d10a on master.

coveralls · 2017-10-09T05:43:51Z

Coverage decreased (-0.07%) to 72.484% when pulling 2b42e2d on cjp/aggregate into eb3d10a on master.

nalimilan · 2017-10-09T15:22:48Z

src/groupeddataframe/grouping.jl

 end

 # Applies aggregate to non-key cols of each SubDataFrame of a GroupedDataFrame
 aggregate(gd::GroupedDataFrame, f::Function; sort::Bool=false) = aggregate(gd, [f], sort=sort)
 function aggregate(gd::GroupedDataFrame, fs::Vector{T}; sort::Bool=false) where T<:Function
-    headers = _makeheaders(fs, setdiff(_names(gd), gd.cols))
-    res = combine(map(x -> _aggregate(without(x, gd.cols), fs, headers), gd))


Actually, I missed this in the first review, but the current code indeed allows returning a vector for each group, and uses combine to turn the result into a single vector. What's annoying is that it's going to slow down everything, but maybe we can make combine efficient when the returned value is a scalar (maybe via inference?)?

We could also imagine having a different function for non-scalar operations. I guess we should check what Pandas and dplyr do.

You caught it in the first review too, but your proposed solution was simpler last time :). Checking how this is handled elsewhere is a good idea

Yeah, but I think I was wrong in my previous review, since I hadn't noticed that call to combine. (AFAICT).

dplyr offers summarize, which only allows functions to return a single value, and errors otherwise. Maybe we should provide the same function for simple cases like that. Currently people use by or aggregate, which are powerful but slow (slow because powerful?).

Or, as an interesting Julian challenge, we could try using inference and see whether it can allow us to find out whether a function is going to return a scalar. In that case we could use a fast path.

Maybe we should drop support for returning vectors. That's what Pandas does, see https://discourse.julialang.org/t/stack-overflow-in-dataframes-group-by/6357/8.

Thanks for looking into this and discussing it with ExpandingMan! Offering multiple functions of varying levels of capability/efficiency sounds like the most straightforward way to support all use cases and keep everyone happy in terms of performance. I'd be happy to clarify the distinctions between the functions as part of Doctoberfest to make sure users understand how to use them effectively and to which use-case each applies.

Nevertheless, I think we should investigate whether inference could allow supporting vector results efficiently. The less different functions we need, the easier our API will be, and it would be nice to be more flexible than Pandas while still choosing the most efficient approach automatically.

Since aggregate works column-wise, we can use inference, so in theory it would be possible to take a fast path when we detect the function returns a scalar for all columns. We only need to check this once, since the type of the columns is the same across groups. If inference fails it's fine to go back to the current slow approach.

quinnj · 2019-05-07T01:27:02Z

I don't imagine this is relevant anymore, given all the work that has gone into performance in other efforts. Good to close @nalimilan?

nalimilan · 2019-05-07T07:39:06Z

aggregate doesn't take advantage of the new efficient methods AFAICT since it uses the type-unstable method. The approach from this PR might still be faster anyway since you can just iterate over columns. (BTW, I'd also like to deprecate aggregate in favor of something like by(df, :key, AllCols() .=> f).)

bkamins · 2019-09-03T08:20:53Z

@nalimilan - do you have an opinion what we should do about this PR?

nalimilan · 2019-09-03T08:44:40Z

I think we should see how we can deprecate aggregate in favor of by(df, :key, AllCols() .=> f), and then we'll see whether a separate implementation from combine is needed.

bkamins · 2019-09-03T12:37:58Z

Working on this I have encountered the following error, that makes me uneasy:

julia> using DataFrames

julia> df = DataFrame(rand(3,4))
3×4 DataFrame
│ Row │ x1       │ x2       │ x3       │ x4       │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.651508 │ 0.231298 │ 0.574306 │ 0.941665 │
│ 2   │ 0.168641 │ 0.25154  │ 0.452554 │ 0.811189 │
│ 3   │ 0.198076 │ 0.637567 │ 0.973212 │ 0.484695 │

julia> by(df, :x1, names(df) .=> sin)
Internal error: encountered unexpected error in runtime:
MethodError(f=typeof(Base.string)(), args=(Expr(:<:, :t, :r),), world=0x0000000000000eec)

there seems to be some problem with the internal design of by (I understand the reason for the error, but stack trace indicates that something bad is going on there).

If we pass a function that accepts vectors all is OK:

julia> by(df, :x1, names(df) .=> sum)
3×5 DataFrame
│ Row │ x1       │ x1_sum   │ x2_sum   │ x3_sum   │ x4_sum   │
│     │ Float64  │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.651508 │ 0.651508 │ 0.231298 │ 0.574306 │ 0.941665 │
│ 2   │ 0.168641 │ 0.168641 │ 0.25154  │ 0.452554 │ 0.811189 │
│ 3   │ 0.198076 │ 0.198076 │ 0.637567 │ 0.973212 │ 0.484695 │

julia> by(df, :x1, names(df) .=> x -> sin.(x))
3×5 DataFrame
│ Row │ x1       │ x1_function │ x2_function │ x3_function │ x4_function │
│     │ Float64  │ Float64     │ Float64     │ Float64     │ Float64     │
├─────┼──────────┼─────────────┼─────────────┼─────────────┼─────────────┤
│ 1   │ 0.651508 │ 0.606386    │ 0.229241    │ 0.543252    │ 0.808539    │
│ 2   │ 0.168641 │ 0.167843    │ 0.248896    │ 0.437264    │ 0.725107    │
│ 3   │ 0.198076 │ 0.196783    │ 0.595242    │ 0.826697    │ 0.465938    │

(so it seems that we essentially have it already)

nalimilan · 2019-09-03T14:07:06Z

Interesting. With a recent Julia master, I get a proper MethodError, so it looks like it's been fixed recently.

Regarding the design, you're right that using names(df) we could already deprecate aggregate without even changing by. Then we could add support for All later. Let's do that?

bkamins · 2019-09-03T14:29:33Z

Yes - adding All support can be for later.
And yes - we can drop aggregate (let us just make sure that we cover all possibilities of using aggregate) since we are in a clean-up mode.

In general (now this is opinion - not a recommendation 😄) - I do not find aggregate very useful anyway, as it does not give you control over column names so it is mostly interactive use function.

bkamins · 2019-12-01T13:38:06Z

@nalimilan - what is your state of thinking about the future of aggregate 😄.
I think we should make this decision before 1.0.

nalimilan · 2019-12-01T14:44:54Z

I'll have a look. All is waiting on JuliaData/DataAPI.jl#10 but we could use names(df) until then.

bkamins · 2019-12-01T20:40:29Z

Thank you.

With JuliaData/DataAPI.jl#10 I think everyone was on board with the change (is there something that is stopping that PR)?

nalimilan · 2019-12-02T15:26:23Z

I just wanted to get Matt's input, as the InvertedIndices author and broadcasting expert.

bkamins · 2020-02-12T11:04:51Z

We should decide what to do with aggregate before 1.0 release.
For this I understand the crucial thing is Not broadcasting.
@nalimilan - I understand you do not have any feedback on it yet?

nalimilan · 2020-02-13T22:09:03Z

Unfortunately not. Maybe we could go ahead anyway, in the worst case people could do All(Not(...)) .=>.

bkamins · 2020-02-13T22:56:39Z

I have opened JuliaData/InvertedIndices.jl#15.

I think then let us go with All(Not(... and we can just simplify things later if Not starts supporting broadcasting.

bkamins · 2020-04-06T08:54:56Z

I am closing this, as even if we want to work on aggregate in the future this PR would have to be completely rewritten.

bkamins · 2020-04-06T08:55:10Z

Of course please reopen if you feel I am wrong.

cjprybol mentioned this pull request Oct 9, 2017

WIP: Modify aggregate for efficiency JuliaData/DataTables.jl#65

Closed

cjprybol force-pushed the cjp/aggregate branch from 844e867 to 2b42e2d Compare October 9, 2017 05:27

JuliaData deleted a comment from coveralls Oct 9, 2017

nalimilan reviewed Oct 9, 2017

View reviewed changes

nalimilan mentioned this pull request Oct 13, 2017

Grouping API consistency and improvements #1256

Closed

bkamins added the breaking The proposed change is breaking. label Feb 12, 2020

bkamins added this to the 1.0 milestone Feb 12, 2020

bkamins changed the title ~~Modify aggregate for efficiency~~ Modify aggregate for efficiency and decide its future Feb 12, 2020

bkamins closed this Apr 6, 2020

nalimilan deleted the cjp/aggregate branch April 6, 2020 09:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify aggregate for efficiency and decide its future #1246

Modify aggregate for efficiency and decide its future #1246

cjprybol commented Oct 9, 2017

coveralls commented Oct 9, 2017

coveralls commented Oct 9, 2017

nalimilan Oct 9, 2017

cjprybol Oct 9, 2017 •

edited

Loading

nalimilan Oct 9, 2017

nalimilan Oct 9, 2017 •

edited

Loading

nalimilan Oct 10, 2017

cjprybol Oct 10, 2017

nalimilan Oct 11, 2017

quinnj commented May 7, 2019

nalimilan commented May 7, 2019

bkamins commented Sep 3, 2019

nalimilan commented Sep 3, 2019

bkamins commented Sep 3, 2019

nalimilan commented Sep 3, 2019

bkamins commented Sep 3, 2019

bkamins commented Dec 1, 2019

nalimilan commented Dec 1, 2019

bkamins commented Dec 1, 2019

nalimilan commented Dec 2, 2019

bkamins commented Feb 12, 2020

nalimilan commented Feb 13, 2020

bkamins commented Feb 13, 2020

bkamins commented Apr 6, 2020

bkamins commented Apr 6, 2020

Modify aggregate for efficiency and decide its future #1246

Modify aggregate for efficiency and decide its future #1246

Conversation

cjprybol commented Oct 9, 2017

coveralls commented Oct 9, 2017

coveralls commented Oct 9, 2017

nalimilan Oct 9, 2017

Choose a reason for hiding this comment

cjprybol Oct 9, 2017 • edited Loading

Choose a reason for hiding this comment

nalimilan Oct 9, 2017

Choose a reason for hiding this comment

nalimilan Oct 9, 2017 • edited Loading

Choose a reason for hiding this comment

nalimilan Oct 10, 2017

Choose a reason for hiding this comment

cjprybol Oct 10, 2017

Choose a reason for hiding this comment

nalimilan Oct 11, 2017

Choose a reason for hiding this comment

quinnj commented May 7, 2019

nalimilan commented May 7, 2019

bkamins commented Sep 3, 2019

nalimilan commented Sep 3, 2019

bkamins commented Sep 3, 2019

nalimilan commented Sep 3, 2019

bkamins commented Sep 3, 2019

bkamins commented Dec 1, 2019

nalimilan commented Dec 1, 2019

bkamins commented Dec 1, 2019

nalimilan commented Dec 2, 2019

bkamins commented Feb 12, 2020

nalimilan commented Feb 13, 2020

bkamins commented Feb 13, 2020

bkamins commented Apr 6, 2020

bkamins commented Apr 6, 2020

cjprybol Oct 9, 2017 •

edited

Loading

nalimilan Oct 9, 2017 •

edited

Loading