-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify aggregate for efficiency and decide its future #1246
Conversation
Bypassing map and combine for aggregate speeds up code and reduces memory allocations by ~1-2 orders of magnitude. The `By` function still uses map and combine to allow more complex anonymous functions that can return DataFrames and do blocks. Functions for naming new columns were inlined, and their Julia 0.4 Compat removed. Added tests.
844e867
to
2b42e2d
Compare
1 similar comment
end | ||
|
||
# Applies aggregate to non-key cols of each SubDataFrame of a GroupedDataFrame | ||
aggregate(gd::GroupedDataFrame, f::Function; sort::Bool=false) = aggregate(gd, [f], sort=sort) | ||
function aggregate(gd::GroupedDataFrame, fs::Vector{T}; sort::Bool=false) where T<:Function | ||
headers = _makeheaders(fs, setdiff(_names(gd), gd.cols)) | ||
res = combine(map(x -> _aggregate(without(x, gd.cols), fs, headers), gd)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I missed this in the first review, but the current code indeed allows returning a vector for each group, and uses combine
to turn the result into a single vector. What's annoying is that it's going to slow down everything, but maybe we can make combine
efficient when the returned value is a scalar (maybe via inference?)?
We could also imagine having a different function for non-scalar operations. I guess we should check what Pandas and dplyr do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You caught it in the first review too, but your proposed solution was simpler last time :). Checking how this is handled elsewhere is a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but I think I was wrong in my previous review, since I hadn't noticed that call to combine
. (AFAICT).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dplyr offers summarize
, which only allows functions to return a single value, and errors otherwise. Maybe we should provide the same function for simple cases like that. Currently people use by
or aggregate
, which are powerful but slow (slow because powerful?).
Or, as an interesting Julian challenge, we could try using inference and see whether it can allow us to find out whether a function is going to return a scalar. In that case we could use a fast path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should drop support for returning vectors. That's what Pandas does, see https://discourse.julialang.org/t/stack-overflow-in-dataframes-group-by/6357/8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this and discussing it with ExpandingMan! Offering multiple functions of varying levels of capability/efficiency sounds like the most straightforward way to support all use cases and keep everyone happy in terms of performance. I'd be happy to clarify the distinctions between the functions as part of Doctoberfest to make sure users understand how to use them effectively and to which use-case each applies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevertheless, I think we should investigate whether inference could allow supporting vector results efficiently. The less different functions we need, the easier our API will be, and it would be nice to be more flexible than Pandas while still choosing the most efficient approach automatically.
Since aggregate
works column-wise, we can use inference, so in theory it would be possible to take a fast path when we detect the function returns a scalar for all columns. We only need to check this once, since the type of the columns is the same across groups. If inference fails it's fine to go back to the current slow approach.
I don't imagine this is relevant anymore, given all the work that has gone into performance in other efforts. Good to close @nalimilan? |
|
@nalimilan - do you have an opinion what we should do about this PR? |
I think we should see how we can deprecate |
Working on this I have encountered the following error, that makes me uneasy:
there seems to be some problem with the internal design of If we pass a function that accepts vectors all is OK:
(so it seems that we essentially have it already) |
Interesting. With a recent Julia master, I get a proper Regarding the design, you're right that using |
Yes - adding In general (now this is opinion - not a recommendation 😄) - I do not find |
@nalimilan - what is your state of thinking about the future of |
I'll have a look. |
Thank you. With JuliaData/DataAPI.jl#10 I think everyone was on board with the change (is there something that is stopping that PR)? |
I just wanted to get Matt's input, as the InvertedIndices author and broadcasting expert. |
We should decide what to do with |
Unfortunately not. Maybe we could go ahead anyway, in the worst case people could do |
I have opened JuliaData/InvertedIndices.jl#15. I think then let us go with |
I am closing this, as even if we want to work on |
Of course please reopen if you feel I am wrong. |
Bypassing map and combine for aggregate speeds up code and reduces
memory allocations by ~1-2 orders of magnitude. The
By
functionstill uses map and combine to allow more complex anonymous functions
that can return DataFrames and do blocks. Functions for naming new
columns were inlined, and their Julia 0.4 Compat removed. Added tests.
Replaces JuliaData/DataTables.jl#65. Should be re-benchmarked, and the benchmarks should be made permanent with PkgBenchmark.jl. We should try using master