Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify aggregate for efficiency and decide its future #1246

Closed
wants to merge 1 commit into from

Conversation

cjprybol
Copy link
Contributor

@cjprybol cjprybol commented Oct 9, 2017

Bypassing map and combine for aggregate speeds up code and reduces
memory allocations by ~1-2 orders of magnitude. The By function
still uses map and combine to allow more complex anonymous functions
that can return DataFrames and do blocks. Functions for naming new
columns were inlined, and their Julia 0.4 Compat removed. Added tests.

Replaces JuliaData/DataTables.jl#65. Should be re-benchmarked, and the benchmarks should be made permanent with PkgBenchmark.jl. We should try using master

Bypassing map and combine for aggregate speeds up code and reduces
memory allocations by ~1-2 orders of magnitude. The `By` function
still uses map and combine to allow more complex anonymous functions
that can return DataFrames and do blocks. Functions for naming new
columns were inlined, and their Julia 0.4 Compat removed. Added tests.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.07%) to 72.484% when pulling 2b42e2d on cjp/aggregate into eb3d10a on master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.07%) to 72.484% when pulling 2b42e2d on cjp/aggregate into eb3d10a on master.

end

# Applies aggregate to non-key cols of each SubDataFrame of a GroupedDataFrame
aggregate(gd::GroupedDataFrame, f::Function; sort::Bool=false) = aggregate(gd, [f], sort=sort)
function aggregate(gd::GroupedDataFrame, fs::Vector{T}; sort::Bool=false) where T<:Function
headers = _makeheaders(fs, setdiff(_names(gd), gd.cols))
res = combine(map(x -> _aggregate(without(x, gd.cols), fs, headers), gd))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I missed this in the first review, but the current code indeed allows returning a vector for each group, and uses combine to turn the result into a single vector. What's annoying is that it's going to slow down everything, but maybe we can make combine efficient when the returned value is a scalar (maybe via inference?)?

We could also imagine having a different function for non-scalar operations. I guess we should check what Pandas and dplyr do.

Copy link
Contributor Author

@cjprybol cjprybol Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You caught it in the first review too, but your proposed solution was simpler last time :). Checking how this is handled elsewhere is a good idea

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I think I was wrong in my previous review, since I hadn't noticed that call to combine. (AFAICT).

Copy link
Member

@nalimilan nalimilan Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dplyr offers summarize, which only allows functions to return a single value, and errors otherwise. Maybe we should provide the same function for simple cases like that. Currently people use by or aggregate, which are powerful but slow (slow because powerful?).

Or, as an interesting Julian challenge, we could try using inference and see whether it can allow us to find out whether a function is going to return a scalar. In that case we could use a fast path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should drop support for returning vectors. That's what Pandas does, see https://discourse.julialang.org/t/stack-overflow-in-dataframes-group-by/6357/8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this and discussing it with ExpandingMan! Offering multiple functions of varying levels of capability/efficiency sounds like the most straightforward way to support all use cases and keep everyone happy in terms of performance. I'd be happy to clarify the distinctions between the functions as part of Doctoberfest to make sure users understand how to use them effectively and to which use-case each applies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevertheless, I think we should investigate whether inference could allow supporting vector results efficiently. The less different functions we need, the easier our API will be, and it would be nice to be more flexible than Pandas while still choosing the most efficient approach automatically.

Since aggregate works column-wise, we can use inference, so in theory it would be possible to take a fast path when we detect the function returns a scalar for all columns. We only need to check this once, since the type of the columns is the same across groups. If inference fails it's fine to go back to the current slow approach.

@quinnj
Copy link
Member

quinnj commented May 7, 2019

I don't imagine this is relevant anymore, given all the work that has gone into performance in other efforts. Good to close @nalimilan?

@nalimilan
Copy link
Member

aggregate doesn't take advantage of the new efficient methods AFAICT since it uses the type-unstable method. The approach from this PR might still be faster anyway since you can just iterate over columns. (BTW, I'd also like to deprecate aggregate in favor of something like by(df, :key, AllCols() .=> f).)

@bkamins
Copy link
Member

bkamins commented Sep 3, 2019

@nalimilan - do you have an opinion what we should do about this PR?

@nalimilan
Copy link
Member

I think we should see how we can deprecate aggregate in favor of by(df, :key, AllCols() .=> f), and then we'll see whether a separate implementation from combine is needed.

@bkamins
Copy link
Member

bkamins commented Sep 3, 2019

Working on this I have encountered the following error, that makes me uneasy:

julia> using DataFrames

julia> df = DataFrame(rand(3,4))
3×4 DataFrame
│ Row │ x1       │ x2       │ x3       │ x4       │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.651508 │ 0.231298 │ 0.574306 │ 0.941665 │
│ 2   │ 0.168641 │ 0.25154  │ 0.452554 │ 0.811189 │
│ 3   │ 0.198076 │ 0.637567 │ 0.973212 │ 0.484695 │

julia> by(df, :x1, names(df) .=> sin)
Internal error: encountered unexpected error in runtime:
MethodError(f=typeof(Base.string)(), args=(Expr(:<:, :t, :r),), world=0x0000000000000eec)

there seems to be some problem with the internal design of by (I understand the reason for the error, but stack trace indicates that something bad is going on there).

If we pass a function that accepts vectors all is OK:

julia> by(df, :x1, names(df) .=> sum)
3×5 DataFrame
│ Row │ x1       │ x1_sum   │ x2_sum   │ x3_sum   │ x4_sum   │
│     │ Float64  │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.651508 │ 0.651508 │ 0.231298 │ 0.574306 │ 0.941665 │
│ 2   │ 0.168641 │ 0.168641 │ 0.25154  │ 0.452554 │ 0.811189 │
│ 3   │ 0.198076 │ 0.198076 │ 0.637567 │ 0.973212 │ 0.484695 │

julia> by(df, :x1, names(df) .=> x -> sin.(x))
3×5 DataFrame
│ Row │ x1       │ x1_function │ x2_function │ x3_function │ x4_function │
│     │ Float64  │ Float64     │ Float64     │ Float64     │ Float64     │
├─────┼──────────┼─────────────┼─────────────┼─────────────┼─────────────┤
│ 1   │ 0.651508 │ 0.606386    │ 0.229241    │ 0.543252    │ 0.808539    │
│ 2   │ 0.168641 │ 0.167843    │ 0.248896    │ 0.437264    │ 0.725107    │
│ 3   │ 0.198076 │ 0.196783    │ 0.595242    │ 0.826697    │ 0.465938    │

(so it seems that we essentially have it already)

@nalimilan
Copy link
Member

Interesting. With a recent Julia master, I get a proper MethodError, so it looks like it's been fixed recently.

Regarding the design, you're right that using names(df) we could already deprecate aggregate without even changing by. Then we could add support for All later. Let's do that?

@bkamins
Copy link
Member

bkamins commented Sep 3, 2019

Yes - adding All support can be for later.
And yes - we can drop aggregate (let us just make sure that we cover all possibilities of using aggregate) since we are in a clean-up mode.

In general (now this is opinion - not a recommendation 😄) - I do not find aggregate very useful anyway, as it does not give you control over column names so it is mostly interactive use function.

@bkamins
Copy link
Member

bkamins commented Dec 1, 2019

@nalimilan - what is your state of thinking about the future of aggregate 😄.
I think we should make this decision before 1.0.

@nalimilan
Copy link
Member

I'll have a look. All is waiting on JuliaData/DataAPI.jl#10 but we could use names(df) until then.

@bkamins
Copy link
Member

bkamins commented Dec 1, 2019

Thank you.

With JuliaData/DataAPI.jl#10 I think everyone was on board with the change (is there something that is stopping that PR)?

@nalimilan
Copy link
Member

I just wanted to get Matt's input, as the InvertedIndices author and broadcasting expert.

@bkamins bkamins added the breaking The proposed change is breaking. label Feb 12, 2020
@bkamins bkamins added this to the 1.0 milestone Feb 12, 2020
@bkamins bkamins changed the title Modify aggregate for efficiency Modify aggregate for efficiency and decide its future Feb 12, 2020
@bkamins
Copy link
Member

bkamins commented Feb 12, 2020

We should decide what to do with aggregate before 1.0 release.
For this I understand the crucial thing is Not broadcasting.
@nalimilan - I understand you do not have any feedback on it yet?

@nalimilan
Copy link
Member

Unfortunately not. Maybe we could go ahead anyway, in the worst case people could do All(Not(...)) .=>.

@bkamins
Copy link
Member

bkamins commented Feb 13, 2020

I have opened JuliaData/InvertedIndices.jl#15.

I think then let us go with All(Not(... and we can just simplify things later if Not starts supporting broadcasting.

@bkamins
Copy link
Member

bkamins commented Apr 6, 2020

I am closing this, as even if we want to work on aggregate in the future this PR would have to be completely rewritten.

@bkamins bkamins closed this Apr 6, 2020
@bkamins
Copy link
Member

bkamins commented Apr 6, 2020

Of course please reopen if you feel I am wrong.

@nalimilan nalimilan deleted the cjp/aggregate branch April 6, 2020 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants