Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify aggregate for efficiency and decide its future #1246

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 12 additions & 19 deletions src/groupeddataframe/grouping.jl
Original file line number Diff line number Diff line change
Expand Up @@ -320,9 +320,6 @@ aggregate(gd::GroupedDataFrame, fs)
* `fs` : a function or vector of functions to be applied to vectors
within groups; expects each argument to be a column vector

Each `fs` should return a value or vector. All returns must be the
same length.

### Returns

* `::DataFrame`
Expand All @@ -342,16 +339,23 @@ df |> groupby(:a) |> [sum, x->mean(Nulls.skip(x))] # equivalent
"""
aggregate(d::AbstractDataFrame, fs::Function; sort::Bool=false) = aggregate(d, [fs], sort=sort)
function aggregate(d::AbstractDataFrame, fs::Vector{T}; sort::Bool=false) where T<:Function
headers = _makeheaders(fs, _names(d))
_aggregate(d, fs, headers, sort)
headers = [Symbol(c, "_", f) for f in fs for c in names(d)]
res = DataFrame(Any[f(d[c]) for f in fs for c in names(d)], headers)
sort && sort!(res)
res
end

# Applies aggregate to non-key cols of each SubDataFrame of a GroupedDataFrame
aggregate(gd::GroupedDataFrame, f::Function; sort::Bool=false) = aggregate(gd, [f], sort=sort)
function aggregate(gd::GroupedDataFrame, fs::Vector{T}; sort::Bool=false) where T<:Function
headers = _makeheaders(fs, setdiff(_names(gd), gd.cols))
res = combine(map(x -> _aggregate(without(x, gd.cols), fs, headers), gd))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I missed this in the first review, but the current code indeed allows returning a vector for each group, and uses combine to turn the result into a single vector. What's annoying is that it's going to slow down everything, but maybe we can make combine efficient when the returned value is a scalar (maybe via inference?)?

We could also imagine having a different function for non-scalar operations. I guess we should check what Pandas and dplyr do.

Copy link
Contributor Author

@cjprybol cjprybol Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You caught it in the first review too, but your proposed solution was simpler last time :). Checking how this is handled elsewhere is a good idea

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I think I was wrong in my previous review, since I hadn't noticed that call to combine. (AFAICT).

Copy link
Member

@nalimilan nalimilan Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dplyr offers summarize, which only allows functions to return a single value, and errors otherwise. Maybe we should provide the same function for simple cases like that. Currently people use by or aggregate, which are powerful but slow (slow because powerful?).

Or, as an interesting Julian challenge, we could try using inference and see whether it can allow us to find out whether a function is going to return a scalar. In that case we could use a fast path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should drop support for returning vectors. That's what Pandas does, see https://discourse.julialang.org/t/stack-overflow-in-dataframes-group-by/6357/8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this and discussing it with ExpandingMan! Offering multiple functions of varying levels of capability/efficiency sounds like the most straightforward way to support all use cases and keep everyone happy in terms of performance. I'd be happy to clarify the distinctions between the functions as part of Doctoberfest to make sure users understand how to use them effectively and to which use-case each applies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevertheless, I think we should investigate whether inference could allow supporting vector results efficiently. The less different functions we need, the easier our API will be, and it would be nice to be more flexible than Pandas while still choosing the most efficient approach automatically.

Since aggregate works column-wise, we can use inference, so in theory it would be possible to take a fast path when we detect the function returns a scalar for all columns. We only need to check this once, since the type of the columns is the same across groups. If inference fails it's fine to go back to the current slow approach.

sort && sort!(res, cols=headers)
res = gd.parent[gd.idx[gd.starts], gd.cols]
cols = setdiff(names(gd.parent), gd.cols)
for f in fs
for c in cols
res[Symbol(c, "_", f)] = [f(g[c]) for g in gd]
end
end
sort && sort!(res)
res
end

Expand All @@ -365,14 +369,3 @@ function aggregate(d::AbstractDataFrame,
sort::Bool=false) where {S<:ColumnIndex, T <:Function}
aggregate(groupby(d, cols, sort=sort), fs)
end

function _makeheaders(fs::Vector{T}, cn::Vector{Symbol}) where T<:Function
fnames = _fnames(fs) # see other/utils.jl
[Symbol(colname,'_',fname) for fname in fnames for colname in cn]
end

function _aggregate(d::AbstractDataFrame, fs::Vector{T}, headers::Vector{Symbol}, sort::Bool=false) where T<:Function
res = DataFrame(Any[vcat(f(d[i])) for f in fs for i in 1:size(d, 2)], headers)
sort && sort!(res, cols=headers)
res
end
15 changes: 0 additions & 15 deletions src/other/utils.jl
Original file line number Diff line number Diff line change
Expand Up @@ -141,18 +141,3 @@ function countnull(a::CategoricalArray)
end
return res
end

# Gets the name of a function. Used in groupeDataFrame/grouping.jl
function _fnames(fs::Vector{T}) where T<:Function
λcounter = 0
names = map(fs) do f
name = string(f)
if name == "(anonymous function)" # Anonymous functions with Julia < 0.5
λcounter += 1
name = "λ$(λcounter)"
end
name
end
names
end

36 changes: 36 additions & 0 deletions test/grouping.jl
Original file line number Diff line number Diff line change
Expand Up @@ -181,4 +181,40 @@ module TestGrouping
@test gd[2] == DataFrame(Key1="A", Key2="B", Value=2)
@test gd[3] == DataFrame(Key1="B", Key2="A", Value=3)
@test gd[4] == DataFrame(Key1="B", Key2="B", Value=4)

@testset "aggregate" begin
# test converting functions to valid column names
@test Symbol.([mean, sum]) == [:mean, :sum]
@test ismatch(r"#\d+", string(Symbol(x -> reduce(^, x))))

dt = DataFrame(group = repeat('c':-1:'a', inner = 4), x = 12:-1:1)

@test aggregate(groupby(dt, :group), sum) ==
aggregate(dt, :group, sum) ==
DataFrame(group = 'c':-1:'a', x_sum = [42, 26, 10])

@test aggregate(groupby(dt, :group), sum, sort = true) ==
aggregate(dt, :group, sum, sort = true) ==
DataFrame(group = 'a':'c', x_sum = [10, 26, 42])

@test aggregate(dt, length) == DataFrame(group_length = 12, x_length = 12)
anonymous = x -> length(x)
@test aggregate(dt, anonymous) == DataFrame(Symbol("group_$anonymous") => 12,
Symbol("x_$anonymous") => 12)

dt = DataFrame(year = repeat(1:4, inner = 12), month = repeat(1:12, outer = 4),
a = 1:48, b = fill(24.5, 48))
@test aggregate(dt, [sum, length]) ==
DataFrame(year_sum = 120, month_sum = 312, a_sum = 1176, b_sum = 1176,
year_length = 48, month_length = 48, a_length = 48, b_length = 48)
@test aggregate(dt, [:year], [sum, length]) ==
DataFrame(year = 1:4, month_sum = fill(78, 4), a_sum = [78, 222, 366, 510],
b_sum = fill(294, 4), month_length = fill(12, 4),
a_length = fill(12, 4), b_length = fill(12, 4))

@test aggregate(dt, [:month], [sum, length]) ==
DataFrame(month = 1:12, year_sum = fill(10, 12), a_sum = collect(76:4:120),
b_sum = fill(98, 12), year_length = fill(4, 12),
a_length = fill(4, 12), b_length = fill(4, 12))
end
end
7 changes: 0 additions & 7 deletions test/utils.jl
Original file line number Diff line number Diff line change
Expand Up @@ -49,13 +49,6 @@ module TestUtils
pdata[1:end] = null
@test DataFrames.countnull(pdata) == 20

funs = [mean, sum, var, x -> sum(x)]
if string(funs[end]) == "(anonymous function)" # Julia < 0.5
@test DataFrames._fnames(funs) == ["mean", "sum", "var", "λ1"]
else
@test DataFrames._fnames(funs) == ["mean", "sum", "var", string(funs[end])]
end

@testset "describe" begin
io = IOBuffer()
df = DataFrame(Any[collect(1:4), Vector{Union{Int, Null}}(2:5),
Expand Down