Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouped describe fails or "clashes" with StatsBase #2952

Closed
jonas-schulze opened this issue Nov 30, 2021 · 7 comments
Closed

Grouped describe fails or "clashes" with StatsBase #2952

jonas-schulze opened this issue Nov 30, 2021 · 7 comments

Comments

@jonas-schulze
Copy link
Contributor

How do I use groupby and describe together properly? I was trying to describe duration values based on some category. Having only DataFrames loaded, it fails due to a missing method for describe:

julia> n = 10;

julia> using DataFrames

julia> df = DataFrame(a=rand(Bool, n), b=rand(1:3, n), duration=randn(n));

julia> select(groupby(df, [:a, :b]), :duration => describe)
ERROR: MethodError: no method matching describe(::SubArray{Float64, 1, Vector{Float64}, Tuple{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}, false})
Closest candidates are:
  describe(::AbstractDataFrame; cols) at /Users/jonas/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:573
  describe(::AbstractDataFrame, ::Union{Symbol, Pair{var"#s25", var"#s24"} where {var"#s25"<:Union{Function, Type}, var"#s24"<:Union{AbstractString, Symbol}}}...; cols) at /Users/jonas/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:570
Stacktrace:
 [1] _combine(gd::GroupedDataFrame{DataFrame}, cs_norm::Vector{Any}, optional_transform::Vector{Bool}, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/groupeddataframe/splitapplycombine.jl:601
 [2] _combine_prepare_norm(gd::GroupedDataFrame{DataFrame}, cs_vec::Vector{Any}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/groupeddataframe/splitapplycombine.jl:81
 [3] _combine_prepare(gd::GroupedDataFrame{DataFrame}, ::Base.RefValue{Any}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/groupeddataframe/splitapplycombine.jl:47
 [4] select(gd::GroupedDataFrame{DataFrame}, args::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat{var"#s429"} where var"#s429"<:Pair}; copycols::Bool, keepkeys::Bool, ungroup::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/groupeddataframe/splitapplycombine.jl:722
 [5] select(gd::GroupedDataFrame{DataFrame}, args::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat{var"#s428"} where var"#s428"<:Pair})
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/groupeddataframe/splitapplycombine.jl:722
 [6] top-level scope
   @ REPL[4]:1

caused by: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:322 [inlined]
 [2] _combine(gd::GroupedDataFrame{DataFrame}, cs_norm::Vector{Any}, optional_transform::Vector{Bool}, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/groupeddataframe/splitapplycombine.jl:597
 [3] _combine_prepare_norm(gd::GroupedDataFrame{DataFrame}, cs_vec::Vector{Any}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/groupeddataframe/splitapplycombine.jl:81
 [4] _combine_prepare(gd::GroupedDataFrame{DataFrame}, ::Base.RefValue{Any}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/groupeddataframe/splitapplycombine.jl:47
 [5] select(gd::GroupedDataFrame{DataFrame}, args::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat{var"#s429"} where var"#s429"<:Pair}; copycols::Bool, keepkeys::Bool, ungroup::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/groupeddataframe/splitapplycombine.jl:722
 [6] select(gd::GroupedDataFrame{DataFrame}, args::Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex, AbstractVecOrMat{var"#s428"} where var"#s428"<:Pair})
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/groupeddataframe/splitapplycombine.jl:722
 [7] top-level scope
   @ REPL[4]:1

    nested task error: MethodError: no method matching describe(::SubArray{Float64, 1, Vector{Float64}, Tuple{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}, false})
    Closest candidates are:
      describe(::AbstractDataFrame; cols) at /Users/jonas/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:573
      describe(::AbstractDataFrame, ::Union{Symbol, Pair{var"#s25", var"#s24"} where {var"#s25"<:Union{Function, Type}, var"#s24"<:Union{AbstractString, Symbol}}}...; cols) at /Users/jonas/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:570
    Stacktrace:
     [1] do_call(f::typeof(describe), idx::Vector{Int64}, starts::Vector{Int64}, ends::Vector{Int64}, gd::GroupedDataFrame{DataFrame}, incols::Tuple{Vector{Float64}}, i::Int64)
       @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/groupeddataframe/callprocessing.jl:94
     [2] _combine_process_pair(::Base.RefValue{Any}, optional_i::Bool, parentdf::DataFrame, gd::GroupedDataFrame{DataFrame}, seen_cols::Dict{Symbol, Tuple{Bool, Int64}}, trans_res::Vector{DataFrames.TransformationResult}, idx_agg::Base.RefValue{Vector{Int64}})
       @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/groupeddataframe/splitapplycombine.jl:492
     [3] macro expansion
       @ ~/.julia/packages/DataFrames/vuMM8/src/groupeddataframe/splitapplycombine.jl:589 [inlined]
     [4] (::DataFrames.var"#614#620"{GroupedDataFrame{DataFrame}, Bool, Bool, DataFrame, Dict{Symbol, Tuple{Bool, Int64}}, Vector{DataFrames.TransformationResult}, Base.RefValue{Vector{Int64}}, Bool, Pair{Int64, Pair{typeof(describe), Symbol}}})()
       @ DataFrames ./threadingconstructs.jl:169

StatsBase (which got loaded by CairoMakie in my case) for example defines above method, but incompatible with my intended use case:

julia> using StatsBase

julia> select(groupby(df, [:a, :b]), :duration => describe)
Summary Stats:
Length:         4
Missing Count:  0
Mean:           -0.180701
Minimum:        -1.262080
1st Quartile:   -0.986103
Median:         -0.364364
3rd Quartile:   0.441038
Maximum:        1.268003
Type:           Float64
Summary Stats:
Length:         3
Missing Count:  0
Mean:           0.096027
Minimum:        -0.405362
1st Quartile:   -0.185873
Median:         0.033616
3rd Quartile:   0.346721
Maximum:        0.659826
Type:           Float64
Summary Stats:
Length:         1
Missing Count:  0
Mean:           -0.480075
Minimum:        -0.480075
1st Quartile:   -0.480075
Median:         -0.480075
3rd Quartile:   -0.480075
Maximum:        -0.480075
Type:           Float64
Summary Stats:
Length:         2
Missing Count:  0
Mean:           -0.023849
Minimum:        -0.025362
1st Quartile:   -0.024605
Median:         -0.023849
3rd Quartile:   -0.023092
Maximum:        -0.022335
Type:           Float64
10×3 DataFrame
 Row │ a      b      duration_describe
     │ Bool   Int64  Nothing
─────┼─────────────────────────────────
   1true      3
   2true      1
   3true      2
   4false      1
   5true      1
   6true      2
   7true      1
   8false      1
   9false      1
  10false      1

I tried variations with AsTable(:duration), but I couldn't get it working. How would you guys approach this? 🙂

Maybe one could define a method for describe(::NamedTuple) to get this working with AsTable, but I'm afraid that would conflict with the use case of StatsBase.

@jonas-schulze
Copy link
Contributor Author

I should have used combine, but that fails in the same way. 🙈

@bkamins
Copy link
Member

bkamins commented Nov 30, 2021

describe only prints values. You are probably looking for summarystats. However, even this function is not integrated well with the ecosystem, so for the time being the thing that I assume you want to get (i.e. statistics in consecutive columns) can be achieved via:

julia> combine(groupby(df, [:a, :b]), :duration => (x -> DataFrame([summarystats(x)])) => AsTable)
4×10 DataFrame
 Row │ a      b      mean       min        q25        median     q75        max       nobs   nmiss
     │ Bool   Int64  Float64    Float64    Float64    Float64    Float64    Float64   Int64  Int64
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
   1 │ false      2  -0.5431    -1.27231   -0.86726   -0.462214  -0.178496  0.105222      3      0
   2 │ false      3   0.743789   0.175907   0.485023   0.794139   1.02773   1.26132       3      0
   3 │  true      1   1.3331     1.3331     1.3331     1.3331     1.3331    1.3331        1      0
   4 │  true      2  -0.212513  -0.718772  -0.693783  -0.668795   0.040616  0.750027      3      0

@bkamins bkamins closed this as completed Nov 30, 2021
@bkamins
Copy link
Member

bkamins commented Nov 30, 2021

Of course if you want to keep the stats in a single column, without expanding then you can do:

julia> combine(groupby(df, [:a, :b]), :duration => summarystats)
4×3 DataFrame
 Row │ a      b      duration_summarystats
     │ Bool   Int64  SummaryStats…
─────┼─────────────────────────────────────────────────
   1 │ false      2  Summary Stats:\nLength:         …
   2 │ false      3  Summary Stats:\nLength:         …
   3 │  true      1  Summary Stats:\nLength:         …
   4 │  true      2  Summary Stats:\nLength:         …

@jonas-schulze
Copy link
Contributor Author

Awesome, and thanks for the quick response!

I suppose this is not feasible, but maybe we could deprecate describe in favor of summarystats for DataFrames? That way the same function would work for grouped and non-grouped data.

@bkamins
Copy link
Member

bkamins commented Nov 30, 2021

Is this #1443 what you would want?

I do not think we will deprecate describe in favor of summarystats. I would rather change what describe does in StatsBase.jl, as currently it is not very useful I think (we tend to avoid functions that just do printing without returning a meaningful value in other places). @nalimilan - what do you think?

@jonas-schulze
Copy link
Contributor Author

jonas-schulze commented Nov 30, 2021

I'm not sure that I understand the direction #1443 ended up aiming for. Interesting that map(describe, gd) is not allowed anymore while combine(describe, gd) and combine(gd, describe) are. I was actually expecting the result to look like (pseudo code ahead)

julia> combine(groupby(df, [:a, :b]), :duration => describe => AsTable)
4×10 DataFrame
 Row │ a      b      duration_mean  duration_min  duration_q25 ...
     │ Bool   Int64  Float64        Float64       Float64 ...
─────┼───────────────────────────────────────────────────────────
   1false      2  -0.5431        -1.27231      -0.86726   -0.462214 ...
   2false      3   0.743789       0.175907      0.485023 ...
   3true      1   1.3331         1.3331        1.3331 ...
   4true      2  -0.212513      -0.718772     -0.693783 ...

similar to :duration => maximum producing the column duration_maximum.

I'm just now noticing that I was also expecting :duration => [minimum, maximum] => AsTable to work and to produce two columns with the respective suffix, which it does not. In my head (which might have a completely wrong concept on how combine works or should work), describe is the same as the collection of functions computing the individual stats.
Similarly, I was expecting :duration => extrema => [:d_min, :d_max] to work, which it does not (but :duration => Ref∘extrema => [:d_min, :d_max] does).

Edit: I guess I'm confusing these concepts with the broadcasted :duration .=> [minimum, maximum], which behaves as I would expect.

@nalimilan
Copy link
Member

I do not think we will deprecate describe in favor of summarystats. I would rather change what describe does in StatsBase.jl, as currently it is not very useful I think (we tend to avoid functions that just do printing without returning a meaningful value in other places). @nalimilan - what do you think?

Yes that's what I'd like to do. We should check whether that can be done in a completely non-breaking way or at least with minor breakage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants