Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make describe work for a grouped dataframe #1443

Open
pdeffebach opened this issue Jul 8, 2018 · 25 comments
Open

Make describe work for a grouped dataframe #1443

pdeffebach opened this issue Jul 8, 2018 · 25 comments
Labels
non-breaking The proposed change is not breaking

Comments

@pdeffebach
Copy link
Contributor

pdeffebach commented Jul 8, 2018

Reporting this to remind myself to make this PR.

Since GroupedDataFraemes seem to support broadcasting, this should be easy.

@pdeffebach
Copy link
Contributor Author

df = DataFrame(rand(5,5))
df[:gender] = ["m", "m", "m", "f", "f"]
gd = groupby(df, :gender)

I don't get how broadcasting works with grouped data frames. I was under the impression that describe would just work with a broadcast.

println.(gd) # works as expected
describe.(gd)
> ERROR: MethodError: no method matching describe(::DataFrames.GroupedDataFrame)

but describe(gd[1]) works, since gd[1] is a subdataframe.

@pdeffebach
Copy link
Contributor Author

Using by is a bit weird too

> by(df, :gender, describe)
│ Row │ gender │ variable │ mean     │ min       │ median    │ max      │ nunique │ nmissing │ eltype  │
├─────┼────────┼──────────┼──────────┼───────────┼───────────┼──────────┼─────────┼──────────┼─────────┤
│ 1   │ m      │ x1       │ 0.334892 │ 0.0531118 │ 0.0600073 │ 0.891557 │ nothing │ nothing  │ Float64 │
│ 2   │ m      │ x2       │ 0.484557 │ 0.351354  │ 0.424645  │ 0.677673 │ nothing │ nothing  │ Float64 │
│ 3   │ m      │ x3       │ 0.561228 │ 0.0260729 │ 0.775023  │ 0.882588 │ nothing │ nothing  │ Float64 │
│ 4   │ m      │ x4       │ 0.41416  │ 0.0112837 │ 0.605188  │ 0.626008 │ nothing │ nothing  │ Float64 │
│ 5   │ m      │ x5       │ 0.596377 │ 0.454716  │ 0.573004  │ 0.761409 │ nothing │ nothing  │ Float64 │
│ 6   │ m      │ gender   │ nothing  │ m         │ nothing   │ m        │ 1       │ nothing  │ String  │
│ 7   │ f      │ x1       │ 0.328554 │ 0.266763  │ 0.328554  │ 0.390345 │ nothing │ nothing  │ Float64 │
│ 8   │ f      │ x2       │ 0.712208 │ 0.524521  │ 0.712208  │ 0.899895 │ nothing │ nothing  │ Float64 │
│ 9   │ f      │ x3       │ 0.305259 │ 0.253505  │ 0.305259  │ 0.357012 │ nothing │ nothing  │ Float64 │
│ 10  │ f      │ x4       │ 0.614933 │ 0.381118  │ 0.614933  │ 0.848749 │ nothing │ nothing  │ Float64 │
│ 11  │ f      │ x5       │ 0.206904 │ 0.145303  │ 0.206904  │ 0.268505 │ nothing │ nothing  │ Float64 │
│ 12  │ f      │ gender   │ nothing  │ f         │ nothing   │ f        │ 1       │ nothing  │ String  │

@pdeffebach
Copy link
Contributor Author

map(describe, df) works fine, but for some reason it does some extra printing of each subdataframe. not sure why it does this.

@nalimilan
Copy link
Member

That's because map returns a GroupApplied object, which can then be combined back to a single data frame. The result of by(df, :gender, describe) is indeed a bit weird, because the grouping variable is passed to the function. I think there have been repeated complaints about this, maybe it should be optional. Not sure which default is best, we should compare with dplyr and Pandas.

@pdeffebach
Copy link
Contributor Author

R

The only solution I could find for R is to use a do command. Maybe this is better with purrr but I've actually never taken the time to learn purrr.

Simply calling summarize on a grouped dataframe doesn't alter behavior.

t = df %>% group_by(Species) %>% do(summary = summary(.))
> Source: local data frame [3 x 2]
Groups: <by row>

# A tibble: 3 x 2
  Species    summary    
* <fct>      <list>     
1 setosa     <S3: table>
2 versicolor <S3: table>
3 virginica  <S3: table>

t$summary
> [[1]]
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200   versicolor: 0  
 Median :5.000   Median :3.400   Median :1.500   Median :0.200   virginica : 0  
 Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246                  
 3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300                  
 Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600                  

[[2]]
  Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
 Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
 1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
 Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
 Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
 3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
 Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  

[[3]]
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400   setosa    : 0  
 1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800   versicolor: 0  
 Median :6.500   Median :3.000   Median :5.550   Median :2.000   virginica :50  
 Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026                  
 3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300                  
 Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500      

Man R loves lists.

Pandas

Pandas is weird. iris.describe() looks nice:

>>> iris.describe()
       Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000

But when you group it, the output becomes unreadable. It prints each variable horizontally.

 t = iris.groupby('Species')
>>> t.describe()
           Petal.Length                                              Petal.Width        ...  Sepal.Length      Sepal.Width
                  count   mean       std  min  25%   50%    75%  max       count   mean ...           75%  max       count   mean       std  min    25%  50%    75%  max
Species                                                                                 ...
setosa             50.0  1.462  0.173664  1.0  1.4  1.50  1.575  1.9        50.0  0.246 ...           5.2  5.8        50.0  3.428  0.379064  2.3  3.200  3.4  3.675  4.4
versicolor         50.0  4.260  0.469911  3.0  4.0  4.35  4.600  5.1        50.0  1.326 ...           6.3  7.0        50.0  2.770  0.313798  2.0  2.525  2.8  3.000  3.4
virginica          50.0  5.552  0.551895  4.5  5.1  5.55  5.875  6.9        50.0  2.026 ...           6.9  7.9        50.0  2.974  0.322497  2.2  2.800  3.0  3.175  3.8

Thoughts

Returning a grouped dataframe:

│ Row │ gender |variable │ mean     │
├─────┼────────|─────────┼──────────┤
│ 1   │ M      |x1       │ 0.456352 │
│ 2   │ M      |x2       │ 0.441564 │
│ 3   │ M      |x3       │ 0.448726 │
│ 4   │ M      |x4       │ 0.456411 │
│ 5   │ M      |x5       │ 0.522709 │
│ 6   │ M      |sex      │     ?    │

│ Row │ gender |variable │ mean     │
├─────┼────────|─────────┼──────────┤
│ 1   │ F      |x1       │ 0.456352 │
│ 2   │ F      |x2       │ 0.441564 │
│ 3   │ F      |x3       │ 0.448726 │
│ 4   │ F      |x4       │ 0.456411 │
│ 5   │ F      |x5       │ 0.522709 │
│ 6   │ F      |sex      │     ?    │

This is weird because the :gender column's value isn't a property of the column :x1 so grouping the summary statistics doesn't really make sense.

We could just print 3 DataFrames with the annotation where :gender == "M" or something. But then we have different outputs for describe since describe returns a DataFrame while this new behavior would return Nothing.

@nalimilan
Copy link
Member

Thanks for checking! dplyr is indeed a bit different from us, since the grouping information can be present without necessarily changing the behavior of many operations. Pandas' behavior makes sense, but the printing is terrible.

How about returning a GroupedDataFrame with the same columns as the standard describe, plus the grouping column? Basically something similar to what by(df, :gender, describe) currently returns, but without the weird additional row for the grouping column.

@pdeffebach
Copy link
Contributor Author

This is a good idea. And something I will work to implement.

I think that there are some changes I want to make before this, since grouped operations can be expensive, see my comment in #1256.

Without trying to re-optimize grouped operations just for this, it's worth changing summarize so that it only does the calculations it needs to, rather than calculating everything and just the statistics specified in stats.

@pdeffebach
Copy link
Contributor Author

I'm going to implement this:

function describe(g::GroupedDataFrame; kwargs...)
       des = combine(map(d -> describe(d, kwargs...), g)) # a dataframe
       groupby(des, g.cols) # group it again
end

Though it isn't much use if Grouped DataFrames don't print more information. They currently cutoff pretty quickly. However I am under the impression we will be changing grouped dataframes printing in the future.

This behavior also might change with #1520

@nalimilan
Copy link
Member

Thanks. Actually, I wonder whether we shouldn't make this the standard behavior of map on GroupedDataFrame. At #1520 I suggested replacing combine with map, but it would also make sense to keep them as different operations, with map preserving the grouping information. Then instead of describe(gd), one would call map(describe, gd).

@pdeffebach
Copy link
Contributor Author

I think thats the right move for sure. Two things.

  1. Does 1520 change the awkward printing of each subdataframe?
  2. Would there be any way to print something like
Group 1: :country == "United States" & :sex == "Female"

as the default behavior for any map operation?

@nalimilan
Copy link
Member

  1. No, not at all. But can you develop what you find awkward in particular?
  2. Yes, that sounds like a good idea. Feel free to experiment with that (the text should probably be cut if it's too long to fit on screen).

@nalimilan
Copy link
Member

Regarding this issue, I'll update #1520 to keep combine, and we can come back to it after that.

@pdeffebach
Copy link
Contributor Author

This is the current output of map(describe, g::GroupedDataFrame)

julia> map(describe, g)
GroupApplied{DataFrame}(GroupedDataFrame with 3 groups based on keys: :A, :B
First Group: 2 rows
│ Row │ a     │ b        │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.121894 │
│ 2   │ 1     │ 0.427624 │
⋮
Last Group: 2 rows
│ Row │ a     │ b        │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 3     │ 0.280238 │
│ 2   │ 3     │ 0.746962 │, DataFrame[2×8 DataFrame. Omitted printing of 2 columns
│ Row │ variable │ mean     │ min      │ median   │ max      │ nunique │
│     │ Symbol   │ Float64  │ Real     │ Float64  │ Real     │ Nothing │
├─────┼──────────┼──────────┼──────────┼──────────┼──────────┼─────────┤
│ 1   │ a        │ 1.0      │ 1        │ 1.0      │ 1        │         │
│ 2   │ b        │ 0.274759 │ 0.121894 │ 0.274759 │ 0.427624 │         │, 2×8 DataFrame. Omitted printing of 2 columns
│ Row │ variable │ mean     │ min      │ median   │ max      │ nunique │
│     │ Symbol   │ Float64  │ Real     │ Float64  │ Real     │ Nothing │
├─────┼──────────┼──────────┼──────────┼──────────┼──────────┼─────────┤
│ 1   │ a        │ 2.0      │ 2        │ 2.0      │ 2        │         │
│ 2   │ b        │ 0.401019 │ 0.199386 │ 0.401019 │ 0.602652 │         │, 2×8 DataFrame. Omitted printing of 2 columns
│ Row │ variable │ mean    │ min      │ median  │ max      │ nunique │
│     │ Symbol   │ Float64 │ Real     │ Float64 │ Real     │ Nothing │
├─────┼──────────┼─────────┼──────────┼─────────┼──────────┼─────────┤
│ 1   │ a        │ 3.0     │ 3        │ 3.0     │ 3        │         │
│ 2   │ b        │ 0.5136  │ 0.280238 │ 0.5136  │ 0.746962 │         │])

I think it could omit the printing of g itself. and add some line breaks in between output (probably linebreaks and some info about the group.)

I think that printing info about the group might be tough though. The only info that a GroupedDataFrame stores is a hash and row numbers. So it would need some exploratory work done.

@nalimilan
Copy link
Member

I think that printing info about the group might be tough though. The only info that a GroupedDataFrame stores is a hash and row numbers. So it would need some exploratory work done.

Yes, but you know that all values for grouping columns are by definition equal for a given group, so you can just take the first one.

The result of map is indeed nonsense, but that's a problem with GroupApplied, which should be removed.

@pdeffebach
Copy link
Contributor Author

Cool. Do you think you could bundle the printing PR with #1520? Or should we wait for that to be merged and see what to do.

fwiw, describe.(g) actually works quite well, other than being memory intensive. The only improvement we would conceivably make would be to have map be an iterator and print out the grouping better.

@nalimilan
Copy link
Member

I'd rather keep #1520 self-contained, printing is totally orthogonal to it.

@pdeffebach
Copy link
Contributor Author

Sounds good when that is merged I will make a PR for printing of map and then we can tell users to use map for describe.

@pdeffebach
Copy link
Contributor Author

After #1520 and others, map doesn't have that awkward printing.

df = DataFrame(a = repeat(1:4, outer = 5), b = randn(20), c = randn(20) .+ 1)
g = groupby(df, :a)
julia> map(describe, g)
GroupedDataFrame{DataFrame} with 4 groups based on key: :a
First Group: 3 rows
│ Row │ a     │ variable │ mean       │ min       │ median   │ max     │ nunique │ nmissing │ eltype   │
│     │ Int64 │ Symbol   │ Float64    │ Real      │ Float64  │ Real    │ Nothing │ Nothing  │ DataType │
├─────┼───────┼──────────┼────────────┼───────────┼──────────┼─────────┼─────────┼──────────┼──────────┤
│ 1   │ 1     │ a        │ 1.0        │ 1         │ 1.0      │ 1       │         │          │ Int64    │
│ 2   │ 1     │ b        │ 0.00779658 │ -1.4079   │ 0.365603 │ 1.78441 │         │          │ Float64  │
│ 3   │ 1     │ c        │ 0.702246   │ -0.164928 │ 0.783903 │ 1.67254 │         │          │ Float64  │
⋮
Last Group: 3 rows
│ Row │ a     │ variable │ mean     │ min       │ median   │ max     │ nunique │ nmissing │ eltype   │
│     │ Int64 │ Symbol   │ Float64  │ Real      │ Float64  │ Real    │ Nothing │ Nothing  │ DataType │
├─────┼───────┼──────────┼──────────┼───────────┼──────────┼─────────┼─────────┼──────────┼──────────┤
│ 1   │ 4     │ a        │ 4.0      │ 4         │ 4.0      │ 4       │         │          │ Int64    │
│ 2   │ 4     │ b        │ 0.309454 │ -0.668419 │ 0.244776 │ 1.49806 │         │          │ Float64  │
│ 3   │ 4     │ c        │ 0.949839 │ 0.286902  │ 1.19766  │ 1.52525 │         │          │ Float64  │

This is pretty good. Having the result be a grouped dataframe feels intuitive. However

  1. It isn't clear which sub-dataframe corresponds to which values in our grouping columns. Though this is a more general problem for grouped dataframes
  2. Default grouped dataframe printing only shows the first and last groups. This isn't very useful for looking at summary statistics.

@nalimilan
Copy link
Member

We could probably print the values of the grouping columns next to the group number. We could also try to print as many groups as possible on screen, but that wouldn't be enough in most cases I guess.

Note that you can also do combine(describe, g) to get a single data frame.

@pdeffebach
Copy link
Contributor Author

#1632 Addresses the general grouped printing function.

showall(map(describe, g)) has the correct behavior. The questions is whether it's too verbose for recommended use.

@pdeffebach
Copy link
Contributor Author

Now that #1632 has been merged, I wonder if we should change the way grouped DataFrames are printed in general.

Should we default to showing 10 rows of however many groups we can? I'm not sure I see the benefit of showing the first and last groups only.

@nalimilan
Copy link
Member

Ideally I think we should print the header only once, using a fixed width across groups for a given column. That would free a lot of space, and then it would make sense to print as many groups as possible.

@pdeffebach
Copy link
Contributor Author

That's what TexTables.jl does here with summarize_by.

@bkamins bkamins mentioned this issue Jan 15, 2019
31 tasks
@bkamins bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020
@pdeffebach
Copy link
Contributor Author

I've been playing around with describe for a grouped data frame and this is the best one-liner

It's still not pretty, maybe when metadata is added we can have something that controls printing to make this better?

julia> df = DataFrame(a = rand(1:5, 1000), b = rand(1000));

julia> combine(describe, groupby(df, "a"), ungroup = false) |> (t -> show(t, allrows = true, allgroups = true))
GroupedDataFrame with 5 groups based on key: a
Group 1 (2 rows): a = 2
│ Row │ a     │ variable │ mean     │ min       │ median  │ max      │ nunique │ nmissing │ eltype   │
│     │ Int64 │ Symbol   │ Float64  │ Real      │ Float64 │ Real     │ Nothing │ Nothing  │ DataType │
├─────┼───────┼──────────┼──────────┼───────────┼─────────┼──────────┼─────────┼──────────┼──────────┤
│ 1   │ 2     │ a        │ 2.0      │ 2         │ 2.0     │ 2        │         │          │ Int64    │
│ 2   │ 2     │ b        │ 0.511089 │ 0.0032423 │ 0.51914 │ 0.994888 │         │          │ Float64  │
Group 2 (2 rows): a = 3
│ Row │ a     │ variable │ mean     │ min        │ median   │ max      │ nunique │ nmissing │ eltype   │
│     │ Int64 │ Symbol   │ Float64  │ Real       │ Float64  │ Real     │ Nothing │ Nothing  │ DataType │
├─────┼───────┼──────────┼──────────┼────────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│ 1   │ 3     │ a        │ 3.0      │ 3          │ 3.0      │ 3        │         │          │ Int64    │
│ 2   │ 3     │ b        │ 0.508742 │ 0.00174766 │ 0.541555 │ 0.997982 │         │          │ Float64  │
Group 3 (2 rows): a = 1
│ Row │ a     │ variable │ mean     │ min       │ median   │ max      │ nunique │ nmissing │ eltype   │
│     │ Int64 │ Symbol   │ Float64  │ Real      │ Float64  │ Real     │ Nothing │ Nothing  │ DataType │
├─────┼───────┼──────────┼──────────┼───────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│ 1   │ 1     │ a        │ 1.0      │ 1         │ 1.0      │ 1        │         │          │ Int64    │
│ 2   │ 1     │ b        │ 0.532574 │ 0.0107663 │ 0.560414 │ 0.994937 │         │          │ Float64  │
Group 4 (2 rows): a = 5
│ Row │ a     │ variable │ mean     │ min        │ median   │ max      │ nunique │ nmissing │ eltype   │
│     │ Int64 │ Symbol   │ Float64  │ Real       │ Float64  │ Real     │ Nothing │ Nothing  │ DataType │
├─────┼───────┼──────────┼──────────┼────────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│ 1   │ 5     │ a        │ 5.0      │ 5          │ 5.0      │ 5        │         │          │ Int64    │
│ 2   │ 5     │ b        │ 0.496365 │ 4.74044e-5 │ 0.495809 │ 0.990004 │         │          │ Float64  │
Group 5 (2 rows): a = 4
│ Row │ a     │ variable │ mean     │ min       │ median   │ max     │ nunique │ nmissing │ eltype   │
│     │ Int64 │ Symbol   │ Float64  │ Real      │ Float64  │ Real    │ Nothing │ Nothing  │ DataType │
├─────┼───────┼──────────┼──────────┼───────────┼──────────┼─────────┼─────────┼──────────┼──────────┤
│ 1   │ 4     │ a        │ 4.0      │ 4         │ 4.0      │ 4       │         │          │ Int64    │
│ 2   │ 4     │ b        │ 0.506765 │ 0.0109599 │ 0.504049 │ 0.99936 │         │          │ Float64  │

@nalimilan
Copy link
Member

We could improve GroupDataFrame printing to avoid repeating columns and use the same column width for all groups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking
Projects
None yet
Development

No branches or pull requests

3 participants