Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support type-stable map and combine on GroupedDataFrame #1601

Merged
merged 16 commits into from Dec 3, 2018
Merged

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Nov 17, 2018

This is inspired by the JuliaDB API. When columns are specified via select, the user-provided function is passed column vectors (as SubArrays) rather than a SubDataFrame. This gives fully type-stable code and dramatically improves performance.

This relatively limited change allows taking full advantage of the recent refactoring (#1520). The performance gain is really incredible: about 200× for a simple sum after compilation.

using DataFrames, BenchmarkTools
df = DataFrame(a = repeat(1:40000, outer=[20]),
               b = randn(800000))
gd = groupby(df, :a)

julia> @time combine(d -> sum(d.b), gd);
  4.657727 seconds (7.39 M allocations: 640.692 MiB, 2.92% gc time)

julia> @btime combine(d -> sum(d.b), gd);
  3.129 s (4399077 allocations: 494.68 MiB)

julia> @time combine(:b => sum, gd);
  0.545264 seconds (1.65 M allocations: 80.211 MiB, 3.91% gc time)

julia> @btime combine(:b => sum, gd);
  10.395 ms (399620 allocations: 16.78 MiB)

I still need to write tests, but otherwise the main question is the API. The discussion has already been started on Discourse. To sum up: while the select API illustrated above is reasonable, the slow variant is a bit shorter and more intuitive, in particular because it avoids repeating the column names in two different places. Yet we really don't want people to use the slow variant unless they really need to (for complex operations or when the number of columns isn't known in advance), since its terribly slow.

The solution I proposed to that problem is to detect whether the user-provided function expects a SubDataFrame or columns, assuming the argument names are those of columns. This would allow writing the example above as simply combine(b -> sum(b), gd). Implementation-wise, it is actually quite easy to extract the argument names and match them to column names. The main problem is that when the function takes a single argument, we don't know whether it's supposed to be a SubDataFrame or a column. So far I haven't found a good solution to that, except writing combine((b, args...) -> sum(b), gd) or something silly like this.

Anyway, we don't necessarily need to decide this right now, as the select approach could be complementary to another more convenient approach: select can be useful to choose programmatically which column(s) to operate on, and a convenience syntax could be based on that for its implementation.

Cc: @bkamins, @piever, @pdeffebach

…gument

This is inspired by the JuliaDB API. When columns are specified via select,
the user-provided function is passed column vectors (as SubArray) rather
than a SubDataFrame. This gives fully type-stable code and dramatically improves
performance.
@nalimilan nalimilan changed the title Support type-stable map and combine on GroupedDataFrame via select ar… Support type-stable map and combine on GroupedDataFrame via select argument Nov 17, 2018
elseif select isa Symbol || select isa Integer
incols = (gd.parent[select],)
else
incols = Tuple(columns(gd.parent[collect(select)]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not optimal in general but I guess it is good enough as I would assume that this is not the costly operation relative to the whole work that has to be done and select is has few elements.

@bkamins
Copy link
Member

bkamins commented Nov 17, 2018

A general quick comment that the proposed solution is good if select does not have large number of columns (which is thinkable e.g. in ML applications). But I think that this is OK to assume that.
The reason for this is twofold:

  • we use a temporary data frame and later splat the columns;
  • user must define the function to accept positional arguments (which is cumbersome if there are many of them and e.g. it might be simpler to accept a single NamedTuple of vectors);

I will have a more detailed look at the PR later.

@bkamins
Copy link
Member

bkamins commented Nov 17, 2018

I think that keeping select even in the future is desirable as it reduces the burden on the compiler with wide data frames when we need only few columns.

When reading the code I realized that we can do yet another optimization. Probably for another PR, but related. We could check if a data frame is sorted on columns on which we group and if yes then use an implementation that takes advantage of the fact. I do not know if it was discussed, but even if we called issorted it should be relatively small overhead (optionally user could pass a keyword argument set to true (guaranteed sorted) false (guaranteed not sorted) or nothing (check if sorted)).

@nalimilan
Copy link
Member Author

Yes, select will only be efficient for a relatively small number of columns, but that's the typical case AFAICT. And indeed operations could be quite faster when data is sorted, we could just store the indices as a range in GroupedDataFrame.

Regarding the API, I realize JuliaDB supports two different syntaxes: combine(sum, gd; select=:b) and combine(:b => sum, gd). I actually prefer the latter, since the position of arguments sounds more natural, and it's also more flexible: you can do combine(gd, resvar = :b => sum) to specify the name of the returned column, and combine(gd, resvar = :b => sum, resvar2 = :c => mean) to compute two summary statistics in one run. The equivalent syntax with select is combine((b, c) -> (resvar1=sum(b), resvar2=mean(c)), gd, select=(:b, :c)), which is longer and repetitve. I'm not sure why JuliaDB supports both syntaxes since they completely overlap AFAICT. @piever Any ideas about that?

@bkamins
Copy link
Member

bkamins commented Nov 17, 2018

And indeed operations could be quite faster when data is sorted, we could just store the indices as a range

This is exactly what I mean and I suppose (have not tested) that it even justifies running issorted to check for it.

The Pairs syntax looks nice - I have not seen it before, but it seems to cover all use cases, so the question is what is the experience of JuliaDB with it 😄. Actually it would also solve the ambiguity problem as then we clearly see if we got a Pair or a Callable as an argument (even with keyword-argument approach they could be mixed).

@pdeffebach
Copy link
Contributor

Wow this is truly excellent. I'm excited to look this over more closely when I try and port this logic to DataFramesMeta. Thanks for the work on this!

One note about syntax, I think a lot of the DataFrames syntax is motivated by the fact that you don't want to ever do f(:x) because that's dishonest: :x is a symbol, not a column. This is the reason the API is more of less of the form d -> f(d.x). While this is a noble goal, I am not concerned about the Pairs syntax. It maps closely with the

collapse, (mean) `mean_vars' (sum) `sum_vars'

pattern that I like in Stata.

I have in fact recently required collapsing by a large number of columns, and I will test out the ways that #1520 and this PR help with that issue, which caused memory overloads even on a slurm cluster in Stata and caused Julia to hang. But it's definitely a use case I have so I'd be interested to benchmark it.

@piever
Copy link

piever commented Nov 18, 2018

Really nice job!

Concerning the JuliaDB API, as far as I can tell, both this approaches work in JuliaDB:

julia> IndexedTables.groupby(:mean => :SepalWidth => mean, t, :Species);

julia> IndexedTables.groupby(mean, t, :Species, select = :SepalWidth);

I think there was a API rewrite at some point and it was decided that most operations (groupby, join, summarize, unstack) should have a by and a select (defaulting to primary columns the by and to non primary columns the select). One motivation for the sugar in JuliaDBMeta is, as you pointed out, that this syntax is a bit clumsy when selecting more than one column.

The "pair" syntax is a consequence of the selection API, so basically instead of passing a function you are passing a combination of a selection and a function.

In terms of syntactic sugar, JuliaDBMeta provides both a @groupby that would detect the select condition and the anonymous function:

@groupby t :Species (m = mean(:SepalLength), s = std(:SepalWidth))

and a @=> macro to generate the pair and the anonymous function from an expression with symbols. (as can be used in combination with groupby, select, etc...). Note also that summarize (DataFrames aggregate) is just syntactic sugar to generate the "pair" expression to feed to groupby: so an interesting idea for DataFrames would be to focus on this pair object and create some utility function to easily generate commonly useful patterns.

There is a difference however between JuliaDB and DataFrames here that I am only realizing now, in that passing a select with a tuple of symbols in JuliaDB is just an optimization, i.e. both of the following work:

julia> IndexedTables.groupby(v -> (m = mean(column(v, :SepalLength)), s = std(column(v, :SepalWidth))), t, :Species)

julia> IndexedTables.groupby(v -> (m = mean(column(v, :SepalLength)), s = std(column(v, :SepalWidth))), t, :Species, select = (:SepalLength, :SepalWidth))

Whereas from what I understand, here the anonymous function would take as many arguments as you pass with select? I guess this is somewhat inevitable for technical reasons (unlike in IndexedTables, here creating the DataFrames would recreate the type instability). If that's the case I find it somewhat awkward that a keyword argument should specify the number of arguments to your function and I would prefer a:

by((:SepalLength, :SepalWidth) => (x, y) -> (m = mean(x), s = std(y)), t, :Species)

in combination with a macro s.t. @=> (m = mean(:SepalLength), y = std(:SepalWidth)) expands to the right thing as well as a @by that would call @=> on the first argument and then by.

A final remark is that the pair selection syntax is probably more general in that it could also be used on the variable you are grouping by, to allow grouping by something that is not a column, i.e.

by(:SepalLength => _ -> (m = mean(_),), t, :Species => _ -> _ == :viridis)

@bkamins
Copy link
Member

bkamins commented Nov 18, 2018

Just to expand on what the function should accept. After thinking of it I get to prefer NamedTuple of vectors over positional arguments even more. The reason is that you can easily alter this NamedTuple e.g. with merge and then return a new NamedTuple (e.g. with an additional column or a replaced column) to get an efficient return value to work with. And inside the function NamedTuple is easy enough to work with I think. We would have to weigh the performance cost of creating a NamedTuple vs splatting a Tuple (which would be relevant if we have many small groups).

@piever
Copy link

piever commented Nov 18, 2018

Just to expand on what the function should accept. After thinking of it I get to prefer NamedTuple of vectors over positional arguments even more.

I'm also a bit sceptical of the "many arguments" approach. The NamedTuple of vectors is actually nice enough in that it can be converted easily to DataFrames by the user, one can still extract columns with . syntax or also rows using Tables.rows(t).

In an ideal world I imagine one could have two versions of DataFrame, one fully typed (as IndexedTable is now) and one untyped but with a consistent API and the "inner function" of a by would use the typed version, but this is not yet feasible right now so NamedTuple of vectors is probably a good compromise.

@bkamins
Copy link
Member

bkamins commented Nov 18, 2018

In an ideal world I imagine one could have two versions of DataFrame

Actually this is what I want to put on a table very soon (but first I have to get getindex and setindex! right, as it is crucial to have a stable API first before adding this type). My current thinking is that a minimal such structure can just be a wrapper around NamedTuple of vectors (then in this wrapper we can also keep e.g. metadata, which is another big thing that is in the works, but did not get to be merged yet).

@nalimilan
Copy link
Member Author

OK. Let's use the pair approach rather than select then. But I must say I'm not a fan of the :newcol => :oldcol => fun syntax, as it uses the same operator for two very different meanings. This API has been designed before named tuples existed, but we can now do (newcol = :oldcol => fun,). Or we could use keyword arguments (which are now fast) for that, and put these pairs as the last arguments. That way one could write easily by(df, :key, mean1 = :col1 => mean, sum2 = :col2 => sum, or just by(df, :key, :col1 => mean, :col2 => sum) to get the default names (which would probably be something like :mean_col1 and :sum_col2).

Regarding what the function should be passed, I'm not sure NamedTuple is the best approach:

  • It's not very convenient when a single column is selected, which is the most common case: by(df, :key, :col1 => mean) becomes by(df, :key, :col1 => x -> mean(x.col1)), and we lose the ability to generate a nice column name for the function name in simple cases like this.
  • It forces recompiling _combine! for each new variable name, which is a waste since it should be frequent to call the same function on different columns of the same type.
  • Finally it's not too hard to create a named tuple manually if needed, and if that's not convenient enough we could add select and pass a NamedTuple rather than a SubDataFrame to the user-provided function. This would be more consistent with what JuliaDB does.

@bkamins
Copy link
Member

bkamins commented Nov 19, 2018

OK - let us move forward with what you propose. When it is implemented I will checkout the PR and test it to see how it goes 😄. Thanks

@piever
Copy link

piever commented Nov 19, 2018

Regarding what the function should be passed, I'm not sure NamedTuple is the best approach.

I think I understand the design a bit better now, and I believe there really are two separate concerns. There is the concern of a "typed selection" API, i.e. a way to select a fully typed subset of columns, which, to be consistent with JuliaDB, could be done as follows:

select(x, :SepalLength) # select with a symbol -> get a vector
select(x, (:SepalLength, :SepalWidth) # select with a Tuple -> get the NamedTuple of columns, or "type stable DataFrame"

And ideally many functions could accept select as a keyword argument to turn the argument into a type stable objects (even filter or map), for example:

map(f, df, select = ...) = map(f, DataFrames.select(df, select))

And then there is the issue of describing functions that operate on a few columns, which is orthogonal to this. So one could have a default:

apply(f, df) = f(df)
used_cols(f, df) = names(df)

and it could have many useful specializations. For example f could be :SepalLength => mean, or :SepalLength => (mean, std), or SepalLength => (new_name = mean,), or (s_m = :SepalLength => mean, p_t = :PetalLength => std) and so on (it's important to decide exactly what exactly is supported). In this scenario it's true that it's probably handier to use the multi-argument approach.

Then, for each of this overloads to apply(f, df) one could potentially also overload used_cols(f, df) if it's clear from f which columns it will use, in which case by can be optimized.

EDIT: it has just occurred to me that if we decide to pass a NamedTuple (or something that can iterate columns) instead of many positional arguments in the pair syntax, one can go back to use positional arguments using Tuple deconstruction (not sure this is the correct name), say:

by((:col1, :col2) => ((x,y),) -> mean(x) + mean(y), ...)

so actually this option is slightly more general than the other.

@pdeffebach
Copy link
Contributor

pdeffebach commented Nov 19, 2018

way one could write easily by(df, :key, mean1 = :col1 => mean, sum2 = :col2 => sum, or just by(df, :key, :col1 => mean, :col2 => sum) to get the default names (which would probably be something like :mean_col1 and :sum_col2).

I'm not a fan of this because mean1 isn't a symbol and thus its difficult to put these operations in functions. DataFrames has done well so far to avoid non-standard evaluation, and it'd be nice to keep it that way.

I also want to note that I find it very common to perform the same action on a variety of columns, something like

by(df, :id, [:x1, :x2, :x3] => mean, [:x1, :x2, :x4] => sum)

Would be ideal. (though this doesn't address the issue of column names)

@bkamins
Copy link
Member

bkamins commented Nov 19, 2018

A possibility would be that we mimick getindex on data frame behavior here. If we write e.g. :a => mean we pass a vector to mean (as in df.a getindex, LHS could be a Symbol or Integer other than Bool), if we write e.g. 1:2 => fun then we pass a NamedTuple (in general anything else that a Symbol or Integer other than Bool would produce a NamedTuple) - similarly how df[1:2] works. This would be unambiguous I think.

@nalimilan how do you see this?

@bkamins
Copy link
Member

bkamins commented Nov 19, 2018

Now I see that my proposal does not solve the recompilation issue introduced by NamedTuple you have raised. But I think (of course benchmarks here would be welcome) that:

  • majority of simple cases would by handled by one-argument calls (that pass a vector)
  • if we pass multiple columns then most probably we would have to recompile the function we call anyway (as probably we should not assume that it would be precompiled for several positional arguments)

But I am not 100% sure here. Maybe typical use cases are 1 or 2 columns, when using positional arguments makes more sense - NamedTuple probably makes sense if we want to pass several columns (I will keep leaving my thoughts here so that you have possible arguments on the table to decide what is best 😃).

@bkamins
Copy link
Member

bkamins commented Nov 19, 2018

Also we should provide users with ability to know source column names (in case they wanted to dynamically generate column names in return value) and without a named tuple (or some equivalent, e.g. splatted keyword arguments) how do you plan to allow for this?

@nalimilan
Copy link
Member Author

Clever. Actually it looks like that's what JuliaDB does. I need to think about it, but that's appealing, and it would replace select completely.

I'm not a fan of this because mean1 isn't a symbol and thus its difficult to put these operations in functions. DataFrames has done well so far to avoid non-standard evaluation, and it'd be nice to keep it that way.

You can create a keyword argument call from a symbol very easily, e.g. x = :a; f(; x=>1).

I also want to note that I find it very common to perform the same action on a variety of columns, something like

by(df, :id, [:x1, :x2, :x3] => mean, [:x1, :x2, :x4] => sum)

Would be ideal. (though this doesn't address the issue of column names)

Hmm, that interpretation of this syntax would conflict with other proposals and would prevent specifying functions which take several argument. I think you can do that with aggregate, right? Or you can also define a broadcasting function like [:x1, :x2, :x3] => x -> mean.(x).

@pdeffebach
Copy link
Contributor

I think you can do that with aggregate, right?

It wouldn't work if you have different subsets of columns you wanted to act on. Maybe a column has mean defined for it but not some other function. However [:x1, :x2, :x3] => x -> mean.(x) seems pretty solid.

You can create a keyword argument call from a symbol very easily, e.g. x = :a; f(; x=>1)

This is really interesting! I will have to look into how this could be done with DataFramesMeta as well

@nalimilan
Copy link
Member Author

nalimilan commented Nov 22, 2018

OK, I've pushed a commit to switch to the Pair approach. It still needs tests.

The pair needs to be passed as the first argument, even if the do block syntax doesn't apply here. This prevents using keyword arguments to specify the output column names. One could use a named tuple for that (currently not supported), or we could support both syntaxes (which by already does). The alternative is to require the pair to be at the end.

EDIT: actually, this doesn't pass a NamedTuple to the function when several columns are specified; need to adjust that.

@nalimilan
Copy link
Member Author

I think this is getting ready for a review. In particular, please have a look at the API.

@nalimilan nalimilan changed the title Support type-stable map and combine on GroupedDataFrame via select argument Support type-stable map and combine on GroupedDataFrame Nov 23, 2018
last(p)(NamedTuple{incols}(map(c -> x[c], incols))) :
last(p)(x[incols])
if res isa Union{AbstractDataFrame, NamedTuple, DataFrameRow, AbstractVector, AbstractMatrix}
throw(ArgumentError("a single value result is required when passing a vector or tuple " *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we do not allow returning AbstractVector?

I can agree that we do not allow other forms (and this answers my question above).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this as an important problematic case - why don't you want to accept returning an AbstractVector?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make my life simpler. :-)

But now that everything has settled, I've added that feature.

@bkamins
Copy link
Member

bkamins commented Dec 1, 2018

I have went through this. A massive work - thank you @nalimilan. The major concern I had is API inconsitency between by and combine (map could be considered different, but it also could be consistent IMHO).

@bkamins
Copy link
Member

bkamins commented Dec 2, 2018

There are a few unresolved things, but they are not very problematic I guess. I agree that this PR is complex-enough and I think that we should merge it when the small issues are discussed to move forward.

The things to do in the future would be:

  • decide on behavior of map (but let us discuss it in a separate thread)
  • adding handling dropmissing kwarg
  • handling of check for preserved index column during combine #1460 (if I read the code correctly it does not adress this issue)
  • hndling of mixing scalars and vectors
  • writing test-cases for hard scenarios - especially for mixing different things in return values and duplicate column names (I will do it when the PR is merged and if something fails we then can decide)

This in general means that the question is if we write somewhere in the documentation that the by & friends functions will still evolve a bit after the coming release. I am not sure what would be the best policy here.

@nalimilan
Copy link
Member Author

I agree with the agenda. But I don't think we need to write anywhere that the functions will continue to evolve, that's OK as long as we don't plan on breaking them. We can always mention it in the release post. Also I think I've covered several corner cases of mixing return values of different types, shapes or lengths.

I've pushed the commit with the fixes.

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - let's get rolling. I have left two questions relating to what f can be if it is a first argument to combine and by.

In general probably we might be mildly breaking in the future as the subject is very complex (like in the case of combining levels categorical arrays JuliaData/CategoricalArrays.jl#172).
E.g. we currently support map(cols=>f, gdf) and probably it should be dropped if we decide to introduce transform (if we decide to add all the methods to map then the question is if by and combine accept a Pair as a first argument - if yes then we will not be breaking).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants