Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow DF() as a selector in select and combine #2220

Closed
bkamins opened this issue May 2, 2020 · 7 comments
Closed

Allow DF() as a selector in select and combine #2220

bkamins opened this issue May 2, 2020 · 7 comments
Labels
feature grouping non-breaking The proposed change is not breaking
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented May 2, 2020

Allow in selector => fun => outcol for selector to be a DF() in which case an AbstractDataFrame would be passed to fun.

Alternatively we could just allow fun => outcol and fun (as we currently do in combine with the fun form).

To be discussed what is better. Using DF() is explicit. Just using fun is shorter and it is not ambiguous so both have pros and cons.

CC @pdeffebach @matthieugomez

@bkamins bkamins added feature grouping non-breaking The proposed change is not breaking labels May 2, 2020
@bkamins bkamins added this to the 1.x milestone May 2, 2020
@matthieugomez
Copy link
Contributor

matthieugomez commented May 2, 2020

I think DF() is safer. Otherwise, in the macro-based DataFrames, syntax such as transform!(df, mean(:x)) may end up being ambiguous. If it ends up being too cumbersome, it can always be removed later.

@nalimilan
Copy link
Member

I'm not totally convinced DF() => fun => outcol is better than just passing fun => outcol. I don't see why you would like to use it in DataFramesMeta (since it detects which columns you want automatically, that's the whole point), so the ambiguity problem isn't clear to me.

@bkamins
Copy link
Member Author

bkamins commented May 18, 2020

So just to give my preference I also prefer fun => outcol.

@pdeffebach
Copy link
Contributor

We don't really want people to use the fun => outcol method, though, right? Its less performant. Is there an argument for not allowing less performant, but easier to type, syntax? Or is that user-hostile?

@bkamins
Copy link
Member Author

bkamins commented May 18, 2020

Its less performant

If it is slower or not depends on the use case, but often it is faster (even if we exclude compulation cost), e.g.:

julia> df = DataFrame(rand(10^6, 2));

julia> gdf = groupby(df, :x1);

julia> @time combine((x -> (n=nrow(x),)), gdf);
  0.394009 seconds (7.20 M allocations: 261.568 MiB, 5.49% gc time)

julia> @time combine(AsTable(:) => (x -> (n=length(x[1]),)), gdf);
  0.960185 seconds (17.69 M allocations: 522.495 MiB, 20.14% gc time)

The more columns you have in your df the more expensive it becomes not to use SubDataFrames:

julia> df = DataFrame(rand(10^6, 100));

julia> gdf = groupby(df, :x1);

julia> @time combine((x -> (n=nrow(x),)), gdf);
  0.404221 seconds (7.20 M allocations: 261.569 MiB)

julia> @time combine(AsTable(:) => (x -> (n=length(x[1]),)), gdf);
 11.342024 seconds (313.67 M allocations: 10.886 GiB, 11.85% gc time)

@bkamins
Copy link
Member Author

bkamins commented Sep 29, 2020

When #2410 will be implemented this should be closed.

@bkamins
Copy link
Member Author

bkamins commented Oct 15, 2022

I would close it. The operation specification syntax is already complex enough. AsTable does almost the same (except that it produces a NamedTuple), so the only limitation is for very wide tables. But then user can just pass a function and then a data frame is passed to it and user can name output columns inside this function.

If you feel this issue should be re-opened please comment.

@bkamins bkamins closed this as completed Oct 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature grouping non-breaking The proposed change is not breaking
Projects
None yet
Development

No branches or pull requests

4 participants