-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BREAKING] new design of select, transform and combine #2214
Conversation
…d transform efficiently
…ame, fix bug in map
@matthieugomez, @pdeffebach, @nalimilan - the PR should be good to have a quick look (and tests - maybe some corner cases that are incorrect will be caught). All functions should be implemented as we discussed. I have not updated tests nor documentation so there might be some holes, but here is how it works:
|
I’ve tried it a little bit and it feels very good from a user perspective! |
@nalimilan - there is one thing on top of what was my idea in #2206 (and something you said you wanted to allow anyway). We should treat |
I have finished the development and tests of the new functionality. Documentation is half-finished (i.e. I tried to update it everywhere but for sure it is not perfect). Maybe even we should agree to make a separate PR for polishing it later (but if you would have good suggestions now of course I am open for them). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks great! I will do a pull request with dplyr/stata comparison, following the tables we did in the issue
OK, I was surprised that the gain was so small. Now I have fixed the performance (barrier functions came in handy; @nalimilan - as a side note the old
(and you see that the same with
|
Co-authored-by: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Ungroup= true (default) would mean that it returns a DataFrame. |
OK - I have pushed a commit changing |
Only coverage fails |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ungroup
sounds good (that's the term used by dplyr).
Just a few small remarks.
return GroupedDataFrame(newparent, collect(1:length(gd.cols)), groups, | ||
collect(1:length(idx)), starts, ends, j, nothing) | ||
nothing, nothing, nothing, groups[end], nothing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't it more efficient to compute starts
and ends
now (since we know rows belonging to the same group are consecutive)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about it. The reason to remove it is the following:
- we would have to compute them anyway (they are not guaranteed to be given like in the branch above)
- if we compute them now then we have to store them (so it takes memory), and it is likely that they might be even not needed
- the cost of computing indices with
compute_indices
is higher than computing them here but not that much higher (it doest 3 loops instead of 2 loops that could be used in an optimal case so the saving is at most ~33%)
For 10^8
groups we have:
julia> x = [1:10^8;];
julia> @time DataFrames.compute_indices(x, 10^8);
1.419073 seconds (8 allocations: 2.235 GiB, 11.12% gc time)
and with an optimal timing we would have it a bit over 1 second (at the cost of allocating these 3 vectors - and maybe we even do not have to pay this cost at all if later this GroupedDataFrame
will be used to do some aggregation function like sum
when we do not need to pay this cost at all).
In summary:
- if we have to pay this cost we will be slow anyway (as it means we had to use non-aggregating operations which will be more expensive than
compute_indices
anyway) - if we do not have to pay it it means we are doing aggregations only which are much faster than
compute_indices
(and this is probably the most likely use case) - additionally we save some memory (this "some" can be a lot if there are a lot of small groups)
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Thank you for the comments. The PR is updated. The decision to store or not to store |
|
||
Transform a [`GroupedDataFrame`](@ref) into a `DataFrame`. | ||
Apply operations to each group in a [`GroupedDataFrame`](@ref) and return | ||
the combined result as a `DataFrame`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess ungroup
should be mentioned here directly, or the type of the result shouldn't be mentioned here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - OK. I have updated both combine
and select
.
OK - so I am merging this as we know CI shall pass and I will open a new PR just bumping the version and there we will check that everything passes on a clean PR. @nalimilan - will you now make an update to CSV.jl that fixes CategoricalArrays.jl compatibility so that we can synchronize the releases. Thank you! |
That’s awesome! Thanks a lot for this! |
Fixes #2206.
This is still a draft. I have only implemented the
AbstractDataFrame
part now (GroupedDataFrame
is pending) and not updated the documentation and tests. But you can have a look at the code to see what changes we essentially agreed to do.