-
Notifications
You must be signed in to change notification settings - Fork 11
Conversation
Looks nice. Can you explain (here and in the commit message) what the change consists in? Why would bypassing |
I'm not sure how much of the efficiency comes from bypassing I don't want to touch
|
Bypassing map and combine for aggregate speeds up code and reduces memory allocations by ~1-2 orders of magnitude. By still uses map and combine to allow more complex anonymous functions that can return DataTables and do blocks. 2 accessory functions with Julia 0.4 handling branches were simplified and inlined within aggregate. Tests were added to extend aggregate test coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've dived into the code (I wasn't very familiar with this part), and I now see it makes sense to optimize aggregate
since the passed function is supposed to return values or vectors, but not DataTable
objects, so it's a bit silly to build them and discard them afterwards.
Though the example you give in the description doesn't sound really appropriate since it computes the length for each column, which by definition is identical for all of them (use FreqTables for this kind of thing). mean
would be a better benchmark.
@@ -355,16 +355,23 @@ dt |> groupby(:a) |> [sum, x->mean(dropnull(x))] # equivalent | |||
""" | |||
aggregate(d::AbstractDataTable, fs::Function; sort::Bool=false) = aggregate(d, [fs], sort=sort) | |||
function aggregate{T<:Function}(d::AbstractDataTable, fs::Vector{T}; sort::Bool=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting here before we forget it that sort
needs documenting. By the way, how did you choose which columns to sort on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been meaning to write a docstring for sort! I always try and use the by
keyword and get reminded by the error message to use (the undocumented) cols
julia> sort(DataTable(x = rand(10), y = rand(10)), by = :y)
ERROR: ArgumentError: 'by' must be a Function or a vector of Functions. Perhaps you wanted 'cols'.
For sorting, these changes should be keeping the current behavior. When a user specifies sort = true
, groupby
will sort on the grouped columns and then aggregate
only bothers to sort the remaining columns that weren't sorted during the groupby.
Current code:
headers = _makeheaders(fs, _setdiff(_names(gd), gd.cols)) # only makes headers for new columns
...
sort && sort!(res, cols=headers) # only sorts new columns
I would feel a little more comfortable if we just sorted on all columns at the end rather than two stages of sorting, although I haven't come up with a test case to show that the two-stage sorting might result in the 2nd sort in aggregate improperly re-ordering the 1st sort in groupby
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a user specifies sort = true, groupby will sort on the grouped columns and then aggregate only bothers to sort the remaining columns that weren't sorted during the groupby.
I must be missing something. If aggregate
does not sort again on the grouped columns (which is a no-op when they are already sorted), the result won't be sorted on these columns.
Anyway if that works it's fine to keep the existing strategy (though if it can be improved in another PR, why not).
res = gd.parent[gd.idx[gd.starts], gd.cols] | ||
cols = setdiff(names(gd.parent), gd.cols) | ||
for f in fs | ||
for c in cols |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the , c in cols
syntax to merge this line with the previous one.
cols = setdiff(names(gd.parent), gd.cols) | ||
for f in fs | ||
for c in cols | ||
res[Symbol(c, "_", f)] = [f(g[c]) for g in gd] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to support the case where f
returns a vector. Though that was already the case of the previous implementation (which stored the returned vector in a cell). I guess we should just remove the mention of vectors from the docstring.
|
||
@test aggregate(dt, length) == DataTable(group_length = 12, x_length = 12) | ||
anonfuncdt = aggregate(dt, x -> length(x)) | ||
@test anonfuncdt[1, 1] == 12 && anonfuncdt[1, 2] == 12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not test the whole contents of the result? Cannot hurt.
anonfuncdt = aggregate(dt, x -> length(x)) | ||
@test anonfuncdt[1, 1] == 12 && anonfuncdt[1, 2] == 12 | ||
|
||
dt = DataTable(year = repeat(1:4, inner = 12), month = repeat(1:12, outer = 4), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double space.
@@ -175,19 +175,5 @@ function _setdiff{T}(a::AbstractVector{T}, b::T) | |||
diff | |||
end | |||
|
|||
# Gets the name of a function. Used in groupedatatable/grouping.jl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to see this go!
Bump! |
@cjprybol I suppose this should be backported to DataFrames? |
Yes, thanks for the reminder! I should also put these benchmarks into the beginnings of a PkgBenchmark suite. I think there are benchmarks we have scattered around other PRs in DataTables too. I couldn't figure out how to use PkgBenchmark the last time I tried, but this comment covers most of the process |
Cool. Note that PkgBenchmark has been largely refactored on git master, so probably better use that (see this post). |
Example code
^ The above loads this file into a datatable that looks like this
I wanted to ask how many transcripts there are for each gene, which can be accomplished by grouping by gene and taking the length of each group.
The 199324 transcripts group down to 58219 genes, but the current master code chokes up. I remember letting it run for something like 15 minutes before giving up. This time I got a segfault when trying to group on the first ~50% of the datatable.
Results
I was satisfied with that improvement for my particular case, although I'd be happy to run this on other cases if anyone would like to propose other benchmarks. Consider this an RFC until I write tests.
this file =
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.transcripts.fa.gz
(for some reason markdown hyperlinks aren't working with that url, maybe because it's ftp?)