Skip to content
This repository has been archived by the owner on May 5, 2019. It is now read-only.

WIP: Modify aggregate for efficiency #65

Closed
wants to merge 1 commit into from
Closed

WIP: Modify aggregate for efficiency #65

wants to merge 1 commit into from

Conversation

cjprybol
Copy link
Contributor

@cjprybol cjprybol commented May 19, 2017

Example code

using DataTables
using CSV
using Bio.Seq
using BenchmarkTools

transcriptids = Vector{String}()
geneids = Vector{String}()
sequences = Vector{DNASequence}()
for record in open(FASTAReader, joinpath("Desktop", "gencode.v26.transcripts.fa"))
    transcriptid, geneid = String.(split(record.name, "|")[[1, 2]])
    push!(transcriptids, transcriptid)
    push!(geneids, geneid)
    push!(sequences, record.seq)
end

dt = DataTable(transcriptid = transcriptids, geneid = geneids, sequence = sequences)

^ The above loads this file into a datatable that looks like this

julia> head(dt)
6×3 DataTables.DataTable
│ Row │ transcriptid      │ geneid            │ sequence                                                                          │
├─────┼───────────────────┼───────────────────┼───────────────────────────────────────────────────────────────────────────────────┤
│ 1   │ ENST00000456328.2 │ ENSG00000223972.5 │ GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCGCTAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTG │
│ 2   │ ENST00000450305.2 │ ENSG00000223972.5 │ GTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGGGCATTAGTGCCCGTTGGAGAAAACAGGGGAATCCCGAA │
│ 3   │ ENST00000488147.1 │ ENSG00000227232.5 │ ATGGGAGCCGTGTGCACGTCGGGAGCTCGGAGTGAGCGCAGAAAACGGCACACCAATCAATAAAGAACTGAGCAGAAA │
│ 4   │ ENST00000619216.1 │ ENSG00000278267.1 │ TGTGGGAGAGGAACATGGGCTCAGGACAGCGGGTGTCAGCTTGCCTGACCCCCATGTCGCCTCTGTAG              │
│ 5   │ ENST00000473358.1 │ ENSG00000243485.5 │ GTGCACACGGCTCCCATGCGTTGTCTTCCGAGCGTCAGGGGACTTCCAAGCCTCCAGAACTGTGAGGGATAAATGTAT │
│ 6   │ ENST00000469289.1 │ ENSG00000243485.5 │ TCATCAGTCCAAAGTCCAGCAGTTGTCCCTCCTGGAATCCTCCAGAACTGTGAGGGATAAATGTATGATTTTAAAGTC │

I wanted to ask how many transcripts there are for each gene, which can be accomplished by grouping by gene and taking the length of each group.

aggregate(dt, :geneid, length)

The 199324 transcripts group down to 58219 genes, but the current master code chokes up. I remember letting it run for something like 15 minutes before giving up. This time I got a segfault when trying to group on the first ~50% of the datatable.

Results

julia> @benchmark aggregate(dt[1:100, :], :geneid, length)
BenchmarkTools.Trial:
# Master
  memory estimate:  317.81 KiB
  allocs estimate:  4696
  --------------
  minimum time:     466.817 μs (0.00% GC)
  median time:      498.773 μs (0.00% GC)
  mean time:        584.415 μs (10.32% GC)
  maximum time:     6.870 ms (91.14% GC)
  --------------
  samples:          8531
  evals/sample:     1
# PR
  memory estimate:  37.78 KiB
  allocs estimate:  552
  --------------
  minimum time:     40.375 μs (0.00% GC)
  median time:      45.639 μs (0.00% GC)
  mean time:        55.888 μs (12.03% GC)
  maximum time:     5.162 ms (96.60% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark aggregate(dt[1:1000, :], :geneid, length)
BenchmarkTools.Trial:
# Master
  memory estimate:  1.41 MiB
  allocs estimate:  22333
  --------------
  minimum time:     2.005 ms (0.00% GC)
  median time:      2.223 ms (0.00% GC)
  mean time:        2.598 ms (11.27% GC)
  maximum time:     9.734 ms (66.11% GC)
  --------------
  samples:          1923
  evals/sample:     1
# PR
  memory estimate:  185.30 KiB
  allocs estimate:  2748
  --------------
  minimum time:     178.382 μs (0.00% GC)
  median time:      209.590 μs (0.00% GC)
  mean time:        253.332 μs (11.14% GC)
  maximum time:     5.663 ms (91.36% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark aggregate(dt[1:10000, :], :geneid, length)
BenchmarkTools.Trial:
# Master
  memory estimate:  18.09 MiB
  allocs estimate:  304407
  --------------
  minimum time:     28.898 ms (0.00% GC)
  median time:      36.982 ms (15.40% GC)
  mean time:        37.070 ms (13.57% GC)
  maximum time:     55.705 ms (18.42% GC)
  --------------
  samples:          135
  evals/sample:     1
# PR
  memory estimate:  2.09 MiB
  allocs estimate:  36746
  --------------
  minimum time:     1.751 ms (0.00% GC)
  median time:      2.060 ms (0.00% GC)
  mean time:        2.457 ms (12.42% GC)
  maximum time:     9.495 ms (53.24% GC)
  --------------
  samples:          2030
  evals/sample:     1

julia> @benchmark aggregate(dt[1:20000, :], :geneid, length)
BenchmarkTools.Trial:
# Master
  memory estimate:  37.05 MiB
  allocs estimate:  627207
  --------------
  minimum time:     66.411 ms (12.64% GC)
  median time:      77.100 ms (19.23% GC)
  mean time:        77.238 ms (18.45% GC)
  maximum time:     92.809 ms (17.32% GC)
  --------------
  samples:          65
  evals/sample:     1
# PR
  memory estimate:  4.23 MiB
  allocs estimate:  74914
  --------------
  minimum time:     3.641 ms (0.00% GC)
  median time:      4.569 ms (0.00% GC)
  mean time:        5.233 ms (11.95% GC)
  maximum time:     14.632 ms (36.55% GC)
  --------------
  samples:          955
  evals/sample:     1

julia> @benchmark aggregate(dt[1:40000, :], :geneid, length)
BenchmarkTools.Trial:
# Master
  memory estimate:  70.27 MiB
  allocs estimate:  1192807
  --------------
  minimum time:     156.096 ms (26.46% GC)
  median time:      187.659 ms (33.71% GC)
  mean time:        250.791 ms (50.89% GC)
  maximum time:     528.856 ms (76.93% GC)
  --------------
  samples:          20
  evals/sample:     1
# PR
  memory estimate:  8.23 MiB
  allocs estimate:  146450
  --------------
  minimum time:     7.260 ms (0.00% GC)
  median time:      9.189 ms (0.00% GC)
  mean time:        10.199 ms (13.79% GC)
  maximum time:     21.267 ms (27.32% GC)
  --------------
  samples:          491
  evals/sample:     1

julia> @benchmark aggregate(dt[1:60000, :], :geneid, length)
BenchmarkTools.Trial:
# Master
  memory estimate:  108.61 MiB
  allocs estimate:  1849707
  --------------
  minimum time:     265.886 ms (22.88% GC)
  median time:      317.234 ms (34.45% GC)
  mean time:        509.907 ms (61.89% GC)
  maximum time:     923.916 ms (79.20% GC)
  --------------
  samples:          11
  evals/sample:     1
# PR
  memory estimate:  12.30 MiB
  allocs estimate:  223464
  --------------
  minimum time:     11.227 ms (0.00% GC)
  median time:      16.902 ms (23.29% GC)
  mean time:        16.598 ms (15.10% GC)
  maximum time:     31.129 ms (22.10% GC)
  --------------
  samples:          302
  evals/sample:     1

julia> @benchmark aggregate(dt[1:80000, :], :geneid, length)
BenchmarkTools.Trial:
# Master
  memory estimate:  148.58 MiB
  allocs estimate:  2525707
  --------------
  minimum time:     340.715 ms (26.50% GC)
  median time:      889.232 ms (71.00% GC)
  mean time:        904.309 ms (71.14% GC)
  maximum time:     1.500 s (81.75% GC)
  --------------
  samples:          6
  evals/sample:     1
# PR
  memory estimate:  16.95 MiB
  allocs estimate:  301624
  --------------
  minimum time:     14.818 ms (0.00% GC)
  median time:      21.627 ms (21.36% GC)
  mean time:        21.506 ms (17.40% GC)
  maximum time:     36.470 ms (27.80% GC)
  --------------
  samples:          233
  evals/sample:     1

julia> @benchmark aggregate(dt[1:100000, :], :geneid, length)
# Master
  Segmentation fault: 11
# PR
  memory estimate:  21.10 MiB
  allocs estimate:  379982
  --------------
  minimum time:     18.975 ms (0.00% GC)
  median time:      27.029 ms (19.48% GC)
  mean time:        26.982 ms (18.94% GC)
  maximum time:     32.547 ms (18.59% GC)
  --------------
  samples:          186
  evals/sample:     1

julia> @benchmark aggregate(dt, :geneid, length)
BenchmarkTools.Trial:
# Master
  not run
# PR
  memory estimate:  36.87 MiB
  allocs estimate:  747336
  --------------
  minimum time:     41.041 ms (7.41% GC)
  median time:      50.146 ms (13.60% GC)
  mean time:        53.885 ms (15.05% GC)
  maximum time:     191.491 ms (76.34% GC)
  --------------
  samples:          93
  evals/sample:     1

I was satisfied with that improvement for my particular case, although I'd be happy to run this on other cases if anyone would like to propose other benchmarks. Consider this an RFC until I write tests.

this file = ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.transcripts.fa.gz (for some reason markdown hyperlinks aren't working with that url, maybe because it's ftp?)

@nalimilan
Copy link
Member

Looks nice. Can you explain (here and in the commit message) what the change consists in? Why would bypassing combine improve performance so much? Should combine be changed too?

@nalimilan nalimilan changed the title Modify aggregate for efficiency WIP: Modify aggregate for efficiency May 19, 2017
@cjprybol
Copy link
Contributor Author

cjprybol commented May 24, 2017

I'm not sure how much of the efficiency comes from bypassing map and how much comes from bypassing combine. map will construct DataTables for each GroupApplied which adds the overhead of the DataTable constructor multiple times. combine then vcats and hcats all of those DataTables together, adding additional checks, assertions, and overhead.

I don't want to touch combine here because I don't know how to support anonymous functions that return DataTables as shown in by examples without it. If I can figure out how to support that with less overhead then we might be able to simplify some of this code more. I think it's safe to remove the calls to map and combine in aggregate because only one of the two aggregate methods (the one that grouped by column before hand) used map and combine. With these changes, users will still have by for more complex split-apply-combine queries, and aggregate is a simpler and more efficient alternative.

  • Add info to the docstring for aggregate

Bypassing map and combine for aggregate speeds up code and reduces
memory allocations by ~1-2 orders of magnitude. By still uses map and
combine to allow more complex anonymous functions that can return
DataTables and do blocks. 2 accessory functions with Julia 0.4
handling branches were simplified and inlined within aggregate. Tests
were added to extend aggregate test coverage.
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've dived into the code (I wasn't very familiar with this part), and I now see it makes sense to optimize aggregate since the passed function is supposed to return values or vectors, but not DataTable objects, so it's a bit silly to build them and discard them afterwards.

Though the example you give in the description doesn't sound really appropriate since it computes the length for each column, which by definition is identical for all of them (use FreqTables for this kind of thing). mean would be a better benchmark.

@@ -355,16 +355,23 @@ dt |> groupby(:a) |> [sum, x->mean(dropnull(x))] # equivalent
"""
aggregate(d::AbstractDataTable, fs::Function; sort::Bool=false) = aggregate(d, [fs], sort=sort)
function aggregate{T<:Function}(d::AbstractDataTable, fs::Vector{T}; sort::Bool=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting here before we forget it that sort needs documenting. By the way, how did you choose which columns to sort on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been meaning to write a docstring for sort! I always try and use the by keyword and get reminded by the error message to use (the undocumented) cols

julia> sort(DataTable(x = rand(10), y = rand(10)), by = :y)
ERROR: ArgumentError: 'by' must be a Function or a vector of Functions. Perhaps you wanted 'cols'.

For sorting, these changes should be keeping the current behavior. When a user specifies sort = true, groupby will sort on the grouped columns and then aggregate only bothers to sort the remaining columns that weren't sorted during the groupby.

Current code:

headers = _makeheaders(fs, _setdiff(_names(gd), gd.cols)) # only makes headers for new columns
...
sort && sort!(res, cols=headers) # only sorts new columns

I would feel a little more comfortable if we just sorted on all columns at the end rather than two stages of sorting, although I haven't come up with a test case to show that the two-stage sorting might result in the 2nd sort in aggregate improperly re-ordering the 1st sort in groupby

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a user specifies sort = true, groupby will sort on the grouped columns and then aggregate only bothers to sort the remaining columns that weren't sorted during the groupby.

I must be missing something. If aggregate does not sort again on the grouped columns (which is a no-op when they are already sorted), the result won't be sorted on these columns.

Anyway if that works it's fine to keep the existing strategy (though if it can be improved in another PR, why not).

res = gd.parent[gd.idx[gd.starts], gd.cols]
cols = setdiff(names(gd.parent), gd.cols)
for f in fs
for c in cols
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the , c in cols syntax to merge this line with the previous one.

cols = setdiff(names(gd.parent), gd.cols)
for f in fs
for c in cols
res[Symbol(c, "_", f)] = [f(g[c]) for g in gd]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to support the case where f returns a vector. Though that was already the case of the previous implementation (which stored the returned vector in a cell). I guess we should just remove the mention of vectors from the docstring.


@test aggregate(dt, length) == DataTable(group_length = 12, x_length = 12)
anonfuncdt = aggregate(dt, x -> length(x))
@test anonfuncdt[1, 1] == 12 && anonfuncdt[1, 2] == 12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not test the whole contents of the result? Cannot hurt.

anonfuncdt = aggregate(dt, x -> length(x))
@test anonfuncdt[1, 1] == 12 && anonfuncdt[1, 2] == 12

dt = DataTable(year = repeat(1:4, inner = 12), month = repeat(1:12, outer = 4),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double space.

@@ -175,19 +175,5 @@ function _setdiff{T}(a::AbstractVector{T}, b::T)
diff
end

# Gets the name of a function. Used in groupedatatable/grouping.jl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to see this go!

@nalimilan
Copy link
Member

Bump!

@nalimilan
Copy link
Member

@cjprybol I suppose this should be backported to DataFrames?

@cjprybol
Copy link
Contributor Author

cjprybol commented Oct 9, 2017

Yes, thanks for the reminder! I should also put these benchmarks into the beginnings of a PkgBenchmark suite. I think there are benchmarks we have scattered around other PRs in DataTables too. I couldn't figure out how to use PkgBenchmark the last time I tried, but this comment covers most of the process

@cjprybol
Copy link
Contributor Author

cjprybol commented Oct 9, 2017

See JuliaData/DataFrames.jl#1246

@cjprybol cjprybol closed this Oct 9, 2017
@cjprybol cjprybol deleted the cjp/aggregate branch October 9, 2017 05:06
@nalimilan
Copy link
Member

Cool. Note that PkgBenchmark has been largely refactored on git master, so probably better use that (see this post).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants