WIP: Modify aggregate for efficiency #65

cjprybol · 2017-05-19T02:20:34Z

Example code

using DataTables
using CSV
using Bio.Seq
using BenchmarkTools

transcriptids = Vector{String}()
geneids = Vector{String}()
sequences = Vector{DNASequence}()
for record in open(FASTAReader, joinpath("Desktop", "gencode.v26.transcripts.fa"))
    transcriptid, geneid = String.(split(record.name, "|")[[1, 2]])
    push!(transcriptids, transcriptid)
    push!(geneids, geneid)
    push!(sequences, record.seq)
end

dt = DataTable(transcriptid = transcriptids, geneid = geneids, sequence = sequences)

^ The above loads this file into a datatable that looks like this

julia> head(dt)
6×3 DataTables.DataTable
│ Row │ transcriptid      │ geneid            │ sequence                                                                          │
├─────┼───────────────────┼───────────────────┼───────────────────────────────────────────────────────────────────────────────────┤
│ 1   │ ENST00000456328.2 │ ENSG00000223972.5 │ GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTC…GCTAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTG │
│ 2   │ ENST00000450305.2 │ ENSG00000223972.5 │ GTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTG…GGCATTAGTGCCCGTTGGAGAAAACAGGGGAATCCCGAA │
│ 3   │ ENST00000488147.1 │ ENSG00000227232.5 │ ATGGGAGCCGTGTGCACGTCGGGAGCTCGGAGTGAGCGC…AGAAAACGGCACACCAATCAATAAAGAACTGAGCAGAAA │
│ 4   │ ENST00000619216.1 │ ENSG00000278267.1 │ TGTGGGAGAGGAACATGGGCTCAGGACAGCGGGTGTCAGCTTGCCTGACCCCCATGTCGCCTCTGTAG              │
│ 5   │ ENST00000473358.1 │ ENSG00000243485.5 │ GTGCACACGGCTCCCATGCGTTGTCTTCCGAGCGTCAGG…GGACTTCCAAGCCTCCAGAACTGTGAGGGATAAATGTAT │
│ 6   │ ENST00000469289.1 │ ENSG00000243485.5 │ TCATCAGTCCAAAGTCCAGCAGTTGTCCCTCCTGGAATC…CTCCAGAACTGTGAGGGATAAATGTATGATTTTAAAGTC │

I wanted to ask how many transcripts there are for each gene, which can be accomplished by grouping by gene and taking the length of each group.

aggregate(dt, :geneid, length)

The 199324 transcripts group down to 58219 genes, but the current master code chokes up. I remember letting it run for something like 15 minutes before giving up. This time I got a segfault when trying to group on the first ~50% of the datatable.

Results

julia> @benchmark aggregate(dt[1:100, :], :geneid, length)
BenchmarkTools.Trial:
# Master
  memory estimate:  317.81 KiB
  allocs estimate:  4696
  --------------
  minimum time:     466.817 μs (0.00% GC)
  median time:      498.773 μs (0.00% GC)
  mean time:        584.415 μs (10.32% GC)
  maximum time:     6.870 ms (91.14% GC)
  --------------
  samples:          8531
  evals/sample:     1
# PR
  memory estimate:  37.78 KiB
  allocs estimate:  552
  --------------
  minimum time:     40.375 μs (0.00% GC)
  median time:      45.639 μs (0.00% GC)
  mean time:        55.888 μs (12.03% GC)
  maximum time:     5.162 ms (96.60% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark aggregate(dt[1:1000, :], :geneid, length)
BenchmarkTools.Trial:
# Master
  memory estimate:  1.41 MiB
  allocs estimate:  22333
  --------------
  minimum time:     2.005 ms (0.00% GC)
  median time:      2.223 ms (0.00% GC)
  mean time:        2.598 ms (11.27% GC)
  maximum time:     9.734 ms (66.11% GC)
  --------------
  samples:          1923
  evals/sample:     1
# PR
  memory estimate:  185.30 KiB
  allocs estimate:  2748
  --------------
  minimum time:     178.382 μs (0.00% GC)
  median time:      209.590 μs (0.00% GC)
  mean time:        253.332 μs (11.14% GC)
  maximum time:     5.663 ms (91.36% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark aggregate(dt[1:10000, :], :geneid, length)
BenchmarkTools.Trial:
# Master
  memory estimate:  18.09 MiB
  allocs estimate:  304407
  --------------
  minimum time:     28.898 ms (0.00% GC)
  median time:      36.982 ms (15.40% GC)
  mean time:        37.070 ms (13.57% GC)
  maximum time:     55.705 ms (18.42% GC)
  --------------
  samples:          135
  evals/sample:     1
# PR
  memory estimate:  2.09 MiB
  allocs estimate:  36746
  --------------
  minimum time:     1.751 ms (0.00% GC)
  median time:      2.060 ms (0.00% GC)
  mean time:        2.457 ms (12.42% GC)
  maximum time:     9.495 ms (53.24% GC)
  --------------
  samples:          2030
  evals/sample:     1

julia> @benchmark aggregate(dt[1:20000, :], :geneid, length)
BenchmarkTools.Trial:
# Master
  memory estimate:  37.05 MiB
  allocs estimate:  627207
  --------------
  minimum time:     66.411 ms (12.64% GC)
  median time:      77.100 ms (19.23% GC)
  mean time:        77.238 ms (18.45% GC)
  maximum time:     92.809 ms (17.32% GC)
  --------------
  samples:          65
  evals/sample:     1
# PR
  memory estimate:  4.23 MiB
  allocs estimate:  74914
  --------------
  minimum time:     3.641 ms (0.00% GC)
  median time:      4.569 ms (0.00% GC)
  mean time:        5.233 ms (11.95% GC)
  maximum time:     14.632 ms (36.55% GC)
  --------------
  samples:          955
  evals/sample:     1

julia> @benchmark aggregate(dt[1:40000, :], :geneid, length)
BenchmarkTools.Trial:
# Master
  memory estimate:  70.27 MiB
  allocs estimate:  1192807
  --------------
  minimum time:     156.096 ms (26.46% GC)
  median time:      187.659 ms (33.71% GC)
  mean time:        250.791 ms (50.89% GC)
  maximum time:     528.856 ms (76.93% GC)
  --------------
  samples:          20
  evals/sample:     1
# PR
  memory estimate:  8.23 MiB
  allocs estimate:  146450
  --------------
  minimum time:     7.260 ms (0.00% GC)
  median time:      9.189 ms (0.00% GC)
  mean time:        10.199 ms (13.79% GC)
  maximum time:     21.267 ms (27.32% GC)
  --------------
  samples:          491
  evals/sample:     1

julia> @benchmark aggregate(dt[1:60000, :], :geneid, length)
BenchmarkTools.Trial:
# Master
  memory estimate:  108.61 MiB
  allocs estimate:  1849707
  --------------
  minimum time:     265.886 ms (22.88% GC)
  median time:      317.234 ms (34.45% GC)
  mean time:        509.907 ms (61.89% GC)
  maximum time:     923.916 ms (79.20% GC)
  --------------
  samples:          11
  evals/sample:     1
# PR
  memory estimate:  12.30 MiB
  allocs estimate:  223464
  --------------
  minimum time:     11.227 ms (0.00% GC)
  median time:      16.902 ms (23.29% GC)
  mean time:        16.598 ms (15.10% GC)
  maximum time:     31.129 ms (22.10% GC)
  --------------
  samples:          302
  evals/sample:     1

julia> @benchmark aggregate(dt[1:80000, :], :geneid, length)
BenchmarkTools.Trial:
# Master
  memory estimate:  148.58 MiB
  allocs estimate:  2525707
  --------------
  minimum time:     340.715 ms (26.50% GC)
  median time:      889.232 ms (71.00% GC)
  mean time:        904.309 ms (71.14% GC)
  maximum time:     1.500 s (81.75% GC)
  --------------
  samples:          6
  evals/sample:     1
# PR
  memory estimate:  16.95 MiB
  allocs estimate:  301624
  --------------
  minimum time:     14.818 ms (0.00% GC)
  median time:      21.627 ms (21.36% GC)
  mean time:        21.506 ms (17.40% GC)
  maximum time:     36.470 ms (27.80% GC)
  --------------
  samples:          233
  evals/sample:     1

julia> @benchmark aggregate(dt[1:100000, :], :geneid, length)
# Master
  Segmentation fault: 11
# PR
  memory estimate:  21.10 MiB
  allocs estimate:  379982
  --------------
  minimum time:     18.975 ms (0.00% GC)
  median time:      27.029 ms (19.48% GC)
  mean time:        26.982 ms (18.94% GC)
  maximum time:     32.547 ms (18.59% GC)
  --------------
  samples:          186
  evals/sample:     1

julia> @benchmark aggregate(dt, :geneid, length)
BenchmarkTools.Trial:
# Master
  not run
# PR
  memory estimate:  36.87 MiB
  allocs estimate:  747336
  --------------
  minimum time:     41.041 ms (7.41% GC)
  median time:      50.146 ms (13.60% GC)
  mean time:        53.885 ms (15.05% GC)
  maximum time:     191.491 ms (76.34% GC)
  --------------
  samples:          93
  evals/sample:     1

I was satisfied with that improvement for my particular case, although I'd be happy to run this on other cases if anyone would like to propose other benchmarks. Consider this an RFC until I write tests.

this file = ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.transcripts.fa.gz (for some reason markdown hyperlinks aren't working with that url, maybe because it's ftp?)

nalimilan · 2017-05-19T13:19:16Z

Looks nice. Can you explain (here and in the commit message) what the change consists in? Why would bypassing combine improve performance so much? Should combine be changed too?

cjprybol · 2017-05-24T06:42:47Z

I'm not sure how much of the efficiency comes from bypassing map and how much comes from bypassing combine. map will construct DataTables for each GroupApplied which adds the overhead of the DataTable constructor multiple times. combine then vcats and hcats all of those DataTables together, adding additional checks, assertions, and overhead.

I don't want to touch combine here because I don't know how to support anonymous functions that return DataTables as shown in by examples without it. If I can figure out how to support that with less overhead then we might be able to simplify some of this code more. I think it's safe to remove the calls to map and combine in aggregate because only one of the two aggregate methods (the one that grouped by column before hand) used map and combine. With these changes, users will still have by for more complex split-apply-combine queries, and aggregate is a simpler and more efficient alternative.

Add info to the docstring for aggregate

Bypassing map and combine for aggregate speeds up code and reduces memory allocations by ~1-2 orders of magnitude. By still uses map and combine to allow more complex anonymous functions that can return DataTables and do blocks. 2 accessory functions with Julia 0.4 handling branches were simplified and inlined within aggregate. Tests were added to extend aggregate test coverage.

nalimilan

I've dived into the code (I wasn't very familiar with this part), and I now see it makes sense to optimize aggregate since the passed function is supposed to return values or vectors, but not DataTable objects, so it's a bit silly to build them and discard them afterwards.

Though the example you give in the description doesn't sound really appropriate since it computes the length for each column, which by definition is identical for all of them (use FreqTables for this kind of thing). mean would be a better benchmark.

nalimilan · 2017-05-28T13:54:53Z

src/groupeddatatable/grouping.jl

@@ -355,16 +355,23 @@ dt |> groupby(:a) |> [sum, x->mean(dropnull(x))]   # equivalent
 """
 aggregate(d::AbstractDataTable, fs::Function; sort::Bool=false) = aggregate(d, [fs], sort=sort)
 function aggregate{T<:Function}(d::AbstractDataTable, fs::Vector{T}; sort::Bool=false)


Just noting here before we forget it that sort needs documenting. By the way, how did you choose which columns to sort on?

I've been meaning to write a docstring for sort! I always try and use the by keyword and get reminded by the error message to use (the undocumented) cols

julia> sort(DataTable(x = rand(10), y = rand(10)), by = :y) ERROR: ArgumentError: 'by' must be a Function or a vector of Functions. Perhaps you wanted 'cols'.

For sorting, these changes should be keeping the current behavior. When a user specifies sort = true, groupby will sort on the grouped columns and then aggregate only bothers to sort the remaining columns that weren't sorted during the groupby.

Current code:

headers = _makeheaders(fs, _setdiff(_names(gd), gd.cols)) # only makes headers for new columns ... sort && sort!(res, cols=headers) # only sorts new columns

I would feel a little more comfortable if we just sorted on all columns at the end rather than two stages of sorting, although I haven't come up with a test case to show that the two-stage sorting might result in the 2nd sort in aggregate improperly re-ordering the 1st sort in groupby

When a user specifies sort = true, groupby will sort on the grouped columns and then aggregate only bothers to sort the remaining columns that weren't sorted during the groupby.

I must be missing something. If aggregate does not sort again on the grouped columns (which is a no-op when they are already sorted), the result won't be sorted on these columns.

Anyway if that works it's fine to keep the existing strategy (though if it can be improved in another PR, why not).

nalimilan · 2017-05-28T13:57:17Z

src/groupeddatatable/grouping.jl

+    res = gd.parent[gd.idx[gd.starts], gd.cols]
+    cols = setdiff(names(gd.parent), gd.cols)
+    for f in fs
+        for c in cols


You can use the , c in cols syntax to merge this line with the previous one.

nalimilan · 2017-05-28T14:11:16Z

src/groupeddatatable/grouping.jl

+    cols = setdiff(names(gd.parent), gd.cols)
+    for f in fs
+        for c in cols
+            res[Symbol(c, "_", f)] = [f(g[c]) for g in gd]


This doesn't seem to support the case where f returns a vector. Though that was already the case of the previous implementation (which stored the returned vector in a cell). I guess we should just remove the mention of vectors from the docstring.

nalimilan · 2017-05-28T14:17:46Z

test/grouping.jl

+
+        @test aggregate(dt, length) == DataTable(group_length = 12, x_length = 12)
+        anonfuncdt = aggregate(dt, x -> length(x))
+        @test anonfuncdt[1, 1] == 12 && anonfuncdt[1, 2] == 12


Why not test the whole contents of the result? Cannot hurt.

nalimilan · 2017-05-28T14:18:17Z

test/grouping.jl

+        anonfuncdt = aggregate(dt, x -> length(x))
+        @test anonfuncdt[1, 1] == 12 && anonfuncdt[1, 2] == 12
+
+        dt = DataTable(year  = repeat(1:4, inner = 12), month = repeat(1:12, outer = 4),


Double space.

nalimilan · 2017-05-28T14:19:03Z

src/other/utils.jl

@@ -175,19 +175,5 @@ function _setdiff{T}(a::AbstractVector{T}, b::T)
    diff
 end

-# Gets the name of a function. Used in groupedatatable/grouping.jl


Good to see this go!

nalimilan · 2017-06-15T07:50:05Z

Bump!

nalimilan · 2017-10-08T11:16:30Z

@cjprybol I suppose this should be backported to DataFrames?

cjprybol · 2017-10-09T02:57:09Z

Yes, thanks for the reminder! I should also put these benchmarks into the beginnings of a PkgBenchmark suite. I think there are benchmarks we have scattered around other PRs in DataTables too. I couldn't figure out how to use PkgBenchmark the last time I tried, but this comment covers most of the process

cjprybol · 2017-10-09T05:06:05Z

See JuliaData/DataFrames.jl#1246

nalimilan · 2017-10-09T07:48:45Z

Cool. Note that PkgBenchmark has been largely refactored on git master, so probably better use that (see this post).

nalimilan changed the title ~~Modify aggregate for efficiency~~ WIP: Modify aggregate for efficiency May 19, 2017

cjprybol mentioned this pull request May 25, 2017

WIP: Consolidate hcat, append, merge, vcat and split functional overlaps #70

Open

nalimilan reviewed May 28, 2017

View reviewed changes

cjprybol mentioned this pull request Jun 23, 2017

WIP: Restore old grouping algorithm and improve it #76

Closed

nalimilan mentioned this pull request Jun 24, 2017

Added pivot() funtion (for updated pull request #1175) JuliaData/DataFrames.jl#1181

Closed

cjprybol mentioned this pull request Oct 9, 2017

Modify aggregate for efficiency and decide its future JuliaData/DataFrames.jl#1246

Closed

cjprybol closed this Oct 9, 2017

cjprybol deleted the cjp/aggregate branch October 9, 2017 05:06

nalimilan mentioned this pull request Oct 13, 2017

Grouping API consistency and improvements JuliaData/DataFrames.jl#1256

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Modify aggregate for efficiency #65

WIP: Modify aggregate for efficiency #65

cjprybol commented May 19, 2017 •

edited

Loading

nalimilan commented May 19, 2017

cjprybol commented May 24, 2017 •

edited

Loading

nalimilan left a comment

nalimilan May 28, 2017

cjprybol Jun 7, 2017

nalimilan Jun 7, 2017

nalimilan May 28, 2017

nalimilan May 28, 2017

nalimilan May 28, 2017

nalimilan May 28, 2017

nalimilan May 28, 2017

nalimilan commented Jun 15, 2017

nalimilan commented Oct 8, 2017

cjprybol commented Oct 9, 2017

cjprybol commented Oct 9, 2017

nalimilan commented Oct 9, 2017

WIP: Modify aggregate for efficiency #65

WIP: Modify aggregate for efficiency #65

Conversation

cjprybol commented May 19, 2017 • edited Loading

nalimilan commented May 19, 2017

cjprybol commented May 24, 2017 • edited Loading

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Jun 15, 2017

nalimilan commented Oct 8, 2017

cjprybol commented Oct 9, 2017

cjprybol commented Oct 9, 2017

nalimilan commented Oct 9, 2017

cjprybol commented May 19, 2017 •

edited

Loading

cjprybol commented May 24, 2017 •

edited

Loading