Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data.table should be smarter about compound logical subsetting #2472

Closed
MichaelChirico opened this issue Nov 9, 2017 · 14 comments
Closed

data.table should be smarter about compound logical subsetting #2472

MichaelChirico opened this issue Nov 9, 2017 · 14 comments
Milestone

Comments

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Nov 9, 2017

I don't see any reason for the following two to have substantially different runtimes:

set.seed(210349)

NN = 1e6
DT = data.table(l1 = sample(letters, NN, TRUE),
                l2 = sample(letters, NN, TRUE))

library(microbenchmark)

times = matrix(nrow = 500, ncol = 2)

for (ii in seq_len(nrow(times))) {
  DT_copy = copy(DT)
  t0 = get_nanotime()
  DT_copy[l1 == 'm' & l2 == 'd']
  t1 = get_nanotime()
  times[ii, 1L] = t1 - t0
  
  DT_copy = copy(DT)
  t0 = get_nanotime()
  DT_copy[l1 == 'm'][l2 == 'd']
  t1 = get_nanotime()
  times[ii, 2L] = t1 - t0
}

median(times[ , 1L])
# [1] 17620043
median(times[ , 2L])
# [1] 12605714
mean(times[ , 1L]/times[ , 2L])
# [1] 1.888558

It surprised me all the more so that it continues to be true when DT has these columns as an index or even a key:

# INDEXING
# benchmarking code same as above but add setindex(DT, l1, l2 before the loop
mean(times[ , 1L]/times[ , 2L])
# [1] 1.47573

# KEYING
mean(times[ , 1L]/times[ , 2L])
# [1] 7.293718

EDIT: Previously stated a large difference with pre-declaring l1 & l2 as logical... but this went away once I fixed the benchmarks to overcome auto-indexing's influence on the timings, though there's something to be said about why this made a difference -- in another issue (for posterity, mean ratio of the logical benchmark: 1.55)

@franknarf1
Copy link
Contributor

@franknarf1 franknarf1 commented Nov 9, 2017

Related: #1453 ?

@MichaelChirico
Copy link
Member Author

@MichaelChirico MichaelChirico commented Nov 9, 2017

Possibly. I thought so, but the fact keying/indexing didn't make a difference suggests it's not to blame

@franknarf1
Copy link
Contributor

@franknarf1 franknarf1 commented Nov 9, 2017

Hm, okay, I guess it is the same. If you look at the verbose=TRUE output, you'll see that the direct/one-call approach does not use the index, which is what that other issue is about (using indices for i queries over multiple columns filtered by == or %in% and connected by & ... at least that's my reading of it). If you instead use a join, of course the index is used, and it's faster in this case:

library(data.table)
set.seed(210349)
NN = 1e6
DT = data.table(l1 = sample(letters, NN, TRUE),
                l2 = sample(letters, NN, TRUE))
setindex(DT, l1, l2); setindex(DT, l1); setindex(DT, l2) 

library(microbenchmark)
microbenchmark(
  join = res1 <- DT[.("m", "d"), on=.(l1, l2), nomatch=0],
  w = res2 <- DT[intersect(DT[l1 == "m", which=TRUE], DT[l2 == "d", which = TRUE])],
  multicall =res3 <- DT[l1 == "m"][l2 == "d"],
  direct = res4 <- DT[l1 == "m" & l2 == "d"]
)

# check
fsetequal(res1, res2); fsetequal(res1, res3); fsetequal(res1, res4)

which gives me

Unit: microseconds
      expr      min        lq      mean    median        uq       max neval
      join  606.643   635.966   663.257   651.016   687.476   882.191   100
         w 2367.921  2406.708  2562.720  2439.601  2506.936  4455.327   100
 multicall 1993.075  2029.070  2444.687  2068.324  2115.334 24022.710   100
    direct 9677.103 10494.284 12007.870 11272.835 11673.124 33061.519   100

@MichaelChirico
Copy link
Member Author

@MichaelChirico MichaelChirico commented Nov 9, 2017

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Nov 9, 2017

The approach I mentioned in the other issue using Reduce I made using for loop here so it can do break if there are 0 rows already after filtering partially.

@MichaelChirico
Copy link
Member Author

@MichaelChirico MichaelChirico commented Nov 10, 2017

I was convinced it would be crazy if R wasn't doing vector & the right way, and it seems it is:

https://github.com/wch/r-source/blob/a968c791fcde95d574cf7262f77797d968bfafc0/src/main/logic.c#L341-L352

So I think the difference is that data.table is optimizing the evaluation only when there's one call to ==:

DT[l1 == 'm' & l2 == 'd', .N, verbose = TRUE]
# Detected that j uses these columns: <none> 
# [1] 1456

#vs
DT[l1 == 'm', verbose = TRUE][l2 == 'd', .N, verbose = TRUE]
# Using existing index 'l1'
# Starting bmerge ...done in 0 secs
# Creating new index 'l2'
# Starting bmerge ...done in 0 secs
# Detected that j uses these columns: <none> 
# [1] 1456

Updated main post to reflect a fairer benchmark

@MarkusBonsch
Copy link
Contributor

@MarkusBonsch MarkusBonsch commented Nov 16, 2017

I am currently working on a fix.
this will make data.table use indices on all of the following DT[x==1], DT[x==1 & y == 2], DT[x == 1 & y %in% c(1,2) & z %chin% c("a", "b")]

Hope to have a first version ready in a few days. . . I will create a branch for this.

@MarkusBonsch
Copy link
Contributor

@MarkusBonsch MarkusBonsch commented Nov 19, 2017

I have created a branch "speedySubset". It passes all existing tests.
The benchmark for your example with key on DT is:

> median(times[ , 1L])
[1] 2815236
> median(times[ , 2L])
[1] 7264336
> mean(times[ , 1L]/times[ , 2L])
[1] 0.386735

I will review my code, add more tests and create a PR soon.

Cheers,
Markus

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Nov 19, 2017

Thanks Markus. Timing looks impressive. Be aware that performance improvement will vary a lot depending on cardinality of data and number of columns to subset.

@MichaelChirico
Copy link
Member Author

@MichaelChirico MichaelChirico commented Nov 20, 2017

Awesome @MarkusBonsch

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Nov 20, 2017

@MarkusBonsch I see some questions in comments in your branch, if you could make PR I can address questions right next to them.

@MarkusBonsch
Copy link
Contributor

@MarkusBonsch MarkusBonsch commented Nov 21, 2017

@jangorecki I will do that asap, thanks!

I did some further benchmarks with three versions:

  1. master
  2. speedySubset (enhanced support for subsetting with keys and indices)
  3. speedyNoReorder: as 2. but additionally skipping the expensive reorder (see #2366)

The code can be found at the end of this post.

The main messages are:

  • speedySubset does what it is expected to
  • on keyed DT, we always get the speed benefit
  • on indexed DT, the speed benefit is often neglected by the reordering after bmerge, especially if the result set has many rows.

Here is a table with the results and comments:

query master (median ms) speedy (median ms) speedyNoReorder (median ms) comment
DT[intCol == 2L] (with key) 9 9 8 as expected, no difference
DT[intCol == 2L] (with INDEX) 12 13 8 as expected, master and speedy identical
DT[doubleCol %in% 1:5] (with key) 37 37 33 as expected, master and speedy identical
DT[doubleCol %in% 1:5] (with index) 114 115 29 same, but shows how terrible reorder can be time-wise
DT[!intCol == 3L] (with key) 77 81 80 as expected, master and speedy identical
DT[!intCol == 3L] (with index) 74 77 72 as expected, master and speedy identical
DT[charCol %chin% c("A", "B", "Y","Z")] (with key) 33 13 11 as expected, speedy faster than master
DT[charCol %chin% c("A", "B", "Y","Z")] (with index) 38 33 10 speed gain of speedy is eaten up by reorder
DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")] (with key) 74 4 4 as expected, speedy faster than master
DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")] (with index) 91 5 4 as expected, reorder not important because return data.table has few rows

Here is the benchmark code:

library(data.table)
library(microbenchmark)

n <- 1e6
DT <- data.table(intCol = sample(1L:10L, n, replace = T),
                 doubleCol = sample(1:10, n, replace=T),
                 charCol   = sample(LETTERS, n, replace = T))

DTequal <- copy(DT)
DTin <- copy(DT)
DTchin <- copy(DT)
DTcombined <- copy(DT)
DTnotjoin <- copy(DT)
setkey(DTequal, intCol, doubleCol, charCol)
setkey(DTin, doubleCol, charCol)
setkey(DTchin, charCol)
setkey(DTcombined, intCol, doubleCol, charCol)
setkey(DTnotjoin, intCol)

## original data.table will be used for indexed subsets
## one call to auto create relevant indices
DT[intCol == 2L]
DT[doubleCol %in% 1:5]
DT[charCol %chin% c("A", "B", "Y","Z")]
DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")]
DT[!intCol == 3L]

test <- microbenchmark(equalKey      = o <- DTequal[intCol == 2L],
                       equalIndex    = o <- DT[intCol == 2L],
                       inKey         = o <- DTin[doubleCol %in% 1:5],
                       inIndex       = o <- DT[doubleCol %in% 1:5],
                       notjoinKey    = o <- DTnotjoin[!intCol == 3L],
                       notjoinIndex  = o <- DT[!intCol == 3L],
                       chinKey       = o <- DTchin[charCol %chin% c("A", "B", "Y","Z")],
                       chinIndex     = o <- DT[charCol %chin% c("A", "B", "Y","Z")],
                       combinedKey   = o <- DTcombined[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")],
                       combinedIndex = o <- DT[intCol == 2L & doubleCol %in% 1:5 & charCol %chin% c("A", "B", "Y","Z")]
                       )

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Nov 28, 2017

I'm very wary of benchmarks measured in anything under 1 second. Much prefer 10 seconds or more for a single run, achieved by increasing data size. A repetition count of 500 is setting off alarm bells. 3-5 runs should be enough to convince on larger data. Call overhead and time to GC affect inferences at this very small scale.

That said, given the nature of this improvement combined with the results shown, I believe it. Awesome! Further, it should use significantly less RAM which will mean more larger queries will work and fit in RAM rather than fail. I approved the PR. Great!

Can the data size be scaled up and the time improvement shown to be, say 1 minute, or perhaps a size that shows failing with out-of-memory vs working fine. For presentation purposes and for the news item to convey the magnitude of improvement for one example in one sentence.

@mattdowle mattdowle added this to the v1.10.6 milestone Nov 28, 2017
@mattdowle mattdowle removed this from the v1.10.6 milestone Nov 28, 2017
@mattdowle mattdowle added this to the Candidate milestone Nov 28, 2017
@mattdowle mattdowle removed this from the Candidate milestone Jan 31, 2018
@mattdowle mattdowle added this to the v1.10.6 milestone Jan 31, 2018
@MarkusBonsch
Copy link
Contributor

@MarkusBonsch MarkusBonsch commented Apr 14, 2018

Is closed by PR #2494

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants