Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign updata.table should be smarter about compound logical subsetting #2472
Comments
|
Related: #1453 ? |
|
Possibly. I thought so, but the fact keying/indexing didn't make a difference suggests it's not to blame |
|
Hm, okay, I guess it is the same. If you look at the verbose=TRUE output, you'll see that the direct/one-call approach does not use the index, which is what that other issue is about (using indices for i queries over multiple columns filtered by == or %in% and connected by & ... at least that's my reading of it). If you instead use a join, of course the index is used, and it's faster in this case:
which gives me
|
|
right, and I'd expect the keyed/sorted approach to work even faster than
what I have in mind. but on the fly should it shouldn't matter -- in A & B,
B only needs to be evaluated when A is TRUE (this is forced by the second
approach).
I guess it's a consequence of lazyeval, but I think it can be overcome.
…On Nov 9, 2017 8:48 PM, "Frank" ***@***.***> wrote:
Hm, okay, I guess it is the same. If you look at the verbose=TRUE output,
you'll see that the direct/one-call approach does not use the index, which
is what that other issue is about (using indices for i queries over
multiple columns filtered by == or %in% and connected by & (at least that's
my reading of it). If you instead use a join, of course the index is used,
and it's faster:
library(data.table)
set.seed(210349)
NN = 1e6
DT = data.table(l1 = sample(letters, NN, TRUE),
l2 = sample(letters, NN, TRUE))
setindex(DT, l1, l2); setindex(DT, l1); setindex(DT, l2)
library(microbenchmark)
microbenchmark(
join = res1 <- DT[.("m", "d"), on=.(l1, l2), nomatch=0],
w = res2 <- DT[intersect(DT[l1 == "m", which=TRUE], DT[l2 == "d", which = TRUE])],
multicall =res3 <- DT[l1 == "m"][l2 == "d"],
direct = res4 <- DT[l1 == "m" & l2 == "d"]
)
# check
fsetequal(res1, res2); fsetequal(res1, res3); fsetequal(res1, res4)
which gives me
Unit: microseconds
expr min lq mean median uq max neval
join 606.643 635.966 663.257 651.016 687.476 882.191 100
w 2367.921 2406.708 2562.720 2439.601 2506.936 4455.327 100
multicall 1993.075 2029.070 2444.687 2068.324 2115.334 24022.710 100
direct 9677.103 10494.284 12007.870 11272.835 11673.124 33061.519 100
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2472 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHQQdVeRNcgssdF3sWCLSsbiRYdkdm0vks5s0vSugaJpZM4QX2lZ>
.
|
|
The approach I mentioned in the other issue using |
|
I was convinced it would be crazy if R wasn't doing vector So I think the difference is that
Updated main post to reflect a fairer benchmark |
|
I am currently working on a fix. Hope to have a first version ready in a few days. . . I will create a branch for this. |
|
I have created a branch "speedySubset". It passes all existing tests.
I will review my code, add more tests and create a PR soon. Cheers, |
|
Thanks Markus. Timing looks impressive. Be aware that performance improvement will vary a lot depending on cardinality of data and number of columns to subset. |
|
Awesome @MarkusBonsch |
|
@MarkusBonsch I see some questions in comments in your branch, if you could make PR I can address questions right next to them. |
|
@jangorecki I will do that asap, thanks! I did some further benchmarks with three versions:
The code can be found at the end of this post. The main messages are:
Here is a table with the results and comments:
Here is the benchmark code:
|
|
I'm very wary of benchmarks measured in anything under 1 second. Much prefer 10 seconds or more for a single run, achieved by increasing data size. A repetition count of 500 is setting off alarm bells. 3-5 runs should be enough to convince on larger data. Call overhead and time to GC affect inferences at this very small scale. That said, given the nature of this improvement combined with the results shown, I believe it. Awesome! Further, it should use significantly less RAM which will mean more larger queries will work and fit in RAM rather than fail. I approved the PR. Great! Can the data size be scaled up and the time improvement shown to be, say 1 minute, or perhaps a size that shows failing with out-of-memory vs working fine. For presentation purposes and for the news item to convey the magnitude of improvement for one example in one sentence. |
|
Is closed by PR #2494 |
I don't see any reason for the following two to have substantially different runtimes:
It surprised me all the more so that it continues to be true when
DThas these columns as anindexor even akey:EDIT: Previously stated a large difference with pre-declaring
l1&l2as logical... but this went away once I fixed the benchmarks to overcome auto-indexing's influence on the timings, though there's something to be said about why this made a difference -- in another issue (for posterity, mean ratio of the logical benchmark: 1.55)