in some cases setkey could save some more sort operations #1321

jan-glx · 2015-09-10T13:49:28Z

When setkey is called on a DT with sorted attribute, existing keys should be reused if possible.
Most importantly in the setkey(setkey(dt, x, y), x) case but also in a little more advanced cases like setkey(setkey(dt, x, y), y, x). If you per se do not want to trust the sorted attribute at least check for the column being sorted before sorting if it is marked as sorted.
-best, Jan
example:

dtu = data.table(x = sample(1E6, 1E8, replace=T), 
                 y = sample(1E6, 1E8, replace=T))
dts = setkey(copy(dtu),x,y)
onUnsorted <- function() setkey(dtu,y,x)
onSorted <- function() setkey(dts,y,x)
onSortedSmart <- function() setattr(setkey(dts,y),"sorted",c("y","x"))

identical(onUnsorted(), onSorted())
#[1] TRUE
identical(onSorted(), onSortedSmart())
#[1] TRUE

system.time(onUnsorted())
#   user  system elapsed 
#   0.47    0.10    0.56 
system.time(onSorted())
#   user  system elapsed 
#   0.50    0.07    0.58 
system.time(onSortedSmart())
#   user  system elapsed 
#   0.24    0.06    0.29

The text was updated successfully, but these errors were encountered:

arunsrinivasan · 2015-09-10T16:28:08Z

I'm sorry I don't follow this FR.

franknarf1 · 2015-09-10T16:38:05Z

@arunsrinivasan I think the idea is to skip some steps in setkey based on the current value of the sorted attribute (aka the key). So if doing setkeyv(DT,cols) and...

cols takes the form c(first_cols, head(key(DT),n)) for some n, then just do the sorting stuff for first_cols, knowing/expecting that the rest are fine.
cols takes the form [something something], then [similarly skip some sorting].

(Not knowing the internals of setkey, the OP instead infers that there are skippable steps from benchmarks.)

arunsrinivasan · 2015-09-10T16:58:36Z

I'm guessing his suggestion is to use the "smart" method? But how does one know the data is still sorted by x in general?

require(data.table)
set.seed(1L); 
dt = data.table(x=sample(3,10,TRUE), y=sample(2,10,TRUE))
setkey(dt, y, x) # sorted by 'y,x'

# perform some operations
dt = dt[, .(x=sample(x)), by=y]
# x isn't sorted within y anymore..
key(dt) # [1] y

setattr(setkey(dt, y), 'sorted', c('y', 'x')) # <~~~ wrong!

# setkey() does a first pass to see if columns are already sorted by the columns specified 
# and if so, avoids computing order and reordering. It also warns if key is set on those 
# columns but data isn't sorted.
setkey(dt, y,x) 
# Warning message:
# In setkeyv(x, cols, verbose = verbose, physical = physical) :
#   Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

eantonya · 2015-09-10T17:02:53Z

@arunsrinivasan by checking the key. Your operation destroyed the key.

arunsrinivasan · 2015-09-10T17:06:52Z

Okay, I've:

setkey(dt, y,x)

How do I check that that's the key when doing:

setkey(setkey(dt, y), y,x)

?

eantonya · 2015-09-10T17:08:15Z

I don't know how far this can be pushed, but what OP's example points out is the following - the first column (of a multicolumn) key never needs to be resorted - it will always be automatically correctly sorted.

So if your current key is x,y and you set the key again to something that involves x - e.g. y,x or y,z,x,w - the x column doesn't need to be sorted - it already will be, so (going in reverse to achieve the required key) - you'll sort w, skip the x sort, then sort z and y.

Hmm.. Correction - I'm not sure if the above is true - it's certainly true in the case where x is the last column of the new key - by definition, but probably not if it's in the middle. So not as general as I initially thought.

arunsrinivasan · 2015-09-10T17:12:33Z

Yes but that's already taken care of internally. The order routine is only called on unsorted data.

Okay, just saw the timings.. woah!?!? sorting (and reordering) thro' 100 million elements in 0.6s? What machine is that?

These are the timings on my laptop.

system.time(onUnsorted())
#    user  system elapsed 
#  11.521   0.386  11.967 
system.time(onSorted())
#    user  system elapsed 
#   8.939   0.366   9.331 
system.time(onSortedSmart())
#    user  system elapsed 
#   0.287   0.090   0.378

Note that the 2nd timing is slightly quicker (only first column needs to be ordered, the 2nd column's check for sorted-ness returns TRUE and therefore never reordered).

Also note that the 3rd timing doesn't do any ordering as both columns are already sorted.

franknarf1 · 2015-09-10T17:19:11Z

Okay, the OP's example is pretty confusing and seems to reach the wrong conclusion.

DT1 <- copy(dtu)
DT2 <- setkey(copy(dtu),x)

system.time(setkey(DT1,x,y)) # 14 seconds
system.time(setkey(DT2,x,y)) # 6  seconds

DT1 <- copy(dtu)
DT2 <- setkey(copy(dtu),y,x)

system.time(setkey(DT1,x,y)) # 15 seconds
system.time(setkey(DT2,x,y)) # 10 seconds

So setkey is taking advantage of the existing key in both of these cases; and presumably in similar cases that the OP can think of (though we'll see if they have other examples).

arunsrinivasan · 2015-09-10T17:22:18Z

@franknarf1 thanks for confirming.

eantonya · 2015-09-10T17:23:48Z

@franknarf1 I don't think those are the right timings to looks at. This is my understanding of the relevant timings:

dts = setkey(copy(dtu),x,y)
setkey(dts, y, x, verbose = T)
# forder took 9.52 sec
# reorder took 6.64 sec

dts = setkey(copy(dtu),x,y)
setkey(dts, y, verbose = T)
# forder took 5.42 sec
# reorder took 7.11 sec

dts = setkey(copy(dtu),x,y)
system.time(data.table:::is.sorted(dts$x))
#    user  system elapsed 
#    0.29    0.00    0.30

You can see that setkey(dts, y) is faster by more than the is.sorted check.

franknarf1 · 2015-09-10T17:44:36Z

@eantonya Oh ok, I see what you mean. Whether that test is sufficient depends on knowledge of forderv that I don't have. I'll take your word on it, and it does make sense to me based on my vague understanding.

It's clear to me that the test data.table:::is.sorted(dts[, cols, with=FALSE]) is sufficient, where cols = c("y","x"); and with that slower check I do still see one second of savings (from 9 to 8).

eantonya · 2015-09-10T18:26:30Z

The order algorithm is fairly trivial - iterate from last column of the key to the first and sort all of them. The result will be a key-sorted table (assuming the sorting algo keeps existing order in equivalent cases, which it does for us).

The OP's observation is fairly simple - if that last column is already sorted - don't sort it again. This happens when e.g. old key = x,y,z,... and new key is ...,x. Slightly more generally there is no need to sort the last n columns when going from old key = x1,x2,...,xn,y1,y2,... to new key = ...,x1,x2,...,xn.

jan-glx · 2015-09-10T20:54:54Z

Wow. 😄 You are waaay quicker than my 1yr old Lenovo y510p ( @arunsrinivasan - maybe not a mac book next time (just guessing)) :D
I am very sorry for the confusion I caused - wanted to finish the report before dinner - not a good idea...
Indeed I did wrongly conclude a much broader scope of the issue just from these timings.

@eantonya hit the nail on the head. Through I think it is minimally more general than as I understood him: when setting a key the data does not not need to be sorted for the last new key columns (i.e. the first sorts can be omitted) as long as they have been part of the old key in the same order (allowing arbitrary gaps; columns removed from the key count as if they were in their order at the end of the key) See code below I case you are interested.

Thanks for the discussion to everyone! I had a fundamentally wrong Idea of how data.table works - I thought it had indices on all key columns - now I see why joins on a secondary key are not that easy.

I'll edit the title to match what's left of the issue.
best, jan

"Proof":

DT = data.table(a = sample(3, 3E4, replace=T), 
                b = sample(3, 3E4, replace=T), 
                c = sample(3, 3E4, replace=T), 
                d = sample(3, 3E4, replace=T), 
                e = sample(1E6, 3E4, replace=T), key="a,b,c,d")
# unreadable code producing all permutations of the 4 keys:
permutations <- function( x, prefix = c() ){if(length(x) == 0 ) return(prefix);do.call(rbind,sapply(1:length(x), FUN = function(idx) permutations( x[-idx], c( prefix, x[idx])), simplify = FALSE))}
perms <- unlist(apply(permutations(c("a","b","c","d")),1,list),recursive = FALSE)
all(sapply(perms,function(p) all(setkeyv(copy(DT),p)==setkeyv(copy(DT),p[0:last(which(c(T,p[-4]>p[-1])))]))))
# returns TRUE

Example:

library(data.table)
set.seed(1)
m <- 1e2; n <- 1E8;
DT <- data.table(a = sample(m, n, replace=T), 
                 b = sample(m, n, replace=T), 
                 c = sample(m, n, replace=T), 
                 d = sample(m, n, replace=T), 
                 e = sample(m, n, replace=T), 
                 key="a,b,c,d")
invisible(gc())
DT1 <- setattr(setkey(copy(DT),b, verbose=T), "sorted", c("b","a","c","d"))
## forder took 0.59 sec
## reorder took 0.88 sec
invisible(gc())
DT2 <- setkey(copy(DT),b,a,c,d, verbose=T)
## forder took 1.65 sec
## reorder took 0.87 sec
identical(DT1,DT2)
## [1] TRUE

arunsrinivasan · 2015-09-10T21:02:01Z

@eantonya we perform MSD radix sorting (cache efficient + benefits from partial ordering). See slide 19 here from Matt's UseR'15 talk.

I think this does it:

# handle identical case separately first
function(old, new) { 
    len = min(length(old), length(new))
    ans = vapply(seq_len(len), function(i) identical(head(old, i), tail(new, i)), TRUE)
    new_filtered = if (any(ans)) head(new, -which.max(ans)) else new
}

Note that we'll have to verify that the sorted attribute is indeed right. So, we've to run a "is sorted" test first before even beginning this. And doesn't seem that common occurrence to me. Feel free to make a PR.

arunsrinivasan · 2015-09-10T21:10:55Z

@jan-glx just checked the configurations -- here and here. Still can't figure out the reason for speed difference to my Macbook Pro 13' with 4MB L3 cache, 256KB L2 cache, 16GB RAM 1600MHz DDR3, i7 3GHz.

The only thing that stands out for the processor compared to mine is 6MB (smart?) cache... unless I'm missing something..

20x faster seems very strange. Eddi's and Frank's timings are more or less identical to mine.

eantonya · 2015-09-10T21:29:41Z

@jan-glx you seem to be right, omitting also works and I don't understand why, since sorting by a,c,d (assuming new key is ...,a,c,d) is different from a,b,c,d. I think an analytic proof is necessary if this ever gets implemented.

@arunsrinivasan I didn't realize the algo was changed to MSD, thanks for pointing that out (I don't think that changes too much for this FR - the timings above are all vs that algo). I'm not sure why you'd need to check is.sorted for a keyed table - seems a bit over-cautious.

arunsrinivasan · 2015-09-10T21:49:33Z

@jan-glx hm, I'm stumped as to why that works as well.

@eantonya

I'm not sure why you'd need to check is.sorted for a keyed table - seems a bit over-cautious.

Not really. We already check when the columns to key by are identical to existing keys. It has helped catch bugs in places where we wrongly retain keys. Also setattr(dt, 'sorted', cols) really has no checks.

eantonya · 2015-09-10T22:20:05Z

@arunsrinivasan I'm still unconvinced. Sure it's nice for catching data.table bugs, but why a regular user should lose performance because of that? Maybe a strict vs fast mode should be introduced for these kinds of performance vs potential data.table bugs redundancies (maybe the existing verbose flag can be adapted for that).

And as far as direct setting the attribute goes - I see no problems with complete failure if user decides to tinker with internals and does so incorrectly.

eantonya · 2015-09-10T22:27:04Z

@jan-glx Here's a counter-example that shows that omitting doesn't work:

> dt = data.table(a = 1, b= c(1,1,2,2), c = c(3,4,1,2), e = c(4,1,4,1), key = c('a','b','c'))
> dt
   a b c e
1: 1 1 3 4
2: 1 1 4 1
3: 1 2 1 4
4: 1 2 2 1
> setkey(copy(dt), e)[]
   a b c e
1: 1 1 4 1
2: 1 2 2 1
3: 1 1 3 4
4: 1 2 1 4
> setkey(copy(dt), e, a, c)[]
   a b c e
1: 1 2 2 1
2: 1 1 4 1
3: 1 2 1 4
4: 1 1 3 4

arunsrinivasan · 2015-09-10T22:29:41Z

setkey() ensures that the data is sorted. To do that, we need it, as we'll rely on the attribute of another keying operation. It's not about the user, rather a property of setkey() we'd like to stick to.

jan-glx · 2015-09-10T23:15:15Z

@arunsrinivasan Unfortunately I have to admit that your macbook is not worse than a cheap lenove - it was just a consequence of the stupid mistakes I made when trying to bring the observation into a presentable form...your initial response was absolute suitable... here is my correct timing for the initial data:(

library(data.table)
set.seed(1)
dtu = data.table(x = sample(1E6, 1E8, replace=T), 
                 y = sample(1E6, 1E8, replace=T))
options(datatable.verbose = T)
dts = setkey(copy(dtu),x,y)
## forder took 8.53 sec
## reorder took 4.11 sec
system.time(setkey(copy(dtu),y,x))
## forder took 8.52 sec
## reorder took 4.12 sec
##    user  system elapsed 
##   12.56    0.36   12.94
system.time(setkey(copy(dts),y,x))
## forder took 4.61 sec
## reorder took 4.08 sec
##    user  system elapsed 
##    8.56    0.38    8.94
system.time(setattr(setkey(copy(dts),y),"sorted",c("y","x")))
## forder took 2.39 sec
## reorder took 4.08 sec
##    user  system elapsed 
##    6.45    0.30    6.75

@eantonya well that was what I meant with

columns removed from the key count as if they were in their order at the end of the key
Thus in your example you would need to specify all columns as c was not before b(now not part of the key anymore, but still needs to be considered) in the old key. So that is a case where this optimization does not bring anything; But consider the similar:
dt = data.table(a = 1, b= c(1,1,2,2), c = c(3,4,1,2), e = c(4,1,4,1), key = c('a','b','c'))
setkey(copy(dt), e, c)[]
setkey(copy(dt), e, c, a)[]
[I aggree with @arunsrinivasan on the is-sorted check- if you are already sure you could as well just setattr, issorted is so cheap compared with any real operation and having a debug flag change what happens can make debugging very difficult...]

But please stop tinkering about it there is not much to gain here and there are really more import thinks to do for this package. E.g. a nice documentation for this wonderful forderv you just told me about. I once unsuccessfully searched an hour for a R (C) implementation of "bucket sort" just to read now randomly that it is already in base as "sort.list"...sometimes one underestimates base...

Thanks again for all that nice information!
-Jan

jangorecki · 2020-08-25T07:55:29Z

This should be nicely handled by #4386

eantonya added feature request Low labels Sep 10, 2015

jan-glx changed the title ~~setkey should use existing keys if possible~~ in some cases setkey could save some more sort operations Sep 10, 2015

franknarf1 mentioned this issue May 20, 2019

Make .SD inherit partial key from parent data table in cases where grouping is performed by a subset of the key #1736

Open

jangorecki mentioned this issue Aug 25, 2020

Teach forder to re-use existing key and index attributes instead of sorting from scratch #4386

Merged

3 tasks

MichaelChirico closed this as completed in #4386 Jul 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in some cases setkey could save some more sort operations #1321

in some cases setkey could save some more sort operations #1321

jan-glx commented Sep 10, 2015

arunsrinivasan commented Sep 10, 2015

franknarf1 commented Sep 10, 2015

arunsrinivasan commented Sep 10, 2015

eantonya commented Sep 10, 2015

arunsrinivasan commented Sep 10, 2015

eantonya commented Sep 10, 2015

arunsrinivasan commented Sep 10, 2015

franknarf1 commented Sep 10, 2015

arunsrinivasan commented Sep 10, 2015

eantonya commented Sep 10, 2015

franknarf1 commented Sep 10, 2015

eantonya commented Sep 10, 2015

jan-glx commented Sep 10, 2015

arunsrinivasan commented Sep 10, 2015

arunsrinivasan commented Sep 10, 2015

eantonya commented Sep 10, 2015

arunsrinivasan commented Sep 10, 2015

eantonya commented Sep 10, 2015

eantonya commented Sep 10, 2015

arunsrinivasan commented Sep 10, 2015

jan-glx commented Sep 10, 2015

jangorecki commented Aug 25, 2020

in some cases setkey could save some more sort operations #1321

in some cases setkey could save some more sort operations #1321

Comments

jan-glx commented Sep 10, 2015

arunsrinivasan commented Sep 10, 2015

franknarf1 commented Sep 10, 2015

arunsrinivasan commented Sep 10, 2015

eantonya commented Sep 10, 2015

arunsrinivasan commented Sep 10, 2015

eantonya commented Sep 10, 2015

arunsrinivasan commented Sep 10, 2015

franknarf1 commented Sep 10, 2015

arunsrinivasan commented Sep 10, 2015

eantonya commented Sep 10, 2015

franknarf1 commented Sep 10, 2015

eantonya commented Sep 10, 2015

jan-glx commented Sep 10, 2015

arunsrinivasan commented Sep 10, 2015

arunsrinivasan commented Sep 10, 2015

eantonya commented Sep 10, 2015

arunsrinivasan commented Sep 10, 2015

eantonya commented Sep 10, 2015

eantonya commented Sep 10, 2015

arunsrinivasan commented Sep 10, 2015

jan-glx commented Sep 10, 2015

jangorecki commented Aug 25, 2020