Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in some cases setkey could save some more sort operations #1321

Closed
jan-glx opened this issue Sep 10, 2015 · 22 comments · Fixed by #4386
Closed

in some cases setkey could save some more sort operations #1321

jan-glx opened this issue Sep 10, 2015 · 22 comments · Fixed by #4386

Comments

@jan-glx
Copy link
Contributor

jan-glx commented Sep 10, 2015

When setkey is called on a DT with sorted attribute, existing keys should be reused if possible.
Most importantly in the setkey(setkey(dt, x, y), x) case but also in a little more advanced cases like setkey(setkey(dt, x, y), y, x). If you per se do not want to trust the sorted attribute at least check for the column being sorted before sorting if it is marked as sorted.
-best, Jan
example:

dtu = data.table(x = sample(1E6, 1E8, replace=T), 
                 y = sample(1E6, 1E8, replace=T))
dts = setkey(copy(dtu),x,y)
onUnsorted <- function() setkey(dtu,y,x)
onSorted <- function() setkey(dts,y,x)
onSortedSmart <- function() setattr(setkey(dts,y),"sorted",c("y","x"))

identical(onUnsorted(), onSorted())
#[1] TRUE
identical(onSorted(), onSortedSmart())
#[1] TRUE

system.time(onUnsorted())
#   user  system elapsed 
#   0.47    0.10    0.56 
system.time(onSorted())
#   user  system elapsed 
#   0.50    0.07    0.58 
system.time(onSortedSmart())
#   user  system elapsed 
#   0.24    0.06    0.29 
@arunsrinivasan
Copy link
Member

I'm sorry I don't follow this FR.

@franknarf1
Copy link
Contributor

@arunsrinivasan I think the idea is to skip some steps in setkey based on the current value of the sorted attribute (aka the key). So if doing setkeyv(DT,cols) and...

  • cols takes the form c(first_cols, head(key(DT),n)) for some n, then just do the sorting stuff for first_cols, knowing/expecting that the rest are fine.
  • cols takes the form [something something], then [similarly skip some sorting].

(Not knowing the internals of setkey, the OP instead infers that there are skippable steps from benchmarks.)

@arunsrinivasan
Copy link
Member

I'm guessing his suggestion is to use the "smart" method? But how does one know the data is still sorted by x in general?

require(data.table)
set.seed(1L); 
dt = data.table(x=sample(3,10,TRUE), y=sample(2,10,TRUE))
setkey(dt, y, x) # sorted by 'y,x'

# perform some operations
dt = dt[, .(x=sample(x)), by=y]
# x isn't sorted within y anymore..
key(dt) # [1] y

setattr(setkey(dt, y), 'sorted', c('y', 'x')) # <~~~ wrong!

# setkey() does a first pass to see if columns are already sorted by the columns specified 
# and if so, avoids computing order and reordering. It also warns if key is set on those 
# columns but data isn't sorted.
setkey(dt, y,x) 
# Warning message:
# In setkeyv(x, cols, verbose = verbose, physical = physical) :
#   Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

@eantonya
Copy link
Contributor

@arunsrinivasan by checking the key. Your operation destroyed the key.

@arunsrinivasan
Copy link
Member

Okay, I've:

setkey(dt, y,x)

How do I check that that's the key when doing:

setkey(setkey(dt, y), y,x)

?

@eantonya
Copy link
Contributor

I don't know how far this can be pushed, but what OP's example points out is the following - the first column (of a multicolumn) key never needs to be resorted - it will always be automatically correctly sorted.

So if your current key is x,y and you set the key again to something that involves x - e.g. y,x or y,z,x,w - the x column doesn't need to be sorted - it already will be, so (going in reverse to achieve the required key) - you'll sort w, skip the x sort, then sort z and y.

Hmm.. Correction - I'm not sure if the above is true - it's certainly true in the case where x is the last column of the new key - by definition, but probably not if it's in the middle. So not as general as I initially thought.

@arunsrinivasan
Copy link
Member

Yes but that's already taken care of internally. The order routine is only called on unsorted data.

Okay, just saw the timings.. woah!?!? sorting (and reordering) thro' 100 million elements in 0.6s? What machine is that?

These are the timings on my laptop.

system.time(onUnsorted())
#    user  system elapsed 
#  11.521   0.386  11.967 
system.time(onSorted())
#    user  system elapsed 
#   8.939   0.366   9.331 
system.time(onSortedSmart())
#    user  system elapsed 
#   0.287   0.090   0.378 

Note that the 2nd timing is slightly quicker (only first column needs to be ordered, the 2nd column's check for sorted-ness returns TRUE and therefore never reordered).

Also note that the 3rd timing doesn't do any ordering as both columns are already sorted.

@franknarf1
Copy link
Contributor

Okay, the OP's example is pretty confusing and seems to reach the wrong conclusion.

DT1 <- copy(dtu)
DT2 <- setkey(copy(dtu),x)

system.time(setkey(DT1,x,y)) # 14 seconds
system.time(setkey(DT2,x,y)) # 6  seconds

DT1 <- copy(dtu)
DT2 <- setkey(copy(dtu),y,x)

system.time(setkey(DT1,x,y)) # 15 seconds
system.time(setkey(DT2,x,y)) # 10 seconds

So setkey is taking advantage of the existing key in both of these cases; and presumably in similar cases that the OP can think of (though we'll see if they have other examples).

@arunsrinivasan
Copy link
Member

@franknarf1 thanks for confirming.

@eantonya
Copy link
Contributor

@franknarf1 I don't think those are the right timings to looks at. This is my understanding of the relevant timings:

dts = setkey(copy(dtu),x,y)
setkey(dts, y, x, verbose = T)
# forder took 9.52 sec
# reorder took 6.64 sec

dts = setkey(copy(dtu),x,y)
setkey(dts, y, verbose = T)
# forder took 5.42 sec
# reorder took 7.11 sec

dts = setkey(copy(dtu),x,y)
system.time(data.table:::is.sorted(dts$x))
#    user  system elapsed 
#    0.29    0.00    0.30 

You can see that setkey(dts, y) is faster by more than the is.sorted check.

@franknarf1
Copy link
Contributor

@eantonya Oh ok, I see what you mean. Whether that test is sufficient depends on knowledge of forderv that I don't have. I'll take your word on it, and it does make sense to me based on my vague understanding.

It's clear to me that the test data.table:::is.sorted(dts[, cols, with=FALSE]) is sufficient, where cols = c("y","x"); and with that slower check I do still see one second of savings (from 9 to 8).

@eantonya
Copy link
Contributor

The order algorithm is fairly trivial - iterate from last column of the key to the first and sort all of them. The result will be a key-sorted table (assuming the sorting algo keeps existing order in equivalent cases, which it does for us).

The OP's observation is fairly simple - if that last column is already sorted - don't sort it again. This happens when e.g. old key = x,y,z,... and new key is ...,x. Slightly more generally there is no need to sort the last n columns when going from old key = x1,x2,...,xn,y1,y2,... to new key = ...,x1,x2,...,xn.

@jan-glx
Copy link
Contributor Author

jan-glx commented Sep 10, 2015

Wow. 😄 You are waaay quicker than my 1yr old Lenovo y510p ( @arunsrinivasan - maybe not a mac book next time (just guessing)) :D
I am very sorry for the confusion I caused - wanted to finish the report before dinner - not a good idea...
Indeed I did wrongly conclude a much broader scope of the issue just from these timings.

@eantonya hit the nail on the head. Through I think it is minimally more general than as I understood him: when setting a key the data does not not need to be sorted for the last new key columns (i.e. the first sorts can be omitted) as long as they have been part of the old key in the same order (allowing arbitrary gaps; columns removed from the key count as if they were in their order at the end of the key) See code below I case you are interested.

Thanks for the discussion to everyone! I had a fundamentally wrong Idea of how data.table works - I thought it had indices on all key columns - now I see why joins on a secondary key are not that easy.

I'll edit the title to match what's left of the issue.
best, jan

"Proof":

DT = data.table(a = sample(3, 3E4, replace=T), 
                b = sample(3, 3E4, replace=T), 
                c = sample(3, 3E4, replace=T), 
                d = sample(3, 3E4, replace=T), 
                e = sample(1E6, 3E4, replace=T), key="a,b,c,d")
# unreadable code producing all permutations of the 4 keys:
permutations <- function( x, prefix = c() ){if(length(x) == 0 ) return(prefix);do.call(rbind,sapply(1:length(x), FUN = function(idx) permutations( x[-idx], c( prefix, x[idx])), simplify = FALSE))}
perms <- unlist(apply(permutations(c("a","b","c","d")),1,list),recursive = FALSE)
all(sapply(perms,function(p) all(setkeyv(copy(DT),p)==setkeyv(copy(DT),p[0:last(which(c(T,p[-4]>p[-1])))]))))
# returns TRUE

Example:

library(data.table)
set.seed(1)
m <- 1e2; n <- 1E8;
DT <- data.table(a = sample(m, n, replace=T), 
                 b = sample(m, n, replace=T), 
                 c = sample(m, n, replace=T), 
                 d = sample(m, n, replace=T), 
                 e = sample(m, n, replace=T), 
                 key="a,b,c,d")
invisible(gc())
DT1 <- setattr(setkey(copy(DT),b, verbose=T), "sorted", c("b","a","c","d"))
## forder took 0.59 sec
## reorder took 0.88 sec
invisible(gc())
DT2 <- setkey(copy(DT),b,a,c,d, verbose=T)
## forder took 1.65 sec
## reorder took 0.87 sec
identical(DT1,DT2)
## [1] TRUE

@jan-glx jan-glx changed the title setkey should use existing keys if possible in some cases setkey could save some more sort operations Sep 10, 2015
@arunsrinivasan
Copy link
Member

@eantonya we perform MSD radix sorting (cache efficient + benefits from partial ordering). See slide 19 here from Matt's UseR'15 talk.

I think this does it:

# handle identical case separately first
function(old, new) { 
    len = min(length(old), length(new))
    ans = vapply(seq_len(len), function(i) identical(head(old, i), tail(new, i)), TRUE)
    new_filtered = if (any(ans)) head(new, -which.max(ans)) else new
}

Note that we'll have to verify that the sorted attribute is indeed right. So, we've to run a "is sorted" test first before even beginning this. And doesn't seem that common occurrence to me. Feel free to make a PR.

@arunsrinivasan
Copy link
Member

@jan-glx just checked the configurations -- here and here. Still can't figure out the reason for speed difference to my Macbook Pro 13' with 4MB L3 cache, 256KB L2 cache, 16GB RAM 1600MHz DDR3, i7 3GHz.

The only thing that stands out for the processor compared to mine is 6MB (smart?) cache... unless I'm missing something..

20x faster seems very strange. Eddi's and Frank's timings are more or less identical to mine.

@eantonya
Copy link
Contributor

@jan-glx you seem to be right, omitting also works and I don't understand why, since sorting by a,c,d (assuming new key is ...,a,c,d) is different from a,b,c,d. I think an analytic proof is necessary if this ever gets implemented.

@arunsrinivasan I didn't realize the algo was changed to MSD, thanks for pointing that out (I don't think that changes too much for this FR - the timings above are all vs that algo). I'm not sure why you'd need to check is.sorted for a keyed table - seems a bit over-cautious.

@arunsrinivasan
Copy link
Member

@jan-glx hm, I'm stumped as to why that works as well.

@eantonya

I'm not sure why you'd need to check is.sorted for a keyed table - seems a bit over-cautious.

Not really. We already check when the columns to key by are identical to existing keys. It has helped catch bugs in places where we wrongly retain keys. Also setattr(dt, 'sorted', cols) really has no checks.

@eantonya
Copy link
Contributor

@arunsrinivasan I'm still unconvinced. Sure it's nice for catching data.table bugs, but why a regular user should lose performance because of that? Maybe a strict vs fast mode should be introduced for these kinds of performance vs potential data.table bugs redundancies (maybe the existing verbose flag can be adapted for that).

And as far as direct setting the attribute goes - I see no problems with complete failure if user decides to tinker with internals and does so incorrectly.

@eantonya
Copy link
Contributor

@jan-glx Here's a counter-example that shows that omitting doesn't work:

> dt = data.table(a = 1, b= c(1,1,2,2), c = c(3,4,1,2), e = c(4,1,4,1), key = c('a','b','c'))
> dt
   a b c e
1: 1 1 3 4
2: 1 1 4 1
3: 1 2 1 4
4: 1 2 2 1
> setkey(copy(dt), e)[]
   a b c e
1: 1 1 4 1
2: 1 2 2 1
3: 1 1 3 4
4: 1 2 1 4
> setkey(copy(dt), e, a, c)[]
   a b c e
1: 1 2 2 1
2: 1 1 4 1
3: 1 2 1 4
4: 1 1 3 4

@arunsrinivasan
Copy link
Member

setkey() ensures that the data is sorted. To do that, we need it, as we'll rely on the attribute of another keying operation. It's not about the user, rather a property of setkey() we'd like to stick to.

@jan-glx
Copy link
Contributor Author

jan-glx commented Sep 10, 2015

@arunsrinivasan Unfortunately I have to admit that your macbook is not worse than a cheap lenove - it was just a consequence of the stupid mistakes I made when trying to bring the observation into a presentable form...your initial response was absolute suitable... here is my correct timing for the initial data:(

library(data.table)
set.seed(1)
dtu = data.table(x = sample(1E6, 1E8, replace=T), 
                 y = sample(1E6, 1E8, replace=T))
options(datatable.verbose = T)
dts = setkey(copy(dtu),x,y)
## forder took 8.53 sec
## reorder took 4.11 sec
system.time(setkey(copy(dtu),y,x))
## forder took 8.52 sec
## reorder took 4.12 sec
##    user  system elapsed 
##   12.56    0.36   12.94
system.time(setkey(copy(dts),y,x))
## forder took 4.61 sec
## reorder took 4.08 sec
##    user  system elapsed 
##    8.56    0.38    8.94
system.time(setattr(setkey(copy(dts),y),"sorted",c("y","x")))
## forder took 2.39 sec
## reorder took 4.08 sec
##    user  system elapsed 
##    6.45    0.30    6.75

@eantonya well that was what I meant with

columns removed from the key count as if they were in their order at the end of the key
Thus in your example you would need to specify all columns as c was not before b(now not part of the key anymore, but still needs to be considered) in the old key. So that is a case where this optimization does not bring anything; But consider the similar:

dt = data.table(a = 1, b= c(1,1,2,2), c = c(3,4,1,2), e = c(4,1,4,1), key = c('a','b','c'))
setkey(copy(dt), e, c)[]
setkey(copy(dt), e, c, a)[]

[I aggree with @arunsrinivasan on the is-sorted check- if you are already sure you could as well just setattr, issorted is so cheap compared with any real operation and having a debug flag change what happens can make debugging very difficult...]

But please stop tinkering about it there is not much to gain here and there are really more import thinks to do for this package. E.g. a nice documentation for this wonderful forderv you just told me about. I once unsuccessfully searched an hour for a R (C) implementation of "bucket sort" just to read now randomly that it is already in base as "sort.list"...sometimes one underestimates base...

Thanks again for all that nice information!
-Jan

@jangorecki
Copy link
Member

This should be nicely handled by #4386

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants