Implement 'rank' #760

arunsrinivasan · 2014-08-06T23:47:19Z

arunsrinivasan · 2014-08-15T22:44:58Z

Hm, this is almost done. However, there's something I hadn't anticipated - NAs. base::rank never considers them as equal. For example:

x = c(0,1,NA,1,0,NA)
rank(x, ties="average")
# [1] 1.5 3.5 5.0 3.5 1.5 6.0

However, frankv (for fast rank) implementation considers all NAs to be equal at the moment.

frankv(x, ties="average")
# [1] 1.5 3.5 5.5 3.5 1.5 5.5

It takes the average of 5,6 to give 5.5 instead of considering them as unique values. This comes directly from forder which groups NAs together for aggregation purposes. Since ties.method = average, max, min depends on the attribute starts from forderv, this issue exists for these methods. Since the logic is much simpler for the remaining two - random and first, it's not a problem there.

This can of course be corrected by checking if x[i] is NA or not and assigning the right value.

But it does make extending rank() to list input a little more tedious (and maybe slower) because for each unique group, we've to check if any of the values in x over all it's elements are NA and apply relevant logic accordingly.

Maybe it's necessary to first implement an efficient which.na(DT) (which is anyway useful) and then use it here.

arunsrinivasan · 2014-08-15T22:50:42Z

But the speed is already impressive :):

set.seed(45L)
x = sample(1e4, 1e7, TRUE)

system.time(ans1 <- rank(x))
#    user  system elapsed 
#  18.106   0.165  19.460 
system.time(ans2 <- frankv(x))
#    user  system elapsed 
#   0.539   0.001   0.545 
identical(ans1, ans2) # [1] TRUE

system.time(ans1 <- rank(x, ties="first"))
#    user  system elapsed 
#  23.124   0.252  23.584 
system.time(ans2 <- frankv(x, ties="first"))
#    user  system elapsed 
#   0.719   0.002   0.725 
identical(ans1, ans2) # [1] TRUE

A common usage would be to use rank while grouping. So here's a small comparison on that as well:

set.seed(45L)
x = sample(1e2, 1e4, TRUE)

system.time(for (i in 1:1e3) rank(x))
#    user  system elapsed 
#   2.911   0.305   3.294 
system.time(for (i in 1:1e3) frankv(x))
#    user  system elapsed 
#   0.279   0.084   0.364 
identical(rank(x), frankv(x)) # [1] TRUE

arunsrinivasan · 2014-08-18T20:04:47Z

Implemented fast rank of all types. NA and NaN behaviour is different from base::rank. This is to be consistent with data.table's grouping of NAs together (and NaNs as well). One other reason for this difference is that this rank function is extended to work on lists a well so that we can do for example:

DT[, rank(.SD, na.last=TRUE, ties="max") < 4L, by=...]

~~Also includes an additional argument order which is a vector with possible values of 1L and -1L for increasing and decreasing order respectively, as in the case of setorderv/setorder.~~

TODO:

1) Maybe implement by= similar to forderv so that we can do:

DT[, rank(.SD, by=<cols>, ties="min", na.last=NA), by=...]

for example. Even though it's possible to use .SDcols, it might be sometimes necessary to get the rank of just a few columns from .SD, and using .SD[, cols, with=FALSE] subsets resulting in a copy, which is avoided this way.

2) Think of how to best use this (discuss with Matt). Internally optimise in j or export as a separate function? Internally optimising has it's disadvantages: a) no documentation explaining differences, and b) can't explain the advantages (works with lists, much faster etc..).
3) ~~If to be exported, what name? fast rank?~~. Named it frankv.

arunsrinivasan added the feature request label Aug 6, 2014

arunsrinivasan added a commit that referenced this issue Aug 18, 2014

Implements #760. Fast rank for vectors & lists.

1ff5e1e

arunsrinivasan added this to the v1.9.6 milestone Sep 25, 2014

arunsrinivasan self-assigned this Sep 25, 2014

arunsrinivasan added the Medium label Sep 25, 2014

arunsrinivasan modified the milestones: v1.9.6, v1.9.8 Oct 10, 2014

arunsrinivasan modified the milestones: v1.9.6, v1.9.8 Jan 6, 2015

arunsrinivasan closed this as completed in 368b356 Jan 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement 'rank' #760

Implement 'rank' #760

arunsrinivasan commented Aug 6, 2014

arunsrinivasan commented Aug 15, 2014

arunsrinivasan commented Aug 15, 2014

arunsrinivasan commented Aug 18, 2014

Implement 'rank' #760

Implement 'rank' #760

Comments

arunsrinivasan commented Aug 6, 2014

na.last =

order

arunsrinivasan commented Aug 15, 2014

arunsrinivasan commented Aug 15, 2014

arunsrinivasan commented Aug 18, 2014