Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement 'rank' #760

Closed
12 tasks done
arunsrinivasan opened this issue Aug 6, 2014 · 3 comments
Closed
12 tasks done

Implement 'rank' #760

arunsrinivasan opened this issue Aug 6, 2014 · 3 comments
Assignees
Milestone

Comments

@arunsrinivasan
Copy link
Member

This is really straightforward from forder. Ranking types (from base::rank):

  • min
  • max
  • first
  • average
  • random

plus extra...

  • dense
  • runlength - will have to provide this as another function, not really ranking method.

na.last =

  • TRUE
  • FALSE
  • NA
  • "keep"

order

  • decreasing order

Note that this will/should work on list input as well. Not just vectors like base.

Here is an Q from SO that could greatly benefit from this.

@arunsrinivasan
Copy link
Member Author

Hm, this is almost done. However, there's something I hadn't anticipated - NAs. base::rank never considers them as equal. For example:

x = c(0,1,NA,1,0,NA)
rank(x, ties="average")
# [1] 1.5 3.5 5.0 3.5 1.5 6.0

However, frankv (for fast rank) implementation considers all NAs to be equal at the moment.

frankv(x, ties="average")
# [1] 1.5 3.5 5.5 3.5 1.5 5.5

It takes the average of 5,6 to give 5.5 instead of considering them as unique values. This comes directly from forder which groups NAs together for aggregation purposes. Since ties.method = average, max, min depends on the attribute starts from forderv, this issue exists for these methods. Since the logic is much simpler for the remaining two - random and first, it's not a problem there.

This can of course be corrected by checking if x[i] is NA or not and assigning the right value.

But it does make extending rank() to list input a little more tedious (and maybe slower) because for each unique group, we've to check if any of the values in x over all it's elements are NA and apply relevant logic accordingly.

Maybe it's necessary to first implement an efficient which.na(DT) (which is anyway useful) and then use it here.

@arunsrinivasan
Copy link
Member Author

But the speed is already impressive :):

set.seed(45L)
x = sample(1e4, 1e7, TRUE)

system.time(ans1 <- rank(x))
#    user  system elapsed 
#  18.106   0.165  19.460 
system.time(ans2 <- frankv(x))
#    user  system elapsed 
#   0.539   0.001   0.545 
identical(ans1, ans2) # [1] TRUE

system.time(ans1 <- rank(x, ties="first"))
#    user  system elapsed 
#  23.124   0.252  23.584 
system.time(ans2 <- frankv(x, ties="first"))
#    user  system elapsed 
#   0.719   0.002   0.725 
identical(ans1, ans2) # [1] TRUE

A common usage would be to use rank while grouping. So here's a small comparison on that as well:

set.seed(45L)
x = sample(1e2, 1e4, TRUE)

system.time(for (i in 1:1e3) rank(x))
#    user  system elapsed 
#   2.911   0.305   3.294 
system.time(for (i in 1:1e3) frankv(x))
#    user  system elapsed 
#   0.279   0.084   0.364 
identical(rank(x), frankv(x)) # [1] TRUE

@arunsrinivasan
Copy link
Member Author

Implemented fast rank of all types. NA and NaN behaviour is different from base::rank. This is to be consistent with data.table's grouping of NAs together (and NaNs as well). One other reason for this difference is that this rank function is extended to work on lists a well so that we can do for example:

DT[, rank(.SD, na.last=TRUE, ties="max") < 4L, by=...]

Also includes an additional argument order which is a vector with possible values of 1L and -1L for increasing and decreasing order respectively, as in the case of setorderv/setorder.

TODO:

  • 1) Maybe implement by= similar to forderv so that we can do:
DT[, rank(.SD, by=<cols>, ties="min", na.last=NA), by=...]

for example. Even though it's possible to use .SDcols, it might be sometimes necessary to get the rank of just a few columns from .SD, and using .SD[, cols, with=FALSE] subsets resulting in a copy, which is avoided this way.

  • 2) Think of how to best use this (discuss with Matt). Internally optimise in j or export as a separate function? Internally optimising has it's disadvantages: a) no documentation explaining differences, and b) can't explain the advantages (works with lists, much faster etc..).
  • 3) If to be exported, what name? fast rank?. Named it frankv.

@arunsrinivasan arunsrinivasan added this to the v1.9.6 milestone Sep 25, 2014
@arunsrinivasan arunsrinivasan self-assigned this Sep 25, 2014
@arunsrinivasan arunsrinivasan modified the milestones: v1.9.6, v1.9.8 Oct 10, 2014
@arunsrinivasan arunsrinivasan modified the milestones: v1.9.6, v1.9.8 Jan 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant