Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frank by group is much slower than rank #3988

Open
skanskan opened this issue Oct 22, 2019 · 3 comments
Open

frank by group is much slower than rank #3988

skanskan opened this issue Oct 22, 2019 · 3 comments
Labels
GForce issues relating to optimized grouping calculations (GForce)

Comments

@skanskan
Copy link

skanskan commented Oct 22, 2019

I have asked a question at Stackoverflow to speed up a simple code involving calculating the ranks of dates by groups.

library(data.table)
library(lubridate)
library(microbenchmark)
set.seed(1)
NN <- 1000000
EE <- 10   
# Just an example.
todo <- data.table(id=paste0("ID",rep(1:NN, each=EE)), 
          val=dmy("1/1/1980") + sample(1:14000,NN*EE,replace=T))
# I want to benchmark this:
todo[,ord := frank(val, ties.method="first"), by=id]  

https://stackoverflow.com/questions/58503115/how-to-compute-the-ranking-of-dates-by-groups-faster-with-data-table-and-lubri/58505724#58505724

and someone (sindri_baldur) has posted a simple alternative using rank(unclass(...)) that is almost 10 times faster.

todo[, rank(unclass(val), ties.method = "first"), by = id]

At the end it seems the slowness is not due to the fact of being a date but that frank takes a long time when calculated on many groups.

I'm using R 3.5.3 Open on Windows 10. I don't know about the other guy.
data.table 1.12.3

@shrektan
Copy link
Member

I think it's similar to #3739 . Some util functions in data.table executes very fast for a relatively large dataset, e.g., a vector with more than 1000 elements. However, this efficiency comes with overheads, which may cause serious performance issues when they are going to be executed millions of times and the size of each input is relatively small. Not sure if we can really solve this issue systematically... but the least we can do (in my opnion) is to document this...

@jangorecki
Copy link
Member

using verbose true would give us a hint, it might a matter of writing grank, a GForce optimized version which run once for all groups and not each time for every group.

@jangorecki jangorecki added the GForce issues relating to optimized grouping calculations (GForce) label Nov 29, 2019
@MichaelChirico
Copy link
Member

is this a duplicate of #1197?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GForce issues relating to optimized grouping calculations (GForce)
Projects
None yet
Development

No branches or pull requests

4 participants