You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have asked a question at Stackoverflow to speed up a simple code involving calculating the ranks of dates by groups.
library(data.table)
library(lubridate)
library(microbenchmark)
set.seed(1)
NN <- 1000000
EE <- 10
# Just an example.
todo <- data.table(id=paste0("ID",rep(1:NN, each=EE)),
val=dmy("1/1/1980") + sample(1:14000,NN*EE,replace=T))
# I want to benchmark this:
todo[,ord := frank(val, ties.method="first"), by=id]
I think it's similar to #3739 . Some util functions in data.table executes very fast for a relatively large dataset, e.g., a vector with more than 1000 elements. However, this efficiency comes with overheads, which may cause serious performance issues when they are going to be executed millions of times and the size of each input is relatively small. Not sure if we can really solve this issue systematically... but the least we can do (in my opnion) is to document this...
using verbose true would give us a hint, it might a matter of writing grank, a GForce optimized version which run once for all groups and not each time for every group.
I have asked a question at Stackoverflow to speed up a simple code involving calculating the ranks of dates by groups.
https://stackoverflow.com/questions/58503115/how-to-compute-the-ranking-of-dates-by-groups-faster-with-data-table-and-lubri/58505724#58505724
and someone (sindri_baldur) has posted a simple alternative using rank(unclass(...)) that is almost 10 times faster.
todo[, rank(unclass(val), ties.method = "first"), by = id]
At the end it seems the slowness is not due to the fact of being a date but that frank takes a long time when calculated on many groups.
I'm using R 3.5.3 Open on Windows 10. I don't know about the other guy.
data.table 1.12.3
The text was updated successfully, but these errors were encountered: