frank by group is much slower than rank #3988

skanskan · 2019-10-22T16:29:46Z

I have asked a question at Stackoverflow to speed up a simple code involving calculating the ranks of dates by groups.

library(data.table)
library(lubridate)
library(microbenchmark)
set.seed(1)
NN <- 1000000
EE <- 10   
# Just an example.
todo <- data.table(id=paste0("ID",rep(1:NN, each=EE)), 
          val=dmy("1/1/1980") + sample(1:14000,NN*EE,replace=T))
# I want to benchmark this:
todo[,ord := frank(val, ties.method="first"), by=id]

https://stackoverflow.com/questions/58503115/how-to-compute-the-ranking-of-dates-by-groups-faster-with-data-table-and-lubri/58505724#58505724

and someone (sindri_baldur) has posted a simple alternative using rank(unclass(...)) that is almost 10 times faster.

todo[, rank(unclass(val), ties.method = "first"), by = id]

At the end it seems the slowness is not due to the fact of being a date but that frank takes a long time when calculated on many groups.

I'm using R 3.5.3 Open on Windows 10. I don't know about the other guy.
data.table 1.12.3

The text was updated successfully, but these errors were encountered:

shrektan · 2019-10-23T14:45:16Z

I think it's similar to #3739 . Some util functions in data.table executes very fast for a relatively large dataset, e.g., a vector with more than 1000 elements. However, this efficiency comes with overheads, which may cause serious performance issues when they are going to be executed millions of times and the size of each input is relatively small. Not sure if we can really solve this issue systematically... but the least we can do (in my opnion) is to document this...

jangorecki · 2019-11-29T15:38:56Z

using verbose true would give us a hint, it might a matter of writing grank, a GForce optimized version which run once for all groups and not each time for every group.

MichaelChirico · 2020-02-13T12:00:02Z

is this a duplicate of #1197?

jangorecki added the GForce issues relating to optimized grouping calculations (GForce) label Nov 29, 2019

mgirlich mentioned this issue Mar 5, 2021

group_by() + slice_max() quite slow tidyverse/dtplyr#216

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

frank by group is much slower than rank #3988

frank by group is much slower than rank #3988

skanskan commented Oct 22, 2019 •

edited

Loading

shrektan commented Oct 23, 2019

jangorecki commented Nov 29, 2019

MichaelChirico commented Feb 13, 2020

frank by group is much slower than rank #3988

frank by group is much slower than rank #3988

Comments

skanskan commented Oct 22, 2019 • edited Loading

shrektan commented Oct 23, 2019

jangorecki commented Nov 29, 2019

MichaelChirico commented Feb 13, 2020

skanskan commented Oct 22, 2019 •

edited

Loading