Proposal for a function to handle many small groups in some situations #4284

chinsoon12 · 2020-03-07T00:04:28Z

When using by with many small groups, runtime can be slow. For some situations like cumsum and frank, we could perform it on the whole vector first and use the last value from the previous group to thus avoiding the use of by. Hence, proposing to add a function that carries the value from previous group to the next group (similar in spirit to rleid and rowid). I suspect it might help with some recursive algo.

R implementation:

#fill group using last value from previous group
#to be placed into a function, similarly like rleid and rowid
#not defensive yet...major improvements required...ideas appreciated
laggrpv <- function(g, v, N) {
    ix <- which(diff(g)!=0L) + 1L
    f <- replace(rep(NA_integer_, N), ix, v[ix-1L])
    k <- nafill(nafill(f, "locf"), fill=0)
}

Timing code for cumsum:

library(data.table) #Win x64 R-3.6.1 data.table_1.12.8
set.seed(0L)
nr <- 1e6L#8
DT <- data.table(ID=rep(1L:(nr/2L), each=4L), VAL=1L:nr)
DT0 <- copy(DT)
DT1 <- copy(DT)

f <-  cumsum
mtd0 <- function() {
    DT0[, CS := f(VAL), ID]
}

#fill group using last value from previous group
#to be placed into a function, similarly like rleid and rowid
laggrpv <- function(g, v, N) {
    ix <- which(diff(g)!=0L) + 1L
    f <- replace(rep(NA_integer_, N), ix, v[ix-1L])
    k <- nafill(nafill(f, "locf"), fill=0)
}

mtd1 <- function() {
    DT1[, CS := {
        cs <- f(as.double(VAL))

        k <- laggrpv(ID, cs, .N)

        cs - k
    }]
}
microbenchmark::microbenchmark(times=3L, mtd0(), mtd1())
fsetequal(DT0, DT1)
#[1] TRUE

timings:

Unit: milliseconds
   expr      min       lq     mean   median       uq      max neval
 mtd0() 176.7456 196.7197 203.8157 216.6937 217.3507 218.0077     3
 mtd1() 125.2726 127.1587 131.0547 129.0448 133.9458 138.8468     3

Timing code for frank(..., ties.method="dense")

f <- function(x) frank(x, ties.method="dense")
mtd0 <- function() {
    DT0[, RNK := f(VAL), ID]
}

#fill group using last value from previous group
#to be placed into a function, similarly like rleid and rowid
laggrpv <- function(g, v, N) {
    ix <- which(diff(g)!=0L) + 1L
    f <- replace(rep(NA_integer_, N), ix, v[ix-1L])
    k <- nafill(nafill(f, "locf"), fill=0)
}

mtd1 <- function() {
    DT1[, RNK := {
        rnk <- rleid(ID, VAL)

        k <- laggrpv(ID, rnk, .N)

        rnk - k
    }]
}
microbenchmark::microbenchmark(times=1L, mtd0(), mtd1())
fsetequal(DT0, DT1)

timing:

Unit: milliseconds
   expr         min          lq        mean      median          uq         max neval
 mtd0() 149920.6706 149920.6706 149920.6706 149920.6706 149920.6706 149920.6706     1
 mtd1()     73.5902     73.5902     73.5902     73.5902     73.5902     73.5902     1

The text was updated successfully, but these errors were encountered:

jangorecki · 2020-03-07T03:52:20Z

frank is probably slow for many groups because of parallelism, do you get same timings when using single thread?

shrektan · 2020-03-07T06:49:20Z

Similar to #3739

ColeMiller1 · 2020-03-07T13:23:04Z

This is a neat idea although it begs the question: does set assignment need equivalent to GForce? Maybe SForce.

Here's a similar version for cumsum where the internals of data.table are used:

mtd2 <- function() {
  DT2[, CS :=  {
    cs <- f(as.double(VAL))
    o__ <- data.table:::forderv(ID, sort = FALSE, retGrp = TRUE)
    f__ = attr(o__, "starts", exact=TRUE)
    len__ = data.table:::uniqlengths(f__, .N)
    
    if (length(o__)) {    
      firstofeachgroup = o__[f__]
      if (length(origorder <- data.table:::forderv(firstofeachgroup))) {
        f__ = f__[origorder]
        len__ = len__[origorder]
      }
    }
    cs - c(rep(0L, len__[1L]), cs[rep(f__[-1L], len__[-1L]) - 1L]) 
  }]
}

I am on the development version with 1.12.9 and it has a hit in performance for groupings like this. But using the internal forderv is quicker and with SForce, we could make it quicker because we wouldn't need to repeat the cs subtraction vector. Instead it would be a scalar in the C code initialized with 0.

microbenchmark::microbenchmark(times=3L, dt_by = mtd0(), laggrpv = mtd1(),dt_forder = mtd2())

Unit: milliseconds
      expr      min       lq      mean   median        uq      max neval
     dt_by 472.3768 491.6481 504.91553 510.9193 521.18490 531.4505     3
   laggrpv 123.4735 124.5407 163.92253 125.6078 184.14705 242.6863     3
 dt_forder  70.6577  77.6985  80.94673  84.7393  86.09125  87.4432     3

fsetequal(DT0, DT1)
##[1] TRUE
fsetequal(DT1, DT2)
##[1] TRUE

The current implementation of laggrpv appears as though it depends on the data being sorted which would limit some effectiveness of it.

chinsoon12 · 2020-03-07T13:51:32Z

@jangorecki, as requested, here are the timings with setDTthreads(1L); microbenchmark::microbenchmark(times=1L, mtd0(), mtd1()):

Unit: milliseconds
   expr        min         lq       mean     median         uq        max neval
 mtd0() 78635.5183 78635.5183 78635.5183 78635.5183 78635.5183 78635.5183     1
 mtd1()   166.7546   166.7546   166.7546   166.7546   166.7546   166.7546     1

@shrektan for uniqueN, i would suggest the below in mtd2:

DT2 <- copy(DT)
mtd0 <- function() DT0[, UNIQN := length(unique(VAL)), ID]
mtd1 <- function() DT1[, UNIQN := uniqueN(VAL), ID]
mtd2 <- function() DT2[, UNIQN := {
        ri <- rleid(ID, VAL)

        k <- laggrpv(ID, ri, .N)

        #the below is in essence, DT[, ri := rleid(ID, VAL)][, kb := ri[.N], ID]
        #which is hard to beat in terms of runtime and readability, so no point making it functional
        ix <- c(which(diff(ID)!=0L), .N)
        v <- replace(rep(NA_integer_, .N), ix, ri[ix])
        kb <- nafill(v, "nocb")

        kb - k
    }]
microbenchmark::microbenchmark(times=1L, mtd0(), mtd1(), mtd2())
fsetequal(DT0, DT1)
#[1] TRUE
fsetequal(DT0, DT2)
#[1] TRUE

timings:

Unit: milliseconds
   expr        min         lq       mean     median         uq        max neval
 mtd0()  2210.0728  2210.0728  2210.0728  2210.0728  2210.0728  2210.0728     1
 mtd1() 68158.5642 68158.5642 68158.5642 68158.5642 68158.5642 68158.5642     1
 mtd2()   106.7376   106.7376   106.7376   106.7376   106.7376   106.7376     1

@ColeMiller1, we can order quickly using DT[order(ID, VAL), NEWCOL := .....] Your demo is also fast as it does not use by. Fine with any implementation in R/C as long as its wrapped in a function

ColeMiller1 · 2020-03-14T18:43:19Z

@chinsoon12 I have a hackish method working - right now it only works without assignment and uses Rcpp. I will work on getting it more robust and try to do a PR. My C is not that good, though...

library(data.table) #Dev Version
set.seed(0L)
nr <- 1e6L#8
DT <- data.table(ID=rep(1L:(nr/2L), each=4L), VAL=1L:nr)
laggrpv <- function(g, v, N) {
  ix <- which(diff(g)!=0L) + 1L
  f <- replace(rep(NA_integer_, N), ix, v[ix-1L])
  k <- nafill(nafill(f, "locf"), fill=0)
}

mtd1 <- function() {
  DT[, .(CS =  {
    cs <- cumsum(as.double(VAL))
    k <- laggrpv(ID, cs, .N)
    cs - k
  })]
}

bench::mark(
  DT[, cumsum(VAL), ID], ##using hack method
  mtd1(),
  DT[, (cumsum(VAL)), ID], ##hack way doesn't see past ()
  check = FALSE
)
### A tibble: 3 x 13
##  expression                  min  median `itr/sec` mem_alloc
##  <bch:expr>              <bch:t> <bch:t>     <dbl> <bch:byt>
##1 DT[, cumsum(VAL), ID]    45.3ms  53.5ms     19.0     28.6MB
##2 mtd1()                   97.2ms 123.8ms      8.34   200.3MB
##3 DT[, (cumsum(VAL)), ID] 435.8ms 449.1ms      2.23    26.7MB

all.equal(DT[, cumsum(VAL), ID]$V1, mtd1()[[1L]])
##[1] TRUE

nhirschey mentioned this issue Mar 9, 2020

Calculations with many groups are much slower by default than with setDTthreads(1) #4294

Open

chinsoon12 linked a pull request Mar 13, 2020 that will close this issue

Lagby #4302

Open

mattdowle added this to the 1.12.11 milestone Jun 18, 2020

mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020

mattdowle removed this from the 1.14.1 milestone Aug 28, 2021

bschilder mentioned this issue Mar 21, 2023

Parallelising in R efficiently neurogenomics/MSTExplorer#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for a function to handle many small groups in some situations #4284

Proposal for a function to handle many small groups in some situations #4284

chinsoon12 commented Mar 7, 2020 •

edited

Loading

jangorecki commented Mar 7, 2020

shrektan commented Mar 7, 2020

ColeMiller1 commented Mar 7, 2020 •

edited

Loading

chinsoon12 commented Mar 7, 2020 •

edited

Loading

ColeMiller1 commented Mar 14, 2020

Proposal for a function to handle many small groups in some situations #4284

Proposal for a function to handle many small groups in some situations #4284

Comments

chinsoon12 commented Mar 7, 2020 • edited Loading

jangorecki commented Mar 7, 2020

shrektan commented Mar 7, 2020

ColeMiller1 commented Mar 7, 2020 • edited Loading

chinsoon12 commented Mar 7, 2020 • edited Loading

ColeMiller1 commented Mar 14, 2020

chinsoon12 commented Mar 7, 2020 •

edited

Loading

ColeMiller1 commented Mar 7, 2020 •

edited

Loading

chinsoon12 commented Mar 7, 2020 •

edited

Loading