Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for a function to handle many small groups in some situations #4284

Open
chinsoon12 opened this issue Mar 7, 2020 · 5 comments · May be fixed by #4302
Open

Proposal for a function to handle many small groups in some situations #4284

chinsoon12 opened this issue Mar 7, 2020 · 5 comments · May be fixed by #4302

Comments

@chinsoon12
Copy link

chinsoon12 commented Mar 7, 2020

When using by with many small groups, runtime can be slow. For some situations like cumsum and frank, we could perform it on the whole vector first and use the last value from the previous group to thus avoiding the use of by. Hence, proposing to add a function that carries the value from previous group to the next group (similar in spirit to rleid and rowid). I suspect it might help with some recursive algo.

R implementation:

#fill group using last value from previous group
#to be placed into a function, similarly like rleid and rowid
#not defensive yet...major improvements required...ideas appreciated
laggrpv <- function(g, v, N) {
    ix <- which(diff(g)!=0L) + 1L
    f <- replace(rep(NA_integer_, N), ix, v[ix-1L])
    k <- nafill(nafill(f, "locf"), fill=0)
}

Timing code for cumsum:

library(data.table) #Win x64 R-3.6.1 data.table_1.12.8
set.seed(0L)
nr <- 1e6L#8
DT <- data.table(ID=rep(1L:(nr/2L), each=4L), VAL=1L:nr)
DT0 <- copy(DT)
DT1 <- copy(DT)

f <-  cumsum
mtd0 <- function() {
    DT0[, CS := f(VAL), ID]
}

#fill group using last value from previous group
#to be placed into a function, similarly like rleid and rowid
laggrpv <- function(g, v, N) {
    ix <- which(diff(g)!=0L) + 1L
    f <- replace(rep(NA_integer_, N), ix, v[ix-1L])
    k <- nafill(nafill(f, "locf"), fill=0)
}

mtd1 <- function() {
    DT1[, CS := {
        cs <- f(as.double(VAL))

        k <- laggrpv(ID, cs, .N)

        cs - k
    }]
}
microbenchmark::microbenchmark(times=3L, mtd0(), mtd1())
fsetequal(DT0, DT1)
#[1] TRUE

timings:

Unit: milliseconds
   expr      min       lq     mean   median       uq      max neval
 mtd0() 176.7456 196.7197 203.8157 216.6937 217.3507 218.0077     3
 mtd1() 125.2726 127.1587 131.0547 129.0448 133.9458 138.8468     3

Timing code for frank(..., ties.method="dense")

f <- function(x) frank(x, ties.method="dense")
mtd0 <- function() {
    DT0[, RNK := f(VAL), ID]
}

#fill group using last value from previous group
#to be placed into a function, similarly like rleid and rowid
laggrpv <- function(g, v, N) {
    ix <- which(diff(g)!=0L) + 1L
    f <- replace(rep(NA_integer_, N), ix, v[ix-1L])
    k <- nafill(nafill(f, "locf"), fill=0)
}

mtd1 <- function() {
    DT1[, RNK := {
        rnk <- rleid(ID, VAL)

        k <- laggrpv(ID, rnk, .N)

        rnk - k
    }]
}
microbenchmark::microbenchmark(times=1L, mtd0(), mtd1())
fsetequal(DT0, DT1)

timing:

Unit: milliseconds
   expr         min          lq        mean      median          uq         max neval
 mtd0() 149920.6706 149920.6706 149920.6706 149920.6706 149920.6706 149920.6706     1
 mtd1()     73.5902     73.5902     73.5902     73.5902     73.5902     73.5902     1
@jangorecki
Copy link
Member

frank is probably slow for many groups because of parallelism, do you get same timings when using single thread?

@shrektan
Copy link
Member

shrektan commented Mar 7, 2020

Similar to #3739

@ColeMiller1
Copy link
Contributor

ColeMiller1 commented Mar 7, 2020

This is a neat idea although it begs the question: does set assignment need equivalent to GForce? Maybe SForce.

Here's a similar version for cumsum where the internals of data.table are used:

mtd2 <- function() {
  DT2[, CS :=  {
    cs <- f(as.double(VAL))
    o__ <- data.table:::forderv(ID, sort = FALSE, retGrp = TRUE)
    f__ = attr(o__, "starts", exact=TRUE)
    len__ = data.table:::uniqlengths(f__, .N)
    
    if (length(o__)) {    
      firstofeachgroup = o__[f__]
      if (length(origorder <- data.table:::forderv(firstofeachgroup))) {
        f__ = f__[origorder]
        len__ = len__[origorder]
      }
    }
    cs - c(rep(0L, len__[1L]), cs[rep(f__[-1L], len__[-1L]) - 1L]) 
  }]
}

I am on the development version with 1.12.9 and it has a hit in performance for groupings like this. But using the internal forderv is quicker and with SForce, we could make it quicker because we wouldn't need to repeat the cs subtraction vector. Instead it would be a scalar in the C code initialized with 0.

microbenchmark::microbenchmark(times=3L, dt_by = mtd0(), laggrpv = mtd1(),dt_forder = mtd2())

Unit: milliseconds
      expr      min       lq      mean   median        uq      max neval
     dt_by 472.3768 491.6481 504.91553 510.9193 521.18490 531.4505     3
   laggrpv 123.4735 124.5407 163.92253 125.6078 184.14705 242.6863     3
 dt_forder  70.6577  77.6985  80.94673  84.7393  86.09125  87.4432     3

fsetequal(DT0, DT1)
##[1] TRUE
fsetequal(DT1, DT2)
##[1] TRUE

The current implementation of laggrpv appears as though it depends on the data being sorted which would limit some effectiveness of it.

@chinsoon12
Copy link
Author

chinsoon12 commented Mar 7, 2020

@jangorecki, as requested, here are the timings with setDTthreads(1L); microbenchmark::microbenchmark(times=1L, mtd0(), mtd1()):

Unit: milliseconds
   expr        min         lq       mean     median         uq        max neval
 mtd0() 78635.5183 78635.5183 78635.5183 78635.5183 78635.5183 78635.5183     1
 mtd1()   166.7546   166.7546   166.7546   166.7546   166.7546   166.7546     1

@shrektan for uniqueN, i would suggest the below in mtd2:

DT2 <- copy(DT)
mtd0 <- function() DT0[, UNIQN := length(unique(VAL)), ID]
mtd1 <- function() DT1[, UNIQN := uniqueN(VAL), ID]
mtd2 <- function() DT2[, UNIQN := {
        ri <- rleid(ID, VAL)

        k <- laggrpv(ID, ri, .N)

        #the below is in essence, DT[, ri := rleid(ID, VAL)][, kb := ri[.N], ID]
        #which is hard to beat in terms of runtime and readability, so no point making it functional
        ix <- c(which(diff(ID)!=0L), .N)
        v <- replace(rep(NA_integer_, .N), ix, ri[ix])
        kb <- nafill(v, "nocb")

        kb - k
    }]
microbenchmark::microbenchmark(times=1L, mtd0(), mtd1(), mtd2())
fsetequal(DT0, DT1)
#[1] TRUE
fsetequal(DT0, DT2)
#[1] TRUE

timings:

Unit: milliseconds
   expr        min         lq       mean     median         uq        max neval
 mtd0()  2210.0728  2210.0728  2210.0728  2210.0728  2210.0728  2210.0728     1
 mtd1() 68158.5642 68158.5642 68158.5642 68158.5642 68158.5642 68158.5642     1
 mtd2()   106.7376   106.7376   106.7376   106.7376   106.7376   106.7376     1

@ColeMiller1, we can order quickly using DT[order(ID, VAL), NEWCOL := .....] Your demo is also fast as it does not use by. Fine with any implementation in R/C as long as its wrapped in a function

@ColeMiller1
Copy link
Contributor

@chinsoon12 I have a hackish method working - right now it only works without assignment and uses Rcpp. I will work on getting it more robust and try to do a PR. My C is not that good, though...

library(data.table) #Dev Version
set.seed(0L)
nr <- 1e6L#8
DT <- data.table(ID=rep(1L:(nr/2L), each=4L), VAL=1L:nr)
laggrpv <- function(g, v, N) {
  ix <- which(diff(g)!=0L) + 1L
  f <- replace(rep(NA_integer_, N), ix, v[ix-1L])
  k <- nafill(nafill(f, "locf"), fill=0)
}

mtd1 <- function() {
  DT[, .(CS =  {
    cs <- cumsum(as.double(VAL))
    k <- laggrpv(ID, cs, .N)
    cs - k
  })]
}

bench::mark(
  DT[, cumsum(VAL), ID], ##using hack method
  mtd1(),
  DT[, (cumsum(VAL)), ID], ##hack way doesn't see past ()
  check = FALSE
)
### A tibble: 3 x 13
##  expression                  min  median `itr/sec` mem_alloc
##  <bch:expr>              <bch:t> <bch:t>     <dbl> <bch:byt>
##1 DT[, cumsum(VAL), ID]    45.3ms  53.5ms     19.0     28.6MB
##2 mtd1()                   97.2ms 123.8ms      8.34   200.3MB
##3 DT[, (cumsum(VAL)), ID] 435.8ms 449.1ms      2.23    26.7MB

all.equal(DT[, cumsum(VAL), ID]$V1, mtd1()[[1L]])
##[1] TRUE

@mattdowle mattdowle added this to the 1.12.11 milestone Jun 18, 2020
@mattdowle mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020
@mattdowle mattdowle removed this from the 1.14.1 milestone Aug 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants