{{ message }}
/ data.table Public

# Proposal for a function to handle many small groups in some situations#4284

Open
opened this issue Mar 7, 2020 · 5 comments · May be fixed by #4302
Open

# Proposal for a function to handle many small groups in some situations #4284

opened this issue Mar 7, 2020 · 5 comments · May be fixed by #4302

### chinsoon12 commented Mar 7, 2020 • edited

 When using `by` with many small groups, runtime can be slow. For some situations like `cumsum` and `frank`, we could perform it on the whole vector first and use the last value from the previous group to thus avoiding the use of `by`. Hence, proposing to add a function that carries the value from previous group to the next group (similar in spirit to `rleid` and `rowid`). I suspect it might help with some recursive algo. R implementation: ``````#fill group using last value from previous group #to be placed into a function, similarly like rleid and rowid #not defensive yet...major improvements required...ideas appreciated laggrpv <- function(g, v, N) { ix <- which(diff(g)!=0L) + 1L f <- replace(rep(NA_integer_, N), ix, v[ix-1L]) k <- nafill(nafill(f, "locf"), fill=0) } `````` Timing code for `cumsum`: ``````library(data.table) #Win x64 R-3.6.1 data.table_1.12.8 set.seed(0L) nr <- 1e6L#8 DT <- data.table(ID=rep(1L:(nr/2L), each=4L), VAL=1L:nr) DT0 <- copy(DT) DT1 <- copy(DT) f <- cumsum mtd0 <- function() { DT0[, CS := f(VAL), ID] } #fill group using last value from previous group #to be placed into a function, similarly like rleid and rowid laggrpv <- function(g, v, N) { ix <- which(diff(g)!=0L) + 1L f <- replace(rep(NA_integer_, N), ix, v[ix-1L]) k <- nafill(nafill(f, "locf"), fill=0) } mtd1 <- function() { DT1[, CS := { cs <- f(as.double(VAL)) k <- laggrpv(ID, cs, .N) cs - k }] } microbenchmark::microbenchmark(times=3L, mtd0(), mtd1()) fsetequal(DT0, DT1) # TRUE `````` timings: ``````Unit: milliseconds expr min lq mean median uq max neval mtd0() 176.7456 196.7197 203.8157 216.6937 217.3507 218.0077 3 mtd1() 125.2726 127.1587 131.0547 129.0448 133.9458 138.8468 3 `````` Timing code for `frank(..., ties.method="dense")` ``````f <- function(x) frank(x, ties.method="dense") mtd0 <- function() { DT0[, RNK := f(VAL), ID] } #fill group using last value from previous group #to be placed into a function, similarly like rleid and rowid laggrpv <- function(g, v, N) { ix <- which(diff(g)!=0L) + 1L f <- replace(rep(NA_integer_, N), ix, v[ix-1L]) k <- nafill(nafill(f, "locf"), fill=0) } mtd1 <- function() { DT1[, RNK := { rnk <- rleid(ID, VAL) k <- laggrpv(ID, rnk, .N) rnk - k }] } microbenchmark::microbenchmark(times=1L, mtd0(), mtd1()) fsetequal(DT0, DT1) `````` timing: ``````Unit: milliseconds expr min lq mean median uq max neval mtd0() 149920.6706 149920.6706 149920.6706 149920.6706 149920.6706 149920.6706 1 mtd1() 73.5902 73.5902 73.5902 73.5902 73.5902 73.5902 1 `````` The text was updated successfully, but these errors were encountered:

### jangorecki commented Mar 7, 2020

 frank is probably slow for many groups because of parallelism, do you get same timings when using single thread?

### shrektan commented Mar 7, 2020

 Similar to #3739

### ColeMiller1 commented Mar 7, 2020 • edited

 This is a neat idea although it begs the question: does set assignment need equivalent to GForce? Maybe SForce. Here's a similar version for `cumsum` where the internals of `data.table` are used: ``````mtd2 <- function() { DT2[, CS := { cs <- f(as.double(VAL)) o__ <- data.table:::forderv(ID, sort = FALSE, retGrp = TRUE) f__ = attr(o__, "starts", exact=TRUE) len__ = data.table:::uniqlengths(f__, .N) if (length(o__)) { firstofeachgroup = o__[f__] if (length(origorder <- data.table:::forderv(firstofeachgroup))) { f__ = f__[origorder] len__ = len__[origorder] } } cs - c(rep(0L, len__[1L]), cs[rep(f__[-1L], len__[-1L]) - 1L]) }] } `````` I am on the development version with 1.12.9 and it has a hit in performance for groupings like this. But using the internal `forderv` is quicker and with SForce, we could make it quicker because we wouldn't need to repeat the `cs` subtraction vector. Instead it would be a scalar in the C code initialized with 0. ``````microbenchmark::microbenchmark(times=3L, dt_by = mtd0(), laggrpv = mtd1(),dt_forder = mtd2()) Unit: milliseconds expr min lq mean median uq max neval dt_by 472.3768 491.6481 504.91553 510.9193 521.18490 531.4505 3 laggrpv 123.4735 124.5407 163.92253 125.6078 184.14705 242.6863 3 dt_forder 70.6577 77.6985 80.94673 84.7393 86.09125 87.4432 3 fsetequal(DT0, DT1) ## TRUE fsetequal(DT1, DT2) ## TRUE `````` The current implementation of `laggrpv` appears as though it depends on the data being sorted which would limit some effectiveness of it.

### chinsoon12 commented Mar 7, 2020 • edited

 @jangorecki, as requested, here are the timings with `setDTthreads(1L); microbenchmark::microbenchmark(times=1L, mtd0(), mtd1())`: ``````Unit: milliseconds expr min lq mean median uq max neval mtd0() 78635.5183 78635.5183 78635.5183 78635.5183 78635.5183 78635.5183 1 mtd1() 166.7546 166.7546 166.7546 166.7546 166.7546 166.7546 1 `````` @shrektan for `uniqueN`, i would suggest the below in `mtd2`: ``````DT2 <- copy(DT) mtd0 <- function() DT0[, UNIQN := length(unique(VAL)), ID] mtd1 <- function() DT1[, UNIQN := uniqueN(VAL), ID] mtd2 <- function() DT2[, UNIQN := { ri <- rleid(ID, VAL) k <- laggrpv(ID, ri, .N) #the below is in essence, DT[, ri := rleid(ID, VAL)][, kb := ri[.N], ID] #which is hard to beat in terms of runtime and readability, so no point making it functional ix <- c(which(diff(ID)!=0L), .N) v <- replace(rep(NA_integer_, .N), ix, ri[ix]) kb <- nafill(v, "nocb") kb - k }] microbenchmark::microbenchmark(times=1L, mtd0(), mtd1(), mtd2()) fsetequal(DT0, DT1) # TRUE fsetequal(DT0, DT2) # TRUE `````` timings: ``````Unit: milliseconds expr min lq mean median uq max neval mtd0() 2210.0728 2210.0728 2210.0728 2210.0728 2210.0728 2210.0728 1 mtd1() 68158.5642 68158.5642 68158.5642 68158.5642 68158.5642 68158.5642 1 mtd2() 106.7376 106.7376 106.7376 106.7376 106.7376 106.7376 1 `````` @ColeMiller1, we can order quickly using `DT[order(ID, VAL), NEWCOL := .....]` Your demo is also fast as it does not use `by`. Fine with any implementation in R/C as long as its wrapped in a function

### ColeMiller1 commented Mar 14, 2020

 @chinsoon12 I have a hackish method working - right now it only works without assignment and uses Rcpp. I will work on getting it more robust and try to do a PR. My C is not that good, though... ```library(data.table) #Dev Version set.seed(0L) nr <- 1e6L#8 DT <- data.table(ID=rep(1L:(nr/2L), each=4L), VAL=1L:nr) laggrpv <- function(g, v, N) { ix <- which(diff(g)!=0L) + 1L f <- replace(rep(NA_integer_, N), ix, v[ix-1L]) k <- nafill(nafill(f, "locf"), fill=0) } mtd1 <- function() { DT[, .(CS = { cs <- cumsum(as.double(VAL)) k <- laggrpv(ID, cs, .N) cs - k })] } bench::mark( DT[, cumsum(VAL), ID], ##using hack method mtd1(), DT[, (cumsum(VAL)), ID], ##hack way doesn't see past () check = FALSE ) ### A tibble: 3 x 13 ## expression min median `itr/sec` mem_alloc ## ##1 DT[, cumsum(VAL), ID] 45.3ms 53.5ms 19.0 28.6MB ##2 mtd1() 97.2ms 123.8ms 8.34 200.3MB ##3 DT[, (cumsum(VAL)), ID] 435.8ms 449.1ms 2.23 26.7MB all.equal(DT[, cumsum(VAL), ID]\$V1, mtd1()[[1L]]) ## TRUE```