[R-Forge #5754] GForce functions and row- + col-wise operations on .SD#523

opened this issue Jun 8, 2014 · 13 comments
[R-Forge #5754] GForce functions and row- + col-wise operations on .SD#523

opened this issue Jun 8, 2014 · 13 comments
arunsrinivasan commented Jun 8, 2014

Submitted by: Arun ; Assigned to: Nobody; R-Forge link

#### For GForce

• gsum, gmean
• .N
• gmin, max
• median
• head(.SD, 1), tail(.SD, 1), last(x)
• `[` for length-1 subsets
• gvar
• gsd
• gprod
• .SD[which.min()], .SD[which.max()]
• guniqueN
• gpaste??
• quantile
• covariance
• correlation
• kurtosis
• skewness

When `GForce` is upgraded to work with `:=`:

• cumulative functions
• rolling / window functions

#### Utility function

It should return a list. That is,

```x <- 1:5
lag(x, 1:2)
# [[1]]
# [1] NA  1  2  3  4
#
# [[2]]
# [1] NA NA  1  2  3```
gmin and gmax done. Partially address #5754 (git #523)
arunsrinivasan commented Jun 18, 2014

Benchmarks for `gmin` and `gmax` on data just big enough to highlight the difference.

## Data:

```require(data.table)
set.seed(2L)
k = 1e4
n = 1e6
is_na = TRUE
dt <- setDT(lapply(1:100, function(x) sample(c(1:k, if(is_na) NA_integer_), n, TRUE)))```

## min, no na.rm

```# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, min), by=V1])
#  user  system elapsed
#  0.533   0.012   0.547

# without
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, min), by=V1])
#   user  system elapsed
#  4.698   0.025   4.761

identical(ans1, ans2) # [1] TRUE```

## min, with na.rm

```# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, min, na.rm=TRUE), by=V1])
#   user  system elapsed
#  0.481   0.016   0.568

# without
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, function(x) min(x, na.rm=TRUE)), by=V1])
#   user  system elapsed
#  5.623   0.023   5.791

identical(ans1, ans2) # [1] TRUE```

## max, no na.rm

```# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, max), by=V1])
#   user  system elapsed
#  0.536   0.014   0.585

# without
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, max), by=V1])
#   user  system elapsed
#  5.069   0.029   5.351

identical(ans1, ans2) # [1] TRUE```

## max, with na.rm

```# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, max, na.rm=TRUE), by=V1])
#   user  system elapsed
#  0.517   0.011   0.546

# without
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, function(x) max(x, na.rm=TRUE)), by=V1])
#   user  system elapsed
#   5.862   0.025   6.064
identical(ans1, ans2) # [1] TRUE```

And here's a comparison putting everything together:

```options(datatable.optimize=2L)
system.time(ans1 <- dt[, c(lapply(.SD, sum), lapply(.SD, mean),
lapply(.SD, min), lapply(.SD, max), .N), by=V1])
#   user  system elapsed
#  2.463   0.018   2.575

options(datatable.optimize=1L)
system.time(ans2 <- dt[, c(lapply(.SD, sum), lapply(.SD, mean),
lapply(.SD, min), lapply(.SD, max), .N), by=V1])
#  user  system elapsed
# 11.840   0.034  11.987

identical(ans1, ans2) # [1] TRUE```

matthieugomez commented Nov 12, 2014

 Ideally, quantile, cov & corr would be great.

arunsrinivasan commented Jan 7, 2015

 `lead/lag` implemented as `shift()`. See #965.

GForce min/max for characters, #523.
gforce now optimises 'median' as well, #523.
arunsrinivasan commented Oct 30, 2015

`gmedian` always returns `numeric` type, so that we don't have to wrap with the annoying `as.numeric()` and is very fast. Using the same data as above:

## without `na.rm = TRUE`

```system.time(ans1 <- dt[, lapply(.SD, median), by=V1])
#    user  system elapsed
#   1.562   0.007   1.574
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(median(x))), by=V1])
#    user  system elapsed
#  23.013   0.336  23.638
identical(ans1, ans2)
# [1] TRUE```

## with `na.rm = TRUE`

```system.time(ans1 <- dt[, lapply(.SD, median, na.rm=TRUE), by=V1])
#    user  system elapsed
#   1.739   0.014   1.787
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(median(x, na.rm=TRUE))), by=V1])
#    user  system elapsed
#   24.201   0.749  25.217
identical(ans1, ans2)
# [1] TRUE```

Minor: s/max/median in error messages, #523.
arunsrinivasan commented Nov 8, 2015

 Benchmarks for `head` and `tail`: ```options(datatable.optimize=Inf) system.time(ans1 <- dt[, head(.SD, 1), by=V1]) # gforce optimised # 0.03 seconds options(datatable.optimize=1) system.time(ans2 <- dt[, head(.SD, 1), by=V1]) # level-1 optimisation # 10 seconds options(datatable.optimize=0) system.time(ans3 <- dt[, head(.SD, 1), by=V1]) # no optimisation # 45 seconds # restore optimisation options(datatable.optimize=Inf)``` works with subsets in `i` as well.

arunsrinivasan commented Nov 8, 2015

 Benchmark for `[` ```options(datatable.optimize=Inf) system.time(ans1 <- dt[, .SD[2], by=V1]) # gforce optimised # 0.03 seconds options(datatable.optimize=1L) system.time(ans2 <- dt[, .SD[2], by=V1]) # level-1 optimisation # 1.75 seconds options(datatable.optimize=0L) system.time(ans3 <- dt[, .SD[2], by=V1]) # no optimisation # 41 seconds # restore optimisation options(datatable.optimize=Inf)``` works with subsets in `i` as well.

head(.SD, 1) and tail(.SD,1) are gforce optimised, #523.
.SD[val] and col[val] optimised with GForce, #523.
jangorecki commented Dec 8, 2015

 Any plans for optimization of `head(.SD, 2)`? or `.SD[1:2]`. IMO there could be tons of cases to make optimization, so it may be better to deal with data.table modularity extension, so any future optimization can be cleaner and easier to contribute.

arunsrinivasan commented Feb 4, 2016

`var`, `sd` and `prod` are now GForce optimised as well.

### var

```# with GForce (default)
system.time(ans1 <- dt[, lapply(.SD, var, na.rm=TRUE), by=V1])
#    user  system elapsed
#   1.273   0.010   1.294

# without
system.time(ans2 <- dt[, lapply(.SD, stats::var, na.rm=TRUE), by=V1])
#    user  system elapsed
#  27.106   0.369  27.635

all.equal(ans1, ans2) # [1] TRUE```

### sd

```# with GForce (default)
system.time(ans1 <- dt[, lapply(.SD, sd, na.rm=TRUE), by=V1])
#    user  system elapsed
#   1.227   0.007   1.242

# without
system.time(ans2 <- dt[, lapply(.SD, stats::sd, na.rm=TRUE), by=V1])
#    user  system elapsed
#  28.428   0.406  29.172

all.equal(ans1, ans2) # [1] TRUE```

var, sd and prod functions are all GForce optimised for speed/memory, #523.
MichaelChirico commented Jan 13, 2017

 Could `is.na` perhaps be added to the list? I've been seeing a few examples of this recently.

franknarf1 commented Jun 21, 2017

 This may be a bad fit for GForce, but it would be nice to have an optimized version of `:`, perhaps. It's common that people want to use that with single-row groups, like: https://stackoverflow.com/a/44664086

franknarf1 commented Feb 20, 2018

 How about `any` and `all`? Example from SO: ``````library(data.table) household <- c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3) trip <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9) brand <- c(1,2,3,4,5,6,7,5,1,6,8,9,9,2,8,1,3,4,5,6,7,8,9,1,1,2,3,4,1,5,6,7,1,8,9,2) DT <- data.table(household,trip,brand) DT[, loyal_brand := .SD[.(household = household, trip = trip - 1L, brand = brand), on=.(household, trip, brand), .N, by=.EACHI]\$N > 0L ] DT[, .(loyal = any(loyal_brand)), by=.(household, trip)] `````` Seems `any` is analogous to `max`; and `all` to `min`.

kdkavanagh commented Sep 29, 2020

 Any idea if `seq` would be possible to convert to a GForce function? Common usecase would be: ``````df[,list(positionInGroup = 1:.N), by=list(grp)] ``````

franknarf1 commented Sep 29, 2020

 @kdkavanagh Fyi, for that you can create a column with `df[, v := rowid(grp)]`

