{{ message }}

# [R-Forge #5754] GForce functions and row- + col-wise operations on .SD#523

Open
opened this issue Jun 8, 2014 · 13 comments
Open

# [R-Forge #5754] GForce functions and row- + col-wise operations on .SD#523

opened this issue Jun 8, 2014 · 13 comments
Labels
feature request GForce

### arunsrinivasan commented Jun 8, 2014

Submitted by: Arun ; Assigned to: Nobody; R-Forge link

#### For GForce

• gsum, gmean
• .N
• gmin, max
• median
• head(.SD, 1), tail(.SD, 1), last(x)
• `[` for length-1 subsets
• gvar
• gsd
• gprod
• .SD[which.min()], .SD[which.max()]
• guniqueN
• gpaste??
• quantile
• covariance
• correlation
• kurtosis
• skewness

When `GForce` is upgraded to work with `:=`:

• cumulative functions
• rolling / window functions

#### Utility function

It should return a list. That is,

```x <- 1:5
lag(x, 1:2)
# [[1]]
# [1] NA  1  2  3  4
#
# [[2]]
# [1] NA NA  1  2  3```
changed the title [R-Forge #5754] Implement functions for row-wise and col-wise operations on `.SD` [R-Forge #5754] GForce functions and row- + col-wise operations on .SD Jun 8, 2014
mentioned this issue Jun 15, 2014
Closed
added a commit that referenced this issue Jun 18, 2014
``` gmin and gmax done. Partially address #5754 (git #523) ```
``` 0af4511 ```

### arunsrinivasan commented Jun 18, 2014

Benchmarks for `gmin` and `gmax` on data just big enough to highlight the difference.

## Data:

```require(data.table)
set.seed(2L)
k = 1e4
n = 1e6
is_na = TRUE
dt <- setDT(lapply(1:100, function(x) sample(c(1:k, if(is_na) NA_integer_), n, TRUE)))```

## min, no na.rm

```# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, min), by=V1])
#  user  system elapsed
#  0.533   0.012   0.547

# without
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, min), by=V1])
#   user  system elapsed
#  4.698   0.025   4.761

identical(ans1, ans2) # [1] TRUE```

## min, with na.rm

```# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, min, na.rm=TRUE), by=V1])
#   user  system elapsed
#  0.481   0.016   0.568

# without
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, function(x) min(x, na.rm=TRUE)), by=V1])
#   user  system elapsed
#  5.623   0.023   5.791

identical(ans1, ans2) # [1] TRUE```

## max, no na.rm

```# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, max), by=V1])
#   user  system elapsed
#  0.536   0.014   0.585

# without
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, max), by=V1])
#   user  system elapsed
#  5.069   0.029   5.351

identical(ans1, ans2) # [1] TRUE```

## max, with na.rm

```# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, max, na.rm=TRUE), by=V1])
#   user  system elapsed
#  0.517   0.011   0.546

# without
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, function(x) max(x, na.rm=TRUE)), by=V1])
#   user  system elapsed
#   5.862   0.025   6.064
identical(ans1, ans2) # [1] TRUE```

And here's a comparison putting everything together:

```options(datatable.optimize=2L)
system.time(ans1 <- dt[, c(lapply(.SD, sum), lapply(.SD, mean),
lapply(.SD, min), lapply(.SD, max), .N), by=V1])
#   user  system elapsed
#  2.463   0.018   2.575

options(datatable.optimize=1L)
system.time(ans2 <- dt[, c(lapply(.SD, sum), lapply(.SD, mean),
lapply(.SD, min), lapply(.SD, max), .N), by=V1])
#  user  system elapsed
# 11.840   0.034  11.987

identical(ans1, ans2) # [1] TRUE```

added this to the v1.9.6 milestone Sep 24, 2014
removed this from the v1.9.6 milestone Oct 10, 2014
added this to the v1.9.8 milestone Oct 10, 2014
added this to the v1.9.8 milestone Oct 10, 2014
removed this from the v1.9.6 milestone Oct 10, 2014

### matthieugomez commented Nov 12, 2014

 Ideally, quantile, cov & corr would be great.

### arunsrinivasan commented Jan 7, 2015

 `lead/lag` implemented as `shift()`. See #965.

added a commit that referenced this issue Jan 26, 2015
``` GForce min/max for characters, #523. ```
``` a18d624 ```
added a commit that referenced this issue Oct 30, 2015
``` gforce now optimises 'median' as well, #523. ```
``` d2f7d63 ```

### arunsrinivasan commented Oct 30, 2015

`gmedian` always returns `numeric` type, so that we don't have to wrap with the annoying `as.numeric()` and is very fast. Using the same data as above:

## without `na.rm = TRUE`

```system.time(ans1 <- dt[, lapply(.SD, median), by=V1])
#    user  system elapsed
#   1.562   0.007   1.574
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(median(x))), by=V1])
#    user  system elapsed
#  23.013   0.336  23.638
identical(ans1, ans2)
# [1] TRUE```

## with `na.rm = TRUE`

```system.time(ans1 <- dt[, lapply(.SD, median, na.rm=TRUE), by=V1])
#    user  system elapsed
#   1.739   0.014   1.787
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(median(x, na.rm=TRUE))), by=V1])
#    user  system elapsed
#   24.201   0.749  25.217
identical(ans1, ans2)
# [1] TRUE```

added a commit that referenced this issue Oct 30, 2015
``` Minor: s/max/median in error messages, #523. ```
``` a6950a2 ```

### arunsrinivasan commented Nov 8, 2015

 Benchmarks for `head` and `tail`: ```options(datatable.optimize=Inf) system.time(ans1 <- dt[, head(.SD, 1), by=V1]) # gforce optimised # 0.03 seconds options(datatable.optimize=1) system.time(ans2 <- dt[, head(.SD, 1), by=V1]) # level-1 optimisation # 10 seconds options(datatable.optimize=0) system.time(ans3 <- dt[, head(.SD, 1), by=V1]) # no optimisation # 45 seconds # restore optimisation options(datatable.optimize=Inf)``` works with subsets in `i` as well.

### arunsrinivasan commented Nov 8, 2015

 Benchmark for `[` ```options(datatable.optimize=Inf) system.time(ans1 <- dt[, .SD[2], by=V1]) # gforce optimised # 0.03 seconds options(datatable.optimize=1L) system.time(ans2 <- dt[, .SD[2], by=V1]) # level-1 optimisation # 1.75 seconds options(datatable.optimize=0L) system.time(ans3 <- dt[, .SD[2], by=V1]) # no optimisation # 41 seconds # restore optimisation options(datatable.optimize=Inf)``` works with subsets in `i` as well.

added a commit that referenced this issue Nov 8, 2015
``` head(.SD, 1) and tail(.SD,1) are gforce optimised, #523. ```
``` e615532 ```
added a commit that referenced this issue Nov 8, 2015
``` .SD[val] and col[val] optimised with GForce, #523. ```
``` 751baff ```

### jangorecki commented Dec 8, 2015

 Any plans for optimization of `head(.SD, 2)`? or `.SD[1:2]`. IMO there could be tons of cases to make optimization, so it may be better to deal with data.table modularity extension, so any future optimization can be cleaner and easier to contribute.

### arunsrinivasan commented Feb 4, 2016

`var`, `sd` and `prod` are now GForce optimised as well.

### var

```# with GForce (default)
system.time(ans1 <- dt[, lapply(.SD, var, na.rm=TRUE), by=V1])
#    user  system elapsed
#   1.273   0.010   1.294

# without
system.time(ans2 <- dt[, lapply(.SD, stats::var, na.rm=TRUE), by=V1])
#    user  system elapsed
#  27.106   0.369  27.635

all.equal(ans1, ans2) # [1] TRUE```

### sd

```# with GForce (default)
system.time(ans1 <- dt[, lapply(.SD, sd, na.rm=TRUE), by=V1])
#    user  system elapsed
#   1.227   0.007   1.242

# without
system.time(ans2 <- dt[, lapply(.SD, stats::sd, na.rm=TRUE), by=V1])
#    user  system elapsed
#  28.428   0.406  29.172

all.equal(ans1, ans2) # [1] TRUE```

added a commit that referenced this issue Feb 4, 2016
``` var, sd and prod functions are all GForce optimised for speed/memory, #… ```
``` adc139f ```
`…523.`
added this to the v2.0.0 milestone Mar 20, 2016
removed this from the v1.9.8 milestone Mar 20, 2016

### MichaelChirico commented Jan 13, 2017

 Could `is.na` perhaps be added to the list? I've been seeing a few examples of this recently.

### franknarf1 commented Jun 21, 2017

 This may be a bad fit for GForce, but it would be nice to have an optimized version of `:`, perhaps. It's common that people want to use that with single-row groups, like: https://stackoverflow.com/a/44664086

### franknarf1 commented Feb 20, 2018

 How about `any` and `all`? Example from SO: ``````library(data.table) household <- c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3) trip <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9) brand <- c(1,2,3,4,5,6,7,5,1,6,8,9,9,2,8,1,3,4,5,6,7,8,9,1,1,2,3,4,1,5,6,7,1,8,9,2) DT <- data.table(household,trip,brand) DT[, loyal_brand := .SD[.(household = household, trip = trip - 1L, brand = brand), on=.(household, trip, brand), .N, by=.EACHI]\$N > 0L ] DT[, .(loyal = any(loyal_brand)), by=.(household, trip)] `````` Seems `any` is analogous to `max`; and `all` to `min`.

removed this from the Candidate milestone May 10, 2018
mentioned this issue Jan 10, 2019
added the GForce label Feb 25, 2019
mentioned this issue Jun 10, 2019
mentioned this issue Oct 30, 2019
removed the Medium label Apr 2, 2020

### kdkavanagh commented Sep 29, 2020

 Any idea if `seq` would be possible to convert to a GForce function? Common usecase would be: ``````df[,list(positionInGroup = 1:.N), by=list(grp)] ``````

### franknarf1 commented Sep 29, 2020

 @kdkavanagh Fyi, for that you can create a column with `df[, v := rowid(grp)]`

mentioned this issue Jul 1, 2021
mentioned this issue Oct 9, 2021