Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-Forge #5754] GForce functions and row- + col-wise operations on .SD #523

Open
10 of 20 tasks
arunsrinivasan opened this issue Jun 8, 2014 · 13 comments
Open
10 of 20 tasks
Labels
feature request GForce

Comments

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Jun 8, 2014

Submitted by: Arun ; Assigned to: Nobody; R-Forge link

For GForce

  • gsum, gmean
  • .N
  • gmin, max
  • median
  • head(.SD, 1), tail(.SD, 1), last(x)
  • [ for length-1 subsets
  • gvar
  • gsd
  • gprod
  • .SD[which.min()], .SD[which.max()]
  • guniqueN
  • gpaste??
  • quantile
  • covariance
  • correlation
  • kurtosis
  • skewness

When GForce is upgraded to work with :=:

  • cumulative functions
  • rolling / window functions

Utility function

  • lead, lag

It should return a list. That is,

x <- 1:5
lag(x, 1:2)
# [[1]]
# [1] NA  1  2  3  4
# 
# [[2]]
# [1] NA NA  1  2  3
@arunsrinivasan arunsrinivasan changed the title [R-Forge #5754] Implement functions for row-wise and col-wise operations on .SD [R-Forge #5754] GForce functions and row- + col-wise operations on .SD Jun 8, 2014
@arunsrinivasan arunsrinivasan mentioned this issue Jun 15, 2014
17 tasks
@arunsrinivasan
Copy link
Member Author

@arunsrinivasan arunsrinivasan commented Jun 18, 2014

Benchmarks for gmin and gmax on data just big enough to highlight the difference.

Data:

require(data.table)
set.seed(2L)
k = 1e4
n = 1e6
is_na = TRUE
dt <- setDT(lapply(1:100, function(x) sample(c(1:k, if(is_na) NA_integer_), n, TRUE)))

min, no na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, min), by=V1])
#  user  system elapsed 
#  0.533   0.012   0.547 

# without
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, min), by=V1])
#   user  system elapsed 
#  4.698   0.025   4.761 

identical(ans1, ans2) # [1] TRUE

min, with na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, min, na.rm=TRUE), by=V1])
#   user  system elapsed 
#  0.481   0.016   0.568 

# without
options(datatable.optimize=1L) 
system.time(ans2 <- dt[, lapply(.SD, function(x) min(x, na.rm=TRUE)), by=V1])
#   user  system elapsed 
#  5.623   0.023   5.791 

identical(ans1, ans2) # [1] TRUE

max, no na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, max), by=V1])
#   user  system elapsed 
#  0.536   0.014   0.585 

# without 
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, max), by=V1])
#   user  system elapsed 
#  5.069   0.029   5.351 

identical(ans1, ans2) # [1] TRUE

max, with na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, max, na.rm=TRUE), by=V1])
#   user  system elapsed 
#  0.517   0.011   0.546 

# without
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, function(x) max(x, na.rm=TRUE)), by=V1])
#   user  system elapsed 
#   5.862   0.025   6.064 
identical(ans1, ans2) # [1] TRUE

And here's a comparison putting everything together:

options(datatable.optimize=2L)
system.time(ans1 <- dt[, c(lapply(.SD, sum), lapply(.SD, mean), 
                             lapply(.SD, min), lapply(.SD, max), .N), by=V1])
#   user  system elapsed 
#  2.463   0.018   2.575 

options(datatable.optimize=1L)
system.time(ans2 <- dt[, c(lapply(.SD, sum), lapply(.SD, mean), 
                    lapply(.SD, min), lapply(.SD, max), .N), by=V1])
 #  user  system elapsed 
# 11.840   0.034  11.987 

identical(ans1, ans2) # [1] TRUE

@arunsrinivasan arunsrinivasan added this to the v1.9.6 milestone Sep 24, 2014
@arunsrinivasan arunsrinivasan removed this from the v1.9.6 milestone Oct 10, 2014
@arunsrinivasan arunsrinivasan added this to the v1.9.8 milestone Oct 10, 2014
@arunsrinivasan arunsrinivasan added this to the v1.9.8 milestone Oct 10, 2014
@arunsrinivasan arunsrinivasan removed this from the v1.9.6 milestone Oct 10, 2014
@matthieugomez
Copy link
Contributor

@matthieugomez matthieugomez commented Nov 12, 2014

Ideally, quantile, cov & corr would be great.

@arunsrinivasan
Copy link
Member Author

@arunsrinivasan arunsrinivasan commented Jan 7, 2015

lead/lag implemented as shift(). See #965.

@arunsrinivasan
Copy link
Member Author

@arunsrinivasan arunsrinivasan commented Oct 30, 2015

gmedian always returns numeric type, so that we don't have to wrap with the annoying as.numeric() and is very fast. Using the same data as above:

without na.rm = TRUE

system.time(ans1 <- dt[, lapply(.SD, median), by=V1])
#    user  system elapsed 
#   1.562   0.007   1.574 
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(median(x))), by=V1])
#    user  system elapsed 
#  23.013   0.336  23.638 
identical(ans1, ans2)
# [1] TRUE

with na.rm = TRUE

system.time(ans1 <- dt[, lapply(.SD, median, na.rm=TRUE), by=V1])
#    user  system elapsed 
#   1.739   0.014   1.787 
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(median(x, na.rm=TRUE))), by=V1])
#    user  system elapsed 
#   24.201   0.749  25.217 
identical(ans1, ans2)
# [1] TRUE

@arunsrinivasan
Copy link
Member Author

@arunsrinivasan arunsrinivasan commented Nov 8, 2015

Benchmarks for head and tail:

options(datatable.optimize=Inf)
system.time(ans1 <- dt[, head(.SD, 1), by=V1]) # gforce optimised
# 0.03 seconds

options(datatable.optimize=1)
system.time(ans2 <- dt[, head(.SD, 1), by=V1]) # level-1 optimisation
# 10 seconds

options(datatable.optimize=0)
system.time(ans3 <- dt[, head(.SD, 1), by=V1]) # no optimisation
# 45 seconds

# restore optimisation
options(datatable.optimize=Inf)

works with subsets in i as well.

@arunsrinivasan
Copy link
Member Author

@arunsrinivasan arunsrinivasan commented Nov 8, 2015

Benchmark for [

options(datatable.optimize=Inf)
system.time(ans1 <- dt[, .SD[2], by=V1]) # gforce optimised
# 0.03 seconds

options(datatable.optimize=1L)
system.time(ans2 <- dt[, .SD[2], by=V1]) # level-1 optimisation
# 1.75 seconds

options(datatable.optimize=0L)
system.time(ans3 <- dt[, .SD[2], by=V1]) # no optimisation
# 41 seconds

# restore optimisation
options(datatable.optimize=Inf)

works with subsets in i as well.

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Dec 8, 2015

Any plans for optimization of head(.SD, 2)? or .SD[1:2].
IMO there could be tons of cases to make optimization, so it may be better to deal with data.table modularity extension, so any future optimization can be cleaner and easier to contribute.

@arunsrinivasan
Copy link
Member Author

@arunsrinivasan arunsrinivasan commented Feb 4, 2016

var, sd and prod are now GForce optimised as well.

var

# with GForce (default)
system.time(ans1 <- dt[, lapply(.SD, var, na.rm=TRUE), by=V1])
#    user  system elapsed 
#   1.273   0.010   1.294 

# without
system.time(ans2 <- dt[, lapply(.SD, stats::var, na.rm=TRUE), by=V1])
#    user  system elapsed 
#  27.106   0.369  27.635 

all.equal(ans1, ans2) # [1] TRUE

sd

# with GForce (default)
system.time(ans1 <- dt[, lapply(.SD, sd, na.rm=TRUE), by=V1])
#    user  system elapsed 
#   1.227   0.007   1.242 

# without
system.time(ans2 <- dt[, lapply(.SD, stats::sd, na.rm=TRUE), by=V1])
#    user  system elapsed 
#  28.428   0.406  29.172 

all.equal(ans1, ans2) # [1] TRUE

@arunsrinivasan arunsrinivasan added this to the v2.0.0 milestone Mar 20, 2016
@arunsrinivasan arunsrinivasan removed this from the v1.9.8 milestone Mar 20, 2016
@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Jan 13, 2017

Could is.na perhaps be added to the list? I've been seeing a few examples of this recently.

@franknarf1
Copy link
Contributor

@franknarf1 franknarf1 commented Jun 21, 2017

This may be a bad fit for GForce, but it would be nice to have an optimized version of :, perhaps. It's common that people want to use that with single-row groups, like: https://stackoverflow.com/a/44664086

@franknarf1
Copy link
Contributor

@franknarf1 franknarf1 commented Feb 20, 2018

How about any and all? Example from SO:

library(data.table)
household <-  c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3)
trip      <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9)
brand     <- c(1,2,3,4,5,6,7,5,1,6,8,9,9,2,8,1,3,4,5,6,7,8,9,1,1,2,3,4,1,5,6,7,1,8,9,2)
DT <- data.table(household,trip,brand)

DT[, loyal_brand := 
  .SD[.(household = household, trip = trip - 1L, brand = brand), on=.(household, trip, brand), .N, by=.EACHI]$N > 0L
]

DT[, .(loyal = any(loyal_brand)), by=.(household, trip)]

Seems any is analogous to max; and all to min.

@MichaelChirico MichaelChirico added the GForce label Feb 25, 2019
@jangorecki jangorecki removed the Medium label Apr 2, 2020
@kdkavanagh
Copy link

@kdkavanagh kdkavanagh commented Sep 29, 2020

Any idea if seq would be possible to convert to a GForce function? Common usecase would be:

df[,list(positionInGroup = 1:.N), by=list(grp)]

@franknarf1
Copy link
Contributor

@franknarf1 franknarf1 commented Sep 29, 2020

@kdkavanagh Fyi, for that you can create a column with df[, v := rowid(grp)]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request GForce
Projects
None yet
Development

No branches or pull requests

7 participants