New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-Forge #5754] GForce functions and row- + col-wise operations on .SD #523

Open
arunsrinivasan opened this Issue Jun 8, 2014 · 11 comments

Comments

Projects
None yet
6 participants
@arunsrinivasan
Member

arunsrinivasan commented Jun 8, 2014

Submitted by: Arun ; Assigned to: Nobody; R-Forge link

For GForce

  • gsum, gmean
  • .N
  • gmin, max
  • median
  • head(.SD, 1), tail(.SD, 1), last(x)
  • [ for length-1 subsets
  • gvar
  • gsd
  • gprod
  • .SD[which.min()], .SD[which.max()]
  • guniqueN
  • gpaste??
  • quantile
  • covariance
  • correlation
  • kurtosis
  • skewness

When GForce is upgraded to work with :=:

  • cumulative functions
  • rolling / window functions

Utility function

  • lead, lag

It should return a list. That is,

x <- 1:5
lag(x, 1:2)
# [[1]]
# [1] NA  1  2  3  4
# 
# [[2]]
# [1] NA NA  1  2  3

@arunsrinivasan arunsrinivasan changed the title from [R-Forge #5754] Implement functions for row-wise and col-wise operations on `.SD` to [R-Forge #5754] GForce functions and row- + col-wise operations on .SD Jun 8, 2014

@arunsrinivasan arunsrinivasan referenced this issue Jun 15, 2014

Closed

Homepage #695

3 of 17 tasks complete
@arunsrinivasan

This comment has been minimized.

Show comment
Hide comment
@arunsrinivasan

arunsrinivasan Jun 18, 2014

Member

Benchmarks for gmin and gmax on data just big enough to highlight the difference.

Data:

require(data.table)
set.seed(2L)
k = 1e4
n = 1e6
is_na = TRUE
dt <- setDT(lapply(1:100, function(x) sample(c(1:k, if(is_na) NA_integer_), n, TRUE)))

min, no na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, min), by=V1])
#  user  system elapsed 
#  0.533   0.012   0.547 

# without
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, min), by=V1])
#   user  system elapsed 
#  4.698   0.025   4.761 

identical(ans1, ans2) # [1] TRUE

min, with na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, min, na.rm=TRUE), by=V1])
#   user  system elapsed 
#  0.481   0.016   0.568 

# without
options(datatable.optimize=1L) 
system.time(ans2 <- dt[, lapply(.SD, function(x) min(x, na.rm=TRUE)), by=V1])
#   user  system elapsed 
#  5.623   0.023   5.791 

identical(ans1, ans2) # [1] TRUE

max, no na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, max), by=V1])
#   user  system elapsed 
#  0.536   0.014   0.585 

# without 
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, max), by=V1])
#   user  system elapsed 
#  5.069   0.029   5.351 

identical(ans1, ans2) # [1] TRUE

max, with na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, max, na.rm=TRUE), by=V1])
#   user  system elapsed 
#  0.517   0.011   0.546 

# without
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, function(x) max(x, na.rm=TRUE)), by=V1])
#   user  system elapsed 
#   5.862   0.025   6.064 
identical(ans1, ans2) # [1] TRUE

And here's a comparison putting everything together:

options(datatable.optimize=2L)
system.time(ans1 <- dt[, c(lapply(.SD, sum), lapply(.SD, mean), 
                             lapply(.SD, min), lapply(.SD, max), .N), by=V1])
#   user  system elapsed 
#  2.463   0.018   2.575 

options(datatable.optimize=1L)
system.time(ans2 <- dt[, c(lapply(.SD, sum), lapply(.SD, mean), 
                    lapply(.SD, min), lapply(.SD, max), .N), by=V1])
 #  user  system elapsed 
# 11.840   0.034  11.987 

identical(ans1, ans2) # [1] TRUE
Member

arunsrinivasan commented Jun 18, 2014

Benchmarks for gmin and gmax on data just big enough to highlight the difference.

Data:

require(data.table)
set.seed(2L)
k = 1e4
n = 1e6
is_na = TRUE
dt <- setDT(lapply(1:100, function(x) sample(c(1:k, if(is_na) NA_integer_), n, TRUE)))

min, no na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, min), by=V1])
#  user  system elapsed 
#  0.533   0.012   0.547 

# without
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, min), by=V1])
#   user  system elapsed 
#  4.698   0.025   4.761 

identical(ans1, ans2) # [1] TRUE

min, with na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, min, na.rm=TRUE), by=V1])
#   user  system elapsed 
#  0.481   0.016   0.568 

# without
options(datatable.optimize=1L) 
system.time(ans2 <- dt[, lapply(.SD, function(x) min(x, na.rm=TRUE)), by=V1])
#   user  system elapsed 
#  5.623   0.023   5.791 

identical(ans1, ans2) # [1] TRUE

max, no na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, max), by=V1])
#   user  system elapsed 
#  0.536   0.014   0.585 

# without 
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, max), by=V1])
#   user  system elapsed 
#  5.069   0.029   5.351 

identical(ans1, ans2) # [1] TRUE

max, with na.rm

# with GForce (default)
options(datatable.optimize=2L)
system.time(ans1 <- dt[, lapply(.SD, max, na.rm=TRUE), by=V1])
#   user  system elapsed 
#  0.517   0.011   0.546 

# without
options(datatable.optimize=1L)
system.time(ans2 <- dt[, lapply(.SD, function(x) max(x, na.rm=TRUE)), by=V1])
#   user  system elapsed 
#   5.862   0.025   6.064 
identical(ans1, ans2) # [1] TRUE

And here's a comparison putting everything together:

options(datatable.optimize=2L)
system.time(ans1 <- dt[, c(lapply(.SD, sum), lapply(.SD, mean), 
                             lapply(.SD, min), lapply(.SD, max), .N), by=V1])
#   user  system elapsed 
#  2.463   0.018   2.575 

options(datatable.optimize=1L)
system.time(ans2 <- dt[, c(lapply(.SD, sum), lapply(.SD, mean), 
                    lapply(.SD, min), lapply(.SD, max), .N), by=V1])
 #  user  system elapsed 
# 11.840   0.034  11.987 

identical(ans1, ans2) # [1] TRUE

@arunsrinivasan arunsrinivasan added this to the v1.9.6 milestone Sep 24, 2014

@arunsrinivasan arunsrinivasan modified the milestones: v1.9.6, v1.9.8 Oct 10, 2014

@matthieugomez

This comment has been minimized.

Show comment
Hide comment
@matthieugomez

matthieugomez Nov 12, 2014

Contributor

Ideally, quantile, cov & corr would be great.

Contributor

matthieugomez commented Nov 12, 2014

Ideally, quantile, cov & corr would be great.

@arunsrinivasan

This comment has been minimized.

Show comment
Hide comment
@arunsrinivasan

arunsrinivasan Jan 7, 2015

Member

lead/lag implemented as shift(). See #965.

Member

arunsrinivasan commented Jan 7, 2015

lead/lag implemented as shift(). See #965.

@arunsrinivasan

This comment has been minimized.

Show comment
Hide comment
@arunsrinivasan

arunsrinivasan Oct 30, 2015

Member

gmedian always returns numeric type, so that we don't have to wrap with the annoying as.numeric() and is very fast. Using the same data as above:

without na.rm = TRUE

system.time(ans1 <- dt[, lapply(.SD, median), by=V1])
#    user  system elapsed 
#   1.562   0.007   1.574 
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(median(x))), by=V1])
#    user  system elapsed 
#  23.013   0.336  23.638 
identical(ans1, ans2)
# [1] TRUE

with na.rm = TRUE

system.time(ans1 <- dt[, lapply(.SD, median, na.rm=TRUE), by=V1])
#    user  system elapsed 
#   1.739   0.014   1.787 
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(median(x, na.rm=TRUE))), by=V1])
#    user  system elapsed 
#   24.201   0.749  25.217 
identical(ans1, ans2)
# [1] TRUE
Member

arunsrinivasan commented Oct 30, 2015

gmedian always returns numeric type, so that we don't have to wrap with the annoying as.numeric() and is very fast. Using the same data as above:

without na.rm = TRUE

system.time(ans1 <- dt[, lapply(.SD, median), by=V1])
#    user  system elapsed 
#   1.562   0.007   1.574 
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(median(x))), by=V1])
#    user  system elapsed 
#  23.013   0.336  23.638 
identical(ans1, ans2)
# [1] TRUE

with na.rm = TRUE

system.time(ans1 <- dt[, lapply(.SD, median, na.rm=TRUE), by=V1])
#    user  system elapsed 
#   1.739   0.014   1.787 
system.time(ans2 <- dt[, lapply(.SD, function(x) as.numeric(median(x, na.rm=TRUE))), by=V1])
#    user  system elapsed 
#   24.201   0.749  25.217 
identical(ans1, ans2)
# [1] TRUE
@arunsrinivasan

This comment has been minimized.

Show comment
Hide comment
@arunsrinivasan

arunsrinivasan Nov 8, 2015

Member

Benchmarks for head and tail:

options(datatable.optimize=Inf)
system.time(ans1 <- dt[, head(.SD, 1), by=V1]) # gforce optimised
# 0.03 seconds

options(datatable.optimize=1)
system.time(ans2 <- dt[, head(.SD, 1), by=V1]) # level-1 optimisation
# 10 seconds

options(datatable.optimize=0)
system.time(ans3 <- dt[, head(.SD, 1), by=V1]) # no optimisation
# 45 seconds

# restore optimisation
options(datatable.optimize=Inf)

works with subsets in i as well.

Member

arunsrinivasan commented Nov 8, 2015

Benchmarks for head and tail:

options(datatable.optimize=Inf)
system.time(ans1 <- dt[, head(.SD, 1), by=V1]) # gforce optimised
# 0.03 seconds

options(datatable.optimize=1)
system.time(ans2 <- dt[, head(.SD, 1), by=V1]) # level-1 optimisation
# 10 seconds

options(datatable.optimize=0)
system.time(ans3 <- dt[, head(.SD, 1), by=V1]) # no optimisation
# 45 seconds

# restore optimisation
options(datatable.optimize=Inf)

works with subsets in i as well.

@arunsrinivasan

This comment has been minimized.

Show comment
Hide comment
@arunsrinivasan

arunsrinivasan Nov 8, 2015

Member

Benchmark for [

options(datatable.optimize=Inf)
system.time(ans1 <- dt[, .SD[2], by=V1]) # gforce optimised
# 0.03 seconds

options(datatable.optimize=1L)
system.time(ans2 <- dt[, .SD[2], by=V1]) # level-1 optimisation
# 1.75 seconds

options(datatable.optimize=0L)
system.time(ans3 <- dt[, .SD[2], by=V1]) # no optimisation
# 41 seconds

# restore optimisation
options(datatable.optimize=Inf)

works with subsets in i as well.

Member

arunsrinivasan commented Nov 8, 2015

Benchmark for [

options(datatable.optimize=Inf)
system.time(ans1 <- dt[, .SD[2], by=V1]) # gforce optimised
# 0.03 seconds

options(datatable.optimize=1L)
system.time(ans2 <- dt[, .SD[2], by=V1]) # level-1 optimisation
# 1.75 seconds

options(datatable.optimize=0L)
system.time(ans3 <- dt[, .SD[2], by=V1]) # no optimisation
# 41 seconds

# restore optimisation
options(datatable.optimize=Inf)

works with subsets in i as well.

@jangorecki

This comment has been minimized.

Show comment
Hide comment
@jangorecki

jangorecki Dec 8, 2015

Member

Any plans for optimization of head(.SD, 2)? or .SD[1:2].
IMO there could be tons of cases to make optimization, so it may be better to deal with data.table modularity extension, so any future optimization can be cleaner and easier to contribute.

Member

jangorecki commented Dec 8, 2015

Any plans for optimization of head(.SD, 2)? or .SD[1:2].
IMO there could be tons of cases to make optimization, so it may be better to deal with data.table modularity extension, so any future optimization can be cleaner and easier to contribute.

@arunsrinivasan

This comment has been minimized.

Show comment
Hide comment
@arunsrinivasan

arunsrinivasan Feb 4, 2016

Member

var, sd and prod are now GForce optimised as well.

var

# with GForce (default)
system.time(ans1 <- dt[, lapply(.SD, var, na.rm=TRUE), by=V1])
#    user  system elapsed 
#   1.273   0.010   1.294 

# without
system.time(ans2 <- dt[, lapply(.SD, stats::var, na.rm=TRUE), by=V1])
#    user  system elapsed 
#  27.106   0.369  27.635 

all.equal(ans1, ans2) # [1] TRUE

sd

# with GForce (default)
system.time(ans1 <- dt[, lapply(.SD, sd, na.rm=TRUE), by=V1])
#    user  system elapsed 
#   1.227   0.007   1.242 

# without
system.time(ans2 <- dt[, lapply(.SD, stats::sd, na.rm=TRUE), by=V1])
#    user  system elapsed 
#  28.428   0.406  29.172 

all.equal(ans1, ans2) # [1] TRUE
Member

arunsrinivasan commented Feb 4, 2016

var, sd and prod are now GForce optimised as well.

var

# with GForce (default)
system.time(ans1 <- dt[, lapply(.SD, var, na.rm=TRUE), by=V1])
#    user  system elapsed 
#   1.273   0.010   1.294 

# without
system.time(ans2 <- dt[, lapply(.SD, stats::var, na.rm=TRUE), by=V1])
#    user  system elapsed 
#  27.106   0.369  27.635 

all.equal(ans1, ans2) # [1] TRUE

sd

# with GForce (default)
system.time(ans1 <- dt[, lapply(.SD, sd, na.rm=TRUE), by=V1])
#    user  system elapsed 
#   1.227   0.007   1.242 

# without
system.time(ans2 <- dt[, lapply(.SD, stats::sd, na.rm=TRUE), by=V1])
#    user  system elapsed 
#  28.428   0.406  29.172 

all.equal(ans1, ans2) # [1] TRUE
@MichaelChirico

This comment has been minimized.

Show comment
Hide comment
@MichaelChirico

MichaelChirico Jan 13, 2017

Contributor

Could is.na perhaps be added to the list? I've been seeing a few examples of this recently.

Contributor

MichaelChirico commented Jan 13, 2017

Could is.na perhaps be added to the list? I've been seeing a few examples of this recently.

@franknarf1

This comment has been minimized.

Show comment
Hide comment
@franknarf1

franknarf1 Jun 21, 2017

This may be a bad fit for GForce, but it would be nice to have an optimized version of :, perhaps. It's common that people want to use that with single-row groups, like: https://stackoverflow.com/a/44664086

franknarf1 commented Jun 21, 2017

This may be a bad fit for GForce, but it would be nice to have an optimized version of :, perhaps. It's common that people want to use that with single-row groups, like: https://stackoverflow.com/a/44664086

@franknarf1

This comment has been minimized.

Show comment
Hide comment
@franknarf1

franknarf1 Feb 20, 2018

How about any and all? Example from SO:

library(data.table)
household <-  c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3)
trip      <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9)
brand     <- c(1,2,3,4,5,6,7,5,1,6,8,9,9,2,8,1,3,4,5,6,7,8,9,1,1,2,3,4,1,5,6,7,1,8,9,2)
DT <- data.table(household,trip,brand)

DT[, loyal_brand := 
  .SD[.(household = household, trip = trip - 1L, brand = brand), on=.(household, trip, brand), .N, by=.EACHI]$N > 0L
]

DT[, .(loyal = any(loyal_brand)), by=.(household, trip)]

Seems any is analogous to max; and all to min.

franknarf1 commented Feb 20, 2018

How about any and all? Example from SO:

library(data.table)
household <-  c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3)
trip      <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9)
brand     <- c(1,2,3,4,5,6,7,5,1,6,8,9,9,2,8,1,3,4,5,6,7,8,9,1,1,2,3,4,1,5,6,7,1,8,9,2)
DT <- data.table(household,trip,brand)

DT[, loyal_brand := 
  .SD[.(household = household, trip = trip - 1L, brand = brand), on=.(household, trip, brand), .N, by=.EACHI]$N > 0L
]

DT[, .(loyal = any(loyal_brand)), by=.(household, trip)]

Seems any is analogous to max; and all to min.

@mattdowle mattdowle removed this from the Candidate milestone May 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment