Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further optimisation of .SD in j #735

Open
7 of 19 tasks
arunsrinivasan opened this issue Jul 15, 2014 · 4 comments
Open
7 of 19 tasks

Further optimisation of .SD in j #735

arunsrinivasan opened this issue Jul 15, 2014 · 4 comments
Labels
GForce issues relating to optimized grouping calculations (GForce) performance

Comments

@arunsrinivasan
Copy link
Member

arunsrinivasan commented Jul 15, 2014

In #370 .SD was optimised internally for cases like:

require(data.table)
DT = data.table(id=c(1,1,1,2,2,2), x=1:6, y=7:12, z=13:18)
DT[, c(sum(x), lapply(.SD, mean)), by=id]
#    id V1 x  y  z
#1:  1  6 2  8 14
#2:  2 15 5 11 17

You can see that it's optimised by turning verbose on:

options(datatable.verbose=TRUE)
DT[, c(sum(x), lapply(.SD, mean)), by=id]
# Finding groups (bysameorder=FALSE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
# lapply optimization changed j from 'c(sum(x), lapply(.SD, mean))' to 'list(sum(x), mean(x), mean(y), mean(z))'
# GForce optimized j to 'list(gsum(x), gmean(x), gmean(y), gmean(z))'
options(datatable.verbose=FALSE)

However, this expression is not always optimised. For example,

options(datatable.verbose=TRUE)
DT[, c(.SD[1], lapply(.SD, mean)), by=id]
options(datatable.verbose=FALSE)
#    id x  y  z x  y  z
#1:  1 1  7 13 2  8 14
#2:  2 4 10 16 5 11 17

# Finding groups (bysameorder=FALSE) ... done in 0.001secs. bysameorder=TRUE and o__ is length 0
# lapply optimization is on, j unchanged as 'c(.SD[1], lapply(.SD, mean))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# ...

This is because .SD cases are a little trickier to optimise. To begin with, if .SD has j as well, then it can't be optimised:

DT[, c(xx=.SD[1, x], lapply(.SD, mean)), by=id]
#    id xx x  y  z
#1:  1  1 2  8 14
#2:  2  4 5 11 17

The above expression can not be changed to list(..) (in my understanding).

And even when there's no j, .SD can have i arguments of type integer, numeric, logical, expressions and even data.tables. For example:

DT[, c(.SD[x > 1 & y > 9][1], lapply(.SD, mean)), by=id]
#    id  x  y  z x  y  z
#1:  1 NA NA NA 2  8 14
#2:  2  4 10 16 5 11 17

If we optimise this as such, it'd turn to:

DT[, list(x=x[x>1 & y > 9][1], y=y[x>1 & y>9][1], z=z[x>1 & y>9][1], x=mean(x), y=mean(y), z=mean(z)), by=id]
#    id  x  y  z x  y  z
#1:  1 NA NA NA 2  8 14
#2:  2  4 10 16 5 11 17

which is not really efficient as it evaulates the expression (vector scan) as many times as there are columns, which would be quite slow when there are more and more columns. A better way to do it would be:

DT[, {tmp = x > 1 & y > 9; list(x=x[tmp][1], y=y[tmp][1], z=z[tmp][1], x=mean(x), y=mean(y), z=mean(z))}, by=id]
#    id  x  y  z x  y  z
#1:  1 NA NA NA 2  8 14
#2:  2  4 10 16 5 11 17

which is a little tricky to implement.

If it's a join on i, then it must not be optimised as well, etc..

Basically, .SD and .SD[...] should be optimised one-by-one, optimising for each scenario:

Optimise (for possible cases):

  • .SD
  • DT[, c(.SD, lapply(.SD, ...)), by=.]
  • DT[, c(.SD[1], lapply(.SD, ...)), by=.]
  • .SD[1L] # no j
  • .SD[1]
  • .SD[logical]
  • .SD[a] # where a is integer
  • .SD[a] # where a is numeric
  • all of the above, but with a ,. Ex: .SD[1,]
  • .SD[x > 1 & y > 9]
  • .SD[data.table] # shouldn't / can't be optimised, IMO
  • .SD[character] # shouldn't / can't be optimised, IMO
  • .SD[eval(.)] # might be possible in some cases
  • .SD[i, j] # shouldn't / can't be optimised, IMO
  • DT[, c(list(.), lapply(.SD, ...)), by=.]

All of these throws error at the moment:

  • DT[, c(data.table(.), lapply(.SD, ...)), by=.]
  • DT[, c(as.data.table(.), lapply(.SD, ...)), by=.]
  • DT[, c(data.frame(.), lapply(.SD, ...)), by=.]
  • DT[, c(as.data.frame(.), lapply(.SD, ...)), by=.]

Note that all these can occur on the right side of lapply(.SD, ...) as well.

arunsrinivasan added a commit that referenced this issue Aug 5, 2014
.SD[1], .SD[1L], head(.SD, 1) in `j` alone or along with c(..) are now optimised for speed internally.
@arunsrinivasan
Copy link
Member Author

Fixed #861.

@arunsrinivasan
Copy link
Member Author

Refer to #952 for example from @mgahan where .SD optimisation using .I is faster.

@eantonya
Copy link
Contributor

eantonya commented May 4, 2015

Some .SD[i, j] expressions can also be optimized (not sure how worth they are though). E.g. I think this works:

d[a, .SD[i, j], b] is equivalent to d[d[a, .I[i], b]$V1, j, b]

@franknarf1
Copy link
Contributor

franknarf1 commented Jan 25, 2017

A further idea: .SD[, ..cols] could be treated in the same way as .SD for purposes of applying GForce..?

I ran into this on SO:

library(data.table)
set.seed(1)
DT <- data.table(C1=c("a","b","b"),
                 C2=round(rnorm(4),4),
                 C3=1:12,
                 C4=9:12)

sum_cols <- c("C2","C3")
mean_cols <- c("C3","C4")

# this gets optimized:
DT[, c(
  .N, 
  sum = lapply(.SD, sum)
), by=C1, .SDcols=sum_cols, verbose = TRUE]

# but this does not:
DT[, c(
  .N, 
  sum = lapply(.SD[, ..sum_cols], sum), 
  mean = lapply(.SD[, ..mean_cols], mean)
), by=C1, verbose = TRUE]

Hm, just noticed that the "lapply optimization" strips my sum = prefixes for the output columns in the first case above. It would be nice to have those prefixes put back in after-the-fact. Not sure if that's a worthwhile feature request or not...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GForce issues relating to optimized grouping calculations (GForce) performance
Projects
None yet
Development

No branches or pull requests

4 participants