Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data.table new column := slower than base R (?) #921

Closed
szilard opened this issue Oct 28, 2014 · 13 comments
Closed

data.table new column := slower than base R (?) #921

szilard opened this issue Oct 28, 2014 · 13 comments
Assignees
Milestone

Comments

@szilard
Copy link

szilard commented Oct 28, 2014

While data.table new column := used to be >100x faster than base R, base R (>=3.1) updates now data.frames in place and caught up. I wonder why data.table is slower for example in this case:

library(data.table)

dt <- data.table(x = runif(100e6))
df <- as.data.frame(dt)

system.time( df$y <- 2*df$x )
system.time( dt[,y := 2*x] )

I get:

base R: 0.272 0.300 0.572 (user system elapsed)
data.table: 0.696 0.744 1.444

R 3.1.1 data.table 1.9.4.

@arunsrinivasan
Copy link
Member

@szilard thanks! R3.1 doesn't update in-place. It shallow copies other columns and adds the new column..

As to why this happens in data.table, just ran a debugonce() on [.data.table. The issue comes down to the same for-loop as shown in #727. This change seems to have happened during the by=.EACHI implementation.

@szilard
Copy link
Author

szilard commented Oct 28, 2014

Thanks @arunsrinivasan!

Yes, I meant base R >=3.1 is not copying the existing columns. So df$y <- 2*df$x takes now approx the same as y <- 2*df$x.

I wonder if data.table can be made faster. First, I increase a bit the data size:

library(data.table)

dt <- data.table(x = runif(500e6))
df <- as.data.frame(dt)

system.time( df$y <- 2*df$x )
system.time( dt[,y := 2*x] )

I get now ~ 2 sec for base R and ~ 5 sec for data.table.

Then I clean up (dt[,y := NULL] which takes suprisingly ~ 3 sec), and finally I run this:

Rprof()
invisible(dt[,y := 2*x])
Rprof(NULL)
summaryRprof()

and get this:

$by.self
          self.time self.pct total.time total.pct
".Call"        2.10    40.70       2.10     40.70
"*"            2.02    39.15       2.02     39.15
"seq_len"      1.04    20.16       1.04     20.16

$by.total
               total.time total.pct self.time self.pct
"["                  5.16    100.00      0.00     0.00
"[.data.table"       5.16    100.00      0.00     0.00
".Call"              2.10     40.70      2.10    40.70
"copy"               2.10     40.70      0.00     0.00
"*"                  2.02     39.15      2.02    39.15
"eval"               2.02     39.15      0.00     0.00
"seq_len"            1.04     20.16      1.04    20.16

$sample.interval
[1] 0.02

$sampling.time
[1] 5.16

Any potential sources for speed-up?

@arunsrinivasan
Copy link
Member

I've already linked to the other issue, where I've listed what I think is the cause - there's an unwanted copy in this case. Maybe something else in addition as well, from the timings you show. It's very much fixable. Thanks again.

@szilard
Copy link
Author

szilard commented Oct 28, 2014

Awesome, thanks (I've seen the other issue, just was not obvious for me it's an easy fix). Also wondering about the speedup, but I guess ~ 2 sec in this case (see profile above).

What about the seq_len? That's another ~ 1 sec, there must be a seq_len(500e6) call which takes precisely that much (~ 1 sec)? Is that avoidable?

(With those 2 changes now it would be on par with base R ~ 2 sec total.)

@arunsrinivasan
Copy link
Member

Right. Don't really understand why there's a seq_len as well.

@szilard
Copy link
Author

szilard commented Oct 28, 2014

I think the seq_len call that takes ~ 1 sec is this SDenv$.I = seq_len(SDenv$.N) here https://github.com/Rdatatable/data.table/blob/master/R/data.table.R#L1075

@mattdowle
Copy link
Member

Thanks a lot. Will take a look.
We could really do with a timing regression framework! Like Python's vbench.

@mattdowle mattdowle added this to the v1.9.6 milestone Oct 29, 2014
@alexcpsec
Copy link

I can also confirm a lot of performance impact by migrating to 1.9.4.

Is there a proposed timeframe for this bugfix? I suppose I could revert back to 1.9.2 for the time being.

@arunsrinivasan
Copy link
Member

@alexcpsec performance impact with respect to just :=? If something else, it'd be great to have a new post with an example.

@alexcpsec
Copy link

It is a very convoluted function I have that makes heavy use of := where I had a significant hold-up.

I could try to Rprofile the whole thing with both 1.9.2 and 1.9.4 versions so I can give you a better idea of how the versions compare to each other in a larger piece of code. Would this be useful for you?

@arunsrinivasan
Copy link
Member

@alexcpsec, yes, that would be incredibly useful, in the absence of a reproducible example. Thanks.

@arunsrinivasan
Copy link
Member

Now I get this:

# clean session
library(data.table)
dt = data.table(x = runif(100e6))
system.time(dt[,y := 2*x])
#  user  system elapsed
#  0.384   0.563   0.956

df = data.frame(x = runif(100e6))
system.time( df$y <- 2*df$x )
#  user  system elapsed
#  0.376   0.554   0.933

The timings are more or less the same, and we can't avoid a copy here. But there are cases where we can delay the copy like R v3.1.0+ does, by shallow copying. That'll be taken care of in #617.

@mattdowle
Copy link
Member

@arunsrinivasan Awesome!! Great fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants