Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug report - RSession Hangs #1470

Closed
Jorges1000 opened this issue Dec 17, 2015 · 5 comments · Fixed by #3777
Closed

Bug report - RSession Hangs #1470

Jorges1000 opened this issue Dec 17, 2015 · 5 comments · Fixed by #3777
Milestone

Comments

@Jorges1000
Copy link

Rsession hangs

When a data.table with large numbers of columns is queried using .SD, first this takes much longer than just creating the DT (from about a minute to nearly 10 minutes), then after a while R starts running in the background for large period of time (5-10 minutes) even without any command. We can see on the Activity Monitor that the rsession process is on at 100% and RStudio unresponsive. Note that R library is in a custom folder and this happens more often if many queries are done on DT. Tried turning off options(datatable.auto.index=FALSE) to no avail.
Using the latest versions of RStudio (0.99.489), R (3.2.3), and data.table (1.9.6) under OS X 10.9.5 (Mavericks) on x86_64-apple-darwin13.4.0 (64-bit). attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] data.table_1.9.6 microbenchmark_1.4-2.1
loaded via a namespace (and not attached): Rcpp_0.12.2 digest_0.6.8 MASS_7.3-45 chron_2.3-47 grid_3.2.3 plyr_1.8.3 gtable_0.1.2 magrittr_1.5 scales_0.3.0 ggplot2_1.0.1 stringi_1.0-1 reshape2_1.4.1 proto_0.3-10 tools_3.2.3 stringr_1.0.0 munsell_0.4.2 colorspace_1.2-6

.libPaths('~/Dropbox/R_packages/library/')    
require(data.table)
require(microbenchmark)
mt=rep(rownames(mtcars)[1:25],20)
st=rep(state.name,10)
system.time(data.table(mt=mt,st=st,matrix(sample(1:(30000L*500),30000*500,replace=T),
                                      nrow = 500,ncol = 30000),key='mt')->DT) # 4-5 secs
system.time(DT[,.SD,by=st][,.(mt,st,V2,V3,V4)]) # 67 to 497 secs - slow, because copying every column to .SD
microbenchmark(DT[,.SD,by=st,.SDcols= c('V2','V3','V4')]) # 12.9 ns median - fast, because .SD contains only .SDcols
boxplot(microbenchmark(DT[,lapply(.SD,median),by=.(st,mt),.SDcols= c('V2','V3','V4')], # 19.7 median
           DT[,lapply(.(V2,V3,V4),median),by=.(st,mt)])->res,notch=T) # 21.4 median, significantly higher
setkey(DT,mt,st) # this command can cause the hang; worse when many other queries are done on DT
@jangorecki
Copy link
Member

I cannot reproduce the hang on Ubuntu R 3.2.3 and 1.9.7.
Regarding your code, second expression, the slower one, can be also written as DT[,list(V2=median(V2),V3=median(V3),V4=median(V4)),by=.(st,mt)], it will be faster.

@arunsrinivasan
Copy link
Member

Hm, the performance hit seems to be due to optimising .SD -- see #735.

Seems like having a lot of columns in j as list(col1, col2, ...) takes a big hit. @mattdowle, thoughts?

options(datatable.optimize=0L) # without optimisation
system.time(DT[,.SD,by=st])
#    user  system elapsed 
#   0.481   0.012   0.502
options(datatable.optimize=Inf) # with optimisation
system.time(DT[,.SD,by=st])
#    user  system elapsed 
#  53.125   8.002  61.784 

Can't reproduce the session hang.

@MichaelChirico
Copy link
Member

@Jorges1000 is this still a problem in the latest releases?

@MichaelChirico
Copy link
Member

not sure this line with verbose=TRUE is working as expected:

deparse(jsub,width.cutoff=200L)

width.cutoff is per-line but there are hundreds of lines for this jsub. Should add nline=1L as well?

@mattdowle
Copy link
Member

mattdowle commented Aug 28, 2019

Looks like it might be dotN() :

> mt = rep(rownames(mtcars)[1:25],20)
> st = rep(state.name,10)
> DT = data.table(mt=mt, st=st, matrix(sample(1:(30000L*500),30000*500,replace=T),
       nrow=500,ncol=30000), key='mt')
> options(datatable.optimize=0L)
> system.time(DT[,.SD,by=st])
   user  system elapsed 
  0.512   0.012   0.367 
> options(datatable.optimize=Inf)
> system.time(DT[,.SD,by=st])
   user  system elapsed 
 25.083   3.157  28.107 
> Rprof()
> system.time(DT[,.SD,by=st])
   user  system elapsed 
 24.321   2.708  26.897 
> Rprof(NULL)
> summaryRprof()
$by.self
               self.time self.pct total.time total.pct
"[.data.table"     13.88    51.26      27.02     99.78
"dotN"             13.12    48.45      13.12     48.45
"gc"                0.06     0.22       0.06      0.22
"c"                 0.02     0.07       0.02      0.07

@MichaelChirico commented here, that call to dotN() is redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants