Bug report - RSession Hangs #1470

Jorges1000 · 2015-12-17T21:29:52Z

Rsession hangs

When a data.table with large numbers of columns is queried using .SD, first this takes much longer than just creating the DT (from about a minute to nearly 10 minutes), then after a while R starts running in the background for large period of time (5-10 minutes) even without any command. We can see on the Activity Monitor that the rsession process is on at 100% and RStudio unresponsive. Note that R library is in a custom folder and this happens more often if many queries are done on DT. Tried turning off options(datatable.auto.index=FALSE) to no avail.
Using the latest versions of RStudio (0.99.489), R (3.2.3), and data.table (1.9.6) under OS X 10.9.5 (Mavericks) on x86_64-apple-darwin13.4.0 (64-bit). attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] data.table_1.9.6 microbenchmark_1.4-2.1
loaded via a namespace (and not attached): Rcpp_0.12.2 digest_0.6.8 MASS_7.3-45 chron_2.3-47 grid_3.2.3 plyr_1.8.3 gtable_0.1.2 magrittr_1.5 scales_0.3.0 ggplot2_1.0.1 stringi_1.0-1 reshape2_1.4.1 proto_0.3-10 tools_3.2.3 stringr_1.0.0 munsell_0.4.2 colorspace_1.2-6

.libPaths('~/Dropbox/R_packages/library/')    
require(data.table)
require(microbenchmark)
mt=rep(rownames(mtcars)[1:25],20)
st=rep(state.name,10)
system.time(data.table(mt=mt,st=st,matrix(sample(1:(30000L*500),30000*500,replace=T),
                                      nrow = 500,ncol = 30000),key='mt')->DT) # 4-5 secs
system.time(DT[,.SD,by=st][,.(mt,st,V2,V3,V4)]) # 67 to 497 secs - slow, because copying every column to .SD
microbenchmark(DT[,.SD,by=st,.SDcols= c('V2','V3','V4')]) # 12.9 ns median - fast, because .SD contains only .SDcols
boxplot(microbenchmark(DT[,lapply(.SD,median),by=.(st,mt),.SDcols= c('V2','V3','V4')], # 19.7 median
           DT[,lapply(.(V2,V3,V4),median),by=.(st,mt)])->res,notch=T) # 21.4 median, significantly higher
setkey(DT,mt,st) # this command can cause the hang; worse when many other queries are done on DT

The text was updated successfully, but these errors were encountered:

jangorecki · 2016-03-16T17:49:40Z

I cannot reproduce the hang on Ubuntu R 3.2.3 and 1.9.7.
Regarding your code, second expression, the slower one, can be also written as DT[,list(V2=median(V2),V3=median(V3),V4=median(V4)),by=.(st,mt)], it will be faster.

arunsrinivasan · 2016-03-18T22:21:49Z

Hm, the performance hit seems to be due to optimising .SD -- see #735.

Seems like having a lot of columns in j as list(col1, col2, ...) takes a big hit. @mattdowle, thoughts?

options(datatable.optimize=0L) # without optimisation
system.time(DT[,.SD,by=st])
#    user  system elapsed 
#   0.481   0.012   0.502
options(datatable.optimize=Inf) # with optimisation
system.time(DT[,.SD,by=st])
#    user  system elapsed 
#  53.125   8.002  61.784

Can't reproduce the session hang.

MichaelChirico · 2017-05-24T18:46:48Z

@Jorges1000 is this still a problem in the latest releases?

MichaelChirico · 2019-08-19T03:03:39Z

not sure this line with verbose=TRUE is working as expected:

deparse(jsub,width.cutoff=200L)

width.cutoff is per-line but there are hundreds of lines for this jsub. Should add nline=1L as well?

mattdowle · 2019-08-28T22:01:53Z

Looks like it might be dotN() :

> mt = rep(rownames(mtcars)[1:25],20)
> st = rep(state.name,10)
> DT = data.table(mt=mt, st=st, matrix(sample(1:(30000L*500),30000*500,replace=T),
       nrow=500,ncol=30000), key='mt')
> options(datatable.optimize=0L)
> system.time(DT[,.SD,by=st])
   user  system elapsed 
  0.512   0.012   0.367 
> options(datatable.optimize=Inf)
> system.time(DT[,.SD,by=st])
   user  system elapsed 
 25.083   3.157  28.107 
> Rprof()
> system.time(DT[,.SD,by=st])
   user  system elapsed 
 24.321   2.708  26.897 
> Rprof(NULL)
> summaryRprof()
$by.self
               self.time self.pct total.time total.pct
"[.data.table"     13.88    51.26      27.02     99.78
"dotN"             13.12    48.45      13.12     48.45
"gc"                0.06     0.22       0.06      0.22
"c"                 0.02     0.07       0.02      0.07

@MichaelChirico commented here, that call to dotN() is redundant.

arunsrinivasan added the performance label Mar 18, 2016

MichaelChirico pushed a commit that referenced this issue Aug 19, 2019

Closes #1470 -- streamline loop in GForce j optimization

08b4b44

MichaelChirico mentioned this issue Aug 19, 2019

streamline loop in GForce j optimization #3777

Merged

mattdowle added this to the 1.12.4 milestone Aug 28, 2019

mattdowle closed this as completed in #3777 Aug 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug report - RSession Hangs #1470

Bug report - RSession Hangs #1470

Jorges1000 commented Dec 17, 2015

jangorecki commented Mar 16, 2016

arunsrinivasan commented Mar 18, 2016

MichaelChirico commented May 24, 2017

MichaelChirico commented Aug 19, 2019

mattdowle commented Aug 28, 2019 •

edited

Loading

Bug report - RSession Hangs #1470

Bug report - RSession Hangs #1470

Comments

Jorges1000 commented Dec 17, 2015

Rsession hangs

jangorecki commented Mar 16, 2016

arunsrinivasan commented Mar 18, 2016

MichaelChirico commented May 24, 2017

MichaelChirico commented Aug 19, 2019

mattdowle commented Aug 28, 2019 • edited Loading

mattdowle commented Aug 28, 2019 •

edited

Loading