Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug report - RSession Hangs #1470

Jorges1000 opened this issue Dec 17, 2015 · 5 comments · Fixed by #3777

Bug report - RSession Hangs #1470

Jorges1000 opened this issue Dec 17, 2015 · 5 comments · Fixed by #3777


Copy link

@Jorges1000 Jorges1000 commented Dec 17, 2015

Rsession hangs

When a data.table with large numbers of columns is queried using .SD, first this takes much longer than just creating the DT (from about a minute to nearly 10 minutes), then after a while R starts running in the background for large period of time (5-10 minutes) even without any command. We can see on the Activity Monitor that the rsession process is on at 100% and RStudio unresponsive. Note that R library is in a custom folder and this happens more often if many queries are done on DT. Tried turning off options( to no avail.
Using the latest versions of RStudio (0.99.489), R (3.2.3), and data.table (1.9.6) under OS X 10.9.5 (Mavericks) on x86_64-apple-darwin13.4.0 (64-bit). attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] data.table_1.9.6 microbenchmark_1.4-2.1
loaded via a namespace (and not attached): Rcpp_0.12.2 digest_0.6.8 MASS_7.3-45 chron_2.3-47 grid_3.2.3 plyr_1.8.3 gtable_0.1.2 magrittr_1.5 scales_0.3.0 ggplot2_1.0.1 stringi_1.0-1 reshape2_1.4.1 proto_0.3-10 tools_3.2.3 stringr_1.0.0 munsell_0.4.2 colorspace_1.2-6

                                      nrow = 500,ncol = 30000),key='mt')->DT) # 4-5 secs
system.time(DT[,.SD,by=st][,.(mt,st,V2,V3,V4)]) # 67 to 497 secs - slow, because copying every column to .SD
microbenchmark(DT[,.SD,by=st,.SDcols= c('V2','V3','V4')]) # 12.9 ns median - fast, because .SD contains only .SDcols
boxplot(microbenchmark(DT[,lapply(.SD,median),by=.(st,mt),.SDcols= c('V2','V3','V4')], # 19.7 median
           DT[,lapply(.(V2,V3,V4),median),by=.(st,mt)])->res,notch=T) # 21.4 median, significantly higher
setkey(DT,mt,st) # this command can cause the hang; worse when many other queries are done on DT
Copy link

@jangorecki jangorecki commented Mar 16, 2016

I cannot reproduce the hang on Ubuntu R 3.2.3 and 1.9.7.
Regarding your code, second expression, the slower one, can be also written as DT[,list(V2=median(V2),V3=median(V3),V4=median(V4)),by=.(st,mt)], it will be faster.

Copy link

@arunsrinivasan arunsrinivasan commented Mar 18, 2016

Hm, the performance hit seems to be due to optimising .SD -- see #735.

Seems like having a lot of columns in j as list(col1, col2, ...) takes a big hit. @mattdowle, thoughts?

options(datatable.optimize=0L) # without optimisation
#    user  system elapsed 
#   0.481   0.012   0.502
options(datatable.optimize=Inf) # with optimisation
#    user  system elapsed 
#  53.125   8.002  61.784 

Can't reproduce the session hang.

Copy link

@MichaelChirico MichaelChirico commented May 24, 2017

@Jorges1000 is this still a problem in the latest releases?

Copy link

@MichaelChirico MichaelChirico commented Aug 19, 2019

not sure this line with verbose=TRUE is working as expected:


width.cutoff is per-line but there are hundreds of lines for this jsub. Should add nline=1L as well?

Copy link

@mattdowle mattdowle commented Aug 28, 2019

Looks like it might be dotN() :

> mt = rep(rownames(mtcars)[1:25],20)
> st = rep(,10)
> DT = data.table(mt=mt, st=st, matrix(sample(1:(30000L*500),30000*500,replace=T),
       nrow=500,ncol=30000), key='mt')
> options(datatable.optimize=0L)
> system.time(DT[,.SD,by=st])
   user  system elapsed 
  0.512   0.012   0.367 
> options(datatable.optimize=Inf)
> system.time(DT[,.SD,by=st])
   user  system elapsed 
 25.083   3.157  28.107 
> Rprof()
> system.time(DT[,.SD,by=st])
   user  system elapsed 
 24.321   2.708  26.897 
> Rprof(NULL)
> summaryRprof()
               self.time self.pct total.time total.pct
"[.data.table"     13.88    51.26      27.02     99.78
"dotN"             13.12    48.45      13.12     48.45
"gc"                0.06     0.22       0.06      0.22
"c"                 0.02     0.07       0.02      0.07

@MichaelChirico commented here, that call to dotN() is redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

Successfully merging a pull request may close this issue.

5 participants