Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about multithreaded tables within multithreaded code #2031

Closed
fruce-ki opened this issue Feb 17, 2017 · 5 comments
Closed

Question about multithreaded tables within multithreaded code #2031

fruce-ki opened this issue Feb 17, 2017 · 5 comments
Labels

Comments

@fruce-ki
Copy link

I've noticed since my last R update that loading data.table now pops up a message about multithreading and openMP. So I'm curious how the parallelised data.table works within code that explicitly forks.

Specifically, I want to avoid excessive forking. I use mclapply to parallelise certain calculations on my data.tables. I want to make sure no additional threads are spawned from within the child processes, as that would mess up resource management on our computing cluster.

@MichaelChirico
Copy link
Member

See discussion in #1660, I'm not sure if it covers all you have in mind

@fruce-ki
Copy link
Author

Hi, thanks! Yes that seems to be what I am referring to. Searching didn't find that thread.

It is unclear from the thread what the conclusion was with regards to default behaviour. I will have to read up on the newer documentation of data.table and openMP. If the default is anything other than 1, my own package is going to have to jump through a lot of hoops to continue to work safely and to maintain interoperability with older versions of data.table.

@fruce-ki
Copy link
Author

I've been catching up with the documentation and news and I found out there is a setDTthreads() function, so that makes my life a lot easier than I anticipated.
I also saw mention that DT automatically switches to single threaded operation inside mclapply calls? Does that mean I don't have to make any modification to my code at all? What is the recommended practice?

Thank you!

@SteveBronder
Copy link

Can we get an update for this?

for mlr and caret, I would worry that they are using explicit parallelism and data.table in some operations. So if data.table does not respect explicit parallelism, this could create a reasonably sized mess.

Furthermore, I can see huge problems with setting the default threads to be greater than one. If users are unaware of this you could be taking up huge amounts of computing resources without their explicit permission or knowledge.

@mattdowle
Copy link
Member

mattdowle commented Mar 14, 2017

@fruce-ki Correct - you shouldn't need to change your code if I've understood correctly.. As you saw in the news item, data.table automatically switches down to single-threaded when used inside explicit parallelism. That was hard to detect but I persevered. In fact the crash issue with fork and OpenMP was stated as no known solution in the articles I saw. The NEWS item from v1.9.8 (Nov 2016) was as follows:

  1. Added setDTthreads() and getDTthreads() to control the threads used in data.table functions that are now parallelized with OpenMP on all architectures including Windows (fwrite(), fsort() and subsetting). Extra code was required internally to ensure these control data.table only and not other packages using OpenMP. When data.table is used from the parallel package (e.g. mclapply as done by 3 CRAN and Bioconductor packages) data.table automatically switches down to one thread to avoid a deadlock/hang when OpenMP is used with fork(); #1745 and #1727. Thanks to Kontstantinos Tsardounis, Ramon Diaz-Uriarte and Jan Gorecki for testing before release and providing reproducible examples. After parallel::mclapply has finished, data.table reverts to the prior getDTthreads() state. Tests added which will therefore will run every day thanks to CRAN (limited to 2 threads on CRAN which is enough to test).

@Stevo15025 Yes the default is to use all cores. I think you're being a little dramatic in your choice of words: reasonably sized mess, and huge problems. The vastly most common case is users on sole laptop/desktop or who have a server to themselves and they just want to benefit from parallelism. If anyone needs to limit the resources then they can, very easily, using setDTthreads(1). The people who are running on shared resources or using explicit parallelism are relatively few but they are certainly capable enough to limit data.table if they need to (e.g. a package maintainer using explicit parallelism is definitely capable of calling setDTthreads(1)). But even so, I put in the work to automatically limit data.table as well when called from explicit parallelism. So mlr and caret shouldn't need to.

Put it another way, if the default were 1 core we'd be deluged with users saying "hey, it's supposed to be parallel but it isn't" needing the reply "you have to setDTthreads(n)" then the reply "ok what do I set n to?" and needing the reply "well - it depends". Etc etc. This way we just leave it to OpenMP by default and they don't need to know one more thing: setDTthreads().

Take for example the 340 CRAN and Bioconductor packages using data.table. Any of them using setkey() are already benefiting from setkey's parallelism. Neither the maintainers or users of those packages needed to call anything first to turn on parallelism. No problems have been reported yet as far as I know.

On multi-user severs, you can limit cpu resources using OS commands; e.g. cpulimit at system level. Surely that is a better way to manage/limit shared resources than forcing all users to manage it themselves somehow at R level. That is unnecessarily inconvenient for desktop, laptop and sole-use servers. The goal is to 'just work' for the vast majority of users. Or if it must, setDTthreads(1) can be placed in your .Rprofile or your .Rprofile.site on the shared-resource server.

Over time, I hope that explicit parallelism calling data.table will no longer be needed. The best place to do the parallelism is inside data.table automatically, not outside manually.

If I've misunderstood or missed something, please reopen. If the predicted problems occur please let me know and I'll definitely think again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants