-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenBLAS with Pthreads can cause CPU contention for MPI programs with Pthreads #4033
Comments
Thank you very much for the detailed writeup. I sort-of agree that this looks like a pathological case, but I find the apparent restriction to just 2 cores puzzling. (And I wonder if some middle ground exists, or if threads go crazy the moment you go from OPENBLAS_NUM_THREADS=1 to OPENBLAS_NUM_THREADS=2) |
I didn't actually try OPENBLAS_NUM_THREADS=2: I would guess that this would be fine, provided that I only use 18 threads otherwise (so that the total "in use" core count sums to 20). FWIW: I think the restriction to 2 cores is arbitrary. I test using 2 MPI processes that each use a single thread for doing MPI communication. If I use (say) |
I've met the similar issue, only under AMD Ryzen 7 5800H CPU.
The result is:
Additionally, If I pin each thread to CPU:
Then the program will not quit no matter what number of the thread pool is. If I run the same program under Intel CPU (both I5 and I7), no matter whether the thread is pinned to specific CPU, the results is pretty normal:
I think there exist some thread contention bugs between OpenMP and pthreads. |
OpenMP on Linux relies on pthreads itself, but if OpenBLAS is not built with USE_OPENMP=1 there is no chance of either knowing about the thread usage of the other. I do not think it likely that there is an actual difference between Intel and AMD cpus in this regard, maybe compiler and/or library versions were different in your test as well ? |
seems openblas_set_num_threads() had no effect in "AMD" case. And show what is happening (attach captured output or extract significant-seeming pieces of it)
|
Revisiting this, I see no possibility for improvement on OpenBLAS' side, as there is no way (to my knowledge) for the pthreads pool to obtain any information about the size (or even just the presence) of the MPI environment it is running in, and limit its own size accordingly. Using OPENBLAS_NUM_THREADS or the openblas_set_num_threads() function interface would appear to be the best one can do in this context, and I notice that guides like https://enccs.github.io/intermediate-mpi/ stress that mixing MPI with any other threading model adds overhead and potential for contention. |
I think this issue is broadly similar to #2543, but I was asked to provide a bug report for this.
TL;DR: Running MPI programs with pthreads and OpenBLAS can cause CPU contention. This is fixed by setting OPENBLAS_NUM_THREADS=1.
I have a program that is somewhat pathological in it's setup that uses OpenBLAS, so this may not be applicable for all use cases.
Specifically, my program looks like this (it's based on G6K).
notify_all
, which is implemented aspthread_cond_broadcast
on my machine.This tanks performance unless I set
OPENBLAS_NUM_THREADS=1
. In particular, it appears to restrict my program to running exclusively on 2 cores (on a 20 core machine), regardless of how many threads I start. Moreover, the program spends around 60% of its time across all threads simply synchronising. I find this surprising: my view of hownotify_all
works is that it shouldn't wake threads that aren't waiting on that particular condition variable.I think the issue is (cf. #2543) this:
In other words, I think the condition variables are substantially more expensive because the cores are over-subscribed, leading to extra context switches.
I'd like to point out that this is likely also an issue that's exacerbated by OpenMPI, which issues memory fences whenever certain requests are checked, which will make all of this far more expensive.
LMK if anything is unclear / if I can help with this in any way. I suspect the issue is unsolvable in general outside of setting the threads as described above: indeed, in my case I suspect that OpenBLAS starts its threads before my program does, so any sort of checking is likely to be difficult.
The text was updated successfully, but these errors were encountered: