Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenBLAS : Program will terminate because you tried to start too many threads. #1735

Closed
logidelic opened this issue Aug 13, 2018 · 24 comments
Closed
Milestone

Comments

@logidelic
Copy link

logidelic commented Aug 13, 2018

I recently built OpenBLAS 0.3.2 on Ubuntu 16 and am running into this error.

OpenBLAS : Program will terminate because you tried to start too many threads.

My program is using a library that allocates many threads for various reasons. Most of the threads are sleeping most of the time, so there are no performance issues, but I guess that the large number of threads is too much...

Does linking to OpenBLAS really limit the max number of threads that the program creates, even if those threads are not making blas calls? Is there a way around this?

FWIW, I have done a lot of searching and reading on the topic, but still can't quite figure out how to solve this... Some notes:

  • I was not getting this error with a previous version of OpenBLAS (whatever is in he Ubuntu repo)
  • I also tried building with USE_OPENMP=1 but that doesn't seem to fix the problem.
  • I tried with OPENBLAS_NUM_THREADS=1 environment variable which seemed to make no difference
  • I tried calling openblas_set_num_threads(1) at runtime which seemd to make some difference, but eventually I got the same error
  • I tried building with USE_THREAD=0. I think this prevents the error but I haven't found clear documentation as to what the implications are. Can I still call blas functions in multiple threads safely (on unrelated data of course)? Does blas do everything in a single thread in this case or in whatever thread it's called from?

Thanks much.

@brada4
Copy link
Contributor

brada4 commented Aug 13, 2018

If you call openblas from multiple threads you need omp or single thread version. Check Makefile.rule for more options

@brada4
Copy link
Contributor

brada4 commented Aug 13, 2018

If you call multi-threaded blas from many threads at once you lose cache efficiency

@martin-frbg
Copy link
Collaborator

This appears to be a (still poorly understood) bug introduced with the rewrite (and speedup) of the thread initialization code in 0.3.1 - see #1704 and #1641. A workaround is simply to increase the value of MAX_ALLOCATING_THREADS, but to achieve some kind of final solution it would be very useful to get an idea how many (blas-calling) threads an affected program is/was trying to start.

@brada4
Copy link
Contributor

brada4 commented Aug 14, 2018

#1704 sort of ended in resource leak where calling thread exits and new comes in place all the time?
@logidelic can you add some cout() where your code creates / exits a thread?

@martin-frbg
Copy link
Collaborator

Not sure I'd call that a resource leak, it just so happens that there is a fixed size array for thread pointers in the new code where old entries apparently never get reused. #1641 already had musings on making this thing dynamically allocated.

@susilehtola
Copy link
Contributor

The same issue has been reported in Fedora
https://bugzilla.redhat.com/show_bug.cgi?id=1615803

so I'm also interested in a speedy fix to this.

@martin-frbg
Copy link
Collaborator

Unfortunately that bugzilla entry is mostly useless, as the person who wrote it does not appear to know what the code he uses actually does internally (or how many threads he will eventually need). For a crude "fix", you could try replacing

#  define MAX_ALLOCATING_THREADS MAX_CPU_NUMBER * 2 * MAX_PARALLEL_NUMBER * 2

in line 512 of driver/others/memory.c with some big constant, such as
#define MAX_ALLOCATING_THREADS 4096. Another choice would be to revert the entire file to its 0.3.0 state...

@PorcelainMouse
Copy link

If one isn't using OpenBLAS API, directly, how can you troubleshoot this error message? I looked, but didn't see any runtime env-based debug options I could try. I cannot even tell where the crash occurs since I don't get a coredump.

I see in the FAQ that "If your application is already multi-threaded, it will conflict with OpenBLAS multi-threading." Is this a new conflict? If not, that is not my problem because it broke immediately after update to 0.3.1-2. But, using OPENBLAS_NUM_THREADS=1 prevents the crash, so, can you tell if that is related or not?

Thanks for your help.

@martin-frbg
Copy link
Collaborator

@PorcelainMouse the problem is an unexpected and unintended consequence of recent changes in OpenBLAS that were made to speed up the thread initialization. There appears to be a certain kind of
workload, or application behaviour, where the assumptions about the maximum number of threads to
expect do not hold. Unfortunately there is no short-term solution via environment options other than
setting OPENBLAS_NUM_THREADS=1.
To fix this, it would be very useful to get an understanding what your program does, and a rough estimate of how many threads it creates. (It seems the assumptions were that at most four times the number of cpu cores on the machine where OpenBLAS was compiled would get used, and that threads would typically persist for the lifetime of the program. From #1704 we now know that some
"deep learning" context may use many more threads over the lifetime of a program, which used to work with 0.3.0 as that used a global memory pool rather than thread-local storage. ) If all else fails,
0.3.3 will switch back to the old, slower method. @oon3m0oo

@oon3m0oo
Copy link
Contributor

The right way to fix this is to do what I suggested before, which is to have the allocation tracking be dynamic, per thread, rather than globally allocated and assuming a certain number of threads. I might have some time tomorrow to make that happen (since I'm the one who originally caused this).

@PorcelainMouse
Copy link

Thanks much! I think I understand. I think I can monitor threads, somehow, though I don't remember. I'm worried I will not be able to catch every single thread if I have to poll, for example. But, I'll see.

My code is python. So, while I'm intimately familiar with every line, I'm also not using OpenBLAS; some module I'm using is using it. Numpy, pandas, and matplotlib could all be using it, since all three depend on OpenBLAS package.

This is an odd situation. It's pretty clear to me that python's use of OpenBLAS assumes this old behavior. I cannot imagine they think it's okay for every one who uses matplotlib to set this OpenBLAS environment variable. I wonder if it is better to build OpenBLAS with OpenMP and make that a dependency for distribution packages? The FAQ seems to be saying that is safer for distro packages since their users cannot change their code, in general. (Hmm, I suppose I could srpm it and try that.)

One more thing: I'm suspicious about the algorithm you describe since it uses the build machine core count, which is likely to be one or two (i.e. a virtual machine) but that is atypical for any modern device, even phone. I know you know that, so maybe I misunderstand, but I'm just thinking out loud and still a bit confused.

I'm more than happy to troubleshoot! I can recur the crash at will--well the code runs for 4 minutes before crashing, so I hardly have MWE; that's why I don't know exactly where it crashes--so that's a resource for you. Let me know if you can think of some instrumentation I can use from high up in the stack. I have no idea how to probe this.

@martin-frbg
Copy link
Collaborator

The new behaviour is just an oversight, but the old code it replaced was not without problems either. The reliance on the core count of the build system is actually just a fallback for when NUM_THREADS is not set at build time. (Actually with the old code you would get a very similar message and subsequent crash if you happened to exceed that limit, just the circumstances that triggered it were different.)
What worries me most is just how many programs have come to depend on OpenBLAS, and will be affected if a distro habitually updates to the latest release as soon as it becomes available. There is no sizable organisation or permanent developer community behind OpenBLAS at the moment, so not all regressions will be caught in time before they make it into a release.

@martin-frbg
Copy link
Collaborator

Perhaps matplotlib is creating a new thread which does a single OpenBLAS call for every point (line or whatever object) it is drawing ? I wonder if it would be possible to come up with a very smalll python/matplotlib script that shows this behaviour.

@logidelic
Copy link
Author

I'll try my best to get a clean repro with debug information, but it might be a while before I have time for it. :(

What I will say (which I think jibes with some other comments here) is that I was running into another (more mysterious) issue with the older version of OpenBLAS, and I highly suspect that it was due to a similar underlying limitation. I disagree with the design philosophy that puts a limit on the number of user threads in the program that links to OpenBLAS, but it is at least better to error with a clear message than fail silently / unpredictably.

@sscherfke
Copy link

You should be very careful when setting NUM_THREADS=4096, b/c it can result in a huge memory “leak” (depending on your code).

@martin-frbg
Copy link
Collaborator

You should certainly not set NUM_THREADS that high unless you have a very big computer. The workaround suggested above was to change the value of MAX_ALLOCATING_THREADS in file memory.c which is just an array of pointers.

@martin-frbg
Copy link
Collaborator

Anyway I have now committed a change to develop (and "soon" 0.3.3) that reverts to the old version of memory.c unless OpenBLAS is built with -DUSE_TLS. (And in the latter case, the arbitrary limitation on the number of threads should be gone - I just do not want to make this code the default just yet as I just hacked around bugs in oonm0oo's latest PR. Hopefully these remaining issues can be resolved soon)

@martin-frbg
Copy link
Collaborator

0.3.3 is released now, reverting to the old code until we get to the bottom of this.

@martin-frbg
Copy link
Collaborator

From #1761 it appears I made a mistake though, as the new USE_TLS option is on by default in 0.3.3 when you build with plain make - it needs to be commented out in Makefile.rule to actually get the 0.3.0 version of memory.c

@martin-frbg
Copy link
Collaborator

martin-frbg commented Sep 19, 2018

@susilehtola the situation in 0.3.3 is as follows - with the latest iteration of the TLS code active (USE_TLS set to 1 at compile time), the "too many threads" problem is probably solved, but the fix is a mix of oon3m0oo's PR #1739 and my attempts at fixing new bugs in that PR. With cmake, the TLS version of memory.c is not used by default - this was my intention to again provide a stable basis after so many and diverse projects were affected by the decision to include this new code. Unfortunately I merged the wrong version of Makefile.rule, which made USE_TLS=1 the default for pure make builds (where 0.3.3 should still perform much better than 0.3.2).

With #1765 merged, the situation is now cleaned up on the develop branch, in that USE_TLS is not set by default no matter which build system is used, leaving it up to the decision of the user or package maintainer to activate it based on their own testing. The same PR also ensures that the old
code is always used for non-threaded builds, which takes care of the recent issue #1761.
There will be a 0.3.4 once I have time to do more than damage control again.

@susilehtola
Copy link
Contributor

@martin-frbg right. I guess I should then add USE_TLS=0 to the Fedora builds of 0.3.3.?

@martin-frbg
Copy link
Collaborator

@susilehtola with stock 0.3.3 you would actually need to remove or comment out the USE_TLS=1 line in Makefile.rule as I had continued the bad tradition in OpenBLAS of making a variable look like a Boolean while only checking if it is defined at all. ☹️

@martin-frbg
Copy link
Collaborator

Closing as 0.3.4 is released with both this fix for my USE_TLS blunder in 0.3.3 and brada4's bumping the default number of buffers to 50.

@linuxl7
Copy link

linuxl7 commented Apr 28, 2023

mybe too many cpus, change \site-packages\joblib\externals\loky\backend\context.py can solved

os_cpu_count = min(os.cpu_count() or 1,12)

cpu_count_user = min(_cpu_count_user(os_cpu_count),12)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants