-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault when aborting matrix multiplication #1468
Comments
I get this:
And no, I do not believe this should be happening. I seem to remember Stefan saying "Any segfault is automatically a bug". :) |
During that call, the control is actually with openblas. Typically, you would intercept the signal, and do a cleanup in the handler. However, many of the native libraries are not designed with this kind of interactive usage in mind. It would still be nice to not have a segfault, if possible. |
Short answer: just use linux :-P We can look into it, but it's unlikely we'll be able to make ctl-C work perfectly in every case. Once an async signal happens the process is technically in an ill-defined state. |
Does OpenBLAS reset signal handlers on function calls? If not, shouldn't Julia's signal handlers be getting called no matter where execution is at the time of the signal? Also, the segfault doesn't always happen immediately for me, for instance in the example above, Julia was able to print out the stack trace, and ask for another prompt before segfaulting. My guess is this is a memory freeing bug triggered by the gc. Some further digging reveals:
|
I managed to capture this in
It paused at the julia prompt for a couple seconds (I was going to type |
I have to push a little harder on my Ubuntu but it can do similar tricks
I is not a big deal for me but I did not recall such behaviour when working with R and thought there might be a fix. |
In Matlab, you don't typically get segfaults when you do such stuff. Usually, matlab will not let you do a ctrl-c when it is in running native code. Not sure how R handles all this gracefully. |
We should definitely take a look at what R does. They may have some good tricks to copy. |
I guess I don't understand why CTRL^C'ing in OpenBLAS code should give us problems vs. being in Julia code when we CTRL^C.
MATLAB queues the CTRL^C until you're back in MATLAB code, and then breaks. |
We have the ability to defer SIGINT around blocks of code, and it is used in some places in the run time system. We could expose this as a macro. It would be silly to use it around every |
I thought it was pretty standard procedure to have signal handlers setup at all times, so that no matter what happens signal-wise you aren't either interpreting signals as SIGTERM when you don't want to, or even worse, ignoring signals that should be having an effect. Having to wrap something around every |
We do have handlers set up, we just defer the handling in some cases so that certain actions are atomic. But in any case it is probably still possible for SIGINT to leave the process in an inconsistent state. I just tried this example on linux, and using openblas threads I got a segfault. After setting |
I see the same thing on my Mac. Single threaded works. |
From your comments, Jeff, it sounds like the openblas child threads are continuing happily along on whatever they were doing, unaware that the SIGINT has jumped execution out of openblas control (and that its temporary arrays will soon be GC'd). Unless there is a way to gracefully (or not) tell openblas to abort everything, perhaps calls to blas functions should be atomic to interruption by signals? |
Cc: @xianyi |
Hi, I just tested multi-threaded DGEMM in C. The Ctrl-C works fine. I think we should build the Julia & OpenBLAS debug version to narrow down this issue. Xianyi |
I tried Ctrl-C with Julia linked to MKL instead of OpenBLAS and the result is the same. Segfault for multithreading and not for single threading so it does not seem to be specific to OpenBLAS. |
I tried this with Accelerate, and got the same result as well, and it seems my suspicions about a memory-freeing problem have a little more substance to them. Nothing bad happened until I manually called
|
Makes sense. We need a way to interrupt the work threads, or have them see and appropriately respond to SIGINT. |
Would setting a SIGINT handler in the openblas threads be sufficient to fix this? Alternatively, when the REPL gets interrupted, can it just cancel all pthreads launched by any julia processes, given that Julia itself does not use any threads? It could end up leaking memory, but not dying would be nice. Cc: @tanmaykm @amitmurthy |
BUMP, this is really annoying. |
i'm pretty sure the openblas pthreads are supposed to sleep, not die, after finishing |
I'm not sure what to do here. On the one hand you might actually want to interrupt the matrix multiply, which we could probably do for OpenBLAS, but there's also Accelerate, MKL, etc. for which we would not be able to do that. Perhaps the best thing to do is to defer the signal until after the BLAS call and deliver it then. |
I'd be happy if it worked well with openblas. Deferring the signal is not such a great option since you'd really like to be able to interrupt the call. It also entails extra complexity and overhead. |
Often the reason you want to deliver it in the middle of the blas call is that you have started something that you realize later will take too long or too much memory. Even if the OpenBLAS case is well behaved, that is worthwhile. In general we want to be able to interrupt any ccall. |
The other problem is that OpenBLAS actually allocates memory in some functions which we would leak if we interrupted it. |
I think that would be OK. Perhaps we can print a warning. |
I'm not sure people would like 40MB of memory leaked every time they press ^C |
Hmm, I have a devious idea. Let's see how it works out. |
For other ccalls it coud be significantly more than 40 MB. There would also be an issue with other system resources that might be more scarce than memory. |
It is quite clear that there is no way to this generally for all C libraries. However, I think I can do the special case of OpenBLAS with a little system-level hackery. |
See also #2622 |
I filed issue OpenMathLib/OpenBLAS#378 against OpenBLAS. As mentioned in the issue I tried a very ugly hack but I don't really like it so I outlined some possible courses of action in that issue. |
As I argued in #2622, and Jeff eventually agreed, the only sane default is to defer sigint in every ccall. (We can possibly hack specific things like openblas to either re-enable sigint for those calls or to periodically check manually for deferred sigints at safe places.) |
* Remove unnecessary sigatomic * Make flisp calls sigatomic * Make type inference calls sigatomic * Refactor interthread communication through signal * Make sure `sleep` is aborted on `SIGINT` on Linux to deliver the exception faster * Implement force signal throwing when `SIGINT` arrives too frequently * Hack to abort io syscall on `SIGINT` Fix #1468; Fix #2622; Towards #14675
* Remove unnecessary sigatomic * Make flisp calls sigatomic * Make type inference calls sigatomic * Refactor interthread communication through signal * Make sure `sleep` is aborted on `SIGINT` on Linux to deliver the exception faster * Implement force signal throwing when `SIGINT` arrives too frequently * Hack to abort io syscall on `SIGINT` Fix #1468; Fix #2622; Towards #14675
* Remove unnecessary sigatomic * Make flisp calls sigatomic * Make type inference calls sigatomic * Refactor interthread communication through signal * Make sure `sleep` is aborted on `SIGINT` on Linux to deliver the exception faster * Implement force signal throwing when `SIGINT` arrives too frequently * Hack to abort io syscall on `SIGINT` Fix #1468; Fix #2622; Towards #14675
* Remove unnecessary sigatomic * Make flisp calls sigatomic * Make type inference calls sigatomic * Refactor interthread communication through signal * Make sure `sleep` is aborted on `SIGINT` on Linux to deliver the exception faster * Implement force signal throwing when `SIGINT` arrives too frequently * Hack to abort io syscall on `SIGINT` Fix #1468; Fix #2622; Towards #14675
* Remove unnecessary sigatomic * Make flisp calls sigatomic * Make type inference calls sigatomic * Refactor interthread communication through signal * Make sure `sleep` is aborted on `SIGINT` on Linux to deliver the exception faster * Implement force signal throwing when `SIGINT` arrives too frequently * Hack to abort io syscall on `SIGINT` Fix #1468; Fix #2622; Towards #14675
* Remove unnecessary sigatomic * Make flisp calls sigatomic * Make type inference calls sigatomic * Refactor interthread communication through signal * Make sure `sleep` is aborted on `SIGINT` on Linux to deliver the exception faster * Implement force signal throwing when `SIGINT` arrives too frequently * Hack to abort io syscall on `SIGINT` Fix #1468; Fix #2622; Towards #14675
* Remove unnecessary sigatomic * Make flisp calls sigatomic * Make type inference calls sigatomic * Refactor interthread communication through signal * Make sure `sleep` is aborted on `SIGINT` on Linux to deliver the exception faster * Implement force signal throwing when `SIGINT` arrives too frequently * Hack to abort io syscall on `SIGINT` Fix #1468; Fix #2622; Towards #14675
* Remove unnecessary sigatomic * Make flisp calls sigatomic * Make type inference calls sigatomic * Refactor interthread communication through signal * Make sure `sleep` is aborted on `SIGINT` on Linux to deliver the exception faster * Implement force signal throwing when `SIGINT` arrives too frequently * Hack to abort io syscall on `SIGINT` Fix #1468; Fix #2622; Towards #14675
I get this systematically
when I try to abort the computation. Julia runs in the terminal on a MacBook Pro with Mountain Lion. Shouldn't it be okay to abort a matrix multiplication like this or are crashes expected when aborting?
The text was updated successfully, but these errors were encountered: