Segfault when aborting matrix multiplication #1468

andreasnoack · 2012-10-29T15:33:16Z

I get this systematically

julia> mA=randn(5000,5000);mB=randn(5000,5000);

julia> mA*mB
^CSegmentation fault: 11

when I try to abort the computation. Julia runs in the terminal on a MacBook Pro with Mountain Lion. Shouldn't it be okay to abort a matrix multiplication like this or are crashes expected when aborting?

The text was updated successfully, but these errors were encountered:

staticfloat · 2012-10-29T15:55:13Z

I get this:

julia> mA*mB
^C
 in gemm! at blas.jl:267
 in gemm_wrapper at matmul.jl:276
 in gemm_wrapper at matmul.jl:265
 in * at matmul.jl:84

julia> Segmentation fault: 11

And no, I do not believe this should be happening. I seem to remember Stefan saying "Any segfault is automatically a bug". :)

ViralBShah · 2012-10-29T19:42:42Z

During that call, the control is actually with openblas. Typically, you would intercept the signal, and do a cleanup in the handler. However, many of the native libraries are not designed with this kind of interactive usage in mind. It would still be nice to not have a segfault, if possible.

JeffBezanson · 2012-10-29T19:50:01Z

Short answer: just use linux :-P

We can look into it, but it's unlikely we'll be able to make ctl-C work perfectly in every case. Once an async signal happens the process is technically in an ill-defined state.

staticfloat · 2012-10-29T19:50:31Z

Does OpenBLAS reset signal handlers on function calls? If not, shouldn't Julia's signal handlers be getting called no matter where execution is at the time of the signal?

Also, the segfault doesn't always happen immediately for me, for instance in the example above, Julia was able to print out the stack trace, and ask for another prompt before segfaulting. My guess is this is a memory freeing bug triggered by the gc. Some further digging reveals:

julia> mA=randn(5000,5000);mB=randn(5000,5000);

julia> mA*mB
^C

julia> 

julia> 2+2
4

julia> gc()
Segmentation fault: 11

staticfloat · 2012-10-29T19:57:26Z

I managed to capture this in gdb:

julia> mA=randn(5000,5000);mB=randn(5000,5000);

julia> mA*mB
^C
Program received signal SIGINT, Interrupt.
0x0000000102e925f3 in .L12 ()
(gdb) signal SIGINT
Continuing with signal SIGINT.


julia> 
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x000000012d43e5e0
[Switching to process 38681 thread 0x1203]
0x0000000102e92895 in .L18 ()
(gdb) bt
#0  0x0000000102e92895 in .L18 ()
#1  0x00000000000004e2 in ?? ()
(gdb) info threads
  3                         0x00007fff8e5ac0fa in __psynch_cvwait ()
* 2                         0x0000000102e92895 in .L18 ()
  1 "com.apple.main-thread" 0x00007fff8e5ac322 in select$DARWIN_EXTSN ()
(gdb) thread 3
[Switching to thread 3 (process 38681)]
0x00007fff8e5ac0fa in __psynch_cvwait ()
(gdb) bt
#0  0x00007fff8e5ac0fa in __psynch_cvwait ()
#1  0x00007fff92de7f89 in _pthread_cond_wait ()
#2  0x0000000100082d7b in run_io_thr ()
#3  0x00007fff92de3742 in _pthread_start ()
#4  0x00007fff92dd0181 in thread_start ()
(gdb) thread 1
[Switching to thread 1 (process 38681), "com.apple.main-thread"]
0x00007fff8e5ac322 in select$DARWIN_EXTSN ()
(gdb) bt
#0  0x00007fff8e5ac322 in select$DARWIN_EXTSN ()
#1  0x00007fff8e5ab20f in select ()
#2  0x0000000101d260eb in ?? ()
#3  0x0000000101d25987 in ?? ()
#4  0x0000000100049c75 in jl_apply_generic ()
#5  0x0000000101d25255 in ?? ()
#6  0x000000010008480e in start_task ()
#7  0x00000001000839ca in julia_trampoline ()
#8  0x0000000100001b7b in main ()
(gdb)

It paused at the julia prompt for a couple seconds (I was going to type gc() but it crashed before I could), so hopefully these stack traces are more focused on the issue at hand.

andreasnoack · 2012-10-29T20:17:38Z

I have to push a little harder on my Ubuntu but it can do similar tricks

julia> mA=randn(5000,5000);mB=randn(5000,5000);

julia> mA*mB
^C

julia> mA*mB
^C

julia> mA*mB
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
Segmentation fault (core dumped)

I is not a big deal for me but I did not recall such behaviour when working with R and thought there might be a fix.

ViralBShah · 2012-10-29T21:53:10Z

In Matlab, you don't typically get segfaults when you do such stuff. Usually, matlab will not let you do a ctrl-c when it is in running native code. Not sure how R handles all this gracefully.

StefanKarpinski · 2012-10-29T22:00:03Z

We should definitely take a look at what R does. They may have some good tricks to copy.

staticfloat · 2012-10-29T23:23:11Z

I guess I don't understand why CTRL^C'ing in OpenBLAS code should give us problems vs. being in Julia code when we CTRL^C.

Usually, matlab will not let you do a ctrl-c when it is in running native code

MATLAB queues the CTRL^C until you're back in MATLAB code, and then breaks.

JeffBezanson · 2012-10-30T06:22:01Z

We have the ability to defer SIGINT around blocks of code, and it is used in some places in the run time system. We could expose this as a macro. It would be silly to use it around every ccall though.

staticfloat · 2012-10-30T06:30:32Z

I thought it was pretty standard procedure to have signal handlers setup at all times, so that no matter what happens signal-wise you aren't either interpreting signals as SIGTERM when you don't want to, or even worse, ignoring signals that should be having an effect. Having to wrap something around every ccall sounds like the wrong way to do things, but shouldn't Julia provide a signal handler that at the very least gracefully exits on signals? (Again, I don't think I'm understanding why you would segfault if the SIGINT happens during OpenBLAS execution vs. not segfaulting during Julia execution.)

JeffBezanson · 2012-10-30T06:46:42Z

We do have handlers set up, we just defer the handling in some cases so that certain actions are atomic. But in any case it is probably still possible for SIGINT to leave the process in an inconsistent state.

I just tried this example on linux, and using openblas threads I got a segfault. After setting OPENBLAS_NUM_THREADS=1 the segfault went away. This is 100% reproducible for me --- threads = segfault, 1 thread = works. So something in how the child threads are handled seems to be implicated. Do others see the same thing?

andreasnoack · 2012-10-30T07:31:56Z

I see the same thing on my Mac. Single threaded works.

vtjnash · 2012-10-30T08:23:55Z

From your comments, Jeff, it sounds like the openblas child threads are continuing happily along on whatever they were doing, unaware that the SIGINT has jumped execution out of openblas control (and that its temporary arrays will soon be GC'd). Unless there is a way to gracefully (or not) tell openblas to abort everything, perhaps calls to blas functions should be atomic to interruption by signals?

ViralBShah · 2012-10-30T16:33:08Z

Cc: @xianyi

xianyi · 2012-11-01T15:18:09Z

Hi,

I just tested multi-threaded DGEMM in C. The Ctrl-C works fine.

I think we should build the Julia & OpenBLAS debug version to narrow down this issue.

Xianyi

andreasnoack · 2012-11-01T15:38:25Z

I tried Ctrl-C with Julia linked to MKL instead of OpenBLAS and the result is the same. Segfault for multithreading and not for single threading so it does not seem to be specific to OpenBLAS.

staticfloat · 2012-11-01T17:55:10Z

I tried this with Accelerate, and got the same result as well, and it seems my suspicions about a memory-freeing problem have a little more substance to them. Nothing bad happened until I manually called gc(), and check out the backtrace on thread #1:

julia> mA=randn(5000,5000);mB=randn(5000,5000);

julia> mA*mB
^C
Program received signal SIGINT, Interrupt.
0x0000000104e7cd72 in dgebpAlignedC_4M4N_SSE ()
(gdb) signal SIGINT
Continuing with signal SIGINT.

 in gemm! at blas.jl:267
 in gemm_wrapper at matmul.jl:276
 in gemm_wrapper at matmul.jl:265
 in * at matmul.jl:84

julia> 

julia> 2+2
4

julia> gc()

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x0000000121820cd0
[Switching to process 55160 thread 0x1303]
0x0000000104e7ce35 in dgebpAlignedC_4M4N_SSE ()
(gdb) info threads
* 3 "com.apple.root.default-priorit" 0x0000000104e7ce35 in dgebpAlignedC_4M4N_SSE ()
  2                                  0x00000001011190fa in __psynch_cvwait ()
  1 "com.apple.main-thread"          0x0000000101118fc6 in munmap ()
(gdb) bt
#0  0x0000000104e7ce35 in dgebpAlignedC_4M4N_SSE ()
#1  0x0000000104e7d82d in dgemm_repack_Aligned ()
#2  0x0000000104e7b431 in __APL_dgemm_block_invoke_0 ()
#3  0x0000000100f34f01 in _dispatch_call_block_and_release ()
#4  0x0000000100f310b6 in _dispatch_client_callout ()
#5  0x0000000100f321fa in _dispatch_worker_thread2 ()
#6  0x0000000100fb2cab in _pthread_wqthread ()
#7  0x0000000100f9d171 in start_wqthread ()
(gdb) thread 2
[Switching to thread 2 (process 55160)]
0x00000001011190fa in __psynch_cvwait ()
(gdb) bt
#0  0x00000001011190fa in __psynch_cvwait ()
#1  0x0000000100fb4f89 in _pthread_cond_wait ()
#2  0x00000001000e687b in run_io_thr ()
#3  0x0000000100fb0742 in _pthread_start ()
#4  0x0000000100f9d181 in thread_start ()
(gdb) thread 1
[Switching to thread 1 (process 55160), "com.apple.main-thread"]
0x0000000101118fc6 in munmap ()
(gdb) bt
#0  0x0000000101118fc6 in munmap ()
#1  0x0000000100fd15de in deallocate_pages ()
#2  0x0000000100fd2209 in free_large ()
#3  0x0000000100fc9898 in free ()
#4  0x00000001000ef55e in jl_gc_collect ()
#5  0x00000001057dcab0 in ?? ()
#6  0x00000001000ad775 in jl_apply_generic ()
#7  0x00000001000e3624 in do_call ()
#8  0x00000001000e25b9 in eval ()
#9  0x00000001000edb64 in jl_toplevel_eval_flex ()
#10 0x00000001000b2407 in jl_f_top_eval ()
#11 0x00000001057d60ec in ?? ()
#12 0x00000001000ad775 in jl_apply_generic ()
#13 0x000000010578dafd in ?? ()
#14 0x000000010578d76d in ?? ()
#15 0x00000001000ad775 in jl_apply_generic ()
#16 0x0000000100001e97 in true_main ()
#17 0x00000001000e74d3 in julia_trampoline ()
#18 0x0000000100002279 in main ()

JeffBezanson · 2012-11-01T18:25:22Z

Makes sense. We need a way to interrupt the work threads, or have them see and appropriately respond to SIGINT.

ViralBShah · 2014-04-27T07:05:26Z

Would setting a SIGINT handler in the openblas threads be sufficient to fix this?

Alternatively, when the REPL gets interrupted, can it just cancel all pthreads launched by any julia processes, given that Julia itself does not use any threads? It could end up leaking memory, but not dying would be nice.

Cc: @tanmaykm @amitmurthy

Keno · 2014-05-15T01:02:12Z

BUMP, this is really annoying.

vtjnash · 2014-05-15T03:03:04Z

i'm pretty sure the openblas pthreads are supposed to sleep, not die, after finishing

Keno · 2014-05-20T19:05:37Z

I'm not sure what to do here. On the one hand you might actually want to interrupt the matrix multiply, which we could probably do for OpenBLAS, but there's also Accelerate, MKL, etc. for which we would not be able to do that. Perhaps the best thing to do is to defer the signal until after the BLAS call and deliver it then.

JeffBezanson · 2014-05-20T19:21:44Z

I'd be happy if it worked well with openblas. Deferring the signal is not such a great option since you'd really like to be able to interrupt the call. It also entails extra complexity and overhead.

ViralBShah · 2014-05-20T19:30:06Z

Often the reason you want to deliver it in the middle of the blas call is that you have started something that you realize later will take too long or too much memory. Even if the OpenBLAS case is well behaved, that is worthwhile. In general we want to be able to interrupt any ccall.

Keno · 2014-05-21T09:08:09Z

The other problem is that OpenBLAS actually allocates memory in some functions which we would leak if we interrupted it.

ViralBShah · 2014-05-21T09:22:27Z

I think that would be OK. Perhaps we can print a warning.

Keno · 2014-05-21T09:38:49Z

I'm not sure people would like 40MB of memory leaked every time they press ^C

Keno · 2014-05-21T09:43:57Z

Hmm, I have a devious idea. Let's see how it works out.

ivarne · 2014-05-21T09:45:30Z

For other ccalls it coud be significantly more than 40 MB. There would also be an issue with other system resources that might be more scarce than memory.

Keno · 2014-05-21T09:49:44Z

It is quite clear that there is no way to this generally for all C libraries. However, I think I can do the special case of OpenBLAS with a little system-level hackery.

ivarne · 2014-05-21T10:02:35Z

See also #2622

Keno · 2014-05-26T10:00:59Z

I filed issue OpenMathLib/OpenBLAS#378 against OpenBLAS. As mentioned in the issue I tried a very ugly hack but I don't really like it so I outlined some possible courses of action in that issue.

stevengj · 2014-12-30T23:06:17Z

As I argued in #2622, and Jeff eventually agreed, the only sane default is to defer sigint in every ccall. (We can possibly hack specific things like openblas to either re-enable sigint for those calls or to periodically check manually for deferred sigints at safe places.)

* Remove unnecessary sigatomic * Make flisp calls sigatomic * Make type inference calls sigatomic * Refactor interthread communication through signal * Make sure `sleep` is aborted on `SIGINT` on Linux to deliver the exception faster * Implement force signal throwing when `SIGINT` arrives too frequently * Hack to abort io syscall on `SIGINT` Fix #1468; Fix #2622; Towards #14675

JeffBezanson mentioned this issue Mar 20, 2013

make ccall sigatomic (defer SIGINT handling) #2622

Closed

JeffBezanson mentioned this issue Oct 20, 2013

REPL segfaults when interrupted many times #4591

Closed

ViralBShah added the mac label Apr 27, 2014

Keno self-assigned this May 20, 2014

Keno removed their assignment Jun 12, 2014

jiahao mentioned this issue Feb 1, 2015

Segmentation fault pressing Ctrl-C during Pkg.update() #9362

Closed

yuyichao mentioned this issue May 21, 2015

Segfault on break #11382

Closed

yuyichao mentioned this issue May 3, 2016

Use safepoint to deliver SIGINT #16174

Merged

vtjnash closed this as completed in #16174 May 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault when aborting matrix multiplication #1468

Segfault when aborting matrix multiplication #1468

andreasnoack commented Oct 29, 2012

staticfloat commented Oct 29, 2012

ViralBShah commented Oct 29, 2012

JeffBezanson commented Oct 29, 2012

staticfloat commented Oct 29, 2012

staticfloat commented Oct 29, 2012

andreasnoack commented Oct 29, 2012

ViralBShah commented Oct 29, 2012

StefanKarpinski commented Oct 29, 2012

staticfloat commented Oct 29, 2012

JeffBezanson commented Oct 30, 2012

staticfloat commented Oct 30, 2012

JeffBezanson commented Oct 30, 2012

andreasnoack commented Oct 30, 2012

vtjnash commented Oct 30, 2012

ViralBShah commented Oct 30, 2012

xianyi commented Nov 1, 2012

andreasnoack commented Nov 1, 2012

staticfloat commented Nov 1, 2012

JeffBezanson commented Nov 1, 2012

ViralBShah commented Apr 27, 2014

Keno commented May 15, 2014

vtjnash commented May 15, 2014

Keno commented May 20, 2014

JeffBezanson commented May 20, 2014

ViralBShah commented May 20, 2014

Keno commented May 21, 2014

ViralBShah commented May 21, 2014

Keno commented May 21, 2014

Keno commented May 21, 2014

ivarne commented May 21, 2014

Keno commented May 21, 2014

ivarne commented May 21, 2014

Keno commented May 26, 2014

stevengj commented Dec 30, 2014

Segfault when aborting matrix multiplication #1468

Segfault when aborting matrix multiplication #1468

Comments

andreasnoack commented Oct 29, 2012

staticfloat commented Oct 29, 2012

ViralBShah commented Oct 29, 2012

JeffBezanson commented Oct 29, 2012

staticfloat commented Oct 29, 2012

staticfloat commented Oct 29, 2012

andreasnoack commented Oct 29, 2012

ViralBShah commented Oct 29, 2012

StefanKarpinski commented Oct 29, 2012

staticfloat commented Oct 29, 2012

JeffBezanson commented Oct 30, 2012

staticfloat commented Oct 30, 2012

JeffBezanson commented Oct 30, 2012

andreasnoack commented Oct 30, 2012

vtjnash commented Oct 30, 2012

ViralBShah commented Oct 30, 2012

xianyi commented Nov 1, 2012

andreasnoack commented Nov 1, 2012

staticfloat commented Nov 1, 2012

JeffBezanson commented Nov 1, 2012

ViralBShah commented Apr 27, 2014

Keno commented May 15, 2014

vtjnash commented May 15, 2014

Keno commented May 20, 2014

JeffBezanson commented May 20, 2014

ViralBShah commented May 20, 2014

Keno commented May 21, 2014

ViralBShah commented May 21, 2014

Keno commented May 21, 2014

Keno commented May 21, 2014

ivarne commented May 21, 2014

Keno commented May 21, 2014

ivarne commented May 21, 2014

Keno commented May 26, 2014

stevengj commented Dec 30, 2014