Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when aborting matrix multiplication #1468

Closed
andreasnoack opened this issue Oct 29, 2012 · 34 comments
Closed

Segfault when aborting matrix multiplication #1468

andreasnoack opened this issue Oct 29, 2012 · 34 comments
Labels
bug Indicates an unexpected problem or unintended behavior system:mac Affects only macOS

Comments

@andreasnoack
Copy link
Member

I get this systematically

julia> mA=randn(5000,5000);mB=randn(5000,5000);

julia> mA*mB
^CSegmentation fault: 11

when I try to abort the computation. Julia runs in the terminal on a MacBook Pro with Mountain Lion. Shouldn't it be okay to abort a matrix multiplication like this or are crashes expected when aborting?

@staticfloat
Copy link
Member

I get this:

julia> mA*mB
^C
 in gemm! at blas.jl:267
 in gemm_wrapper at matmul.jl:276
 in gemm_wrapper at matmul.jl:265
 in * at matmul.jl:84

julia> Segmentation fault: 11

And no, I do not believe this should be happening. I seem to remember Stefan saying "Any segfault is automatically a bug". :)

@ViralBShah
Copy link
Member

During that call, the control is actually with openblas. Typically, you would intercept the signal, and do a cleanup in the handler. However, many of the native libraries are not designed with this kind of interactive usage in mind. It would still be nice to not have a segfault, if possible.

@JeffBezanson
Copy link
Member

Short answer: just use linux :-P

We can look into it, but it's unlikely we'll be able to make ctl-C work perfectly in every case. Once an async signal happens the process is technically in an ill-defined state.

@staticfloat
Copy link
Member

Does OpenBLAS reset signal handlers on function calls? If not, shouldn't Julia's signal handlers be getting called no matter where execution is at the time of the signal?

Also, the segfault doesn't always happen immediately for me, for instance in the example above, Julia was able to print out the stack trace, and ask for another prompt before segfaulting. My guess is this is a memory freeing bug triggered by the gc. Some further digging reveals:

julia> mA=randn(5000,5000);mB=randn(5000,5000);

julia> mA*mB
^C

julia> 

julia> 2+2
4

julia> gc()
Segmentation fault: 11

@staticfloat
Copy link
Member

I managed to capture this in gdb:

julia> mA=randn(5000,5000);mB=randn(5000,5000);

julia> mA*mB
^C
Program received signal SIGINT, Interrupt.
0x0000000102e925f3 in .L12 ()
(gdb) signal SIGINT
Continuing with signal SIGINT.


julia> 
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x000000012d43e5e0
[Switching to process 38681 thread 0x1203]
0x0000000102e92895 in .L18 ()
(gdb) bt
#0  0x0000000102e92895 in .L18 ()
#1  0x00000000000004e2 in ?? ()
(gdb) info threads
  3                         0x00007fff8e5ac0fa in __psynch_cvwait ()
* 2                         0x0000000102e92895 in .L18 ()
  1 "com.apple.main-thread" 0x00007fff8e5ac322 in select$DARWIN_EXTSN ()
(gdb) thread 3
[Switching to thread 3 (process 38681)]
0x00007fff8e5ac0fa in __psynch_cvwait ()
(gdb) bt
#0  0x00007fff8e5ac0fa in __psynch_cvwait ()
#1  0x00007fff92de7f89 in _pthread_cond_wait ()
#2  0x0000000100082d7b in run_io_thr ()
#3  0x00007fff92de3742 in _pthread_start ()
#4  0x00007fff92dd0181 in thread_start ()
(gdb) thread 1
[Switching to thread 1 (process 38681), "com.apple.main-thread"]
0x00007fff8e5ac322 in select$DARWIN_EXTSN ()
(gdb) bt
#0  0x00007fff8e5ac322 in select$DARWIN_EXTSN ()
#1  0x00007fff8e5ab20f in select ()
#2  0x0000000101d260eb in ?? ()
#3  0x0000000101d25987 in ?? ()
#4  0x0000000100049c75 in jl_apply_generic ()
#5  0x0000000101d25255 in ?? ()
#6  0x000000010008480e in start_task ()
#7  0x00000001000839ca in julia_trampoline ()
#8  0x0000000100001b7b in main ()
(gdb) 

It paused at the julia prompt for a couple seconds (I was going to type gc() but it crashed before I could), so hopefully these stack traces are more focused on the issue at hand.

@andreasnoack
Copy link
Member Author

I have to push a little harder on my Ubuntu but it can do similar tricks

julia> mA=randn(5000,5000);mB=randn(5000,5000);

julia> mA*mB
^C

julia> mA*mB
^C

julia> mA*mB
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
Segmentation fault (core dumped)

I is not a big deal for me but I did not recall such behaviour when working with R and thought there might be a fix.

@ViralBShah
Copy link
Member

In Matlab, you don't typically get segfaults when you do such stuff. Usually, matlab will not let you do a ctrl-c when it is in running native code. Not sure how R handles all this gracefully.

@StefanKarpinski
Copy link
Member

We should definitely take a look at what R does. They may have some good tricks to copy.

@staticfloat
Copy link
Member

I guess I don't understand why CTRL^C'ing in OpenBLAS code should give us problems vs. being in Julia code when we CTRL^C.

Usually, matlab will not let you do a ctrl-c when it is in running native code

MATLAB queues the CTRL^C until you're back in MATLAB code, and then breaks.

@JeffBezanson
Copy link
Member

We have the ability to defer SIGINT around blocks of code, and it is used in some places in the run time system. We could expose this as a macro. It would be silly to use it around every ccall though.

@staticfloat
Copy link
Member

I thought it was pretty standard procedure to have signal handlers setup at all times, so that no matter what happens signal-wise you aren't either interpreting signals as SIGTERM when you don't want to, or even worse, ignoring signals that should be having an effect. Having to wrap something around every ccall sounds like the wrong way to do things, but shouldn't Julia provide a signal handler that at the very least gracefully exits on signals? (Again, I don't think I'm understanding why you would segfault if the SIGINT happens during OpenBLAS execution vs. not segfaulting during Julia execution.)

@JeffBezanson
Copy link
Member

We do have handlers set up, we just defer the handling in some cases so that certain actions are atomic. But in any case it is probably still possible for SIGINT to leave the process in an inconsistent state.

I just tried this example on linux, and using openblas threads I got a segfault. After setting OPENBLAS_NUM_THREADS=1 the segfault went away. This is 100% reproducible for me --- threads = segfault, 1 thread = works. So something in how the child threads are handled seems to be implicated. Do others see the same thing?

@andreasnoack
Copy link
Member Author

I see the same thing on my Mac. Single threaded works.

@vtjnash
Copy link
Member

vtjnash commented Oct 30, 2012

From your comments, Jeff, it sounds like the openblas child threads are continuing happily along on whatever they were doing, unaware that the SIGINT has jumped execution out of openblas control (and that its temporary arrays will soon be GC'd). Unless there is a way to gracefully (or not) tell openblas to abort everything, perhaps calls to blas functions should be atomic to interruption by signals?

@ViralBShah
Copy link
Member

Cc: @xianyi

@xianyi
Copy link

xianyi commented Nov 1, 2012

Hi,

I just tested multi-threaded DGEMM in C. The Ctrl-C works fine.

I think we should build the Julia & OpenBLAS debug version to narrow down this issue.

Xianyi

@andreasnoack
Copy link
Member Author

I tried Ctrl-C with Julia linked to MKL instead of OpenBLAS and the result is the same. Segfault for multithreading and not for single threading so it does not seem to be specific to OpenBLAS.

@staticfloat
Copy link
Member

I tried this with Accelerate, and got the same result as well, and it seems my suspicions about a memory-freeing problem have a little more substance to them. Nothing bad happened until I manually called gc(), and check out the backtrace on thread #1:

julia> mA=randn(5000,5000);mB=randn(5000,5000);

julia> mA*mB
^C
Program received signal SIGINT, Interrupt.
0x0000000104e7cd72 in dgebpAlignedC_4M4N_SSE ()
(gdb) signal SIGINT
Continuing with signal SIGINT.

 in gemm! at blas.jl:267
 in gemm_wrapper at matmul.jl:276
 in gemm_wrapper at matmul.jl:265
 in * at matmul.jl:84

julia> 

julia> 2+2
4

julia> gc()

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x0000000121820cd0
[Switching to process 55160 thread 0x1303]
0x0000000104e7ce35 in dgebpAlignedC_4M4N_SSE ()
(gdb) info threads
* 3 "com.apple.root.default-priorit" 0x0000000104e7ce35 in dgebpAlignedC_4M4N_SSE ()
  2                                  0x00000001011190fa in __psynch_cvwait ()
  1 "com.apple.main-thread"          0x0000000101118fc6 in munmap ()
(gdb) bt
#0  0x0000000104e7ce35 in dgebpAlignedC_4M4N_SSE ()
#1  0x0000000104e7d82d in dgemm_repack_Aligned ()
#2  0x0000000104e7b431 in __APL_dgemm_block_invoke_0 ()
#3  0x0000000100f34f01 in _dispatch_call_block_and_release ()
#4  0x0000000100f310b6 in _dispatch_client_callout ()
#5  0x0000000100f321fa in _dispatch_worker_thread2 ()
#6  0x0000000100fb2cab in _pthread_wqthread ()
#7  0x0000000100f9d171 in start_wqthread ()
(gdb) thread 2
[Switching to thread 2 (process 55160)]
0x00000001011190fa in __psynch_cvwait ()
(gdb) bt
#0  0x00000001011190fa in __psynch_cvwait ()
#1  0x0000000100fb4f89 in _pthread_cond_wait ()
#2  0x00000001000e687b in run_io_thr ()
#3  0x0000000100fb0742 in _pthread_start ()
#4  0x0000000100f9d181 in thread_start ()
(gdb) thread 1
[Switching to thread 1 (process 55160), "com.apple.main-thread"]
0x0000000101118fc6 in munmap ()
(gdb) bt
#0  0x0000000101118fc6 in munmap ()
#1  0x0000000100fd15de in deallocate_pages ()
#2  0x0000000100fd2209 in free_large ()
#3  0x0000000100fc9898 in free ()
#4  0x00000001000ef55e in jl_gc_collect ()
#5  0x00000001057dcab0 in ?? ()
#6  0x00000001000ad775 in jl_apply_generic ()
#7  0x00000001000e3624 in do_call ()
#8  0x00000001000e25b9 in eval ()
#9  0x00000001000edb64 in jl_toplevel_eval_flex ()
#10 0x00000001000b2407 in jl_f_top_eval ()
#11 0x00000001057d60ec in ?? ()
#12 0x00000001000ad775 in jl_apply_generic ()
#13 0x000000010578dafd in ?? ()
#14 0x000000010578d76d in ?? ()
#15 0x00000001000ad775 in jl_apply_generic ()
#16 0x0000000100001e97 in true_main ()
#17 0x00000001000e74d3 in julia_trampoline ()
#18 0x0000000100002279 in main ()

@JeffBezanson
Copy link
Member

Makes sense. We need a way to interrupt the work threads, or have them see and appropriately respond to SIGINT.

@ViralBShah
Copy link
Member

Would setting a SIGINT handler in the openblas threads be sufficient to fix this?

Alternatively, when the REPL gets interrupted, can it just cancel all pthreads launched by any julia processes, given that Julia itself does not use any threads? It could end up leaking memory, but not dying would be nice.

Cc: @tanmaykm @amitmurthy

@Keno
Copy link
Member

Keno commented May 15, 2014

BUMP, this is really annoying.

@vtjnash
Copy link
Member

vtjnash commented May 15, 2014

i'm pretty sure the openblas pthreads are supposed to sleep, not die, after finishing

@Keno Keno self-assigned this May 20, 2014
@Keno
Copy link
Member

Keno commented May 20, 2014

I'm not sure what to do here. On the one hand you might actually want to interrupt the matrix multiply, which we could probably do for OpenBLAS, but there's also Accelerate, MKL, etc. for which we would not be able to do that. Perhaps the best thing to do is to defer the signal until after the BLAS call and deliver it then.

@JeffBezanson
Copy link
Member

I'd be happy if it worked well with openblas. Deferring the signal is not such a great option since you'd really like to be able to interrupt the call. It also entails extra complexity and overhead.

@ViralBShah
Copy link
Member

Often the reason you want to deliver it in the middle of the blas call is that you have started something that you realize later will take too long or too much memory. Even if the OpenBLAS case is well behaved, that is worthwhile. In general we want to be able to interrupt any ccall.

@Keno
Copy link
Member

Keno commented May 21, 2014

The other problem is that OpenBLAS actually allocates memory in some functions which we would leak if we interrupted it.

@ViralBShah
Copy link
Member

I think that would be OK. Perhaps we can print a warning.

@Keno
Copy link
Member

Keno commented May 21, 2014

I'm not sure people would like 40MB of memory leaked every time they press ^C

@Keno
Copy link
Member

Keno commented May 21, 2014

Hmm, I have a devious idea. Let's see how it works out.

@ivarne
Copy link
Member

ivarne commented May 21, 2014

For other ccalls it coud be significantly more than 40 MB. There would also be an issue with other system resources that might be more scarce than memory.

@Keno
Copy link
Member

Keno commented May 21, 2014

It is quite clear that there is no way to this generally for all C libraries. However, I think I can do the special case of OpenBLAS with a little system-level hackery.

@ivarne
Copy link
Member

ivarne commented May 21, 2014

See also #2622

@Keno
Copy link
Member

Keno commented May 26, 2014

I filed issue OpenMathLib/OpenBLAS#378 against OpenBLAS. As mentioned in the issue I tried a very ugly hack but I don't really like it so I outlined some possible courses of action in that issue.

@Keno Keno removed their assignment Jun 12, 2014
@stevengj
Copy link
Member

As I argued in #2622, and Jeff eventually agreed, the only sane default is to defer sigint in every ccall. (We can possibly hack specific things like openblas to either re-enable sigint for those calls or to periodically check manually for deferred sigints at safe places.)

yuyichao added a commit that referenced this issue May 3, 2016
* Remove unnecessary sigatomic
* Make flisp calls sigatomic
* Make type inference calls sigatomic
* Refactor interthread communication through signal
* Make sure `sleep` is aborted on `SIGINT` on Linux to deliver the exception faster
* Implement force signal throwing when `SIGINT` arrives too frequently
* Hack to abort io syscall on `SIGINT`

Fix #1468; Fix #2622; Towards #14675
yuyichao added a commit that referenced this issue May 4, 2016
* Remove unnecessary sigatomic
* Make flisp calls sigatomic
* Make type inference calls sigatomic
* Refactor interthread communication through signal
* Make sure `sleep` is aborted on `SIGINT` on Linux to deliver the exception faster
* Implement force signal throwing when `SIGINT` arrives too frequently
* Hack to abort io syscall on `SIGINT`

Fix #1468; Fix #2622; Towards #14675
yuyichao added a commit that referenced this issue May 4, 2016
* Remove unnecessary sigatomic
* Make flisp calls sigatomic
* Make type inference calls sigatomic
* Refactor interthread communication through signal
* Make sure `sleep` is aborted on `SIGINT` on Linux to deliver the exception faster
* Implement force signal throwing when `SIGINT` arrives too frequently
* Hack to abort io syscall on `SIGINT`

Fix #1468; Fix #2622; Towards #14675
yuyichao added a commit that referenced this issue May 4, 2016
* Remove unnecessary sigatomic
* Make flisp calls sigatomic
* Make type inference calls sigatomic
* Refactor interthread communication through signal
* Make sure `sleep` is aborted on `SIGINT` on Linux to deliver the exception faster
* Implement force signal throwing when `SIGINT` arrives too frequently
* Hack to abort io syscall on `SIGINT`

Fix #1468; Fix #2622; Towards #14675
yuyichao added a commit that referenced this issue May 4, 2016
* Remove unnecessary sigatomic
* Make flisp calls sigatomic
* Make type inference calls sigatomic
* Refactor interthread communication through signal
* Make sure `sleep` is aborted on `SIGINT` on Linux to deliver the exception faster
* Implement force signal throwing when `SIGINT` arrives too frequently
* Hack to abort io syscall on `SIGINT`

Fix #1468; Fix #2622; Towards #14675
yuyichao added a commit that referenced this issue May 5, 2016
* Remove unnecessary sigatomic
* Make flisp calls sigatomic
* Make type inference calls sigatomic
* Refactor interthread communication through signal
* Make sure `sleep` is aborted on `SIGINT` on Linux to deliver the exception faster
* Implement force signal throwing when `SIGINT` arrives too frequently
* Hack to abort io syscall on `SIGINT`

Fix #1468; Fix #2622; Towards #14675
yuyichao added a commit that referenced this issue May 5, 2016
* Remove unnecessary sigatomic
* Make flisp calls sigatomic
* Make type inference calls sigatomic
* Refactor interthread communication through signal
* Make sure `sleep` is aborted on `SIGINT` on Linux to deliver the exception faster
* Implement force signal throwing when `SIGINT` arrives too frequently
* Hack to abort io syscall on `SIGINT`

Fix #1468; Fix #2622; Towards #14675
yuyichao added a commit that referenced this issue May 6, 2016
* Remove unnecessary sigatomic
* Make flisp calls sigatomic
* Make type inference calls sigatomic
* Refactor interthread communication through signal
* Make sure `sleep` is aborted on `SIGINT` on Linux to deliver the exception faster
* Implement force signal throwing when `SIGINT` arrives too frequently
* Hack to abort io syscall on `SIGINT`

Fix #1468; Fix #2622; Towards #14675
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior system:mac Affects only macOS
Projects
None yet
Development

No branches or pull requests

10 participants