Thread blocked indefinitely #48

simonmar · 2012-05-15T14:02:57Z

When using the CUDA backend, but not the interpreter, it is easy to get a "thread blocked indefinitely on MVar" exception by having one GPU computation depend on another. I presume this is due to the use of withMVar in run, so I worked around it with some seqs. Is that right?

This seems like a bug, since if something works with the interpreter we would expect it to work with the CUDA backend. I can imagine it would be difficult to fix though. Could it be documented somewhere?

The text was updated successfully, but these errors were encountered:

rrnewton · 2012-05-15T19:45:53Z

Actually, I have a related question on this same code that I might as well ask here. The thread that does puts the MVar after GPU results are available is forked with forkIO here:

https://github.com/AccelerateHS/accelerate-cuda/blob/2320705ee4e1bda2565117810ef6b4e5e702ca48/Data/Array/Accelerate/CUDA.hs#L224

But doesn't that need to be a forkOS so that the GHC capability isn't stalled by the blocking CUDA call, interfering with other IO threads?

I think Accelerate is not passing cuCtxCreate any flags, right now, correct? Which means we would get the default blocking/spinning behavior CU_CTX_SCHED_AUTO:

http://developer.download.nvidia.com/compute/cuda/4_2/rel/toolkit/docs/online/group__CUDA__CTX_g65dc0012348bc84810e2103a40d8e2cf.html#g65dc0012348bc84810e2103a40d8e2cf

Which in practice means spinning in most cases. However, I think that spinning in a foreign function is just as bad as blocking wrt the GHC RTS, right?

Personally, I was hoping we could make CU_CTX_SCHED_BLOCKING_SYNC the Accelerate default so as to be gentle on CPU resources. It's a bit solipsistic for NVidia to make spinning the default -- waste power and screw up whatever the CPU is trying to do because, obviously, the GPU computation is the only thing you should care about ;-).

simonmar · 2012-05-15T19:50:09Z

forkIO vs. forkOS makes no difference to blocking, it only affects which OS thread the foreign call is made in. As long as the foreign call is marked "safe", it won't block other Haskell threads (provided you use -threaded).

rrnewton · 2012-05-16T02:28:52Z

Ok, just to make sure I'm following --

If we compile with --threaded, and run with +RTS -N1, and then forkIO ten threads, one of which does a blocking CUDA call which blocks the hosting pthread for, say, a week, the other nine IO threads will have a chance to run in the intervening week, right?
A helpful passage appears here:

"although during the course of running the program more OS threads might be 
 created in order to continue running Haskell code while foreign calls execute"

Ah, so extra OS threads are forked on demand! Very nice. But from that Wiki I don't yet understand when these are forked. It would seem every "safe" foreign call from an IO thread can result in an OS thread being created? For example if we forkIO 10K threads, and do 10K foreign calls, we can end up with 10K OS threads, irrespective of +RTS -N, right?

Is it fair to say that blocking foreign calls should be marked as "safe"? This wiki makes it sound as if safe/unsafe is just a question of whether the foreign function calls back into Haskell. But if, again, a foreign call blocks for a week on CUDA, even if it doesn't call back into Haskell, we want it on its own OS thread...

It looks like "waitForReturnCapability" is for returning safe FFI calls to get back in:
https://github.com/ghc/ghc/blob/1dbe6d59b621ab9bd836241d633b3a8d99812cb3/rts/Capability.c#L579
I didn't see where OS threads get forked.. is that in the compiler generated code sequence for a safe foreign call?

rrnewton · 2012-05-16T02:33:05Z

P.S. It looks like 100% of the foreign decls in the cuda package are marked as "unsafe".

This would need to be changed in a couple places to get the behavior Simon describes, right?

tmcdonell · 2012-05-16T03:05:30Z

@simonmar okay, thanks! I have added some notes to the documentation as to why this happens. 'seq' is definitely one way to go avoiding it.

tmcdonell · 2012-05-16T03:23:40Z

@rrnewton the default context does not pass any context creation flags, so yes, it would just pick up the CU_CTX_SCHED_AUTO spinning behaviour. Changing to CU_CTX_SCHED_BLOCKING_SYNC would be good, although I think some explicit synchronisation points will need to be added in the execute phase --- I'm pretty sure I know how to do that now, however.

From memory, 100% of the foreign calls in the cuda package are marked as "unsafe". The CUDA documentation is a bit confusing; I may have been under the impression that functions such as launchKernel and (more obviously) memcpyAsync always return immediately, so then it wouldn't matter if unsafe foreign calls block. But maybe that is only valid for the runtime and not driver API, since we can give these different behaviour flags to context creation. I'm no longer sure.

I've no objection to changing the foreign calls to "safe".

simonmar · 2012-05-16T08:17:54Z

On 16/05/2012 03:28, Ryan Newton wrote:

Ok, just to make sure I'm following --

If we compile with --threaded, and run with +RTS -N1, and then forkIO
ten threads, one of which does a blocking CUDA call which blocks the
hosting pthread for, say, a week, the other nine IO threads will
have a chance to run in the intervening week, right?

Correct. The best docs for this are in the GHC users guide:

http://www.haskell.org/ghc/docs/latest/html/users_guide/ffi-ghc.html#ffi-threads

Ah, so extra OS threads are forked on demand! Very nice. But from
that Wiki I don't yet understand when these are forked. It would
seem every "safe" foreign call from an IO thread can result in an OS
thread being created? For example if we forkIO 10K threads, and do
10K foreign calls, we can end up with 10K OS threads, irrespective of
+RTS -N, right?

Also correct. Every blocked FFI call needs a separate OS thread, and we
also need at least one OS thread per capability in the RTS.

More background here:
http://community.haskell.org/~simonmar/papers/conc-ffi.pdf

This is why we have the IO manager: it handles the majority of blocking
I/O with a single OS threaed.

Is it fair to say that blocking foreign calls should be marked as
"safe"?

Absolutely.

It looks like "waitForReturnCapability" is for returning safe FFI
calls to get back in:
https://github.com/ghc/ghc/blob/1dbe6d59b621ab9bd836241d633b3a8d99812cb3/rts/Capability.c#L579

I didn't see where OS threads get forked.. is that in the compiler
generated code sequence for a safe foreign call?

rts/Schedule.c:suspendThread() is called by the Haskell code right
before a safe foreign call, it calls releaseCapability_() which spawns a
new OS thread if necessary.

Cheers,
Simon

simonmar · 2012-05-16T08:19:19Z

On 16/05/2012 04:23, Trevor L. McDonell wrote:

I've no objection to changing the foreign calls to "safe".

Best would be to just mark the long-running ones as safe, leave the
short-running ones as unsafe, since there is quite a high overhead for a
safe call.

Cheers,
Simon

rrnewton · 2012-05-16T10:27:57Z

I'm finding it difficult to dig up information on exactly how the cuda driver interacts with the OS (e.g. How does it block/wake pthreads?).

http://forums.nvidia.com/index.php?showtopic=177417

tmcdonell · 2012-05-21T06:50:16Z

@rrnewton notes:

For reference, here's a link to the new bit of doc:

https://github.com/AccelerateHS/accelerate-cuda/blob/4575f474fa85d15bf48a88a4abb59ec62992d1cf/Data/Array/Accelerate/CUDA.hs#L85

Would it be a solution to walk the AST (during or before convertAcc) and force all the input arrays, before doing anything with the context?

Wouldn't that be a built-in equivalent of the seq Simon uses above? The Accelerate computation is strict in all the arrays that gets used, so I don't think this approach would force anything that isn't due to be forced anyway.

Shall we keep this issue open until a fix for the underlying issue is found? As Simon pointed out, behavior should not differ from the interpreter.

If we make the right things strict in the library, the user shouldn't have to deal with adding seq themselves.

rrnewton · 2012-05-21T18:21:27Z

This was an issue that cropped up in spite of single threaded use of the API.

What are the issues with multi-threaded use of the API? Won't two threads calling "run" both grab the default context (the same context), leading to the same issue? Or is there something I'm forgetting?

tmcdonell · 2012-05-22T01:48:56Z

If two threads call run with independent computations, then yes, they will be serialised when they try to grab the default context. But that would have happened anyway because the GPU can only process work from a single thread at a time (although it looks like the GK110, released yesterday, lifts this restriction).

So if you have a multi-threaded application, you probably want to use the runIn form, and have each thread supply their own context that refers to a distinct device (possibly lurking bug: driver contexts hold thread-local state?)

The issue here was that the computations were dependent, and the second had already taken the context before realising the first was still to be evaluated --- deadlock.

It might be enough to make use in the front-end library strict in its argument. But then again I'm not certain the context needs an MVar around it anyway.

Aside: it always calls forkIO on the computation, otherwise I found the finaliser threads, which free device memory, didn't get a chance to run.

mchakravarty · 2012-05-23T02:49:42Z

@tmcdonell Yes, why have an MVar around the context? WIthout the MVar, all this wouldn't be an issue, or would it?

tmcdonell · 2012-05-23T05:27:25Z

I mangled the commit tag, but here it is: AccelerateHS/accelerate-cuda@9f019ce

Thinking about why I did this, it occurs to me that what I really wanted was exclusive access to the device to avoid context switching (I recall trying this once and it was much slower than executing the two programs sequentially), but actually using a single context from different points is fine.

mchakravarty · 2012-05-24T11:28:28Z

Ok, great, but can't we close the issue in this case? Or is there an outstanding problem?

tmcdonell · 2012-05-25T02:20:25Z

The 'thread blocked' issues is resolved. I've created a new issue w.r.t. what Ryan mentions above regarding context synchronisation behaviour.

mchakravarty · 2012-05-25T03:16:26Z

Thanks.

tmcdonell mentioned this issue May 18, 2012

Segmentation fault #49

Closed

ghost assigned tmcdonell May 23, 2012

tmcdonell mentioned this issue May 25, 2012

Asynchronous execution #53

Open

tmcdonell closed this as completed May 25, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread blocked indefinitely #48

Thread blocked indefinitely #48

simonmar commented May 15, 2012

rrnewton commented May 15, 2012

simonmar commented May 15, 2012

rrnewton commented May 16, 2012

rrnewton commented May 16, 2012

tmcdonell commented May 16, 2012

tmcdonell commented May 16, 2012

simonmar commented May 16, 2012

simonmar commented May 16, 2012

rrnewton commented May 16, 2012

tmcdonell commented May 21, 2012

rrnewton commented May 21, 2012

tmcdonell commented May 22, 2012

mchakravarty commented May 23, 2012

tmcdonell commented May 23, 2012

mchakravarty commented May 24, 2012

tmcdonell commented May 25, 2012

mchakravarty commented May 25, 2012

Thread blocked indefinitely #48

Thread blocked indefinitely #48

Comments

simonmar commented May 15, 2012

rrnewton commented May 15, 2012

simonmar commented May 15, 2012

rrnewton commented May 16, 2012

rrnewton commented May 16, 2012

tmcdonell commented May 16, 2012

tmcdonell commented May 16, 2012

simonmar commented May 16, 2012

simonmar commented May 16, 2012

rrnewton commented May 16, 2012

tmcdonell commented May 21, 2012

rrnewton commented May 21, 2012

tmcdonell commented May 22, 2012

mchakravarty commented May 23, 2012

tmcdonell commented May 23, 2012

mchakravarty commented May 24, 2012

tmcdonell commented May 25, 2012

mchakravarty commented May 25, 2012