Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread blocked indefinitely #48

Closed
simonmar opened this issue May 15, 2012 · 17 comments
Closed

Thread blocked indefinitely #48

simonmar opened this issue May 15, 2012 · 17 comments
Assignees
Labels
cuda backend [deprecated]

Comments

@simonmar
Copy link

When using the CUDA backend, but not the interpreter, it is easy to get a "thread blocked indefinitely on MVar" exception by having one GPU computation depend on another. I presume this is due to the use of withMVar in run, so I worked around it with some seqs. Is that right?

This seems like a bug, since if something works with the interpreter we would expect it to work with the CUDA backend. I can imagine it would be difficult to fix though. Could it be documented somewhere?

@rrnewton
Copy link
Member

Actually, I have a related question on this same code that I might as well ask here. The thread that does puts the MVar after GPU results are available is forked with forkIO here:

https://github.com/AccelerateHS/accelerate-cuda/blob/2320705ee4e1bda2565117810ef6b4e5e702ca48/Data/Array/Accelerate/CUDA.hs#L224

But doesn't that need to be a forkOS so that the GHC capability isn't stalled by the blocking CUDA call, interfering with other IO threads?

I think Accelerate is not passing cuCtxCreate any flags, right now, correct? Which means we would get the default blocking/spinning behavior CU_CTX_SCHED_AUTO:

http://developer.download.nvidia.com/compute/cuda/4_2/rel/toolkit/docs/online/group__CUDA__CTX_g65dc0012348bc84810e2103a40d8e2cf.html#g65dc0012348bc84810e2103a40d8e2cf

Which in practice means spinning in most cases. However, I think that spinning in a foreign function is just as bad as blocking wrt the GHC RTS, right?

Personally, I was hoping we could make CU_CTX_SCHED_BLOCKING_SYNC the Accelerate default so as to be gentle on CPU resources. It's a bit solipsistic for NVidia to make spinning the default -- waste power and screw up whatever the CPU is trying to do because, obviously, the GPU computation is the only thing you should care about ;-).

@simonmar
Copy link
Author

forkIO vs. forkOS makes no difference to blocking, it only affects which OS thread the foreign call is made in. As long as the foreign call is marked "safe", it won't block other Haskell threads (provided you use -threaded).

@rrnewton
Copy link
Member

Ok, just to make sure I'm following --

If we compile with --threaded, and run with +RTS -N1, and then forkIO ten threads, one of which does a blocking CUDA call which blocks the hosting pthread for, say, a week, the other nine IO threads will have a chance to run in the intervening week, right?
A helpful passage appears here:

"although during the course of running the program more OS threads might be 
 created in order to continue running Haskell code while foreign calls execute"

Ah, so extra OS threads are forked on demand! Very nice. But from that Wiki I don't yet understand when these are forked. It would seem every "safe" foreign call from an IO thread can result in an OS thread being created? For example if we forkIO 10K threads, and do 10K foreign calls, we can end up with 10K OS threads, irrespective of +RTS -N, right?

Is it fair to say that blocking foreign calls should be marked as "safe"? This wiki makes it sound as if safe/unsafe is just a question of whether the foreign function calls back into Haskell. But if, again, a foreign call blocks for a week on CUDA, even if it doesn't call back into Haskell, we want it on its own OS thread...

It looks like "waitForReturnCapability" is for returning safe FFI calls to get back in:
https://github.com/ghc/ghc/blob/1dbe6d59b621ab9bd836241d633b3a8d99812cb3/rts/Capability.c#L579
I didn't see where OS threads get forked.. is that in the compiler generated code sequence for a safe foreign call?

@rrnewton
Copy link
Member

P.S. It looks like 100% of the foreign decls in the cuda package are marked as "unsafe".

This would need to be changed in a couple places to get the behavior Simon describes, right?

@tmcdonell
Copy link
Member

@simonmar okay, thanks! I have added some notes to the documentation as to why this happens. 'seq' is definitely one way to go avoiding it.

@tmcdonell
Copy link
Member

@rrnewton the default context does not pass any context creation flags, so yes, it would just pick up the CU_CTX_SCHED_AUTO spinning behaviour. Changing to CU_CTX_SCHED_BLOCKING_SYNC would be good, although I think some explicit synchronisation points will need to be added in the execute phase --- I'm pretty sure I know how to do that now, however.

From memory, 100% of the foreign calls in the cuda package are marked as "unsafe". The CUDA documentation is a bit confusing; I may have been under the impression that functions such as launchKernel and (more obviously) memcpyAsync always return immediately, so then it wouldn't matter if unsafe foreign calls block. But maybe that is only valid for the runtime and not driver API, since we can give these different behaviour flags to context creation. I'm no longer sure.

I've no objection to changing the foreign calls to "safe".

@simonmar
Copy link
Author

On 16/05/2012 03:28, Ryan Newton wrote:

Ok, just to make sure I'm following --

If we compile with --threaded, and run with +RTS -N1, and then forkIO
ten threads, one of which does a blocking CUDA call which blocks the
hosting pthread for, say, a week, the other nine IO threads will
have a chance to run in the intervening week, right?

Correct. The best docs for this are in the GHC users guide:

http://www.haskell.org/ghc/docs/latest/html/users_guide/ffi-ghc.html#ffi-threads

Ah, so extra OS threads are forked on demand! Very nice. But from
that Wiki I don't yet understand when these are forked. It would
seem every "safe" foreign call from an IO thread can result in an OS
thread being created? For example if we forkIO 10K threads, and do
10K foreign calls, we can end up with 10K OS threads, irrespective of
+RTS -N, right?

Also correct. Every blocked FFI call needs a separate OS thread, and we
also need at least one OS thread per capability in the RTS.

More background here:
http://community.haskell.org/~simonmar/papers/conc-ffi.pdf

This is why we have the IO manager: it handles the majority of blocking
I/O with a single OS threaed.

Is it fair to say that blocking foreign calls should be marked as
"safe"?

Absolutely.

It looks like "waitForReturnCapability" is for returning safe FFI
calls to get back in:
https://github.com/ghc/ghc/blob/1dbe6d59b621ab9bd836241d633b3a8d99812cb3/rts/Capability.c#L579

I didn't see where OS threads get forked.. is that in the compiler
generated code sequence for a safe foreign call?

rts/Schedule.c:suspendThread() is called by the Haskell code right
before a safe foreign call, it calls releaseCapability_() which spawns a
new OS thread if necessary.

Cheers,
Simon

@simonmar
Copy link
Author

On 16/05/2012 04:23, Trevor L. McDonell wrote:

I've no objection to changing the foreign calls to "safe".

Best would be to just mark the long-running ones as safe, leave the
short-running ones as unsafe, since there is quite a high overhead for a
safe call.

Cheers,
Simon

@rrnewton
Copy link
Member

I'm finding it difficult to dig up information on exactly how the cuda driver interacts with the OS (e.g. How does it block/wake pthreads?).

http://forums.nvidia.com/index.php?showtopic=177417

@tmcdonell
Copy link
Member

@rrnewton notes:

For reference, here's a link to the new bit of doc:

https://github.com/AccelerateHS/accelerate-cuda/blob/4575f474fa85d15bf48a88a4abb59ec62992d1cf/Data/Array/Accelerate/CUDA.hs#L85

Would it be a solution to walk the AST (during or before convertAcc) and force all the input arrays, before doing anything with the context?

Wouldn't that be a built-in equivalent of the seq Simon uses above? The Accelerate computation is strict in all the arrays that gets used, so I don't think this approach would force anything that isn't due to be forced anyway.

Shall we keep this issue open until a fix for the underlying issue is found? As Simon pointed out, behavior should not differ from the interpreter.

If we make the right things strict in the library, the user shouldn't have to deal with adding seq themselves.

@rrnewton
Copy link
Member

This was an issue that cropped up in spite of single threaded use of the API.

What are the issues with multi-threaded use of the API? Won't two threads calling "run" both grab the default context (the same context), leading to the same issue? Or is there something I'm forgetting?

@tmcdonell
Copy link
Member

If two threads call run with independent computations, then yes, they will be serialised when they try to grab the default context. But that would have happened anyway because the GPU can only process work from a single thread at a time (although it looks like the GK110, released yesterday, lifts this restriction).

So if you have a multi-threaded application, you probably want to use the runIn form, and have each thread supply their own context that refers to a distinct device (possibly lurking bug: driver contexts hold thread-local state?)

The issue here was that the computations were dependent, and the second had already taken the context before realising the first was still to be evaluated --- deadlock.

It might be enough to make use in the front-end library strict in its argument. But then again I'm not certain the context needs an MVar around it anyway.

Aside: it always calls forkIO on the computation, otherwise I found the finaliser threads, which free device memory, didn't get a chance to run.

@mchakravarty
Copy link
Member

@tmcdonell Yes, why have an MVar around the context? WIthout the MVar, all this wouldn't be an issue, or would it?

@ghost ghost assigned tmcdonell May 23, 2012
@tmcdonell
Copy link
Member

I mangled the commit tag, but here it is: AccelerateHS/accelerate-cuda@9f019ce

Thinking about why I did this, it occurs to me that what I really wanted was exclusive access to the device to avoid context switching (I recall trying this once and it was much slower than executing the two programs sequentially), but actually using a single context from different points is fine.

@mchakravarty
Copy link
Member

Ok, great, but can't we close the issue in this case? Or is there an outstanding problem?

@tmcdonell
Copy link
Member

The 'thread blocked' issues is resolved. I've created a new issue w.r.t. what Ryan mentions above regarding context synchronisation behaviour.

@mchakravarty
Copy link
Member

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda backend [deprecated]
Projects
None yet
Development

No branches or pull requests

4 participants