Use CUDA 7's per-thread default stream, with automatic syncing #2077

longjon · 2015-03-09T06:47:31Z

Currently Caffe treats CUDA streams like this: we normally only use the default (NULL) stream, although a few layers use explicit streams which they are responsible for synchronizing.

Since we do not explicitly synchronize the default stream at the end of any layers, layer calls can actually be asynchronous with respect to the host. This normally goes unnoticed because all future CUDA calls (including memcpys) synch with the default stream. However, this use of the default stream also prevents GPU concurrency between threads.

CUDA 7 introduces the nvcc flag --default-stream per-thread, which turns the default stream into a normal, per-thread stream. This automatically allows concurrency with our existing layers, but also makes consecutive layer calls potentially asynchronous with respect to each other, which breaks normal net computation.

We could eliminate all use of the default stream, and insist that layers always block until their computation actually completes. This would require additional code in all existing CUDA layers. An alternative, presented here, is to introduce the convention that the null stream is always implicitly synchronized after a forward or backward. This maintains the old semantics of writing layers using the default stream in CUDA <7, while allowing layer concurrency despite the default stream in CUDA 7 with the appropriate flag. Since the default stream will block the whole device anyway in CUDA <7, there is no reason to avoid synchronizing it between layer calls.

After this patch, it should be okay to compile with --default-stream per-thread.

longjon · 2015-03-17T08:40:57Z

Er, this doesn't seem to work... it seems 0 is not the handle of the per-thread default stream. Update to come; if you are compiling with CUDA 7, you can use cudaStreamPerThread instead, so it's probably necessary to have some ifdef switching to make this work.

longjon · 2015-03-20T21:53:49Z

Update: this should be a more sensible patch now, though it requires CUDA 7.

Here is what it does now:

require CUDA 7, and always use --default-stream per-thread
establish the convention that the default stream is automatically synced after Forward or Backward

This makes existing layer code work as if it were rewritten to always use an explicit stream with synchronization.

The only reason I can think of not to take this approach is if CUDA 7 is an undue constraint; so far I don't know any reason why it would be, but comments are welcome.

Alternatively, we could just outlaw the default stream, but that would require changes to most existing layers.

This simplifies the OS X build, and will allow use of the per-thread default stream for running existing layer code asynchronously.

Note that this may cause issues with code that assumes either explicit or device-level synchronization, which we'll fix in the next commit.

This ensures that layers are synchronous with respect to each other, even when layer code doesn't use explicit streams.

longjon · 2015-03-27T23:43:09Z

Update: we also need to explicitly use the default stream for cuBLAS and cuDNN. This will continue to fail until Travis is switched to CUDA 7.

shelhamer · 2015-03-27T23:47:49Z

CUDA 7 was officially released last week, so we could update the travis script in our copious free time.

shelhamer · 2017-04-13T06:57:20Z

Closing as #2219 was closed.

longjon added the in progress label Mar 17, 2015

longjon force-pushed the synch-default-stream branch from 62be6f1 to 07fc34c Compare March 20, 2015 21:40

longjon changed the title ~~Always synch the default stream after GPU forward or backward~~ Use CUDA 7's per-thread default stream, with automatic syncing Mar 20, 2015

longjon added 3 commits March 20, 2015 14:54

require CUDA 7

77072a7

This simplifies the OS X build, and will allow use of the per-thread default stream for running existing layer code asynchronously.

[build] use CUDA 7's per thread default stream

c261659

Note that this may cause issues with code that assumes either explicit or device-level synchronization, which we'll fix in the next commit.

always sync the default stream after GPU forward or backward

1800672

This ensures that layers are synchronous with respect to each other, even when layer code doesn't use explicit streams.

longjon force-pushed the synch-default-stream branch from 07fc34c to 1800672 Compare March 20, 2015 21:54

longjon added 2 commits March 27, 2015 16:39

use per-thread stream as default for cuDNN

395245e

use per-thread stream as default for cuBLAS

74fd7a6

longjon force-pushed the synch-default-stream branch 2 times, most recently from 8e0b0c4 to 74fd7a6 Compare March 27, 2015 23:42

longjon mentioned this pull request Mar 28, 2015

Synchronous SGD via layer-wise parallelism #2219

Closed

longjon mentioned this pull request Jul 23, 2015

Explicitly synchronize streams in cuDNN code #2798

Closed

longjon mentioned this pull request Aug 29, 2015

Fix a recently introduced race condition in DataLayer #2998

Merged

shelhamer closed this Apr 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use CUDA 7's per-thread default stream, with automatic syncing #2077

Use CUDA 7's per-thread default stream, with automatic syncing #2077

longjon commented Mar 9, 2015

longjon commented Mar 17, 2015

longjon commented Mar 20, 2015

longjon commented Mar 27, 2015

shelhamer commented Mar 27, 2015

shelhamer commented Apr 13, 2017

Use CUDA 7's per-thread default stream, with automatic syncing #2077

Use CUDA 7's per-thread default stream, with automatic syncing #2077

Conversation

longjon commented Mar 9, 2015

longjon commented Mar 17, 2015

longjon commented Mar 20, 2015

longjon commented Mar 27, 2015

shelhamer commented Mar 27, 2015

shelhamer commented Apr 13, 2017