Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CUDA 7's per-thread default stream, with automatic syncing #2077

Closed
wants to merge 5 commits into from

Conversation

longjon
Copy link
Contributor

@longjon longjon commented Mar 9, 2015

Currently Caffe treats CUDA streams like this: we normally only use the default (NULL) stream, although a few layers use explicit streams which they are responsible for synchronizing.

Since we do not explicitly synchronize the default stream at the end of any layers, layer calls can actually be asynchronous with respect to the host. This normally goes unnoticed because all future CUDA calls (including memcpys) synch with the default stream. However, this use of the default stream also prevents GPU concurrency between threads.

CUDA 7 introduces the nvcc flag --default-stream per-thread, which turns the default stream into a normal, per-thread stream. This automatically allows concurrency with our existing layers, but also makes consecutive layer calls potentially asynchronous with respect to each other, which breaks normal net computation.

We could eliminate all use of the default stream, and insist that layers always block until their computation actually completes. This would require additional code in all existing CUDA layers. An alternative, presented here, is to introduce the convention that the null stream is always implicitly synchronized after a forward or backward. This maintains the old semantics of writing layers using the default stream in CUDA <7, while allowing layer concurrency despite the default stream in CUDA 7 with the appropriate flag. Since the default stream will block the whole device anyway in CUDA <7, there is no reason to avoid synchronizing it between layer calls.

After this patch, it should be okay to compile with --default-stream per-thread.

@longjon
Copy link
Contributor Author

longjon commented Mar 17, 2015

Er, this doesn't seem to work... it seems 0 is not the handle of the per-thread default stream. Update to come; if you are compiling with CUDA 7, you can use cudaStreamPerThread instead, so it's probably necessary to have some ifdef switching to make this work.

@longjon longjon changed the title Always synch the default stream after GPU forward or backward Use CUDA 7's per-thread default stream, with automatic syncing Mar 20, 2015
@longjon
Copy link
Contributor Author

longjon commented Mar 20, 2015

Update: this should be a more sensible patch now, though it requires CUDA 7.

Here is what it does now:

  • require CUDA 7, and always use --default-stream per-thread
  • establish the convention that the default stream is automatically synced after Forward or Backward

This makes existing layer code work as if it were rewritten to always use an explicit stream with synchronization.

The only reason I can think of not to take this approach is if CUDA 7 is an undue constraint; so far I don't know any reason why it would be, but comments are welcome.

Alternatively, we could just outlaw the default stream, but that would require changes to most existing layers.

This simplifies the OS X build, and will allow use of the per-thread
default stream for running existing layer code asynchronously.
Note that this may cause issues with code that assumes either explicit
or device-level synchronization, which we'll fix in the next commit.
This ensures that layers are synchronous with respect to each other,
even when layer code doesn't use explicit streams.
@longjon longjon force-pushed the synch-default-stream branch 2 times, most recently from 8e0b0c4 to 74fd7a6 Compare March 27, 2015 23:42
@longjon
Copy link
Contributor Author

longjon commented Mar 27, 2015

Update: we also need to explicitly use the default stream for cuBLAS and cuDNN. This will continue to fail until Travis is switched to CUDA 7.

@shelhamer
Copy link
Member

CUDA 7 was officially released last week, so we could update the travis script in our copious free time.

@shelhamer
Copy link
Member

Closing as #2219 was closed.

@shelhamer shelhamer closed this Apr 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants