-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use CUDA 7's per-thread default stream, with automatic syncing #2077
Conversation
Er, this doesn't seem to work... it seems |
62be6f1
to
07fc34c
Compare
Update: this should be a more sensible patch now, though it requires CUDA 7. Here is what it does now:
This makes existing layer code work as if it were rewritten to always use an explicit stream with synchronization. The only reason I can think of not to take this approach is if CUDA 7 is an undue constraint; so far I don't know any reason why it would be, but comments are welcome. Alternatively, we could just outlaw the default stream, but that would require changes to most existing layers. |
This simplifies the OS X build, and will allow use of the per-thread default stream for running existing layer code asynchronously.
Note that this may cause issues with code that assumes either explicit or device-level synchronization, which we'll fix in the next commit.
This ensures that layers are synchronous with respect to each other, even when layer code doesn't use explicit streams.
07fc34c
to
1800672
Compare
8e0b0c4
to
74fd7a6
Compare
Update: we also need to explicitly use the default stream for cuBLAS and cuDNN. This will continue to fail until Travis is switched to CUDA 7. |
CUDA 7 was officially released last week, so we could update the travis script in our copious free time. |
Closing as #2219 was closed. |
Currently Caffe treats CUDA streams like this: we normally only use the default (NULL) stream, although a few layers use explicit streams which they are responsible for synchronizing.
Since we do not explicitly synchronize the default stream at the end of any layers, layer calls can actually be asynchronous with respect to the host. This normally goes unnoticed because all future CUDA calls (including
memcpy
s) synch with the default stream. However, this use of the default stream also prevents GPU concurrency between threads.CUDA 7 introduces the
nvcc
flag--default-stream per-thread
, which turns the default stream into a normal, per-thread stream. This automatically allows concurrency with our existing layers, but also makes consecutive layer calls potentially asynchronous with respect to each other, which breaks normal net computation.We could eliminate all use of the default stream, and insist that layers always block until their computation actually completes. This would require additional code in all existing CUDA layers. An alternative, presented here, is to introduce the convention that the null stream is always implicitly synchronized after a forward or backward. This maintains the old semantics of writing layers using the default stream in CUDA <7, while allowing layer concurrency despite the default stream in CUDA 7 with the appropriate flag. Since the default stream will block the whole device anyway in CUDA <7, there is no reason to avoid synchronizing it between layer calls.
After this patch, it should be okay to compile with
--default-stream per-thread
.