Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Fix a recently introduced race condition in DataLayer #2998
Conversation
ronghanghu
and 1 other
commented on an outdated diff
Aug 29, 2015
| @@ -13,6 +13,9 @@ void BasePrefetchingDataLayer<Dtype>::Forward_gpu( | ||
| // Copy the data | ||
| caffe_copy(batch->data_.count(), batch->data_.gpu_data(), | ||
| top[0]->mutable_gpu_data()); | ||
| + // Ensure the copy is synchronous wrt the host, so that the next batch isn't | ||
| + // copied in meanwhile. | ||
| + CUDA_CHECK(cudaStreamSynchronize(cudaStreamLegacy)); |
ronghanghu
Member
|
ronghanghu
added the
bug
label
Aug 29, 2015
|
Thanks @longjon for finding this. Synchronizing on the default stream is done at a couple places in parallel.cpp. If cudaStreamLegacy needs to be used instead for old CUDA versions I can change it. P2PSync used to start async copies and call cudaStreamSynchronize once, but after @thatguymike tuning it calls sync after each copy, so we could have the same code as base_data_layer.cu. |
|
@cypof, I think what I'm saying about |
longjon
added the
ready for review
label
Aug 30, 2015
|
Ah OK let's keep cudaStreamDefault everywhere then. For the docs I don't know, they have the programing guides and samples that can be helpful in addition to the headers/doxygen. |
|
This PR should receive high priority since it affects the correctness of all ongoing training with DataLayer, no matter single GPU or multi GPU. I hope to merge this as soon as possible (if no one opposes). |
|
+1 |
|
LGTM, thanks for tracking this down @longjon! |
jeffdonahue
added a commit
that referenced
this pull request
Aug 30, 2015
|
|
jeffdonahue |
d362894
|
longjon commentedAug 29, 2015
This PR fixes a race condition introduced to
DataLayerby #2903, which can cause incorrect data to appear ontopafter a GPU forward. This affects single or multiple GPU usage with any data source.DataLayer's forward copies prefetch GPU data -> top GPU data usingcaffe_copy. Meanwhile, the prefetch thread copies prefetch CPU data -> prefetch GPU data using a non-blocking CUDA stream.caffe_copyis asynchronous wrt the host (when device -> device). That means these two copies can happen in any order, giving you either this batch's data or the next's (or some combination?). If you have two synchronized data sources (e.g., separate images and labels), this can be catastrophic.Note that the queue pair is no help here; the
batchis reinserted into the free queue immediately after the copy is issued, before it's completed.To reproduce this issue easily, set
PREFETCH_COUNTto1, and put the copy https://github.com/BVLC/caffe/blob/master/src/caffe/layers/base_data_layer.cu#L14 in a loop that executes, e.g., 1000 times. That shouldn't affect correctness, but gives the race enough time to occur reliably (on my system, at least).The fix here explicitly synchronizes the null stream used by
caffe_copy. However, I think it requires CUDA 7. @thatguymike or others, what's the right way to do this without switching right away to CUDA 7?It would be nice if there were some way to test that this doesn't happen again, but that seems difficult...
Please note: Caffe as such, with few exceptions, uses the default stream with no explicit synchronization. Layer calls are asynchronous wrt the host. (That's why there's, e.g., #2077.)
caffe_copy(ascudaMemcpy) is asynchronous wrt the host when device -> device. If you create a non-blocking stream, don't expect it to be synchronous wrt any existing Caffe GPU code.