Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Multi-GPU Data Parallelism (with Parallel Data Layers) #2903
Conversation
cypof
and others
added some commits
Apr 28, 2015
shelhamer
added in progress focus speed-up
labels
Aug 11, 2015
|
Well, tests pass, but training runs seem to hang in the data prefetch queue. Not sure the new datareader code is behaving. |
|
@thatguymike I'll look into this issue shortly, and see why training hang. I expect to do a rebase tonight and test on my data with multiple GPUs. |
|
It's a great thing to get rid of the data reader and unify all data layer types. One thing I'm concerned about in this design though is about the ordering of threads on the lock. It might not be absolutely required, but if we want runs to be reproducible at the numerical precision level, each solver needs to take data items in the same order, which I don't believe the lock can enforce as it is. Each run might see items distributed to solvers differently. The gradient sum should be the same, but with slight differences as items would have been added in different order. |
|
Regarding Michael Houston's concern:
In this PR, only one single DataLayer is shared among all worker solvers. Since data in lmdb/leveldb is read in this DataLayer prefetch thread rather than worker solver thread, the data prefetch behavior doesn't deviate from single GPU. |
@cypof I thought about this issue. However, I am not too concerned about that, since in general this PR produces more consistent and numerically same results for all other data layers (except for level/lmdb) than #2870. In #2870 you'll get random behavior if a data layer supports and turns on shuffling, or get e.g. 4X learning rate otherwise. In both situation, the behavior is clearly worse than this PR and deviates from single GPU training+increased batch size. The latter behavior also defeats the purpose of MultiGPU data parallelism. |
|
Travis CI fails because NVCC generates warning over boost/thread.hpp included in layer.hpp (see Travis CI build details)
@shelhamer any suggestions to fix/suppress this warning? |
|
@thatguymike I made some update, removed data reader, and successfully trained on MNIST. I am also training on ILSVRC-2012-CLS with this PR. Can you test again? Since data in lmdb/leveldb is read in this DataLayer prefetch thread rather than worker solver thread, the data prefetch behavior shouldn't deviate from single GPU. |
|
Seems to work functionally, but scaling perf took a significant hit for some reason at 4 GPUs for AlexNet. Quite significant slowdown. |
|
@thatguymike I'll look into this today. |
|
@thatguymike To be specific, do you experience a lot of following logs?
|
|
How many transform threads are created by the shared data layer? |
|
@cypof There should be only one single prefetch thread in which transfom is performed. Only forward is done multi-thread in each solver via a lock. @thatguymike looking into the drift issue you mentioned. |
|
I am seeing a few notices of the data layer prefetch queue being empty that in theory I shouldn't be seeing. I don't see them with 2870 because I'm on fast SSDs and my LMDB should be in kernel filecache. |
|
If I understand well, the shared layer prefetch thread is now doing both load and transform. This might not be fast enough to feed several gpus |
|
Good point. Let me try with an uncompressed LMDB. A single thread should be fast enough for the raw IO, but might not be fast enough for the decode and transforms. |
|
The data layer empty message is only showed every 1000 to avoid filling the logs. |
|
Adding to the comment from @cypof, if we are going to do a shared read, we can read the superbatch, notify all and have each solver worker pull from the image array with their offset. That should make things consistent. We might need to rework the code to handle decode and transform in each worker, ideally in an independent thread, to keep up. |
|
As for NVCC warnings, we should be able to send a suppression flag to the front-end. |
|
@cypof @thatguymike Right now there is only one prefetch thread. I think I can do multi prefetch thread instead, still in one single shared DataLayer. My idea is to make each data layer read and decode and transform as fast as they can to fill the prefetch batch space. This can be done via multi-threading within a data layer. Each data layer should be responsible to provide data with highest speed. |
|
To clarify, the first 9 commits in this PR are exactly the same as in current #2870. The only changes are the last two commits,
Looking into the correctness issue right now. Will address speed issue afterwards. |
|
Regarding the numerically non-deterministic issue, I plan to use a more advanced lock that only allows threads to visit it in e.g 1->2->3->4->1->2->3->4->1 order on shared data layers. But right now I'll first handle the more drastic drift I observed. |
|
Multiple prefetch threads from LMDB isn't going to help generally as one thread should be able to saturate the IO. It's the image decode and image transforms we need to spread across CPU threads. Forcing the threads to visit in a specific order might work, but again, you only would want to do that on the data load and not on the processing (decode/transform) of the data. I think it would make more sense to have a single thread pulling the super batch from the IO system and then each thread get notified to read their chunk using a specific thread ID offset (batch*thread_id). |
I wonder if that can be done with a simple solution e.g. OpenMP over a for loop, within prefetch thread? |
|
Mixing OpenMP with boost threads is going to be a pain and debugging hell. Better to launch threads as we do now and use the blocking queues. However, if we start heading back to what @cypof originally did, that should get use back perf wise, but it's the numerical drift that has me currently more concerned. I'll bet it's all wrapped up in the same changes in the data loading. |
|
@thatguymike @cypof Due to the issues, while keeping debugging my current branch, I took a step back in this PR and kept the data reader. Now LMDB/LEVELDB DataLayer isn't shared. There are only 94 additions and 20 deletions upon #2870, which should be easy to review. |
|
Wow, this is it! I'm already the old fart annoyed at the young padawan who wants to change everything. Anyway, if you change the data layer to be multi-threaded, you will probably end up replicating the data reader pattern. So for now it might be better to just support multi-gpu only with DBs. As a longer term solution, there is either moving the other data layer types to the data reader, or the more general solution to allow layers to run asynchronously from each other. Something like having queues between layers, and letting them run independently on their own thread. I'm not sure how that would look like, but then a big network could have several sections running in parallel, feeding each other activations and gradients through queues. It's probably best to design this as part of the larger discussion about model parallelism etc. |
|
@cypof OK, let's get the NVCC warning suppressed. Any idea? Right now other data layers like HDF5DataLayer is shared to make solver tests pass. |
|
So reproducibility and perf are back it seems. Why the scope lock in Forward? For the suppression, let's try: -Xcudafe "--diag_suppress=boolean_controlling_expr_is_constant" Should work, but a little ugly. |
|
Thanks for all your work in this thread everybody! @cypof layer parallelism is covered by @longjon 's #2219 I can understand the frustration as we figure out how to bridge parallel IO @ronghanghu for NVCC / boost warnings we're usually careful about includes, For threads / parallelism, let's keep exclusively to boost as commented by
|
In short, let solver tests on HDF5 pass. Quote from @jeffdonahue
So, in order to test solvers with deterministic data, the current workaround was to load hdf5 files with HDF5DataLayer, which shared needs to be locked during forward. I'll modify to only apply it to those shared layers. |
|
Well, without the mutex, HDF5 tests crash in entertaining ways in multiGPU tests. For training, it has little performance impact until you run cuDNNv3 or really start to scale up (8 GPUs). Why is the mutex required to serialize foward for the parallel layers (hdf5) to pass tests? I fear we are heading back into hacky land. For LMDB training, doesn't seem required... |
|
We raced on comments. The mutex hurts perf as we serialize the forward parts. For 4 GPUs without cuDNN, its ~10% perf hit. 8 GPUs, ~25% perf hit. WIth cuDNNv3 backend, ~30% hit at 4GPUs. |
I'll only apply it when necessary. To be updated. |
|
Now the remaining is to suppress NVCC warnings on boost |
|
It might work to load a data layer from the data reader, I can look at it. On Wed, Aug 12, 2015, 11:52 AM Ronghang Hu notifications@github.com wrote:
|
@cypof Yes, I think so. |
|
Right now forward is locked on a layer only if it is actually shared. Fixed NVCC warnings on boost. @cypof @thatguymike I suppose this should be working. |
|
All of my sanity tests are checking out. I have kicked off a full training run with 4GPUs on AlexNet to validate. It will take time to complete, but convergence looks correct at the moment. I think this is ready to merge and then we can come back around and look more carefully at data prefetch/decode/transform and clean up shared/not shared layers. @cypof and @shelhamer any concerns at this point? |
|
No, I think it's all good. On Wed, Aug 12, 2015, 3:02 PM Michael Houston notifications@github.com
|
|
Great work, in case you want to revised an earlier attempt I did to separate better data sources from pre-processing and backend. Although it is outdated, I think the basic idea of separating the reading from the DB, the pre-processing and the dataLayer is still useful. The basic idea is to have a factory of Datum_DB that only allows to open each data source once, but allows multiple cursors. So each data layer can read from there and do the needed preprocessing. I think encapsulating the Datum_DB with its own parameters could be a good thing down the road. |
|
As we scale and keep tuning GPU perf, we are going to hit limits on the image handling. I'll try to find a block of time to look at #1568. |
|
For the sake of progress and not to stall Multi-GPU again, I expect this PR be merged within this week, since this make #2870 pass solver tests and is not causing performance drop when preserving DataReader. @jeffdonahue @longjon Please take a look when you have time. Sharing data layer may be a temporal solution to get #2870 pass solver tests. The original idea of parallel data layer suffers from speed drop due to decode and transform bottleneck, as well as numerical drift, and I admit it is quite difficult or hacky to fix.
The current design of shared data layers can be revoked if more general solutions are provided. |
|
Nice work @ronghanghu -- thanks for working out all these issues and getting this ready! And thanks of course to @cypof and @thatguymike for all your previous work and keeping up with the changes, and continuing to benchmark on your end. The new changes LGTM, and given that the newly rigorous unit tests pass and @thatguymike and @cypof's blessing, I think this is ready for merge. Would be good for @longjon to take a look as well though. |
|
This approach has my blessing with the consensus of @jeffdonahue @longjon Thanks @ronghanghu!
|
|
This PR is planned to be merged once @thatguymike 's tests pass. I am also testing it on BVLC machine. |
ronghanghu
added ready for review and removed in progress
labels
Aug 13, 2015
|
One more thought, although for testing the code we probably want to control the order in which the data is processed, for actual training introducing more randomness in the order of the data is helpful. |
|
Okay, my 2 intensive tests have passed convergence training. I think we should merge and then we can get back to perf issues. @sguada - I think we need to look more generally on randomly building batches. You don't want to do this in LMDB. I agree as a stop gap, especially if we are going to thread transforms, then we can start at least supporting randomization of the images in the batch in the data layer. |
|
Thanks to @cypof @thatguymike for development and test. Thanks to all community contributors for working on and reviewing Multi-GPU. For now, Multi-GPU data parallelism in Caffe is expected to be used on LMDB/LEVELDB DataLayer to achieve maximum performance, and is available only in training phase. Generalizations to other types of data layers and/or revisions on multi-gpu design can probably follow in future issues and PRs. |
ronghanghu
added a commit
that referenced
this pull request
Aug 13, 2015
|
|
ronghanghu |
bb0a90e
|
ronghanghu
merged commit bb0a90e
into
BVLC:master
Aug 13, 2015
1 check passed
ronghanghu
deleted the
ronghanghu:multi_gpu branch
Aug 13, 2015
|
Congrats to everyone involved!!! Sergio On Thu, Aug 13, 2015 at 2:03 PM, Cyprien Noel notifications@github.com
|
This was referenced Aug 14, 2015
|
I was able to reconcile this code with #2610 (device abstraction and OpenCL backend). However, while all the new code is merged, it is not fully functional at the moment. If anyone has time to review it or test it on nVidia, that would be great :) |
This was referenced Aug 16, 2015
zxt881108
commented
Aug 21, 2015
|
@shelhamer @ronghanghu In my experiments, I found the Multi-GPU can work with Tesla K40, but couldn't work with Titan X (I use four K40s & four Titan X). For Titan X, the iteration is very very slow, and the result is wrong(loss = 87). Do you know the problem, cuda driver? maxswell? Thanks! I test on cuda-7.0 and cuda-7.5. |
lukeyeager
referenced
this pull request
Aug 21, 2015
Closed
CUDA version required for multi-GPU #2954
|
Adding @thatguymike as he is working on a 4 TitanX box. |
|
@zxt881108, we need a lot more information. What network, cuDNN vs BLAS, etc? (CaffeNet is a touchy network generally to train...) Did you properly adjust your batch size in train_val.prototxt down as you add GPUs or correct the learning rate in your solver to account for added GPUs? It's odd the iteration is very slow, it implies the Titan X boards are not on the same root complex, and/or your motherboard has problems talking between devices, and/or there is some stability issue with one of your boards. What does the output of nvidia-smi topo -m show? If you run the nvidia sample p2pBandwidthLatencyTest, what is the output? For DIGITS devbox (4 Titan X): Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) |
zxt881108
commented
Aug 23, 2015
|
@thatguymike We try again, still not solve. We test on AlexNet & Googlenet, with cudnn v2, and the tricks you said looks right (we can reproduce the multi-gpu with the same configuration on Tesla k40m). So the main problem might be our server. Base on your suggest, we test the nvidia p2pBandwidthLatency, the result may have some problem, but we don't know the reason. Now I uploading the result, would you please give answers, thanks. When we run the nvidia sample p2pBandwidthLatencyTest, we can only get these result, when the program is on P2P=Enabled Latency Matrix (us), the program stuck there, can't get the result. |
|
The topology on your server looks very strange. What server/motherboard and what GPUs exactly are you trying to run. Generally on a 2 socket node 0,1,2,3 and 4,5,6,7 should be able to talk fast in that group. However, per the bandwidth test on your server, 0<->1 is the slowest connection, as is 2<->3, etc. This looks like a system bios or motherboard issue, You can try working around this by manually grouping 0,2,4,6 together (-gpu=0,2,4,6) to see if the device numbering from the bios is incorrect. |
zxt881108
commented
Aug 23, 2015
|
@thatguymike Thanks! Just now, I use device id -gpu 0,2,4,6, the problem is partly solved , but the speedup ratio is too horrible(Googlenet, quick_solver, mini-batch=64, device_id=0 & iter20=9s, device_id=0,2 & iter20=12s,device_id=0,2,4,6 & iter20=23s), what about your speedup ratio on DIGITS devbox (4 Titan X)? Our server is Tyan B7079, GPUs are Titan X, CPUs are Intel E2650v3(x2), Memory is 32G DDR4(x24), the HARD DISKs are all SSD. It now seems there are still some problems about our server system bios, we have called the manufacturer, thanks again! |
|
Remember that your effective batch size scales up as well, so your 2 device speedup doesn't look too bad, but clearly not great. Note from your P2Pbandwidth test results, your server has about half the bandwidth between boards as the DIGITS DevBox so you are going to be MUCH more communication bound on scaling that some other systems. I will note that issues with scaling performance and performance stability is exactly why my team designed the DevBox they way we did. You can replicate most of our build if you wish from online documents. You can try larger batches to see how your performance changes, but something is up with your server. You might want to check the server logs for PCIe errors and definately check on system bios. You can also systematically try different combinations of devices to see if you can find the fast and slow pairs and then the fast and slow 4 boards. 8 boards on that machine because you have to cross the PCIe bridge is not going to perform well with the current code, if ever. (Especially as one of your links is only 1GB/s from your bandwidth test results) You might also want to validate the scaling performance you achieve with AlexNet as there is more published work on that. You are also running TitanX's in a server chassis at that density is likely not going to behave how you want in the long run without careful cooling design. (Note the modifications we had to make in DIGITS DevBox to keep 4 TitanX's thermally happy without crazy fan setups). |
|
Okay, my numbers for GoogleNet with cudnnv3 on DIGITS DevBox (X99-E WS chipset and 4x TitanX) Weak scaling (default behavior of master) My P2P bidirectional perf: Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) |
zxt881108
commented
Aug 25, 2015
|
@thatguymike Thanks for you suggest, we have solved the p2p bandwidth problem between GPU id 0&1. The system bios version is too low, after update the version, the p2p bandwidth value seems normal |
raingo
referenced
this pull request
Aug 26, 2015
Closed
A MultiGPU bug with multiple input layers #2977
eldar
commented
Aug 28, 2015
|
I tried it and got quite poor scaling: |
This was referenced Aug 29, 2015
|
Is test iteration also distributed to gpus? |
zxxmac
commented
Sep 10, 2015
|
I want to predict image at caffe-window.But result all the same for different image,I don not konw how to predict.
|
|
Test iterations are running on single GPU. |
zxxmac
commented
Sep 10, 2015
|
Yes it is single GPU
|
This was referenced Sep 12, 2015
leizhangcn
commented on ddcdc9d
Oct 1, 2015
|
This is great to check. Yet I am almost done to modify |
This was referenced Oct 6, 2015
ronghanghu
added the
parallelism
label
Oct 28, 2015
This was referenced Oct 29, 2015
weiliu89
commented
Feb 17, 2016
|
@ronghanghu This PR is great!! Any hint on how to modify the code to do testing on Multi-GPU as well? |
alfredox10
commented
Sep 2, 2016
|
Does this enable multi-GPU detection when executing? |





ronghanghu commentedAug 11, 2015
This is my package of #2870 (and originally, #2114)
Modification: Allow data layers (and also PythonLayer when used as data layer) to be shared among worker solver's training net, and also test net for future-proof if one wants to do Multi-GPU testing. Data layers are locked during forward to ensure sequential forward. Now all worker solvers fetch data from one single data layer.
This ensure that single-gpu training is consistent with multi-gpu training, and allow tests in #2870 to pass. Otherwise in #2870 (#2114) , there are multiple data layers created for worker solver, and these data layers are unaware of each other. This can be a serious issue if one uses deterministic data layers or turn off shuffling. In such case, since data layers in each worker solver reads the same data, one eventually gets same gradient on each solver, so it is almost equivalent to multiply learning rate by GPU number. This is definitely not the desired behavior of Multi-GPU data parallelism, since one should train on different subsets of dataset. Although in #2114 a DataReader is provided, it only applied to leveldb and lmdb, and is hardly extensible to other data layers.
DataReader is preserved in this PR and LMDB/LEVELDB DataLayer is not shared.
TODOs
Remove DataReader. Restore old behavior of DataLayer.DataReader is kept.make runteston multiple GPU machine.Drawback
Multi-GPU training is numerically non-deterministic on data layers excepted for LMDB/LEVELDB DataLayer, see #2903 (comment)