Multi-GPU Data Parallelism (with Parallel Data Layers) #2903

Merged
merged 11 commits into from Aug 13, 2015

Conversation

Projects
None yet
Member

ronghanghu commented Aug 11, 2015

This is my package of #2870 (and originally, #2114)

Modification: Allow data layers (and also PythonLayer when used as data layer) to be shared among worker solver's training net, and also test net for future-proof if one wants to do Multi-GPU testing. Data layers are locked during forward to ensure sequential forward. Now all worker solvers fetch data from one single data layer.

This ensure that single-gpu training is consistent with multi-gpu training, and allow tests in #2870 to pass. Otherwise in #2870 (#2114) , there are multiple data layers created for worker solver, and these data layers are unaware of each other. This can be a serious issue if one uses deterministic data layers or turn off shuffling. In such case, since data layers in each worker solver reads the same data, one eventually gets same gradient on each solver, so it is almost equivalent to multiply learning rate by GPU number. This is definitely not the desired behavior of Multi-GPU data parallelism, since one should train on different subsets of dataset. Although in #2114 a DataReader is provided, it only applied to leveldb and lmdb, and is hardly extensible to other data layers.

DataReader is preserved in this PR and LMDB/LEVELDB DataLayer is not shared.

TODOs

  • Add ShareInParallel function to layer.hpp, data_layer.hpp and pythonlayer.hpp .
  • Implement share layers during net construction, construct top blobs of shared layers.
  • Add lock to forward in layer.hpp to lock layers.
  • Share layers during workersolver construction.
  • Remove DataReader. Restore old behavior of DataLayer. DataReader is kept.
  • Test make runtest on multiple GPU machine.
  • Test multi-gpu training on MNIST. (log: https://gist.github.com/ronghanghu/d66d63882c25b31b6148)
  • Test multi-gpu training on ILSVRC.
  • Fix NVCC warning on boost/thread.hpp to get Travis CI pass.

Drawback

Multi-GPU training is numerically non-deterministic on data layers excepted for LMDB/LEVELDB DataLayer, see #2903 (comment)

cypof and others added some commits Apr 28, 2015

@cypof @shelhamer cypof Thread-local Caffe 45d792e
@cypof @shelhamer cypof Add BlockingQueue for inter-thread communication d94ca3f
@cypof @shelhamer cypof Change the way threads are started and stopped
- Interrupt the thread before waiting on join
- Provide a method for looping threads to exit on demand
- CHECK if start and stop succeed instead of returning an error
73b3d13
@cypof @shelhamer cypof Persistent prefetch thread ddcdc9d
@cypof @shelhamer cypof Add DataReader for parallel training with one DB session
- Make sure each solver accesses a different subset of the data
- Sequential reading of DB for performance
- Prefetch a configurable amount of data to host memory
- Distribute data to solvers in round-robin way for determinism
bcc8f50
@cypof @shelhamer cypof Allocate host memory through cudaMallocHost
thanks to discussion by @thatguymike and @flx42
d2f0457
@cypof @shelhamer cypof Multi-GPU
- Parallelize batches among GPUs and tree-reduce the gradients
- The effective batch size scales with the number of devices
- Batch size is multiplied by the number of devices
- Split batches between GPUs, and tree-reduce the gradients
- Detect machine topology (twin-GPU boards, P2P connectivity)
- Track device in syncedmem (thanks @thatguymike)
- Insert a callback in the solver for minimal code change
- Accept list for gpu flag of caffe tool, e.g. '-gpu 0,1' or '-gpu all'.
  Run on default GPU if no ID given.
- Add multi-GPU solver test
- Deterministic architecture for reproducible runs
e5575cf
@shelhamer mhouston Detect topology corner cases and improve broadcast order
- Start with distant nodes in broadcast
- Fix outside loop to loop for full tree depth
335bee7
@shelhamer shelhamer [docs] add multi-gpu usage note to interfaces 8771d0f

shelhamer referenced this pull request Aug 11, 2015

Merged

Multi-GPU #2870

10 of 10 tasks complete
Contributor

thatguymike commented Aug 12, 2015

Well, tests pass, but training runs seem to hang in the data prefetch queue. Not sure the new datareader code is behaving.

Member

ronghanghu commented Aug 12, 2015

@thatguymike I'll look into this issue shortly, and see why training hang. I expect to do a rebase tonight and test on my data with multiple GPUs.

Contributor

cypof commented Aug 12, 2015

It's a great thing to get rid of the data reader and unify all data layer types. One thing I'm concerned about in this design though is about the ordering of threads on the lock. It might not be absolutely required, but if we want runs to be reproducible at the numerical precision level, each solver needs to take data items in the same order, which I don't believe the lock can enforce as it is. Each run might see items distributed to solvers differently. The gradient sum should be the same, but with slight differences as items would have been added in different order.

Member

ronghanghu commented Aug 12, 2015

Regarding Michael Houston's concern:

I wouldn’t be surprised if in the new code we are violating some internal assumption about LMDB thread access causing deadlocks, I have hit those before.

In this PR, only one single DataLayer is shared among all worker solvers. Since data in lmdb/leveldb is read in this DataLayer prefetch thread rather than worker solver thread, the data prefetch behavior doesn't deviate from single GPU.

Member

ronghanghu commented Aug 12, 2015

It might not be absolutely required, but if we want runs to be reproducible at the numerical precision level, each solver needs to take data items in the same order, which I don't believe the lock can enforce as it is.

@cypof I thought about this issue. However, I am not too concerned about that, since in general this PR produces more consistent and numerically same results for all other data layers (except for level/lmdb) than #2870.

In #2870 you'll get random behavior if a data layer supports and turns on shuffling, or get e.g. 4X learning rate otherwise. In both situation, the behavior is clearly worse than this PR and deviates from single GPU training+increased batch size. The latter behavior also defeats the purpose of MultiGPU data parallelism.

Member

ronghanghu commented Aug 12, 2015

Travis CI fails because NVCC generates warning over boost/thread.hpp included in layer.hpp (see Travis CI build details)

/home/travis/miniconda/include/boost/thread/pthread/thread_data.hpp(42): warning: controlling expression is constant

@shelhamer any suggestions to fix/suppress this warning?

Member

ronghanghu commented Aug 12, 2015

@thatguymike I made some update, removed data reader, and successfully trained on MNIST. I am also training on ILSVRC-2012-CLS with this PR.

Can you test again? Since data in lmdb/leveldb is read in this DataLayer prefetch thread rather than worker solver thread, the data prefetch behavior shouldn't deviate from single GPU.

Contributor

thatguymike commented Aug 12, 2015

Seems to work functionally, but scaling perf took a significant hit for some reason at 4 GPUs for AlexNet. Quite significant slowdown.

Member

ronghanghu commented Aug 12, 2015

@thatguymike I'll look into this today.

Member

ronghanghu commented Aug 12, 2015

@thatguymike To be specific, do you experience a lot of following logs?

I0812 08:57:39.468806 24173 blocking_queue.cpp:49] Data layer prefetch queue empty
Contributor

cypof commented Aug 12, 2015

How many transform threads are created by the shared data layer?

Member

ronghanghu commented Aug 12, 2015

@cypof There should be only one single prefetch thread in which transfom is performed. Only forward is done multi-thread in each solver via a lock.

@thatguymike looking into the drift issue you mentioned.

Contributor

thatguymike commented Aug 12, 2015

I am seeing a few notices of the data layer prefetch queue being empty that in theory I shouldn't be seeing. I don't see them with 2870 because I'm on fast SSDs and my LMDB should be in kernel filecache.

Contributor

cypof commented Aug 12, 2015

If I understand well, the shared layer prefetch thread is now doing both load and transform. This might not be fast enough to feed several gpus

Contributor

thatguymike commented Aug 12, 2015

Good point. Let me try with an uncompressed LMDB. A single thread should be fast enough for the raw IO, but might not be fast enough for the decode and transforms.

Contributor

cypof commented Aug 12, 2015

The data layer empty message is only showed every 1000 to avoid filling the logs.

Contributor

thatguymike commented Aug 12, 2015

Adding to the comment from @cypof, if we are going to do a shared read, we can read the superbatch, notify all and have each solver worker pull from the image array with their offset. That should make things consistent. We might need to rework the code to handle decode and transform in each worker, ideally in an independent thread, to keep up.

Contributor

thatguymike commented Aug 12, 2015

As for NVCC warnings, we should be able to send a suppression flag to the front-end.

Member

ronghanghu commented Aug 12, 2015

@cypof @thatguymike Right now there is only one prefetch thread. I think I can do multi prefetch thread instead, still in one single shared DataLayer.

My idea is to make each data layer read and decode and transform as fast as they can to fill the prefetch batch space. This can be done via multi-threading within a data layer. Each data layer should be responsible to provide data with highest speed.

Member

ronghanghu commented Aug 12, 2015

To clarify, the first 9 commits in this PR are exactly the same as in current #2870. The only changes are the last two commits,

  • enable sharing data layer
  • remove data reader

Looking into the correctness issue right now. Will address speed issue afterwards.

Member

ronghanghu commented Aug 12, 2015

Regarding the numerically non-deterministic issue, I plan to use a more advanced lock that only allows threads to visit it in e.g 1->2->3->4->1->2->3->4->1 order on shared data layers. But right now I'll first handle the more drastic drift I observed.

Contributor

thatguymike commented Aug 12, 2015

Multiple prefetch threads from LMDB isn't going to help generally as one thread should be able to saturate the IO. It's the image decode and image transforms we need to spread across CPU threads.

Forcing the threads to visit in a specific order might work, but again, you only would want to do that on the data load and not on the processing (decode/transform) of the data. I think it would make more sense to have a single thread pulling the super batch from the IO system and then each thread get notified to read their chunk using a specific thread ID offset (batch*thread_id).

Member

ronghanghu commented Aug 12, 2015

It's the image decode and image transforms we need to spread across CPU threads.

I wonder if that can be done with a simple solution e.g. OpenMP over a for loop, within prefetch thread?

Contributor

thatguymike commented Aug 12, 2015

Mixing OpenMP with boost threads is going to be a pain and debugging hell. Better to launch threads as we do now and use the blocking queues. However, if we start heading back to what @cypof originally did, that should get use back perf wise, but it's the numerical drift that has me currently more concerned. I'll bet it's all wrapped up in the same changes in the data loading.

@ronghanghu ronghanghu Data Layers Parallel for Multi-GPU
Allow data layers (and also PythonLayer when used as data layer) to be shared
among worker solver's training net, and also test net for future-proof if one
wants to do Multi-GPU testing. Data layers are locked during forward to ensure
sequential forward.
0d34d5b
Member

ronghanghu commented Aug 12, 2015

@thatguymike @cypof Due to the issues, while keeping debugging my current branch, I took a step back in this PR and kept the data reader. Now LMDB/LEVELDB DataLayer isn't shared.

There are only 94 additions and 20 deletions upon #2870, which should be easy to review.

Contributor

cypof commented Aug 12, 2015

Wow, this is it! I'm already the old fart annoyed at the young padawan who wants to change everything. Anyway, if you change the data layer to be multi-threaded, you will probably end up replicating the data reader pattern. So for now it might be better to just support multi-gpu only with DBs.

As a longer term solution, there is either moving the other data layer types to the data reader, or the more general solution to allow layers to run asynchronously from each other. Something like having queues between layers, and letting them run independently on their own thread.

I'm not sure how that would look like, but then a big network could have several sections running in parallel, feeding each other activations and gradients through queues. It's probably best to design this as part of the larger discussion about model parallelism etc.

Member

ronghanghu commented Aug 12, 2015

@cypof OK, let's get the NVCC warning suppressed. Any idea?

Right now other data layers like HDF5DataLayer is shared to make solver tests pass.

Contributor

thatguymike commented Aug 12, 2015

So reproducibility and perf are back it seems. Why the scope lock in Forward?

For the suppression, let's try: -Xcudafe "--diag_suppress=boolean_controlling_expr_is_constant" Should work, but a little ugly.

Owner

shelhamer commented Aug 12, 2015

Thanks for all your work in this thread everybody!

@cypof layer parallelism is covered by @longjon 's #2219
#2219 as a reference for that full
conversation when we have it (but let's try to solve the current IO issue
first).

I can understand the frustration as we figure out how to bridge parallel IO
and different data formats. Do you see a path to change or extend
DataReader to other formats without duplicating the data layers?

@ronghanghu for NVCC / boost warnings we're usually careful about includes,
forward declarations, and façades to hide boost from NVCC.

For threads / parallelism, let's keep exclusively to boost as commented by
@thatguymike
On Wed, Aug 12, 2015 at 11:32 Michael Houston notifications@github.com
wrote:

So reproducibility and perf are back it seems. Why the scope lock in
Forward?

For the suppression, let's try: -Xcudafe
"--diag_suppress=boolean_controlling_expr_is_constant" Should work, but a
little ugly.


Reply to this email directly or view it on GitHub
#2903 (comment).

Member

ronghanghu commented Aug 12, 2015

so reproducibility and perf are back it seems. Why the scope lock in Forward?

In short, let solver tests on HDF5 pass.

Quote from @jeffdonahue

The previous version of TestGradientBasedSolver in #2114 was not really testing much of anything because it was using a DummyDataLayer with a constant filler -- all data were just 1 vectors.
This means that a solver which reorders data in arbitrary ways would pass the tests (as the data were all 1s). In #2866 I changed this so most tests used a Gaussian filler (thereby checking that solvers don't somehow reorder the data). When the parallel branch was rebased on the new version of the test, parallel tests were no longer passing due to a somewhat complex issue with the RNG being in a different state.
To avoid the messiness of relying on RNG state to verify correctness, but still check that solvers don't reorder the data, Evan again changed the test to use fixed HDF5 data (#2867). Rebasing the parallel branch again with the newer version of the test using HDF5 data, the parallel tests still failed, this time because DataReader did not work for HDF5 data. As Evan explained in his original reply, this led us to reconsider the DataReader design choice, which is the current ongoing work by Ronghang.

So, in order to test solvers with deterministic data, the current workaround was to load hdf5 files with HDF5DataLayer, which shared needs to be locked during forward. I'll modify to only apply it to those shared layers.

Contributor

thatguymike commented Aug 12, 2015

Well, without the mutex, HDF5 tests crash in entertaining ways in multiGPU tests. For training, it has little performance impact until you run cuDNNv3 or really start to scale up (8 GPUs). Why is the mutex required to serialize foward for the parallel layers (hdf5) to pass tests? I fear we are heading back into hacky land. For LMDB training, doesn't seem required...

Contributor

thatguymike commented Aug 12, 2015

We raced on comments. The mutex hurts perf as we serialize the forward parts. For 4 GPUs without cuDNN, its ~10% perf hit. 8 GPUs, ~25% perf hit. WIth cuDNNv3 backend, ~30% hit at 4GPUs.

Member

ronghanghu commented Aug 12, 2015

The mutex hurts perf as we serialize the forward parts. For 4 GPUs without cuDNN, its ~10% perf hit. 8 GPUs, ~25% perf hit. WIth cuDNNv3 backend, ~30% hit at 4GPUs.

I'll only apply it when necessary. To be updated.

Member

ronghanghu commented Aug 12, 2015

Now the remaining is to suppress NVCC warnings on boost

Contributor

cypof commented Aug 12, 2015

It might work to load a data layer from the data reader, I can look at it.
Is the lock enough to get the tests to pass for now?

On Wed, Aug 12, 2015, 11:52 AM Ronghang Hu notifications@github.com wrote:

The mutex hurts perf as we serialize the forward parts. For 4 GPUs without
cuDNN, its ~10% perf hit. 8 GPUs, ~25% perf hit. WIth cuDNNv3 backend, ~30%
hit at 4GPUs.

I'll remove it in layer.hpp and only apply it when necessary.


Reply to this email directly or view it on GitHub
#2903 (comment).

Member

ronghanghu commented Aug 12, 2015

Is the lock enough to get the tests to pass for now?

@cypof Yes, I think so.

Member

ronghanghu commented Aug 12, 2015

Right now forward is locked on a layer only if it is actually shared. Fixed NVCC warnings on boost.

@cypof @thatguymike I suppose this should be working.

Contributor

thatguymike commented Aug 12, 2015

All of my sanity tests are checking out. I have kicked off a full training run with 4GPUs on AlexNet to validate. It will take time to complete, but convergence looks correct at the moment.

I think this is ready to merge and then we can come back around and look more carefully at data prefetch/decode/transform and clean up shared/not shared layers.

@cypof and @shelhamer any concerns at this point?

Contributor

cypof commented Aug 12, 2015

No, I think it's all good.

On Wed, Aug 12, 2015, 3:02 PM Michael Houston notifications@github.com
wrote:

All of my sanity tests are checking out. I have kicked off a full training
run with 4GPUs on AlexNet to validate. It will take time to complete, but
convergence looks correct at the moment.

I think this is ready to merge and then we can come back around and look
more carefully at data prefetch/decode/transform and clean up shared/not
shared layers.

@cypof https://github.com/cypof and @shelhamer
https://github.com/shelhamer any concerns at this point?


Reply to this email directly or view it on GitHub
#2903 (comment).

Contributor

sguada commented Aug 12, 2015

Great work, in case you want to revised an earlier attempt I did to separate better data sources from pre-processing and backend. Although it is outdated, I think the basic idea of separating the reading from the DB, the pre-processing and the dataLayer is still useful.

#1568

The basic idea is to have a factory of Datum_DB that only allows to open each data source once, but allows multiple cursors. So each data layer can read from there and do the needed preprocessing. I think encapsulating the Datum_DB with its own parameters could be a good thing down the road.

Contributor

thatguymike commented Aug 12, 2015

As we scale and keep tuning GPU perf, we are going to hit limits on the image handling. I'll try to find a block of time to look at #1568.

Member

ronghanghu commented Aug 12, 2015

For the sake of progress and not to stall Multi-GPU again, I expect this PR be merged within this week, since this make #2870 pass solver tests and is not causing performance drop when preserving DataReader. @jeffdonahue @longjon Please take a look when you have time.

Sharing data layer may be a temporal solution to get #2870 pass solver tests. The original idea of parallel data layer suffers from speed drop due to decode and transform bottleneck, as well as numerical drift, and I admit it is quite difficult or hacky to fix.

As a longer term solution, there is either moving the other data layer types to the data reader, or the more general solution to allow layers to run asynchronously from each other. Something like having queues between layers, and letting them run independently on their own thread.

The current design of shared data layers can be revoked if more general solutions are provided.

Contributor

jeffdonahue commented Aug 12, 2015

Nice work @ronghanghu -- thanks for working out all these issues and getting this ready! And thanks of course to @cypof and @thatguymike for all your previous work and keeping up with the changes, and continuing to benchmark on your end.

The new changes LGTM, and given that the newly rigorous unit tests pass and @thatguymike and @cypof's blessing, I think this is ready for merge. Would be good for @longjon to take a look as well though.

Owner

shelhamer commented Aug 13, 2015

This approach has my blessing with the consensus of @jeffdonahue @longjon
@cypof @thatguymike

Thanks @ronghanghu!
On Wed, Aug 12, 2015 at 18:17 Jeff Donahue notifications@github.com wrote:

Nice work @ronghanghu https://github.com/ronghanghu -- thanks for
working out all these issues and getting this ready! And thanks of course
to @cypof https://github.com/cypof and @thatguymike
https://github.com/thatguymike for all your previous work and keeping
up with the changes, and continuing to benchmark on your end.

The new changes LGTM, and given that the newly rigorous unit tests pass
and @thatguymike https://github.com/thatguymike and @cypof
https://github.com/cypof's blessing, I think this is ready for merge.
Would be good for @longjon https://github.com/longjon to take a look as
well though.


Reply to this email directly or view it on GitHub
#2903 (comment).

Member

ronghanghu commented Aug 13, 2015

This PR is planned to be merged once @thatguymike 's tests pass. I am also testing it on BVLC machine.

Contributor

sguada commented Aug 13, 2015

One more thought, although for testing the code we probably want to control the order in which the data is processed, for actual training introducing more randomness in the order of the data is helpful.
So it would be great that after reading many Datum from the sequential DB, we could process them in random order, so they introduce randomness in each batch processed by each GPU.

Contributor

thatguymike commented Aug 13, 2015

Okay, my 2 intensive tests have passed convergence training. I think we should merge and then we can get back to perf issues.

@sguada - I think we need to look more generally on randomly building batches. You don't want to do this in LMDB. I agree as a stop gap, especially if we are going to thread transforms, then we can start at least supporting randomization of the images in the batch in the data layer.

Member

ronghanghu commented Aug 13, 2015

Thanks to @cypof @thatguymike for development and test. Thanks to all community contributors for working on and reviewing Multi-GPU.

For now, Multi-GPU data parallelism in Caffe is expected to be used on LMDB/LEVELDB DataLayer to achieve maximum performance, and is available only in training phase. Generalizations to other types of data layers and/or revisions on multi-gpu design can probably follow in future issues and PRs.

@ronghanghu ronghanghu added a commit that referenced this pull request Aug 13, 2015

@ronghanghu ronghanghu Merge pull request #2903 from ronghanghu/multi_gpu
Multi-GPU Data Parallelism
bb0a90e

@ronghanghu ronghanghu merged commit bb0a90e into BVLC:master Aug 13, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

ronghanghu deleted the ronghanghu:multi_gpu branch Aug 13, 2015

Contributor

cypof commented Aug 13, 2015

Woohoo! Thank you Caffe team and fellow contributors!

woohoo

ronghanghu referenced this pull request Aug 13, 2015

Closed

Adam solver #2856

Contributor

sguada commented Aug 13, 2015

Congrats to everyone involved!!!

Sergio

On Thu, Aug 13, 2015 at 2:03 PM, Cyprien Noel notifications@github.com
wrote:

Woohoo! Thank you Caffe team and fellow contributors!


Reply to this email directly or view it on GitHub
#2903 (comment).

Member

naibaf7 commented Aug 15, 2015

I was able to reconcile this code with #2610 (device abstraction and OpenCL backend). However, while all the new code is merged, it is not fully functional at the moment.
The OpenCL backend needs some additional functions for supporting it the same way as CUDA does, and I suspect something with the GPU IDs will be wrong at the moment when using it with CUDA.
I can't fix it currently because I don't have two nVidia GPUs, but I can fix it as soon as I have written the new OpenCL multigpu functions.

If anyone has time to review it or test it on nVidia, that would be great :)

@shelhamer @ronghanghu In my experiments, I found the Multi-GPU can work with Tesla K40, but couldn't work with Titan X (I use four K40s & four Titan X). For Titan X, the iteration is very very slow, and the result is wrong(loss = 87). Do you know the problem, cuda driver? maxswell? Thanks! I test on cuda-7.0 and cuda-7.5.

Contributor

cypof commented Aug 21, 2015

Adding @thatguymike as he is working on a 4 TitanX box.

Contributor

thatguymike commented Aug 21, 2015

@zxt881108, we need a lot more information. What network, cuDNN vs BLAS, etc? (CaffeNet is a touchy network generally to train...) Did you properly adjust your batch size in train_val.prototxt down as you add GPUs or correct the learning rate in your solver to account for added GPUs?

It's odd the iteration is very slow, it implies the Titan X boards are not on the same root complex, and/or your motherboard has problems talking between devices, and/or there is some stability issue with one of your boards.

What does the output of nvidia-smi topo -m show? If you run the nvidia sample p2pBandwidthLatencyTest, what is the output? For DIGITS devbox (4 Titan X):

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 128.28 26.12 20.32 20.32
1 26.16 127.91 20.33 20.32
2 20.32 20.33 128.08 26.14
3 20.31 20.32 26.13 127.74

@thatguymike We try again, still not solve. We test on AlexNet & Googlenet, with cudnn v2, and the tricks you said looks right (we can reproduce the multi-gpu with the same configuration on Tesla k40m). So the main problem might be our server. Base on your suggest, we test the nvidia p2pBandwidthLatency, the result may have some problem, but we don't know the reason. Now I uploading the result, would you please give answers, thanks.
The nvidia-smi topo -m show:
topo -m

When we run the nvidia sample p2pBandwidthLatencyTest, we can only get these result, when the program is on P2P=Enabled Latency Matrix (us), the program stuck there, can't get the result.
p2p
111

naibaf7 referenced this pull request Aug 23, 2015

Closed

Caffe OpenCL support #2610

Contributor

thatguymike commented Aug 23, 2015

The topology on your server looks very strange. What server/motherboard and what GPUs exactly are you trying to run.

Generally on a 2 socket node 0,1,2,3 and 4,5,6,7 should be able to talk fast in that group. However, per the bandwidth test on your server, 0<->1 is the slowest connection, as is 2<->3, etc. This looks like a system bios or motherboard issue, You can try working around this by manually grouping 0,2,4,6 together (-gpu=0,2,4,6) to see if the device numbering from the bios is incorrect.

@thatguymike Thanks! Just now, I use device id -gpu 0,2,4,6, the problem is partly solved , but the speedup ratio is too horrible(Googlenet, quick_solver, mini-batch=64, device_id=0 & iter20=9s, device_id=0,2 & iter20=12s,device_id=0,2,4,6 & iter20=23s), what about your speedup ratio on DIGITS devbox (4 Titan X)? Our server is Tyan B7079, GPUs are Titan X, CPUs are Intel E2650v3(x2), Memory is 32G DDR4(x24), the HARD DISKs are all SSD. It now seems there are still some problems about our server system bios, we have called the manufacturer, thanks again!

Contributor

thatguymike commented Aug 23, 2015

Remember that your effective batch size scales up as well, so your 2 device speedup doesn't look too bad, but clearly not great. Note from your P2Pbandwidth test results, your server has about half the bandwidth between boards as the DIGITS DevBox so you are going to be MUCH more communication bound on scaling that some other systems. I will note that issues with scaling performance and performance stability is exactly why my team designed the DevBox they way we did. You can replicate most of our build if you wish from online documents.

You can try larger batches to see how your performance changes, but something is up with your server. You might want to check the server logs for PCIe errors and definately check on system bios. You can also systematically try different combinations of devices to see if you can find the fast and slow pairs and then the fast and slow 4 boards. 8 boards on that machine because you have to cross the PCIe bridge is not going to perform well with the current code, if ever. (Especially as one of your links is only 1GB/s from your bandwidth test results)

You might also want to validate the scaling performance you achieve with AlexNet as there is more published work on that.

You are also running TitanX's in a server chassis at that density is likely not going to behave how you want in the long run without careful cooling design. (Note the modifications we had to make in DIGITS DevBox to keep 4 TitanX's thermally happy without crazy fan setups).

Contributor

thatguymike commented Aug 24, 2015

Okay, my numbers for GoogleNet with cudnnv3 on DIGITS DevBox (X99-E WS chipset and 4x TitanX)

Weak scaling (default behavior of master)
1 GPU: 7.9 sec/20 iterations
2 GPU: 8.3 sec/20 iterations
4 GPU: 11.3 sec/20 iterations

My P2P bidirectional perf:

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 128.03 26.16 20.32 20.32
1 26.17 127.99 20.31 20.32
2 20.32 20.32 127.91 26.14
3 20.31 20.32 26.14 127.49

@thatguymike Thanks for you suggest, we have solved the p2p bandwidth problem between GPU id 0&1. The system bios version is too low, after update the version, the p2p bandwidth value seems normal
234

eldar commented Aug 28, 2015

I tried it and got quite poor scaling:
1 GPU: 39 sec/20 iterations
2 GPU: 67 sec/20 iterations
I use custom data layer derived from WindowDataLayer. Version of Caffe is the latest from master. How can I profile what's going on there?

Contributor

erogol commented Sep 10, 2015

Is test iteration also distributed to gpus?

zxxmac commented Sep 10, 2015

I want to predict image at caffe-window.But result all the same for different image,I don not konw how to predict.

在 2015年9月10日,17:05,Eren Golge notifications@github.com 写道:

Is test iteration also distributed to gpus?


Reply to this email directly or view it on GitHub.

Member

ronghanghu commented Sep 10, 2015

Test iterations are running on single GPU.

zxxmac commented Sep 10, 2015

Yes it is single GPU

在 2015年9月11日,00:21,Ronghang Hu notifications@github.com 写道:

Test iterations are running on single GPU.


Reply to this email directly or view it on GitHub.

This is great to check. Yet I am almost done to modify MILDataLayer to make it compilable. Give me 15 mins.

@ronghanghu This PR is great!! Any hint on how to modify the code to do testing on Multi-GPU as well?

ih4cku referenced this pull request in ih4cku/caffe-notes Jul 31, 2016

Open

caffe multiple card #13

Does this enable multi-GPU detection when executing?
prediction = net.forward()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment