Synchronous SGD via layer-wise parallelism #2219

longjon · 2015-03-28T01:39:31Z

This PR upgrades Caffe with integrated support for layer-wise multi-device parallelism. This allows Caffe to perform parallel synchronous SGD, using model-parallel, data-parallel, or hybrid approaches. This includes parallelization across GPUs, CPU/GPU parallelism, and multi-threaded CPU parallelism. This is a work-in-progress in proof-of-concept state.

The basic approach is as follows:

Layer parameters are augmented with a device and thread_id. Each provides a (logical) description of which device the layer will be run on (and where its memory will be allocated). Thread IDs work like CUDA streams; layers are launched in their specified order, and automatically synchronized according to the network DAG. Devices and thread IDs are specified manually; poor choices will result in poor performance.
With one exception, each layer executes on a single device, and has its tops and bottoms and params stored on that same device. The exception is split layer, which is allowed to have tops that live on different devices, and is used to implement automatic copying. The actual top devices have to be specified as a parameter (top_device) to split layer, but this is done automatically for automatically inserted split layers.

Limitations:

Parameters-as-such may not be shared between devices. Note that Caffe currently provides two different mechanisms for sharing things, which implement essentially the same functionality: parameter sharing and split layers. Rather than extend both versions of this functionality, this PR chooses to extend split layer. Shared parameters can still be used, they just need to be done with Add parameter layer for learning any bottom #2079 and Add option to take parameters from bottoms #2166.
Each layer's forward and backward must be performed on the same device.
Automatically inserted split layers are given thread_id -1, and therefore all copies are done in the same thread. This can be worked around by manually specifying split layers.
There is no way to specify which physical devices to use. (device_ids as given to layers are purely logical and mapped arbitrarily to physical IDs starting from 0.) Use CUDA_VISIBLE_DEVICES instead.
The currently included examples use DummyDataLayer. Using this with real data layers will cause problems, due to some subtle and unnecessary singleton interaction. Ultimately prefetching should probably just be integrated into this framework, but that requires additional functionality for pipelining computation. There are very ugly ways to hack around this for now if you really want to.
Many checks that should exist are missing.
I have only verified correctness of a previous version, there could be bugs in correctness of this one.
This PR adds a protobuf enum for device types, which should probably be merged with Caffe's own Brew enum somehow.

A simple example is included for CaffeNet data parallelism (not using real data layers). I've found that it performs near 2x on 2 GPUs, ~2.5x on 3 GPUs, and poorly on 4 GPUs. I'm sure that with additional tuning/better execution planning, much nearer-linear scaling can be achieved. The example is generated using #2086 (but the output is included). Note that actually solving will have some (small) additional overhead from H2D transfer of data (which should be made asynchronous) and from the solver parameter-update code.

NVTX instrumentation of forward/backward calls is included, and together with nvprof/nvvp it makes performance issues pretty easy to find. It's a build option, WITH_NVTX.

A hacky option, -notime_layers, is added to caffe time to do rough-and-ready parallel timing.

This PR is largely orthogonal to the net-level parallelism of the parallel branch, even though both can be used to implement some of the same things, like pure data parallelism for synchronous SGD (see #2114). Some common functionality has been factored out. I have no plans to support multi-node parallelism or asynchronous SGD using this code; you can combine it with the parallel branch for that.

There are many existing PRs included in this one:

Thread-specific singleton (from @cypof) #2067, the thread-specific singleton from @cypof
Blocking queue (from @cypof, with modifications) #2167, the blocking queue from @cypof, used for thread work queues
Use CUDA 7's per-thread default stream, with automatic syncing #2077, use CUDA 7 and its per-thread default stream
Keep track of the layer graph #2073, keep track of the layer graph
Remove spurious inclusions of net.hpp #2168, remove spurious net.hpp includes so that Net can get intimate with boost::thread
Add device-parametrized blob accessors #2169, device-parametrized blob accessors, to clean up onerous switches.
Add parameter layer for learning any bottom #2079, parameter layer and
Add option to take parameters from bottoms #2166, the param_bottoms option, so that split layer can be used for sharing parameters
a future PR, which will update cuDNN convolution to be a bit more stringent with its stream synchronization
in addition, Python net specification #2086 is very useful to avoid writing very long prototxt files

Note that omitting included PRs and (very long) examples, the total diff here is on the order of a few hundred lines.

This is still work-in-progress, but hopefully usable/hackable by eager beavers.

Major TODOs:

Provide a sensible way to specify (multiple) devices, and make CPU/GPU mode a construction time Net option or coordinate this PR with mode switching some other way.
Provide a mechanism for pipelining and integrate data loading/transformation into this framework.
Verify correctness of learning in this framework (any deviation from single-device learning is a bug), and provide tests.
Verify that known schemes for model/data parallelism perform as expected, e.g., Krizhevsky's one weird trick and data parallelism as employed e.g. by VGG and MSRA.

This means that Caffe::Get has to be moved to common.cpp, and loses its "inline" (but there are no real performance implications).

Instead of just keeping track of input and output blobs, also keep track of layer dependencies. (Also adjust AppendBottom's argument types to avoid passing an input as a pointer.)

This simplifies the OS X build, and will allow use of the per-thread default stream for running existing layer code asynchronously.

Note that this may cause issues with code that assumes either explicit or device-level synchronization, which we'll fix in the next commit.

This ensures that layers are synchronous with respect to each other, even when layer code doesn't use explicit streams.

There are no cases where Forward is called without Reshape, so we can simplify the call structure.

This will allow us to cleanly kill compute threads that are waiting for wark.

This gives us a way to specify layer-level execution placement for layerwise parallelism, implemented in future commits.

Split layer gains a param, top_device, which allows tops to exist on different (explicitly specified) devices. Params are automatically copied and diffs are automatically accumulated. Because the implementation is now device-agnostic, it's done in (only) the *_cpu functions.

This fills in the top_device param of split layer according to the device params of the connecting layers.

This is necessary to ensure that buffers are allocated on the correct devices.

Compute threads hold (blocking) queues of forward or backward commands, which are synchronized according to the layer graph through Net member variables.

This fully exercises the multi-GPU case, and saves time.

This is necessary to ensure that operations are performed on the correct device.

futurely · 2015-05-04T09:30:08Z

Why would a PR include and be blocked by so many other PRs?

dzhwinter · 2015-05-29T02:42:38Z

the peer copy is failed when use cudaDeviceEnablePeerAccess() function in net.cpp line 349.
I test on multiGPU 2 cards

shelhamer · 2017-02-17T00:26:37Z

Closing as multi-GPU dev took the other turn, although I always liked this branch.

cypof and others added 30 commits March 27, 2015 16:40

thread specific singleton

4fe9305

forward declare instead of including boost/thread.hpp (BVLC#1009)

c4590db

This means that Caffe::Get has to be moved to common.cpp, and loses its "inline" (but there are no real performance implications).

add parameter layer for learning any bottom

70ac334

keep track of layer graph in Net

10ac0ff

Instead of just keeping track of input and output blobs, also keep track of layer dependencies. (Also adjust AppendBottom's argument types to avoid passing an input as a pointer.)

require CUDA 7

6c2b0b5

This simplifies the OS X build, and will allow use of the per-thread default stream for running existing layer code asynchronously.

[build] use CUDA 7's per thread default stream

a67e216

Note that this may cause issues with code that assumes either explicit or device-level synchronization, which we'll fix in the next commit.

always sync the default stream after GPU forward or backward

c7357b9

This ensures that layers are synchronous with respect to each other, even when layer code doesn't use explicit streams.

use per-thread stream as default for cuDNN

3ac616f

use per-thread stream as default for cuBLAS

832b273

always call Layer::Reshape in Layer::Forward

6a8525d

There are no cases where Forward is called without Reshape, so we can simplify the call structure.

add param_bottoms option to take parameters from bottoms

e5cda03

remove spurious net.hpp includes

c70a21e

device-parametrized blob accessors

a44b3bf

add blocking queue for synchronous things

404e61d

simplify blocking queue

37edfd9

add blocking_queue::wait_for_empty for blocking until done

57b813c

expose boost::thread::interrupt as InternalThread::Interrupt

c3e247e

This will allow us to cleanly kill compute threads that are waiting for wark.

layers get device and thread_id

1a79768

This gives us a way to specify layer-level execution placement for layerwise parallelism, implemented in future commits.

split layers are automatically inserted between devices

cf9bb2d

This fills in the top_device param of split layer according to the device params of the connecting layers.

add a Caffe::SetDevice overload that takes DeviceParameter

3e3a0eb

[pycaffe] explicitly specify SetDevice overload

121b912

Net sets device before layer setup

29cdf53

This is necessary to ensure that buffers are allocated on the correct devices.

Net gets a ComputeThread subclass for async forward/backward

f2839d5

Compute threads hold (blocking) queues of forward or backward commands, which are synchronized according to the layer graph through Net member variables.

Net creates threads and maps logical -> physical device ids

bf674ef

Net uses threads for forward/backward

f47298d

enable P2P access

47f74d6

[tools] caffe time performs initial Forward/Backward together

ee45c17

This fully exercises the multi-GPU case, and saves time.

[tools] caffe time lets Net perform layer Forward/Backward

069bdcd

This is necessary to ensure that operations are performed on the correct device.

optional NVTX instrumentation for forward/backward

afb2ac1

longjon added 3 commits March 27, 2015 16:47

[tools] caffe time option for overall/multi-device forward/backward

31b2155

cudnn conv: properly sync streams

4afb9bb

[examples] multi-GPU examples

eb4e7f8

longjon added enhancement in progress labels Mar 28, 2015

cypof mentioned this pull request Apr 3, 2015

Multi-GPU #2114

Closed

longjon mentioned this pull request Apr 25, 2015

Data_layer (#1933) update and URI sources #2351

Closed

dzhwinter mentioned this pull request Jun 22, 2015

debug with parallel branch #2633

Closed

longjon mentioned this pull request Jul 8, 2015

Multicore + GPU? #2710

Closed

shelhamer mentioned this pull request Aug 12, 2015

Multi-GPU Data Parallelism (with Parallel Data Layers) #2903

Merged

9 tasks

ronghanghu mentioned this pull request Oct 28, 2015

Multi-node caffe #3252

Closed

ronghanghu added the parallelism label Oct 28, 2015

shelhamer mentioned this pull request Jun 2, 2016

Engine Abstraction in Layers #4187

Open

shelhamer closed this Feb 17, 2017

This was referenced Apr 13, 2017

Add device-parametrized blob accessors #2169

Closed

Add option to take parameters from bottoms #2166

Closed

Use CUDA 7's per-thread default stream, with automatic syncing #2077

Closed

Keep track of the layer graph #2073

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronous SGD via layer-wise parallelism #2219

Synchronous SGD via layer-wise parallelism #2219

longjon commented Mar 28, 2015

futurely commented May 4, 2015

dzhwinter commented May 29, 2015

shelhamer commented Feb 17, 2017

Synchronous SGD via layer-wise parallelism #2219

Synchronous SGD via layer-wise parallelism #2219

Conversation

longjon commented Mar 28, 2015

futurely commented May 4, 2015

dzhwinter commented May 29, 2015

shelhamer commented Feb 17, 2017