Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronous SGD via layer-wise parallelism #2219

Closed
wants to merge 33 commits into from
Closed

Conversation

longjon
Copy link
Contributor

@longjon longjon commented Mar 28, 2015

This PR upgrades Caffe with integrated support for layer-wise multi-device parallelism. This allows Caffe to perform parallel synchronous SGD, using model-parallel, data-parallel, or hybrid approaches. This includes parallelization across GPUs, CPU/GPU parallelism, and multi-threaded CPU parallelism. This is a work-in-progress in proof-of-concept state.

The basic approach is as follows:

  • Layer parameters are augmented with a device and thread_id. Each provides a (logical) description of which device the layer will be run on (and where its memory will be allocated). Thread IDs work like CUDA streams; layers are launched in their specified order, and automatically synchronized according to the network DAG. Devices and thread IDs are specified manually; poor choices will result in poor performance.
  • With one exception, each layer executes on a single device, and has its tops and bottoms and params stored on that same device. The exception is split layer, which is allowed to have tops that live on different devices, and is used to implement automatic copying. The actual top devices have to be specified as a parameter (top_device) to split layer, but this is done automatically for automatically inserted split layers.

Limitations:

  • Parameters-as-such may not be shared between devices. Note that Caffe currently provides two different mechanisms for sharing things, which implement essentially the same functionality: parameter sharing and split layers. Rather than extend both versions of this functionality, this PR chooses to extend split layer. Shared parameters can still be used, they just need to be done with Add parameter layer for learning any bottom #2079 and Add option to take parameters from bottoms #2166.
  • Each layer's forward and backward must be performed on the same device.
  • Automatically inserted split layers are given thread_id -1, and therefore all copies are done in the same thread. This can be worked around by manually specifying split layers.
  • There is no way to specify which physical devices to use. (device_ids as given to layers are purely logical and mapped arbitrarily to physical IDs starting from 0.) Use CUDA_VISIBLE_DEVICES instead.
  • The currently included examples use DummyDataLayer. Using this with real data layers will cause problems, due to some subtle and unnecessary singleton interaction. Ultimately prefetching should probably just be integrated into this framework, but that requires additional functionality for pipelining computation. There are very ugly ways to hack around this for now if you really want to.
  • Many checks that should exist are missing.
  • I have only verified correctness of a previous version, there could be bugs in correctness of this one.
  • This PR adds a protobuf enum for device types, which should probably be merged with Caffe's own Brew enum somehow.

A simple example is included for CaffeNet data parallelism (not using real data layers). I've found that it performs near 2x on 2 GPUs, ~2.5x on 3 GPUs, and poorly on 4 GPUs. I'm sure that with additional tuning/better execution planning, much nearer-linear scaling can be achieved. The example is generated using #2086 (but the output is included). Note that actually solving will have some (small) additional overhead from H2D transfer of data (which should be made asynchronous) and from the solver parameter-update code.

NVTX instrumentation of forward/backward calls is included, and together with nvprof/nvvp it makes performance issues pretty easy to find. It's a build option, WITH_NVTX.

A hacky option, -notime_layers, is added to caffe time to do rough-and-ready parallel timing.

This PR is largely orthogonal to the net-level parallelism of the parallel branch, even though both can be used to implement some of the same things, like pure data parallelism for synchronous SGD (see #2114). Some common functionality has been factored out. I have no plans to support multi-node parallelism or asynchronous SGD using this code; you can combine it with the parallel branch for that.

There are many existing PRs included in this one:

Note that omitting included PRs and (very long) examples, the total diff here is on the order of a few hundred lines.

This is still work-in-progress, but hopefully usable/hackable by eager beavers.

Major TODOs:

  • Provide a sensible way to specify (multiple) devices, and make CPU/GPU mode a construction time Net option or coordinate this PR with mode switching some other way.
  • Provide a mechanism for pipelining and integrate data loading/transformation into this framework.
  • Verify correctness of learning in this framework (any deviation from single-device learning is a bug), and provide tests.
  • Verify that known schemes for model/data parallelism perform as expected, e.g., Krizhevsky's one weird trick and data parallelism as employed e.g. by VGG and MSRA.

cypof and others added 30 commits March 27, 2015 16:40
This means that Caffe::Get has to be moved to common.cpp, and loses its
"inline" (but there are no real performance implications).
Instead of just keeping track of input and output blobs, also keep track
of layer dependencies. (Also adjust AppendBottom's argument types to
avoid passing an input as a pointer.)
This simplifies the OS X build, and will allow use of the per-thread
default stream for running existing layer code asynchronously.
Note that this may cause issues with code that assumes either explicit
or device-level synchronization, which we'll fix in the next commit.
This ensures that layers are synchronous with respect to each other,
even when layer code doesn't use explicit streams.
There are no cases where Forward is called without Reshape, so we can
simplify the call structure.
This will allow us to cleanly kill compute threads that are waiting for
wark.
This gives us a way to specify layer-level execution placement for
layerwise parallelism, implemented in future commits.
Split layer gains a param, top_device, which allows tops to exist on
different (explicitly specified) devices. Params are automatically
copied and diffs are automatically accumulated. Because the
implementation is now device-agnostic, it's done in (only) the *_cpu
functions.
This fills in the top_device param of split layer according to the
device params of the connecting layers.
This is necessary to ensure that buffers are allocated on the correct
devices.
Compute threads hold (blocking) queues of forward or backward commands,
which are synchronized according to the layer graph through Net member
variables.
This fully exercises the multi-GPU case, and saves time.
This is necessary to ensure that operations are performed on the correct
device.
@futurely
Copy link

futurely commented May 4, 2015

Why would a PR include and be blocked by so many other PRs?

@dzhwinter
Copy link

the peer copy is failed when use cudaDeviceEnablePeerAccess() function in net.cpp line 349.
I test on multiGPU 2 cards

@shelhamer
Copy link
Member

Closing as multi-GPU dev took the other turn, although I always liked this branch.

forking paths

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants