Merge NVIDIA's NCCL multi-GPU, switch it to python #4563

cypof · 2016-08-09T00:02:49Z

This is an attempt to merge NVIDIA's multi-GPU work on NCCL, while fixing the open issues with python support and the parallel data pipeline. Some of them are non trivial, particularly around the GIL, so the idea is to switch to a python implementation using processes instead. NCCL supports it, and transfers are direct GPU-GPU anyway, so perf. should be the same. The code in train.py should have the same functionality as the command line and C++ version, but is much simpler and gives more flexibility to users. Custom setups from users will also be easier to share with Caffe2.

To test it, you need to clone NCCL, make install it, and set the new flag in Makefile.config.

The GIL and the fact you need to fork sub-processes before initializing CUDA made it tricky to get good performance, but it looks OK now. It should be as good as the NV branch with the new thread-pool, maybe a bit faster as the current pipeline does an in-device copy as the last step, whereas this one is fully zero-copy using CUDA IPC. I have tried on a couple of 4 and 8-GPU boxes so far and everything seems to stay busy.

If NV is interested to help testing and benchmarking, it seems layer-wise reduction and overlapping NCCL communications with compute is often slower than a single big allreduce at the end. Not sure if I'm doing something wrong, and if it can be fixed/improved. Also I don't think layer-wise works with shared weights. We need to compute actual dependencies between layers and that's better left for Caffe2. For now I simply disable layer-wise if the network has shared weights.

If we are happy with this, I suggest we deprecate multi-GPU from the command line version, and remove all associated code before 1.0. There is a lot of code we could remove around the round-robin IO, shared layers lock, root solvers etc. The complexity has grown a bit out of control, and NV had to add another layer of thread pool in their branch to make it go fast, with more Transformer functions etc. I hope we can avoid it by switching to this.

TODO: breakdown the PR into simpler ones, and modularize the python code. For now it exposes a single function train() that emulates the command line. The different parts could be made easier to use and customize individually.

cypof · 2016-08-27T02:32:07Z

New commit that allows training with any data layer. Lots of cleaning, and removed most of the old parallel code. I simplified train.py a lot, it's really easy to customize now, and moved the advanced bits like the multi-threaded pipeline to a separate example.

shelhamer · 2016-08-29T22:44:31Z

Makefile

@@ -409,7 +409,7 @@ CXXFLAGS += -MMD -MP
 # Complete build flags.
 COMMON_FLAGS += $(foreach includedir,$(INCLUDE_DIRS),-I$(includedir))
 CXXFLAGS += -pthread -fPIC $(COMMON_FLAGS) $(WARNINGS)
-NVCCFLAGS += -ccbin=$(CXX) -Xcompiler -fPIC $(COMMON_FLAGS)
+NVCCFLAGS += -D_FORCE_INLINES -ccbin=$(CXX) -Xcompiler -fPIC $(COMMON_FLAGS)


This is only a workaround for an Ubuntu 16.04 issue that should be fixed upstream by Ubuntu, or at least that's what I remember. At any rate it should not be committed as part of this patch since it is a build detail and not about parallelism and NCCL.

shelhamer

I did a short pass but I need to review this again with more coffee. In the meantime I made a few points that you could address. I can confirm that this builds and passes tests on multi-GPU machines.

shelhamer · 2016-10-02T07:16:47Z

include/caffe/parallel.hpp

+  static string new_uid();
+
+  /**
+   * Broadcast weigths from rank 0 other solvers.


spellcheck: weights

shelhamer · 2016-10-02T07:23:14Z

src/caffe/layers/hdf5_data_layer.cpp

@@ -125,34 +125,53 @@ void HDF5DataLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,
 }

 template <typename Dtype>
+bool HDF5DataLayer<Dtype>::Skip() {


👍 on pulling this into Skip() and Next().

shelhamer · 2016-10-02T07:27:11Z

include/caffe/solver.hpp

-  DISABLE_COPY_AND_ASSIGN(Solver);
-};
+  // Timing information, handy to tune e.g. nbr of GPUs
+  Timer iteration_timer_;


Can we take all of these timers back out? I don't think we need to expose the Caffe Timer to Python and the like, since existing profiling tools could be used instead.

Timing is pretty handy to adjust the number of GPUs etc. Not sure how to do that conveniently without it? The only way to measure accurately is to insert events in the GPU stream, so peopl would have to use something like pycuda maybe, is that OK?

I've made my peace with this timer. It can stay this time.

shelhamer · 2016-10-02T07:28:58Z

include/caffe/layers/base_data_layer.hpp

@@ -68,15 +68,16 @@ class BasePrefetchingDataLayer :
      const vector<Blob<Dtype>*>& top);

  // Prefetches batches (asynchronously if to GPU memory)
-  static const int PREFETCH_COUNT = 3;
+  static const int PREFETCH_COUNT = 4;  // same as proto


If this is the same as the proto def, then drop this. No need to have fragile duplication.

I haven't found a way to read the default value from the proto. One way could be to instantiate a data_param, read the value and destroy it. Seems a kill overkill so just copied the constant here.

shelhamer · 2016-10-02T07:32:38Z

Makefile

@@ -448,7 +454,18 @@ endif

 all: lib tools examples

-lib: $(STATIC_NAME) $(DYNAMIC_NAME)
+ifeq ($(CPU_ONLY), 1)


These build checks should be spun out into another PR.

shelhamer · 2016-10-02T07:46:00Z

src/caffe/test/test_gradient_based_solver.cpp

@@ -70,7 +70,7 @@ class GradientBasedSolverTest : public MultiDeviceTest<TypeParam> {
  string RunLeastSquaresSolver(const Dtype learning_rate,
      const Dtype weight_decay, const Dtype momentum, const int num_iters,
      const int iter_size = 1, const int devices = 1,
-      const bool snapshot = false, const char* from_snapshot = NULL) {
+      const bool snapshot = false, const string from_snapshot = "") {


Why char* -> string and the need to now call c_str()?

shelhamer · 2016-10-02T07:53:15Z

src/caffe/test/test_gradient_based_solver.cpp

@@ -565,7 +581,9 @@ class SGDSolverTest : public GradientBasedSolverTest<TypeParam> {

 protected:
  virtual void InitSolver(const SolverParameter& param) {
-    this->solver_.reset(new SGDSolver<Dtype>(param));
+    SolverParameter new_param = param;
+    new_param.set_type("SGD");


Setting the type has no effect when instantiating the solver directly from its type as done here with SGDSolver and the others. It's only needed when making use of the solver registry.

It used to work as the other workers did not apply gradients. Now each worker does the full iteration, so they need to be of the right type, including the ones created by parallel.cpp, based on param type.

This should no longer be needed as of #5009. Please drop these set_type0 to check.

shelhamer · 2016-10-02T07:56:15Z

python/caffe/_caffe.cpp

@@ -51,7 +51,18 @@ const int NPY_DTYPE = NPY_FLOAT32;
 void set_mode_cpu() { Caffe::set_mode(Caffe::CPU); }
 void set_mode_gpu() { Caffe::set_mode(Caffe::GPU); }

-void set_random_seed(unsigned int seed) { Caffe::set_random_seed(seed); }
+void InitLog(int level) {


I'm not sure why this is here, and in particular in this PR.

The python code might want to initialize each process with different log levels. E.g. full log for the master, and warn only for each worker, so that there is less duplicated logs.

Logging could be its own commit.

It'd be nice, but not strictly necessary, to figure out how to do this with boost python call policies instead of hardcoding it. (Like the Net constructor.)

shelhamer · 2016-10-02T07:57:12Z

python/caffe/_caffe.cpp

@@ -359,10 +411,18 @@ BOOST_PYTHON_MODULE(_caffe) {
          bp::return_internal_reference<>()))
    .def("setup", &Layer<Dtype>::LayerSetUp)
    .def("reshape", &Layer<Dtype>::Reshape)
-    .add_property("type", bp::make_function(&Layer<Dtype>::type));
+    .add_property("type", bp::make_function(&Layer<Dtype>::type))


Many of the members exposed in these changes don't seem specific to this PR. Could they be pulled out on their own?

shelhamer · 2016-10-02T07:59:04Z

src/caffe/blob.cpp

@@ -89,6 +89,12 @@ const Dtype* Blob<Dtype>::cpu_data() const {
 template <typename Dtype>
 void Blob<Dtype>::set_cpu_data(Dtype* data) {
  CHECK(data);
+  // Make sure CPU and GPU sizes remain equal


How can the sizes differ? This seems like it should never happen.

It's a weird case. As an optimization, Blob doesn't release its buffers if the size is reduced. It only updates its size, and uses part of the allocated buffers. Unfortunately the underlying syncedmem doesn't have a notion of size vs capacity, it always copies its full length between CPU and GPU. If a pointer is directly set from set_cpu_data with the current blob size, syncedmem might try to copy an old larger capacity into it.

shelhamer · 2016-10-02T08:04:17Z

src/caffe/layers/data_layer.cpp

@@ -74,6 +74,7 @@ void DataLayer<Dtype>::Next() {
        << "Restarting data prefetching from start.";
    cursor_->SeekToFirst();
  }
+  offset_++;


Shouldn't this reset when seeking back to the beginning or otherwise it will add up endlessly?

It's on purpose, the round robin can continue the same way, even if the database has to reset. It's easier to think of it on the unrolled dataset.

shelhamer · 2016-10-02T08:06:49Z

src/caffe/layers/hdf5_data_layer.cu

-        std::random_shuffle(data_permutation_.begin(), data_permutation_.end());
+  for (int i = 0; i < batch_size; ++i) {
+    while (Skip()) {
+      Next();


At a glance it looks like Next() is still doing the work of loading data. Can Skip() instead advance over the data w/o doing all the work of Next()?

I checked the database code, it's only moving the cursor which should be cheap.

cypof · 2016-10-03T07:12:29Z

Thanks for reviewing, I will update this tomorrow.

junshi15 · 2016-10-27T13:27:20Z

@cypof since you are getting rid of P2Psync, how would this affect RDMASync used in CaffeOnSpark?

cypof · 2016-10-27T18:47:37Z

@junshi15 GPUParams has not changed, it still gathers all the weights in a single buffer, so the distributed version should not require any changes. You run the distributed allreduce on this buffer at the end of the local one like before. The code will need to extends NCCL instead of P2PSync, but they are similar.

junshi15 · 2016-10-27T20:58:19Z

@cypof Thanks for the info. How much performance gain do you see with NCCL against original P2PSync?

cypof · 2016-10-27T21:52:44Z

I don't have numbers yet, but it's better. If you plan to switch I would be happy to know what you get.

shelhamer · 2016-11-19T00:38:12Z

python/caffe/_caffe.cpp

@@ -51,7 +51,18 @@ const int NPY_DTYPE = NPY_FLOAT32;
 void set_mode_cpu() { Caffe::set_mode(Caffe::CPU); }
 void set_mode_gpu() { Caffe::set_mode(Caffe::GPU); }

-void set_random_seed(unsigned int seed) { Caffe::set_random_seed(seed); }


This shouldn't go missing! Can't drop this from the API.

shelhamer · 2016-11-19T01:00:00Z

include/caffe/solver.hpp

@@ -72,7 +72,8 @@ class Solver {
  inline const vector<shared_ptr<Net<Dtype> > >& test_nets() {
    return test_nets_;
  }
-  int iter() { return iter_; }
+  int iter() const { return iter_; }
+  void set_iter(int value) { iter_ = value; }


Should not be necessary: please double-check and drop. Restore() should handle this.

shelhamer · 2016-11-19T01:07:31Z

src/caffe/test/test_data_layer.cpp

@@ -105,6 +105,32 @@ class DataLayerTest : public MultiDeviceTest<TypeParam> {
    }
  }

+  void TestSkip() {


ImageDataLayer needs a TestSkip() too.

shelhamer · 2016-11-23T02:03:31Z

include/caffe/common.hpp

-  inline static void set_root_solver(bool val) { Get().root_solver_ = val; }
+  inline static int solver_rank() { return Get().solver_rank_; }
+  inline static void set_solver_rank(int val) { Get().solver_rank_ = val; }
+  inline static bool multi_process() { return Get().multi_process_; }


multi_process -> multiprocess?

shelhamer · 2016-11-23T02:06:46Z

@cypof Thanks for the updates. Still need to drop set_iter() and set_type() but this is looking ready otherwise!

oyxhust · 2017-02-20T01:07:45Z

I have the same question as @iacopomasi . When I used my custom python layer but got this issue, So I have to use trian.py? @shelhamer

shelhamer · 2017-02-20T08:29:43Z

@iacopomasi @oyxhust Right, train.py is the demonstration of the new pycaffe multi-GPU interface which includes training nets with Python layers (as each net is now in its own process, sidestepping earlier parallelization issues). For further usage questions please ask on the mailing list.

@cypof it might be helpful to include further documentation in the form of an ipython notebook?

weiliu89 · 2017-03-09T07:31:17Z

@cypof Thanks for including NCCL in Caffe, which helps speed up the Multi-GPU training quite a bit. However, I am facing two issues:

As you mentioned that share_weight is not supported for AllReduce, do you have any plan to include support for this? Or can you give any hints on how to add support for it. When a net has share_weight, it can be solved by setting layer_wise_reduce to false. However the training speed becomes much slower (2x). I think supporting share_weight will be very useful (for example recurrent network needs to share_weight).
When I cancel a job trained using Multi-GPU, the job seems to hang at ncclCommDestroy. Have you encountered such issue?

Thanks!

cypof · 2017-03-10T06:34:17Z

@shelhamer @weiliu89 Sorry I missed the previous message. Yes a notebook would be great, I will try to do that. About share_weight, it's a matter of ordering the graph, to sync weights only when all their layers are done. It might not be too difficult but we need to look at the code for weight sharing, how to list layers for a given blob etc. I haven't seen the hang on shutdown, there is not much we can do if it hangs in it. It might be useful to create an issue on the NCCL repo.

nnop · 2017-10-31T18:11:18Z

src/caffe/parallel.cpp

+    CUDA_CHECK(cudaGetDevice(&device));
+    CHECK_EQ(device, device_);
+#endif
+    param.set_type(rank0_->type());


Is this necessary? The type info has been copied from rank0_.

nnop · 2017-10-31T18:50:31Z

python/train.py

+    caffe.set_device(gpus[rank])
+    caffe.set_solver_count(len(gpus))
+    caffe.set_solver_rank(rank)
+    caffe.set_multiprocess(True)


Is this necessary? This has been done in

caffe/src/caffe/parallel.cpp

Line 129 in 3ba2054

Caffe::set_multiprocess(true);

.

mod ssd layer to fit nccl

shelhamer added parallelism multi-GPU Python labels Aug 9, 2016

cypof force-pushed the nccl branch from a155592 to 7b78d8f Compare August 9, 2016 00:32

cypof mentioned this pull request Aug 15, 2016

Non-deterministic data reading of image_data_layer in parallel training #4590

Closed

shelhamer reviewed Aug 29, 2016
View reviewed changes

cypof force-pushed the nccl branch from c81e60e to be16a01 Compare September 9, 2016 23:08

willyd mentioned this pull request Sep 12, 2016

enable multi-gpu training with python interface ? #4119

Closed

shelhamer reviewed Oct 2, 2016

View reviewed changes

cypof mentioned this pull request Oct 17, 2016

multi gpu support for lstm layer #4851

Closed

shelhamer mentioned this pull request Oct 27, 2016

Move root_net_ check in net constructor #4806

Closed

cypof force-pushed the nccl branch from 35de71d to d6fd394 Compare November 18, 2016 06:16

shelhamer reviewed Nov 19, 2016

View reviewed changes

shelhamer mentioned this pull request Nov 21, 2016

Solver: check and set type to reconcile class and proto type #5009

Merged

cypof force-pushed the nccl branch from d6fd394 to 375e402 Compare November 23, 2016 01:12

shelhamer reviewed Nov 23, 2016

View reviewed changes

This was referenced Mar 23, 2017

Parallel training and PoolingLayer::Forward_cpu #1312

Closed

Multi-GPU operation and data / model Parallelism #876

Closed

Not enable peer access in case of the GPUs are located over QPI #3319

Closed

This was referenced Apr 7, 2017

multi-gpu python layer NVIDIA/DIGITS#1570

Closed

PythonLayer is not implemented in Multi-GPU training NVIDIA/caffe#305

Closed

RSly mentioned this pull request Apr 21, 2017

caffe2 + python layer? facebookarchive/caffe2#366

Closed

VasLem mentioned this pull request Apr 30, 2017

Python: caffe has no attribute 'NCCL' NVIDIA/caffe#318

Closed

zjharper mentioned this pull request May 22, 2017

set multiple gpu using set_device() #4253

Open

nnop reviewed Oct 31, 2017

View reviewed changes

wqvbjhc added a commit to wqvbjhc/caffe-ssd that referenced this pull request Jul 16, 2019

Merge NVIDIA's NCCL multi-GPU, switch it to python (BVLC#4563)

9734044

mod ssd layer to fit nccl

Merge NVIDIA's NCCL multi-GPU, switch it to python #4563

Merge NVIDIA's NCCL multi-GPU, switch it to python #4563

Conversation

cypof commented Aug 9, 2016

cypof commented Aug 27, 2016

Choose a reason for hiding this comment

shelhamer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shelhamer Oct 2, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shelhamer Oct 2, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cypof commented Oct 3, 2016

junshi15 commented Oct 27, 2016

cypof commented Oct 27, 2016

junshi15 commented Oct 27, 2016

cypof commented Oct 27, 2016

shelhamer Nov 19, 2016 • edited Loading

Choose a reason for hiding this comment

shelhamer Nov 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shelhamer commented Nov 23, 2016

oyxhust commented Feb 20, 2017

shelhamer commented Feb 20, 2017 • edited Loading

weiliu89 commented Mar 9, 2017

cypof commented Mar 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shelhamer Oct 2, 2016 •

edited

Loading

shelhamer Oct 2, 2016 •

edited

Loading

shelhamer Nov 19, 2016 •

edited

Loading

shelhamer Nov 19, 2016 •

edited

Loading

shelhamer commented Feb 20, 2017 •

edited

Loading