Merge NVIDIA's NCCL multi-GPU, switch it to python #4563

Merged
merged 5 commits into from Jan 17, 2017

Conversation

Projects
None yet
Contributor

cypof commented Aug 9, 2016

This is an attempt to merge NVIDIA's multi-GPU work on NCCL, while fixing the open issues with python support and the parallel data pipeline. Some of them are non trivial, particularly around the GIL, so the idea is to switch to a python implementation using processes instead. NCCL supports it, and transfers are direct GPU-GPU anyway, so perf. should be the same. The code in train.py should have the same functionality as the command line and C++ version, but is much simpler and gives more flexibility to users. Custom setups from users will also be easier to share with Caffe2.

To test it, you need to clone NCCL, make install it, and set the new flag in Makefile.config.

The GIL and the fact you need to fork sub-processes before initializing CUDA made it tricky to get good performance, but it looks OK now. It should be as good as the NV branch with the new thread-pool, maybe a bit faster as the current pipeline does an in-device copy as the last step, whereas this one is fully zero-copy using CUDA IPC. I have tried on a couple of 4 and 8-GPU boxes so far and everything seems to stay busy.

If NV is interested to help testing and benchmarking, it seems layer-wise reduction and overlapping NCCL communications with compute is often slower than a single big allreduce at the end. Not sure if I'm doing something wrong, and if it can be fixed/improved. Also I don't think layer-wise works with shared weights. We need to compute actual dependencies between layers and that's better left for Caffe2. For now I simply disable layer-wise if the network has shared weights.

If we are happy with this, I suggest we deprecate multi-GPU from the command line version, and remove all associated code before 1.0. There is a lot of code we could remove around the round-robin IO, shared layers lock, root solvers etc. The complexity has grown a bit out of control, and NV had to add another layer of thread pool in their branch to make it go fast, with more Transformer functions etc. I hope we can avoid it by switching to this.

TODO: breakdown the PR into simpler ones, and modularize the python code. For now it exposes a single function train() that emulates the command line. The different parts could be made easier to use and customize individually.

Contributor

cypof commented Aug 27, 2016

New commit that allows training with any data layer. Lots of cleaning, and removed most of the old parallel code. I simplified train.py a lot, it's really easy to customize now, and moved the advanced bits like the multi-threaded pipeline to a separate example.

@shelhamer shelhamer commented on an outdated diff Aug 29, 2016

@@ -409,7 +409,7 @@ CXXFLAGS += -MMD -MP
# Complete build flags.
COMMON_FLAGS += $(foreach includedir,$(INCLUDE_DIRS),-I$(includedir))
CXXFLAGS += -pthread -fPIC $(COMMON_FLAGS) $(WARNINGS)
-NVCCFLAGS += -ccbin=$(CXX) -Xcompiler -fPIC $(COMMON_FLAGS)
+NVCCFLAGS += -D_FORCE_INLINES -ccbin=$(CXX) -Xcompiler -fPIC $(COMMON_FLAGS)
@shelhamer

shelhamer Aug 29, 2016

Owner

This is only a workaround for an Ubuntu 16.04 issue that should be fixed upstream by Ubuntu, or at least that's what I remember. At any rate it should not be committed as part of this patch since it is a build detail and not about parallelism and NCCL.

@shelhamer shelhamer commented on an outdated diff Aug 29, 2016

include/caffe/layers/data_layer.hpp
@@ -29,9 +28,13 @@ class DataLayer : public BasePrefetchingDataLayer<Dtype> {
virtual inline int MaxTopBlobs() const { return 2; }
protected:
+ void Next();
+ bool skip();
@shelhamer

shelhamer Aug 29, 2016

Owner

Should have consistent case: next().

@shelhamer shelhamer commented on an outdated diff Aug 29, 2016

src/caffe/layers/base_data_layer.cpp
@@ -88,6 +88,7 @@ void BasePrefetchingDataLayer<Dtype>::InternalThreadEntry() {
#ifndef CPU_ONLY
if (Caffe::mode() == Caffe::GPU) {
batch->data_.data().get()->async_gpu_push(stream);
+ batch->label_.data().get()->async_gpu_push(stream);
@shelhamer

shelhamer Aug 29, 2016

Owner

Should have this->output_labels_ condition.

@jeffdonahue jeffdonahue commented on an outdated diff Aug 29, 2016

include/caffe/data_transformer.hpp
@@ -23,7 +23,7 @@ class DataTransformer {
* @brief Initialize the Random number generations if needed by the
* transformation.
*/
- void InitRand();
+ void InitRand(unsigned int seed);
@jeffdonahue

jeffdonahue Aug 29, 2016

Contributor

Can we avoid most of the InitRand call changes by keeping the argument-less version and invoking InitRand(caffe_rng_rand()) in that case?

@shelhamer

I did a short pass but I need to review this again with more coffee. In the meantime I made a few points that you could address. I can confirm that this builds and passes tests on multi-GPU machines.

include/caffe/parallel.hpp
@@ -14,6 +15,12 @@
#include "caffe/syncedmem.hpp"
#include "caffe/util/blocking_queue.hpp"
+#ifdef USE_NCCL
@shelhamer

shelhamer Oct 2, 2016

Owner

I'm a bit confused by the guards here. With the old parallelism removed, shouldn't all of this be guarded?

@cypof

cypof Oct 3, 2016

Contributor

Yes we could guard the whole file and the corresponding sections in caffe.cpp and python.

include/caffe/parallel.hpp
+ static string new_uid();
+
+ /**
+ * Broadcast weigths from rank 0 other solvers.
@shelhamer

shelhamer Oct 2, 2016

Owner

spellcheck: weights

@@ -125,34 +125,53 @@ void HDF5DataLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,
}
template <typename Dtype>
+bool HDF5DataLayer<Dtype>::Skip() {
@shelhamer

shelhamer Oct 2, 2016

Owner

👍 on pulling this into Skip() and Next().

- DISABLE_COPY_AND_ASSIGN(Solver);
-};
+ // Timing information, handy to tune e.g. nbr of GPUs
+ Timer iteration_timer_;
@shelhamer

shelhamer Oct 2, 2016

Owner

Can we take all of these timers back out? I don't think we need to expose the Caffe Timer to Python and the like, since existing profiling tools could be used instead.

@cypof

cypof Oct 3, 2016

Contributor

Timing is pretty handy to adjust the number of GPUs etc. Not sure how to do that conveniently without it? The only way to measure accurately is to insert events in the GPU stream, so peopl would have to use something like pycuda maybe, is that OK?

@shelhamer

shelhamer Nov 23, 2016

Owner

I've made my peace with this timer. It can stay this time.

include/caffe/layers/base_data_layer.hpp
@@ -68,15 +68,16 @@ class BasePrefetchingDataLayer :
const vector<Blob<Dtype>*>& top);
// Prefetches batches (asynchronously if to GPU memory)
- static const int PREFETCH_COUNT = 3;
+ static const int PREFETCH_COUNT = 4; // same as proto
@shelhamer

shelhamer Oct 2, 2016

Owner

If this is the same as the proto def, then drop this. No need to have fragile duplication.

@cypof

cypof Oct 3, 2016

Contributor

I haven't found a way to read the default value from the proto. One way could be to instantiate a data_param, read the value and destroy it. Seems a kill overkill so just copied the constant here.

Makefile
@@ -448,7 +454,18 @@ endif
all: lib tools examples
-lib: $(STATIC_NAME) $(DYNAMIC_NAME)
+ifeq ($(CPU_ONLY), 1)
@shelhamer

shelhamer Oct 2, 2016

Owner

These build checks should be spun out into another PR.

Makefile
@@ -495,7 +512,8 @@ examples: $(EXAMPLE_BINS)
py$(PROJECT): py
-py: $(PY$(PROJECT)_SO) $(PROTO_GEN_PY)
+py: checks \
@shelhamer

shelhamer Oct 2, 2016

Owner

Why does the Python target run the checks too?

@cypof

cypof Oct 3, 2016

Contributor

If people only compile the python target then checks are ignored. Compiling with NCCL but not CUDA is valid I believe but doesn't work at runtime.

Makefile.config.example
+# cd nccl
+# make -j
+# USE_NCCL := 1
+# INCLUDE_DIRS += $(HOME)/nccl/src
@shelhamer

shelhamer Oct 2, 2016 edited

Owner

Is there not a standard or suggested NCCL install path?

@cypof

cypof Oct 3, 2016

Contributor

If they install it its the default /usr/local path, so they don't need that. I could remove this, not sure if installing would be the typical setup for nccl.

src/caffe/parallel.cpp
+#endif
+ shared_ptr<Solver<Dtype> > s(SolverRegistry<Dtype>::CreateSolver(param));
+ if (restore_.size()) {
+ // Could not make NCCL broadcast solver state, it seems to crash
@shelhamer

shelhamer Oct 2, 2016

Owner

This is pretty weird. Let's have a closer look in person at some point.

src/caffe/parallel.cpp
+ // Solve
+ s->Step(param.max_iter() - start_iter);
+ barrier_->wait();
+#ifdef DEBUG
@shelhamer

shelhamer Oct 2, 2016

Owner

I'd rather see this kind of check in tests than debugging blocks in the main code.

@@ -3,26 +3,41 @@
#include "caffe/util/math_functions.hpp"
namespace caffe {
+SyncedMemory::SyncedMemory()
@shelhamer

shelhamer Oct 2, 2016 edited

Owner

Is there some kind of corresponding test we can substitute for these debugging blocks and all the invocations of check_device()? I'd rather do without these.

@shelhamer

shelhamer Nov 19, 2016

Owner

I'm still not so enthused for these, but if they have to stay at least add a comment to document the need for check_device() at its definition.

@jeffdonahue

jeffdonahue Jan 3, 2017

Contributor

All the extra debugging code is a bit ugly, but if it makes debugging easier and doesn't affect performance outside of debug mode (and it looks like it shouldn't as everything's in #ifdef DEBUG guards), I'm ok with it. check_device could use a comment though.

@@ -70,7 +70,7 @@ class GradientBasedSolverTest : public MultiDeviceTest<TypeParam> {
string RunLeastSquaresSolver(const Dtype learning_rate,
const Dtype weight_decay, const Dtype momentum, const int num_iters,
const int iter_size = 1, const int devices = 1,
- const bool snapshot = false, const char* from_snapshot = NULL) {
+ const bool snapshot = false, const string from_snapshot = "") {
@shelhamer

shelhamer Oct 2, 2016 edited

Owner

Why char* -> string and the need to now call c_str()?

@@ -565,7 +581,9 @@ class SGDSolverTest : public GradientBasedSolverTest<TypeParam> {
protected:
virtual void InitSolver(const SolverParameter& param) {
- this->solver_.reset(new SGDSolver<Dtype>(param));
+ SolverParameter new_param = param;
+ new_param.set_type("SGD");
@shelhamer

shelhamer Oct 2, 2016

Owner

Setting the type has no effect when instantiating the solver directly from its type as done here with SGDSolver and the others. It's only needed when making use of the solver registry.

@cypof

cypof Oct 3, 2016

Contributor

It used to work as the other workers did not apply gradients. Now each worker does the full iteration, so they need to be of the right type, including the ones created by parallel.cpp, based on param type.

@shelhamer

shelhamer Nov 23, 2016

Owner

This should no longer be needed as of #5009. Please drop these set_type0 to check.

python/caffe/_caffe.cpp
@@ -51,7 +51,18 @@ const int NPY_DTYPE = NPY_FLOAT32;
void set_mode_cpu() { Caffe::set_mode(Caffe::CPU); }
void set_mode_gpu() { Caffe::set_mode(Caffe::GPU); }
-void set_random_seed(unsigned int seed) { Caffe::set_random_seed(seed); }
+void InitLog(int level) {
@shelhamer

shelhamer Oct 2, 2016

Owner

I'm not sure why this is here, and in particular in this PR.

@cypof

cypof Oct 3, 2016

Contributor

The python code might want to initialize each process with different log levels. E.g. full log for the master, and warn only for each worker, so that there is less duplicated logs.

@shelhamer

shelhamer Nov 19, 2016

Owner

Logging could be its own commit.

@shelhamer

shelhamer Nov 19, 2016

Owner

It'd be nice, but not strictly necessary, to figure out how to do this with boost python call policies instead of hardcoding it. (Like the Net constructor.)

python/caffe/_caffe.cpp
@@ -359,10 +411,18 @@ BOOST_PYTHON_MODULE(_caffe) {
bp::return_internal_reference<>()))
.def("setup", &Layer<Dtype>::LayerSetUp)
.def("reshape", &Layer<Dtype>::Reshape)
- .add_property("type", bp::make_function(&Layer<Dtype>::type));
+ .add_property("type", bp::make_function(&Layer<Dtype>::type))
@shelhamer

shelhamer Oct 2, 2016

Owner

Many of the members exposed in these changes don't seem specific to this PR. Could they be pulled out on their own?

@@ -89,6 +89,12 @@ const Dtype* Blob<Dtype>::cpu_data() const {
template <typename Dtype>
void Blob<Dtype>::set_cpu_data(Dtype* data) {
CHECK(data);
+ // Make sure CPU and GPU sizes remain equal
@shelhamer

shelhamer Oct 2, 2016 edited

Owner

How can the sizes differ? This seems like it should never happen.

@cypof

cypof Oct 3, 2016

Contributor

It's a weird case. As an optimization, Blob doesn't release its buffers if the size is reduced. It only updates its size, and uses part of the allocated buffers. Unfortunately the underlying syncedmem doesn't have a notion of size vs capacity, it always copies its full length between CPU and GPU. If a pointer is directly set from set_cpu_data with the current blob size, syncedmem might try to copy an old larger capacity into it.

src/caffe/layers/data_layer.cpp
@@ -74,6 +74,7 @@ void DataLayer<Dtype>::Next() {
<< "Restarting data prefetching from start.";
cursor_->SeekToFirst();
}
+ offset_++;
@shelhamer

shelhamer Oct 2, 2016

Owner

Shouldn't this reset when seeking back to the beginning or otherwise it will add up endlessly?

@cypof

cypof Oct 3, 2016

Contributor

It's on purpose, the round robin can continue the same way, even if the database has to reset. It's easier to think of it on the unrolled dataset.

- std::random_shuffle(data_permutation_.begin(), data_permutation_.end());
+ for (int i = 0; i < batch_size; ++i) {
+ while (Skip()) {
+ Next();
@shelhamer

shelhamer Oct 2, 2016

Owner

At a glance it looks like Next() is still doing the work of loading data. Can Skip() instead advance over the data w/o doing all the work of Next()?

@cypof

cypof Oct 3, 2016

Contributor

I checked the database code, it's only moving the cursor which should be cheap.

Contributor

cypof commented Oct 3, 2016

Thanks for reviewing, I will update this tomorrow.

@cypof since you are getting rid of P2Psync, how would this affect RDMASync used in CaffeOnSpark?

Contributor

cypof commented Oct 27, 2016

@junshi15 GPUParams has not changed, it still gathers all the weights in a single buffer, so the distributed version should not require any changes. You run the distributed allreduce on this buffer at the end of the local one like before. The code will need to extends NCCL instead of P2PSync, but they are similar.

@cypof Thanks for the info. How much performance gain do you see with NCCL against original P2PSync?

Contributor

cypof commented Oct 27, 2016

I don't have numbers yet, but it's better. If you plan to switch I would be happy to know what you get.

python/caffe/_caffe.cpp
@@ -51,7 +51,18 @@ const int NPY_DTYPE = NPY_FLOAT32;
void set_mode_cpu() { Caffe::set_mode(Caffe::CPU); }
void set_mode_gpu() { Caffe::set_mode(Caffe::GPU); }
-void set_random_seed(unsigned int seed) { Caffe::set_random_seed(seed); }
@shelhamer

shelhamer Nov 19, 2016 edited

Owner

This shouldn't go missing! Can't drop this from the API.

include/caffe/solver.hpp
@@ -72,7 +72,8 @@ class Solver {
inline const vector<shared_ptr<Net<Dtype> > >& test_nets() {
return test_nets_;
}
- int iter() { return iter_; }
+ int iter() const { return iter_; }
+ void set_iter(int value) { iter_ = value; }
@shelhamer

shelhamer Nov 19, 2016 edited

Owner

Should not be necessary: please double-check and drop. Restore() should handle this.

@@ -105,6 +105,32 @@ class DataLayerTest : public MultiDeviceTest<TypeParam> {
}
}
+ void TestSkip() {
@shelhamer

shelhamer Nov 19, 2016

Owner

ImageDataLayer needs a TestSkip() too.

include/caffe/common.hpp
- inline static void set_root_solver(bool val) { Get().root_solver_ = val; }
+ inline static int solver_rank() { return Get().solver_rank_; }
+ inline static void set_solver_rank(int val) { Get().solver_rank_ = val; }
+ inline static bool multi_process() { return Get().multi_process_; }
@shelhamer

shelhamer Nov 23, 2016

Owner

multi_process -> multiprocess?

Owner

shelhamer commented Nov 23, 2016

@cypof Thanks for the updates. Still need to drop set_iter() and set_type() but this is looking ready otherwise!

Although a training for LSTM network with multi-GPU can be completed successfully, I failed to resume the training with the snapshot and multi-GPU.
(Now I'm trying to find the reason, but it looks hard for me.)

Multi GPU training --> Success.
Snapshot by MultiGPU ---> Resume with multi GPU -->Fail.
Snapshot by MultiGPU ---> Resume with single GPU -->Success.

-----Error message when failed ------
I0101 20:55:55.651378 6305 recurrent_layer.cpp:20] Initializing recurrent layer: assuming input batch contains 10 timesteps of 30 independent streams.
I0101 20:55:56.042740 6305 recurrent_layer.cpp:150] Adding parameter 0: W_xc
I0101 20:55:56.042780 6305 recurrent_layer.cpp:150] Adding parameter 1: b_c
I0101 20:55:56.042794 6305 recurrent_layer.cpp:150] Adding parameter 2: W_hc
I0101 20:55:56.043316 6305 recurrent_layer.cpp:20] Initializing recurrent layer: assuming input batch contains 10 timesteps of 30 independent streams.
I0101 20:55:56.462378 6305 recurrent_layer.cpp:150] Adding parameter 0: W_xc
I0101 20:55:56.462421 6305 recurrent_layer.cpp:150] Adding parameter 1: b_c
I0101 20:55:56.462432 6305 recurrent_layer.cpp:150] Adding parameter 2: W_xc_static
I0101 20:55:56.462441 6305 recurrent_layer.cpp:150] Adding parameter 3: W_hc
F0101 20:55:56.544127 6305 solver.cpp:465] Check failed: Caffe::root_solver()

src/caffe/layers/base_data_layer.cpp
- for (int i = 0; i < PREFETCH_COUNT; ++i) {
- prefetch_free_.push(&prefetch_[i]);
+ prefetch_(param.has_data_param() ?
+ param.data_param().prefetch() : PREFETCH_COUNT),
@jeffdonahue

jeffdonahue Jan 3, 2017

Contributor

Unless I'm misunderstanding something, this should just be prefetch_(param.data_param().prefetch()), no? That should just use the default value if there's no explicit data_param or data_param.prefetch set, removing the need to also duplicate and hardcode PREFETCH_COUNT.

@cypof

cypof Jan 4, 2017

Contributor

You're right thanks, I removed it.

+#ifndef CPU_ONLY
+template <typename Dtype>
+void caffe_gpu_scal(const int N, const Dtype alpha, Dtype* X, cudaStream_t str);
+#endif
@jeffdonahue

jeffdonahue Jan 3, 2017

Contributor

I couldn't find where this function was called -- does it get called somewhere? If not, it should be removed from this PR.

@cypof

cypof Jan 4, 2017

Contributor

It's called in parallel.cpp:232, only if USE_NCCL is set.

Contributor

jeffdonahue commented Jan 3, 2017

Thanks @cypof -- other than the minor comments above, this looks good to me.

Contributor

jeffdonahue commented Jan 3, 2017 edited

@SIshijima that's strange, not sure why that would be. I've never tried multi-GPU training with recurrent layers -- does that work with Caffe master as is (without this PR)? Also, does single GPU training with recurrence still work with this PR?

Thanks @jeffdonahue I'm working Caffe master with this PR, and found the problem.
Caffe master without this PR can not train recurrent network with multi-GPU. (With single-GPU, it works well) #4851

Contributor

jeffdonahue commented Jan 4, 2017 edited

@SIshijima great, thanks for checking that -- issue is unrelated to this PR then. Feel free to post a separate issue about it if you like. (edit: oops, I now see you already pointed to an existing issue about this.)

Owner

shelhamer commented Jan 4, 2017

Thanks @jeffdonahue for the co-review. @cypof this looks ready for merge once you do a pass to groom the history (for instance squashing style fixes like the line edit, and fixes to CMake, etc.).

Since #5154 is an existing issue, it can be resolved by another PR.

flx42 referenced this pull request Jan 6, 2017

Merged

Docker refresh #5153

Contributor

cypof commented Jan 6, 2017

OK done, commits look better now

@shelhamer shelhamer merged commit 536851c into BVLC:master Jan 17, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Owner

shelhamer commented Jan 17, 2017

@cypof thanks for the migration to NCCL! I think a follow-up PR that updates the interface docs and adds a notebook for Python parallelism would be helpful.

Contributor

willyd commented Jan 17, 2017

Multi GPU training from python is great, thanks for this work! However, this breaks multi GPU training on Windows as nccl does not build natively on windows.

In case someone is interested to port this change to windows I have started a minimal nccl CMake based build for windows at https://github.com/willyd/nccl/tree/windows. This can build a static library with VS 2015 but I haven't ported any of the tests yet.

Credits to @tbennun for the initial port.

awan-10 commented Jan 19, 2017

After this integration of NCCL, will the old multi-GPU code work? or is the NCCL dependency a must to do multi-GPU training from now on?

Contributor

cypof commented Jan 19, 2017

@shelhamer , I checked the master this PR merged.
Multi-GPU training for LSTMs can be done successfully, but failed to resume training.
-- #5154 occurs still.
I used coco captioning to check.

@shelhamer, @SIshijima , I came across the same problem when I tried to resume training a normal CNN with multi-GPU. I think this line in solver.cpp should be removed because the parameter restoration is applied in every GPU now.

Contributor

cypof commented Jan 23, 2017

You're right, the check is not valid anymore, removing it.

oyxhust commented Jan 25, 2017

So I can train the python layer on multi-gpu now?

Hi @cypof, @shelhamer,
Is this true that now DataLayer cannot be shared by multi-GPUs in C++ interface? When you skip a data item in database, you advance the cursor into a new position, and we will never visit some items. Is it true? How offset_ variable shared between multiple GPUs? Hope to have you guys explanations. I am confused about this new DataLayer and the fact of deleting DataReader function.

This part is also confusing to me. Currently, each GPU will have its own process and thus its own instance of DataLayer. During training, parameters of layers will be broadcasted using NCCL. However, the DataLayer's db/cursor seems not, leading to asynchronized data fetching orders. Each GPU will have its own order of fetching data and they are not sharing a same cursor.

In theory, it might lead to slower convergence (worst case you may say exactly the same data points on each GPU). Yet with random shuffling / skipping, this should never happen. And this effect will factor out on large-scale datasets. I am curious how this asynchronized data fetching order will play a role on the accuracy/convergence on datasets with different sizes.

Contributor

cypof commented Feb 5, 2017 edited

It's not mentioned in the PR description, but the current system is exactly equivalent to the old one. Before there was a single instance of the DB, and some code to round-robin items between GPUs, so that training would still be deterministic. Having a single DB cursor seemed to help with the mechanical disks we used when it was written a couple years ago. But it doesn't matter that much, and people use SSDs now on large GPU boxes anyway, so the new implementation is simpler. Each network opens a separate cursor, and skips some items so that they only keep the ones they would have been given by the round robin code. E.g. with 4 solvers, GPU 0 keeps 0, 4, 8, GPU 1 keeps 1, 5, 9, etc. This also allows training from python, where each solver needs its own process.

happyharrycn commented Feb 5, 2017 edited

Thanks @cypof for the explanation! Now I got it.

I wonder how we should capture this information. It is important if someone want to write a new data layer for Multi-GPU setting in C++/Python. They should implement proper round-robin schedule by checking solver_rank.

I thought that with this pull it would be possible to run caffe con MultiGPUs with a custom Python layer but doing so I got:

PythonLayer does not support CLI Multi-GPU, use train.py

I am not sure if I understood correctly: Can I use my custom python layer if I use train.py ?

Thanks

oyxhust commented Feb 20, 2017

I have the same question as @iacopomasi . When I used my custom python layer but got this issue, So I have to use trian.py? @shelhamer

Owner

shelhamer commented Feb 20, 2017 edited

@iacopomasi @oyxhust Right, train.py is the demonstration of the new pycaffe multi-GPU interface which includes training nets with Python layers (as each net is now in its own process, sidestepping earlier parallelization issues). For further usage questions please ask on the mailing list.

@cypof it might be helpful to include further documentation in the form of an ipython notebook?

weiliu89 commented Mar 9, 2017

@cypof Thanks for including NCCL in Caffe, which helps speed up the Multi-GPU training quite a bit. However, I am facing two issues:

  1. As you mentioned that share_weight is not supported for AllReduce, do you have any plan to include support for this? Or can you give any hints on how to add support for it. When a net has share_weight, it can be solved by setting layer_wise_reduce to false. However the training speed becomes much slower (2x). I think supporting share_weight will be very useful (for example recurrent network needs to share_weight).

  2. When I cancel a job trained using Multi-GPU, the job seems to hang at ncclCommDestroy. Have you encountered such issue?

Thanks!

Contributor

cypof commented Mar 10, 2017

@shelhamer @weiliu89 Sorry I missed the previous message. Yes a notebook would be great, I will try to do that. About share_weight, it's a matter of ordering the graph, to sync weights only when all their layers are done. It might not be too difficult but we need to look at the code for weight sharing, how to list layers for a given blob etc. I haven't seen the hang on shutdown, there is not much we can do if it hangs in it. It might be useful to create an issue on the NCCL repo.

This was referenced Mar 23, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment