Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Merge NVIDIA's NCCL multi-GPU, switch it to python #4563
Conversation
shelhamer
added parallelism multi-GPU Python
labels
Aug 9, 2016
cypof
referenced
this pull request
Aug 15, 2016
Closed
Non-deterministic data reading of image_data_layer in parallel training #4590
|
New commit that allows training with any data layer. Lots of cleaning, and removed most of the old parallel code. I simplified train.py a lot, it's really easy to customize now, and moved the advanced bits like the multi-threaded pipeline to a separate example. |
shelhamer
commented on an outdated diff
Aug 29, 2016
| @@ -409,7 +409,7 @@ CXXFLAGS += -MMD -MP | ||
| # Complete build flags. | ||
| COMMON_FLAGS += $(foreach includedir,$(INCLUDE_DIRS),-I$(includedir)) | ||
| CXXFLAGS += -pthread -fPIC $(COMMON_FLAGS) $(WARNINGS) | ||
| -NVCCFLAGS += -ccbin=$(CXX) -Xcompiler -fPIC $(COMMON_FLAGS) | ||
| +NVCCFLAGS += -D_FORCE_INLINES -ccbin=$(CXX) -Xcompiler -fPIC $(COMMON_FLAGS) |
shelhamer
Owner
|
shelhamer
commented on an outdated diff
Aug 29, 2016
shelhamer
commented on an outdated diff
Aug 29, 2016
jeffdonahue
commented on an outdated diff
Aug 29, 2016
| @@ -23,7 +23,7 @@ class DataTransformer { | ||
| * @brief Initialize the Random number generations if needed by the | ||
| * transformation. | ||
| */ | ||
| - void InitRand(); | ||
| + void InitRand(unsigned int seed); |
jeffdonahue
Contributor
|
willyd
referenced
this pull request
Sep 12, 2016
Closed
enable multi-gpu training with python interface ? #4119
shelhamer
reviewed
Oct 2, 2016
I did a short pass but I need to review this again with more coffee. In the meantime I made a few points that you could address. I can confirm that this builds and passes tests on multi-GPU machines.
| @@ -14,6 +15,12 @@ | ||
| #include "caffe/syncedmem.hpp" | ||
| #include "caffe/util/blocking_queue.hpp" | ||
| +#ifdef USE_NCCL |
shelhamer
Oct 2, 2016
Owner
I'm a bit confused by the guards here. With the old parallelism removed, shouldn't all of this be guarded?
cypof
Oct 3, 2016
Contributor
Yes we could guard the whole file and the corresponding sections in caffe.cpp and python.
| + static string new_uid(); | ||
| + | ||
| + /** | ||
| + * Broadcast weigths from rank 0 other solvers. |
| @@ -125,34 +125,53 @@ void HDF5DataLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom, | ||
| } | ||
| template <typename Dtype> | ||
| +bool HDF5DataLayer<Dtype>::Skip() { |
| - DISABLE_COPY_AND_ASSIGN(Solver); | ||
| -}; | ||
| + // Timing information, handy to tune e.g. nbr of GPUs | ||
| + Timer iteration_timer_; |
shelhamer
Oct 2, 2016
Owner
Can we take all of these timers back out? I don't think we need to expose the Caffe Timer to Python and the like, since existing profiling tools could be used instead.
cypof
Oct 3, 2016
Contributor
Timing is pretty handy to adjust the number of GPUs etc. Not sure how to do that conveniently without it? The only way to measure accurately is to insert events in the GPU stream, so peopl would have to use something like pycuda maybe, is that OK?
| @@ -68,15 +68,16 @@ class BasePrefetchingDataLayer : | ||
| const vector<Blob<Dtype>*>& top); | ||
| // Prefetches batches (asynchronously if to GPU memory) | ||
| - static const int PREFETCH_COUNT = 3; | ||
| + static const int PREFETCH_COUNT = 4; // same as proto |
shelhamer
Oct 2, 2016
Owner
If this is the same as the proto def, then drop this. No need to have fragile duplication.
cypof
Oct 3, 2016
Contributor
I haven't found a way to read the default value from the proto. One way could be to instantiate a data_param, read the value and destroy it. Seems a kill overkill so just copied the constant here.
| @@ -448,7 +454,18 @@ endif | ||
| all: lib tools examples | ||
| -lib: $(STATIC_NAME) $(DYNAMIC_NAME) | ||
| +ifeq ($(CPU_ONLY), 1) |
| @@ -495,7 +512,8 @@ examples: $(EXAMPLE_BINS) | ||
| py$(PROJECT): py | ||
| -py: $(PY$(PROJECT)_SO) $(PROTO_GEN_PY) | ||
| +py: checks \ |
cypof
Oct 3, 2016
Contributor
If people only compile the python target then checks are ignored. Compiling with NCCL but not CUDA is valid I believe but doesn't work at runtime.
| +# cd nccl | ||
| +# make -j | ||
| +# USE_NCCL := 1 | ||
| +# INCLUDE_DIRS += $(HOME)/nccl/src |
cypof
Oct 3, 2016
Contributor
If they install it its the default /usr/local path, so they don't need that. I could remove this, not sure if installing would be the typical setup for nccl.
| +#endif | ||
| + shared_ptr<Solver<Dtype> > s(SolverRegistry<Dtype>::CreateSolver(param)); | ||
| + if (restore_.size()) { | ||
| + // Could not make NCCL broadcast solver state, it seems to crash |
| + // Solve | ||
| + s->Step(param.max_iter() - start_iter); | ||
| + barrier_->wait(); | ||
| +#ifdef DEBUG |
shelhamer
Oct 2, 2016
Owner
I'd rather see this kind of check in tests than debugging blocks in the main code.
| @@ -3,26 +3,41 @@ | ||
| #include "caffe/util/math_functions.hpp" | ||
| namespace caffe { | ||
| +SyncedMemory::SyncedMemory() |
shelhamer
Oct 2, 2016
•
Owner
Is there some kind of corresponding test we can substitute for these debugging blocks and all the invocations of check_device()? I'd rather do without these.
shelhamer
Nov 19, 2016
Owner
I'm still not so enthused for these, but if they have to stay at least add a comment to document the need for check_device() at its definition.
jeffdonahue
Jan 3, 2017
Contributor
All the extra debugging code is a bit ugly, but if it makes debugging easier and doesn't affect performance outside of debug mode (and it looks like it shouldn't as everything's in #ifdef DEBUG guards), I'm ok with it. check_device could use a comment though.
| @@ -70,7 +70,7 @@ class GradientBasedSolverTest : public MultiDeviceTest<TypeParam> { | ||
| string RunLeastSquaresSolver(const Dtype learning_rate, | ||
| const Dtype weight_decay, const Dtype momentum, const int num_iters, | ||
| const int iter_size = 1, const int devices = 1, | ||
| - const bool snapshot = false, const char* from_snapshot = NULL) { | ||
| + const bool snapshot = false, const string from_snapshot = "") { |
| @@ -565,7 +581,9 @@ class SGDSolverTest : public GradientBasedSolverTest<TypeParam> { | ||
| protected: | ||
| virtual void InitSolver(const SolverParameter& param) { | ||
| - this->solver_.reset(new SGDSolver<Dtype>(param)); | ||
| + SolverParameter new_param = param; | ||
| + new_param.set_type("SGD"); |
shelhamer
Oct 2, 2016
Owner
Setting the type has no effect when instantiating the solver directly from its type as done here with SGDSolver and the others. It's only needed when making use of the solver registry.
cypof
Oct 3, 2016
Contributor
It used to work as the other workers did not apply gradients. Now each worker does the full iteration, so they need to be of the right type, including the ones created by parallel.cpp, based on param type.
shelhamer
Nov 23, 2016
Owner
This should no longer be needed as of #5009. Please drop these set_type0 to check.
| @@ -51,7 +51,18 @@ const int NPY_DTYPE = NPY_FLOAT32; | ||
| void set_mode_cpu() { Caffe::set_mode(Caffe::CPU); } | ||
| void set_mode_gpu() { Caffe::set_mode(Caffe::GPU); } | ||
| -void set_random_seed(unsigned int seed) { Caffe::set_random_seed(seed); } | ||
| +void InitLog(int level) { |
cypof
Oct 3, 2016
Contributor
The python code might want to initialize each process with different log levels. E.g. full log for the master, and warn only for each worker, so that there is less duplicated logs.
shelhamer
Nov 19, 2016
Owner
It'd be nice, but not strictly necessary, to figure out how to do this with boost python call policies instead of hardcoding it. (Like the Net constructor.)
| @@ -359,10 +411,18 @@ BOOST_PYTHON_MODULE(_caffe) { | ||
| bp::return_internal_reference<>())) | ||
| .def("setup", &Layer<Dtype>::LayerSetUp) | ||
| .def("reshape", &Layer<Dtype>::Reshape) | ||
| - .add_property("type", bp::make_function(&Layer<Dtype>::type)); | ||
| + .add_property("type", bp::make_function(&Layer<Dtype>::type)) |
shelhamer
Oct 2, 2016
Owner
Many of the members exposed in these changes don't seem specific to this PR. Could they be pulled out on their own?
| @@ -89,6 +89,12 @@ const Dtype* Blob<Dtype>::cpu_data() const { | ||
| template <typename Dtype> | ||
| void Blob<Dtype>::set_cpu_data(Dtype* data) { | ||
| CHECK(data); | ||
| + // Make sure CPU and GPU sizes remain equal |
cypof
Oct 3, 2016
Contributor
It's a weird case. As an optimization, Blob doesn't release its buffers if the size is reduced. It only updates its size, and uses part of the allocated buffers. Unfortunately the underlying syncedmem doesn't have a notion of size vs capacity, it always copies its full length between CPU and GPU. If a pointer is directly set from set_cpu_data with the current blob size, syncedmem might try to copy an old larger capacity into it.
| @@ -74,6 +74,7 @@ void DataLayer<Dtype>::Next() { | ||
| << "Restarting data prefetching from start."; | ||
| cursor_->SeekToFirst(); | ||
| } | ||
| + offset_++; |
shelhamer
Oct 2, 2016
Owner
Shouldn't this reset when seeking back to the beginning or otherwise it will add up endlessly?
cypof
Oct 3, 2016
Contributor
It's on purpose, the round robin can continue the same way, even if the database has to reset. It's easier to think of it on the unrolled dataset.
| - std::random_shuffle(data_permutation_.begin(), data_permutation_.end()); | ||
| + for (int i = 0; i < batch_size; ++i) { | ||
| + while (Skip()) { | ||
| + Next(); |
shelhamer
Oct 2, 2016
Owner
At a glance it looks like Next() is still doing the work of loading data. Can Skip() instead advance over the data w/o doing all the work of Next()?
cypof
Oct 3, 2016
Contributor
I checked the database code, it's only moving the cursor which should be cheap.
|
Thanks for reviewing, I will update this tomorrow. |
This was referenced Oct 17, 2016
junshi15
commented
Oct 27, 2016
|
@cypof since you are getting rid of P2Psync, how would this affect RDMASync used in CaffeOnSpark? |
|
@junshi15 GPUParams has not changed, it still gathers all the weights in a single buffer, so the distributed version should not require any changes. You run the distributed allreduce on this buffer at the end of the local one like before. The code will need to extends NCCL instead of P2PSync, but they are similar. |
junshi15
commented
Oct 27, 2016
|
@cypof Thanks for the info. How much performance gain do you see with NCCL against original P2PSync? |
|
I don't have numbers yet, but it's better. If you plan to switch I would be happy to know what you get. |
| @@ -51,7 +51,18 @@ const int NPY_DTYPE = NPY_FLOAT32; | ||
| void set_mode_cpu() { Caffe::set_mode(Caffe::CPU); } | ||
| void set_mode_gpu() { Caffe::set_mode(Caffe::GPU); } | ||
| -void set_random_seed(unsigned int seed) { Caffe::set_random_seed(seed); } |
| @@ -72,7 +72,8 @@ class Solver { | ||
| inline const vector<shared_ptr<Net<Dtype> > >& test_nets() { | ||
| return test_nets_; | ||
| } | ||
| - int iter() { return iter_; } | ||
| + int iter() const { return iter_; } | ||
| + void set_iter(int value) { iter_ = value; } |
shelhamer
Nov 19, 2016
•
Owner
Should not be necessary: please double-check and drop. Restore() should handle this.
| @@ -105,6 +105,32 @@ class DataLayerTest : public MultiDeviceTest<TypeParam> { | ||
| } | ||
| } | ||
| + void TestSkip() { |
shelhamer
referenced
this pull request
Nov 21, 2016
Merged
Solver: check and set type to reconcile class and proto type #5009
| - inline static void set_root_solver(bool val) { Get().root_solver_ = val; } | ||
| + inline static int solver_rank() { return Get().solver_rank_; } | ||
| + inline static void set_solver_rank(int val) { Get().solver_rank_ = val; } | ||
| + inline static bool multi_process() { return Get().multi_process_; } |
|
@cypof Thanks for the updates. Still need to drop |
SIshijima
commented
Jan 1, 2017
|
Although a training for LSTM network with multi-GPU can be completed successfully, I failed to resume the training with the snapshot and multi-GPU. Multi GPU training --> Success. -----Error message when failed ------ |
| - for (int i = 0; i < PREFETCH_COUNT; ++i) { | ||
| - prefetch_free_.push(&prefetch_[i]); | ||
| + prefetch_(param.has_data_param() ? | ||
| + param.data_param().prefetch() : PREFETCH_COUNT), |
jeffdonahue
Jan 3, 2017
Contributor
Unless I'm misunderstanding something, this should just be prefetch_(param.data_param().prefetch()), no? That should just use the default value if there's no explicit data_param or data_param.prefetch set, removing the need to also duplicate and hardcode PREFETCH_COUNT.
| +#ifndef CPU_ONLY | ||
| +template <typename Dtype> | ||
| +void caffe_gpu_scal(const int N, const Dtype alpha, Dtype* X, cudaStream_t str); | ||
| +#endif |
jeffdonahue
Jan 3, 2017
Contributor
I couldn't find where this function was called -- does it get called somewhere? If not, it should be removed from this PR.
|
Thanks @cypof -- other than the minor comments above, this looks good to me. |
|
@SIshijima that's strange, not sure why that would be. I've never tried multi-GPU training with recurrent layers -- does that work with Caffe master as is (without this PR)? Also, does single GPU training with recurrence still work with this PR? |
SIshijima
commented
Jan 4, 2017
|
Thanks @jeffdonahue I'm working Caffe master with this PR, and found the problem. |
|
@SIshijima great, thanks for checking that -- issue is unrelated to this PR then. Feel free to post a separate issue about it if you like. (edit: oops, I now see you already pointed to an existing issue about this.) |
SIshijima
referenced
this pull request
Jan 4, 2017
Closed
Failed to resume training for LSTM with multi GPU #5154
|
Thanks @jeffdonahue for the co-review. @cypof this looks ready for merge once you do a pass to groom the history (for instance squashing style fixes like the line edit, and fixes to CMake, etc.). Since #5154 is an existing issue, it can be resolved by another PR. |
cypof
and others
added some commits
Nov 22, 2016
|
OK done, commits look better now |
shelhamer
merged commit 536851c
into
BVLC:master
Jan 17, 2017
1 check passed
|
@cypof thanks for the migration to NCCL! I think a follow-up PR that updates the interface docs and adds a notebook for Python parallelism would be helpful. |
|
Multi GPU training from python is great, thanks for this work! However, this breaks multi GPU training on Windows as nccl does not build natively on windows. In case someone is interested to port this change to windows I have started a minimal nccl CMake based build for windows at https://github.com/willyd/nccl/tree/windows. This can build a static library with VS 2015 but I haven't ported any of the tests yet. Credits to @tbennun for the initial port. |
This was referenced Jan 17, 2017
awan-10
commented
Jan 19, 2017
|
After this integration of NCCL, will the old multi-GPU code work? or is the NCCL dependency a must to do multi-GPU training from now on? |
|
Yes NCCL became a required dependency for multi-GPU training.
…
|
shelhamer
referenced
this pull request
Jan 20, 2017
Closed
Serialize python execution in Multi-GPU #2939
SIshijima
commented
Jan 22, 2017
|
@shelhamer , I checked the master this PR merged. |
happynear
commented
Jan 22, 2017
|
@shelhamer, @SIshijima , I came across the same problem when I tried to resume training a normal CNN with multi-GPU. I think this line in |
|
You're right, the check is not valid anymore, removing it. |
This was referenced Jan 23, 2017
oyxhust
commented
Jan 25, 2017
|
So I can train the python layer on multi-gpu now? |
|
Hi @cypof, @shelhamer, |
happyharrycn
commented
Feb 3, 2017
|
This part is also confusing to me. Currently, each GPU will have its own process and thus its own instance of DataLayer. During training, parameters of layers will be broadcasted using NCCL. However, the DataLayer's db/cursor seems not, leading to asynchronized data fetching orders. Each GPU will have its own order of fetching data and they are not sharing a same cursor. In theory, it might lead to slower convergence (worst case you may say exactly the same data points on each GPU). Yet with random shuffling / skipping, this should never happen. And this effect will factor out on large-scale datasets. I am curious how this asynchronized data fetching order will play a role on the accuracy/convergence on datasets with different sizes. |
|
It's not mentioned in the PR description, but the current system is exactly equivalent to the old one. Before there was a single instance of the DB, and some code to round-robin items between GPUs, so that training would still be deterministic. Having a single DB cursor seemed to help with the mechanical disks we used when it was written a couple years ago. But it doesn't matter that much, and people use SSDs now on large GPU boxes anyway, so the new implementation is simpler. Each network opens a separate cursor, and skips some items so that they only keep the ones they would have been given by the round robin code. E.g. with 4 solvers, GPU 0 keeps 0, 4, 8, GPU 1 keeps 1, 5, 9, etc. This also allows training from python, where each solver needs its own process. |
happyharrycn
commented
Feb 5, 2017
•
|
Thanks @cypof for the explanation! Now I got it. I wonder how we should capture this information. It is important if someone want to write a new data layer for Multi-GPU setting in C++/Python. They should implement proper round-robin schedule by checking solver_rank. |
shelhamer
referenced
this pull request
Feb 17, 2017
Closed
fix the bug of some gpus are blocked during a training iteration #4368
iacopomasi
commented
Feb 18, 2017
|
I thought that with this pull it would be possible to run caffe con MultiGPUs with a custom Python layer but doing so I got:
I am not sure if I understood correctly: Can I use my custom python layer if I use Thanks |
oyxhust
commented
Feb 20, 2017
|
I have the same question as @iacopomasi . When I used my custom python layer but got this issue, So I have to use trian.py? @shelhamer |
|
@iacopomasi @oyxhust Right, @cypof it might be helpful to include further documentation in the form of an ipython notebook? |
weiliu89
commented
Mar 9, 2017
|
@cypof Thanks for including NCCL in Caffe, which helps speed up the Multi-GPU training quite a bit. However, I am facing two issues:
Thanks! |
|
@shelhamer @weiliu89 Sorry I missed the previous message. Yes a notebook would be great, I will try to do that. About share_weight, it's a matter of ordering the graph, to sync weights only when all their layers are done. It might not be too difficult but we need to look at the code for weight sharing, how to list layers for a given blob etc. I haven't seen the hang on shutdown, there is not much we can do if it hangs in it. It might be useful to create an issue on the NCCL repo. |
cypof commentedAug 9, 2016
This is an attempt to merge NVIDIA's multi-GPU work on NCCL, while fixing the open issues with python support and the parallel data pipeline. Some of them are non trivial, particularly around the GIL, so the idea is to switch to a python implementation using processes instead. NCCL supports it, and transfers are direct GPU-GPU anyway, so perf. should be the same. The code in train.py should have the same functionality as the command line and C++ version, but is much simpler and gives more flexibility to users. Custom setups from users will also be easier to share with Caffe2.
To test it, you need to clone NCCL,
make installit, and set the new flag in Makefile.config.The GIL and the fact you need to fork sub-processes before initializing CUDA made it tricky to get good performance, but it looks OK now. It should be as good as the NV branch with the new thread-pool, maybe a bit faster as the current pipeline does an in-device copy as the last step, whereas this one is fully zero-copy using CUDA IPC. I have tried on a couple of 4 and 8-GPU boxes so far and everything seems to stay busy.
If NV is interested to help testing and benchmarking, it seems layer-wise reduction and overlapping NCCL communications with compute is often slower than a single big allreduce at the end. Not sure if I'm doing something wrong, and if it can be fixed/improved. Also I don't think layer-wise works with shared weights. We need to compute actual dependencies between layers and that's better left for Caffe2. For now I simply disable layer-wise if the network has shared weights.
If we are happy with this, I suggest we deprecate multi-GPU from the command line version, and remove all associated code before 1.0. There is a lot of code we could remove around the round-robin IO, shared layers lock, root solvers etc. The complexity has grown a bit out of control, and NV had to add another layer of thread pool in their branch to make it go fast, with more Transformer functions etc. I hope we can avoid it by switching to this.
TODO: breakdown the PR into simpler ones, and modularize the python code. For now it exposes a single function train() that emulates the command line. The different parts could be made easier to use and customize individually.