Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No multi-GPU capability with shared weights #5538

Closed
peteanderson80 opened this issue Apr 14, 2017 · 5 comments
Closed

No multi-GPU capability with shared weights #5538

peteanderson80 opened this issue Apr 14, 2017 · 5 comments

Comments

@peteanderson80
Copy link

Issue summary

It appears that it is no longer possible to train a network with shared weights across multiple gpus. This worked in rc3. Was this functionality deliberately sacrificed in the upgrade to use NCCL? If so it's a bit of shame for us at least as we can't upgrade past rc3.

Steps to reproduce

./build/tools/caffe train --solver=solver_referencing_net_with_shared_weights.prototxt

If compiled with USE_NCCL, this will trigger "Layer-wise reduce is not supported for nets with shared weights." (from parallel.cpp).

Otherwise, this will fail with "Multi-GPU execution not available - rebuild with USE_NCCL" (from caffe.cpp).

@shelhamer
Copy link
Member

shelhamer commented Apr 15, 2017

Setting layer_wise_reduce: false in the solver specification should resolve this. The issue is that the order of gradients with weight sharing does not necessarily respect topological ordering of the layer graph that the parallel implementation follows for overlapping communication with computation. The error is triggered to keep from accidentally computing the wrong gradients.

Reducing at the end of backward ensures correctness for shared weights at the cost of efficiency. Parallel training can still give a speed-up in this case depending on the architecture.

@cypof, please confirm.

@cypof
Copy link
Member

cypof commented Apr 15, 2017

Yes, layer_wise_reduce is only an optimization. It often doesn't make a huge difference, so it's definitely still worth it to run multi-GPU without it.

@cypof cypof closed this as completed Apr 15, 2017
@peteanderson80
Copy link
Author

Problem sorted, thank you!

@shelhamer
Copy link
Member

@cypof this should likely be documented, for instance in docs/multigpu.md

@billhhh
Copy link

billhhh commented Aug 16, 2017

My problem is my NCCL is not unconmment in makefile.cofig

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants