-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No multi-GPU capability with shared weights #5538
Comments
Setting Reducing at the end of backward ensures correctness for shared weights at the cost of efficiency. Parallel training can still give a speed-up in this case depending on the architecture. @cypof, please confirm. |
Yes, layer_wise_reduce is only an optimization. It often doesn't make a huge difference, so it's definitely still worth it to run multi-GPU without it. |
Problem sorted, thank you! |
@cypof this should likely be documented, for instance in docs/multigpu.md |
My problem is my NCCL is not unconmment in makefile.cofig |
Issue summary
It appears that it is no longer possible to train a network with shared weights across multiple gpus. This worked in rc3. Was this functionality deliberately sacrificed in the upgrade to use NCCL? If so it's a bit of shame for us at least as we can't upgrade past rc3.
Steps to reproduce
./build/tools/caffe train --solver=solver_referencing_net_with_shared_weights.prototxt
If compiled with USE_NCCL, this will trigger "Layer-wise reduce is not supported for nets with shared weights." (from parallel.cpp).
Otherwise, this will fail with "Multi-GPU execution not available - rebuild with USE_NCCL" (from caffe.cpp).
The text was updated successfully, but these errors were encountered: