-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU Training with DDP #1096
Comments
What exactly should DDP wrap?Essentially, a DDP module wraps a set of parameters, and a function Later when the results' Note that the results' |
Can we have more than 1 DDP wrapped modules in a distributed training?The answer is yes. Theoretically it works and I coded an experiment to verify that. It is worth noting that if you have more than 1 DDP wrapped modules, the order of calling in different subprocesses needs to be exactly the same. Because of how DDP works, if the order is different, the reducer of module A in process 1 might be waiting for its counterpart in process 2, while in the process 2 the reducer of module B is waiting for its counterpart in process 1 - effectively a textbook example of deadlock. |
While working on enabling After some digging I found that the problem comes from the fact that when DDP start to sync (reduce), it will sync the buffers of the wrapped module as well. All the offending buffers are within the replay buffer. I am working on a generic way to rule them out before being wrapped by the DDP. One of the problems is that the |
With some hack I was able to run PPG with DDP on two 3080s. Below is the comparison of the same setup trained on
Note that the DDP version did better when looking at the by env steps graph: Also, the time consumed is less than single 3090: It is actually not 2x but 1.5x faster. I think one of the factor is that 3090 has a better performance than a single 3080. Another reason could be in this hacky version I had to let DDP figure out what parameters are "unused" which adds overhead. I am still working on remove those hacks. |
I got stuck on how The reason we need it for PPG is that PPG's network's auxiliary output is not used for policy phase update, but only in auxiliary phase update. Therefore corresponding parameters becomes "unused", and DDP does not like that as it is waiting for hooks to be called on all parameters. |
When turning on DDP, PPG + Metadrive can get stuck after several iterations (or several hundreds of iterations) arbitrarily. To make sure that it is DDP causing the problem, I also ran another training without DDP, and the result looks good. See below for the comparison.
|
Explanation of the above debugging log, see below. Further debugging shows that when getting stuck, it is inside the |
Was able to pin point the problem at experience = alf.nest.map_structure(lambda x: x[indices],
experience) Where one of the process got stuck here, which is outside the DDP-wrapped code. This is consistently reproducible on 2 different machines. The above code comes from https://github.com/HorizonRobotics/alf/blob/pytorch/alf/algorithms/algorithm.py#L1372 Both |
This one seems related: https://discuss.pytorch.org/t/training-get-stuck-at-some-iteration-step/48329 |
Latest experiment result - after moving shuffle into per mini_batch, it worked around the previous stuck point. However, it then freezes at calling the DDP wrapped function |
Can rule out |
This is a follow-up to #913
Motivation
Add full support for multi-process and multi-GPU training in alf with pytorch's DDP.
Goals
forward
of the wrapped DDP module is equivalent to calling the original function, with distributed hooks added to the result Implement @data_distributed Decorator #1098unroll()
should not go through DDP in off-policy branch Conditional @data_distributed_when to disable DDP on unroll for Off-policy Algorithms #1114While achieving the main goals above, we should also make sure that the following specific use cases are considered.
backward
andoptimizer
(e.g. target updater in SAC). Make sure that the behavior is consistent with the non-distributed version.num_env_steps
SIGINT
, there are defunct zombie processes leftBlockers and Issues:
The text was updated successfully, but these errors were encountered: