How to deal with the case that when one or some processes are much faster than others #21

BichengYing · 2020-04-14T06:46:34Z

Because of the essence of one-sided communication, the progress of different processes may vary a lot, especially under the heterogeneous environment. If simply write the code like
for e in range(epochs):
xxx
some_collective_ops

Then, the last collective ops will waste the advantage of one-sided communication. We need a better way to design the code or deal with this situation.

BichengYing · 2020-04-28T03:21:49Z

Thoughts: 1. Use barrier function every N iterations, which can be useful for unstable performance but not useful for heterogeneous situation.
2. Run for a very long time and relied on the early stopping technology, whichever node/agent achieve the stopping criteria, sending a stop signal to the others and use the model of that agent as the final result.

BichengYing added enhancement New feature or request investigation This issues requires more investigation labels Apr 14, 2020

BichengYing self-assigned this Apr 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deal with the case that when one or some processes are much faster than others #21

How to deal with the case that when one or some processes are much faster than others #21

BichengYing commented Apr 14, 2020

BichengYing commented Apr 28, 2020

How to deal with the case that when one or some processes are much faster than others #21

How to deal with the case that when one or some processes are much faster than others #21

Comments

BichengYing commented Apr 14, 2020

BichengYing commented Apr 28, 2020