Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deal with the case that when one or some processes are much faster than others #21

Open
BichengYing opened this issue Apr 14, 2020 · 1 comment
Assignees
Labels
enhancement New feature or request investigation This issues requires more investigation

Comments

@BichengYing
Copy link
Collaborator

Because of the essence of one-sided communication, the progress of different processes may vary a lot, especially under the heterogeneous environment. If simply write the code like
for e in range(epochs):
xxx
some_collective_ops

Then, the last collective ops will waste the advantage of one-sided communication. We need a better way to design the code or deal with this situation.

@BichengYing BichengYing added enhancement New feature or request investigation This issues requires more investigation labels Apr 14, 2020
@BichengYing BichengYing self-assigned this Apr 14, 2020
@BichengYing
Copy link
Collaborator Author

Thoughts: 1. Use barrier function every N iterations, which can be useful for unstable performance but not useful for heterogeneous situation.
2. Run for a very long time and relied on the early stopping technology, whichever node/agent achieve the stopping criteria, sending a stop signal to the others and use the model of that agent as the final result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request investigation This issues requires more investigation
Projects
None yet
Development

No branches or pull requests

1 participant