Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More user-friendly errors and automatic restarts in case of engines crashing due to OOM #695

Open
sahil1105 opened this issue Apr 19, 2022 · 5 comments

Comments

@sahil1105
Copy link
Collaborator

The errors we report in case of OOM and Segmentation-Fault are now much better, but I was wondering is there a way to make them more "user-friendly"?

  1. Currently, at least for the MPI case, we report the mpiexec output, which is great, but could there be a way to report a cleaner error in addition to this, that could clearly identify this as a OOM error (or a seg-fault if possible)?
  2. Is there something that packages (like Bodo) could do to make this experience better/easier?
  3. What's the best way to automate restart of engines in this case? Ideally, if enabled, in cases where the engines crash, if we could clean up the processes, display a message (e.g. "engines crashed due to OOM, restarting engines..."), and then restart the engines, that would be useful.
@minrk
Copy link
Member

minrk commented Apr 21, 2022

I think it's hard to do this in general such that it fits in the base class, but Launchers have two relevant methods:

  1. _log_output which is called on stop. This is what logs the mpi errors. You can override this in your custom Launcher to do further processing/parsing of the output to change what's logged by default instead of or in addition to the current MPI output
  2. Launcher.on_stop allows registering arbitrary stop callbacks. example notebook.

If you already have a custom launcher, you can combine these to add self.on_stop(self.custom_log_message) at the end of .start() to always add your own custom stop handlers.

@sahil1105
Copy link
Collaborator Author

Thanks @minrk! Will try this out.

@sahil1105
Copy link
Collaborator Author

@minrk Any feedback on the automatic restart setup?

@minrk
Copy link
Member

minrk commented Apr 27, 2022

Sorry, missed that part. Automatic restart could possibly also be achieved through the on_stop callback. The question becomes whether it makes sense to restart the same engine set vs starting a new one. Restarting in-place would probably feel cleaner, but likely would also make debugging more challenging (e.g. losing handles on the logs for the crashed engines). Starting a new engine set is simpler, because you only need to call cluster.start_engines(n).

I think it's reasonable for restart-on-fail to be a built-in feature for Engine[Set]Launcher, but it should be possible now via on_stop.

@sahil1105
Copy link
Collaborator Author

sahil1105 commented May 6, 2022

Thanks @minrk! Will try out building restart in a custom launcher.
Will also open a separate issue for built-in restart support.


UPDATE: Opened this issue: #706

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants