More user-friendly errors and automatic restarts in case of engines crashing due to OOM #695

sahil1105 · 2022-04-19T15:29:19Z

The errors we report in case of OOM and Segmentation-Fault are now much better, but I was wondering is there a way to make them more "user-friendly"?

Currently, at least for the MPI case, we report the mpiexec output, which is great, but could there be a way to report a cleaner error in addition to this, that could clearly identify this as a OOM error (or a seg-fault if possible)?
Is there something that packages (like Bodo) could do to make this experience better/easier?
What's the best way to automate restart of engines in this case? Ideally, if enabled, in cases where the engines crash, if we could clean up the processes, display a message (e.g. "engines crashed due to OOM, restarting engines..."), and then restart the engines, that would be useful.

minrk · 2022-04-21T08:52:39Z

I think it's hard to do this in general such that it fits in the base class, but Launchers have two relevant methods:

_log_output which is called on stop. This is what logs the mpi errors. You can override this in your custom Launcher to do further processing/parsing of the output to change what's logged by default instead of or in addition to the current MPI output
Launcher.on_stop allows registering arbitrary stop callbacks. example notebook.

If you already have a custom launcher, you can combine these to add self.on_stop(self.custom_log_message) at the end of .start() to always add your own custom stop handlers.

sahil1105 · 2022-04-27T01:13:28Z

Thanks @minrk! Will try this out.

sahil1105 · 2022-04-27T01:31:03Z

@minrk Any feedback on the automatic restart setup?

minrk · 2022-04-27T07:27:55Z

Sorry, missed that part. Automatic restart could possibly also be achieved through the on_stop callback. The question becomes whether it makes sense to restart the same engine set vs starting a new one. Restarting in-place would probably feel cleaner, but likely would also make debugging more challenging (e.g. losing handles on the logs for the crashed engines). Starting a new engine set is simpler, because you only need to call cluster.start_engines(n).

I think it's reasonable for restart-on-fail to be a built-in feature for Engine[Set]Launcher, but it should be possible now via on_stop.

sahil1105 · 2022-05-06T03:12:06Z

Thanks @minrk! Will try out building restart in a custom launcher.
Will also open a separate issue for built-in restart support.

UPDATE: Opened this issue: #706

sahil1105 mentioned this issue May 6, 2022

Add option to restart engines #706

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More user-friendly errors and automatic restarts in case of engines crashing due to OOM #695

More user-friendly errors and automatic restarts in case of engines crashing due to OOM #695

sahil1105 commented Apr 19, 2022

minrk commented Apr 21, 2022

sahil1105 commented Apr 27, 2022

sahil1105 commented Apr 27, 2022

minrk commented Apr 27, 2022

sahil1105 commented May 6, 2022 •

edited

More user-friendly errors and automatic restarts in case of engines crashing due to OOM #695

More user-friendly errors and automatic restarts in case of engines crashing due to OOM #695

Comments

sahil1105 commented Apr 19, 2022

minrk commented Apr 21, 2022

sahil1105 commented Apr 27, 2022

sahil1105 commented Apr 27, 2022

minrk commented Apr 27, 2022

sahil1105 commented May 6, 2022 • edited

sahil1105 commented May 6, 2022 •

edited