The conductor is not robust when forced to terminate abnormally #40

FrankD412 · 2017-08-11T00:24:16Z

It's the case where the conductor fails in execution due to unexpected issues (system crash or file permission errors due to unexpected file system issues). When the conductor crashes, the user currently must manually restart the study by pointing the conductor to the pkl file in the output directory. If done soon enough after a crash, a scheduler most likely still contains job states and the conductor can resume.

However, in the usual case -- the conductor crashes without the user's knowledge. The scheduler usually loses the state of the jobs by the time the conductor is restarted meaning that the ExecutionGraph doesn't recover because it cannot find the states of jobs that we previously running. The best case is that the jobs are long running and are still being managed by the scheduler. The worst case is that jobs have finished and the state is no longer kept.

The ExecutionGraph needs a more graceful way to handle these conditions -- there are a couple of options here:

If a step in the graph is detected to have been running and the job isn't detected -- restart the simulation from scratch.
If a step was pending -- restart it from scratch.
If a step had failed -- either consider it failed (make sure all dependent steps are marked as such) OR attempt to restart the step from scratch.
Otherwise, treat the step normally.

tadesautels · 2020-02-18T17:31:24Z

I'd like to bring this up again; I've had a couple of cases lately where the conductor has gone down for one reason or another (in one case, the conductor was killed by a software update on the login node), but the job steps in the scheduler have continued to run normally.

What's the current behavior if I just invoke the conductor pointed at the existing pickle?
Would it be possible to implement a simple case where the conductor makes the assumption that any job step that is no longer running must have completed successfully?

Note: This is similar to the discussion in issue 95.

FrankD412 · 2020-02-18T20:23:18Z

@tadesautels -- I've not attempted to invoke the conductor on an existing study directory. I could see it getting most of the way to successfully booting, but there will be the case where it can't find the job since the schedulers only seem to keep job information for very limited amount of time.

I'll have to give it a shot -- I expect that it will fail, but worth a shot.

FrankD412 added High Priority Infrastructure labels Aug 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The conductor is not robust when forced to terminate abnormally #40

The conductor is not robust when forced to terminate abnormally #40

FrankD412 commented Aug 11, 2017

tadesautels commented Feb 18, 2020

FrankD412 commented Feb 18, 2020

The conductor is not robust when forced to terminate abnormally #40

The conductor is not robust when forced to terminate abnormally #40

Comments

FrankD412 commented Aug 11, 2017

tadesautels commented Feb 18, 2020

FrankD412 commented Feb 18, 2020