You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's the case where the conductor fails in execution due to unexpected issues (system crash or file permission errors due to unexpected file system issues). When the conductor crashes, the user currently must manually restart the study by pointing the conductor to the pkl file in the output directory. If done soon enough after a crash, a scheduler most likely still contains job states and the conductor can resume.
However, in the usual case -- the conductor crashes without the user's knowledge. The scheduler usually loses the state of the jobs by the time the conductor is restarted meaning that the ExecutionGraph doesn't recover because it cannot find the states of jobs that we previously running. The best case is that the jobs are long running and are still being managed by the scheduler. The worst case is that jobs have finished and the state is no longer kept.
The ExecutionGraph needs a more graceful way to handle these conditions -- there are a couple of options here:
If a step in the graph is detected to have been running and the job isn't detected -- restart the simulation from scratch.
If a step was pending -- restart it from scratch.
If a step had failed -- either consider it failed (make sure all dependent steps are marked as such) OR attempt to restart the step from scratch.
Otherwise, treat the step normally.
The text was updated successfully, but these errors were encountered:
I'd like to bring this up again; I've had a couple of cases lately where the conductor has gone down for one reason or another (in one case, the conductor was killed by a software update on the login node), but the job steps in the scheduler have continued to run normally.
What's the current behavior if I just invoke the conductor pointed at the existing pickle?
Would it be possible to implement a simple case where the conductor makes the assumption that any job step that is no longer running must have completed successfully?
Note: This is similar to the discussion in issue 95.
@tadesautels -- I've not attempted to invoke the conductor on an existing study directory. I could see it getting most of the way to successfully booting, but there will be the case where it can't find the job since the schedulers only seem to keep job information for very limited amount of time.
I'll have to give it a shot -- I expect that it will fail, but worth a shot.
It's the case where the conductor fails in execution due to unexpected issues (system crash or file permission errors due to unexpected file system issues). When the conductor crashes, the user currently must manually restart the study by pointing the conductor to the pkl file in the output directory. If done soon enough after a crash, a scheduler most likely still contains job states and the conductor can resume.
However, in the usual case -- the conductor crashes without the user's knowledge. The scheduler usually loses the state of the jobs by the time the conductor is restarted meaning that the ExecutionGraph doesn't recover because it cannot find the states of jobs that we previously running. The best case is that the jobs are long running and are still being managed by the scheduler. The worst case is that jobs have finished and the state is no longer kept.
The ExecutionGraph needs a more graceful way to handle these conditions -- there are a couple of options here:
The text was updated successfully, but these errors were encountered: