Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The conductor is not robust when forced to terminate abnormally #40

Open
FrankD412 opened this issue Aug 11, 2017 · 2 comments
Open

The conductor is not robust when forced to terminate abnormally #40

FrankD412 opened this issue Aug 11, 2017 · 2 comments

Comments

@FrankD412
Copy link
Member

It's the case where the conductor fails in execution due to unexpected issues (system crash or file permission errors due to unexpected file system issues). When the conductor crashes, the user currently must manually restart the study by pointing the conductor to the pkl file in the output directory. If done soon enough after a crash, a scheduler most likely still contains job states and the conductor can resume.

However, in the usual case -- the conductor crashes without the user's knowledge. The scheduler usually loses the state of the jobs by the time the conductor is restarted meaning that the ExecutionGraph doesn't recover because it cannot find the states of jobs that we previously running. The best case is that the jobs are long running and are still being managed by the scheduler. The worst case is that jobs have finished and the state is no longer kept.

The ExecutionGraph needs a more graceful way to handle these conditions -- there are a couple of options here:

  • If a step in the graph is detected to have been running and the job isn't detected -- restart the simulation from scratch.
  • If a step was pending -- restart it from scratch.
  • If a step had failed -- either consider it failed (make sure all dependent steps are marked as such) OR attempt to restart the step from scratch.
  • Otherwise, treat the step normally.
@tadesautels
Copy link
Member

I'd like to bring this up again; I've had a couple of cases lately where the conductor has gone down for one reason or another (in one case, the conductor was killed by a software update on the login node), but the job steps in the scheduler have continued to run normally.

What's the current behavior if I just invoke the conductor pointed at the existing pickle?
Would it be possible to implement a simple case where the conductor makes the assumption that any job step that is no longer running must have completed successfully?

Note: This is similar to the discussion in issue 95.

@FrankD412
Copy link
Member Author

@tadesautels -- I've not attempted to invoke the conductor on an existing study directory. I could see it getting most of the way to successfully booting, but there will be the case where it can't find the job since the schedulers only seem to keep job information for very limited amount of time.

I'll have to give it a shot -- I expect that it will fail, but worth a shot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants