Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error handling logic for inference runs #320

Merged
merged 4 commits into from May 26, 2020
Merged

Conversation

jwills
Copy link
Collaborator

@jwills jwills commented May 25, 2020

We currently have the problem where a single failure in one of the child jobs will cause the parent job for a run to be marked as failed, which prevents the copy job from running during the last phase of the pipeline. A small number of child job failures isn't actually a problem for us, so I'm adding a handler function that will:

  1. Trigger a retry of the job on a failure when the retry count is less than 3,
  2. Exit the job on its third failure as "successful," but writing a failure indicator file to S3 marking the job as having had a failure.

To prevent the problem where every job we run fails but gets marked as succeeding, I'm adding a check to the start of each inference run that will count up the number of failure indicator files in the failure indicator directory, and if it is greater than 10, every job in the run will fail.

Note that a consequence of this change is that some inference runs will re-start from scratch again after an initial failure, which will mean that some slots will have been generated using fewer sims than other slots. I'm not sure of the right way to handle this, but I'm assuming that dropping that slot entirely is an acceptable fallback.

@jwills
Copy link
Collaborator Author

jwills commented May 26, 2020

Verified that this works with some test jobs that intentionally called the error_handler, triggered some fake "successes" and saw that downstream jobs failed once the threshold of 10 failures was exceeded.

@jkamins7 jkamins7 merged commit b5fb7d6 into inference May 26, 2020
@jwills jwills deleted the jwills_better_retry branch May 31, 2020 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants