error handling logic for inference runs #320

jwills · 2020-05-25T00:02:10Z

We currently have the problem where a single failure in one of the child jobs will cause the parent job for a run to be marked as failed, which prevents the copy job from running during the last phase of the pipeline. A small number of child job failures isn't actually a problem for us, so I'm adding a handler function that will:

Trigger a retry of the job on a failure when the retry count is less than 3,
Exit the job on its third failure as "successful," but writing a failure indicator file to S3 marking the job as having had a failure.

To prevent the problem where every job we run fails but gets marked as succeeding, I'm adding a check to the start of each inference run that will count up the number of failure indicator files in the failure indicator directory, and if it is greater than 10, every job in the run will fail.

Note that a consequence of this change is that some inference runs will re-start from scratch again after an initial failure, which will mean that some slots will have been generated using fewer sims than other slots. I'm not sure of the right way to handle this, but I'm assuming that dropping that slot entirely is an acceptable fallback.

jwills · 2020-05-26T05:17:53Z

Verified that this works with some test jobs that intentionally called the error_handler, triggered some fake "successes" and saw that downstream jobs failed once the threshold of 10 failures was exceeded.

jwills added 4 commits May 24, 2020 16:53

error handling logic for inference runs

d955ba0

fix typo

02aaeb8

consistency fix

18db77a

load-bearing typo

95d597c

jkamins7 merged commit b5fb7d6 into inference May 26, 2020

jwills deleted the jwills_better_retry branch May 31, 2020 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error handling logic for inference runs #320

error handling logic for inference runs #320

jwills commented May 25, 2020

jwills commented May 26, 2020

error handling logic for inference runs #320

error handling logic for inference runs #320

Conversation

jwills commented May 25, 2020

jwills commented May 26, 2020