-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Signal job failures when recovering dead jobs #16
Conversation
for i in range(0, 5): | ||
spin.schedule(do_something, i) | ||
|
||
# Start two workers; test that only one job runs at once as per the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You speak English, yeah? Well, an odd version of it, but I think you do.
if job["retries"] < job["max_retries"] then | ||
job["retries"] = job["retries"] + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to unconditionally increment retries
here, rather than treat it as a "mulligan".
As an unrealistic example from my fevered imagination: a job that happens to kill its worker somehow every time it runs would just endlessly re-queue itself and never really leave any indication that something has gone wrong. If it's incrementing the retries
counter, there would at least be some forensics to use when looking around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured this might be controversial. Blindly incrementing retries is what gets you into the problematic situation in the first place and if you're looking for forensics I'd say your log file will be full of them.
I suppose we do need a way to clean out bad jobs that keep killing the worker (cough cough) so if you have any ideas, please say now.
The only other thing that came to mind as an alternative is to make the failure handler run when the last retry is passed. That would require some extensive changes here I think which is why I went this way first. But the bottom line for me is that I didn't think we should couple broker death with job failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
if job["retries"] < job["max_retries"] then | ||
job["retries"] = job["retries"] + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
Ok so this version catches failed jobs as a cause of the dead worker, and sends appropriate signals. |
The `enqueue_jobs_from_dead_broker.lua` script doesn't re-enqueue jobs if the max_retries was exceeded. This is very surprising behaviour for most people, and because the failure handler is not called it means jobs can be left in an inconsistent state. This change will make sure that the failure handler is called and the job moved to the FAILED state. Drive-by: Add functional tests for concurrency that I forgot to `git add` previously. Drive-by: Add tags file to .gitignore Drive-by: add flake8 to tox venv dependencies so that vscode works better in that venv Fixes: NicolasLM#14
Tested in anger on an actual deployment and my job was eventually marked failed correctly. |
@NicolasLM Would you care to review and merge this please? Would you also consider giving us write access to your repo so we can manage some of this? |
Sorry for the delay, I've been quite busy and this fell off my radar. I am all for making changes so that failed jobs always call the failure handler, however it seems to remove an inconsistency by introducing another one:
The second option seems even more surprising to be than the current behavior. Please let me know your thoughts. |
@bigjools I gave you write access to the repository if you need it. |
Thanks for replying Nicolas and also thanks for giving me write access, I promise to be careful :) I'm not sure if you want to give me write access to pypi as well so I can make releases? You raise a good point about the non-idempotent jobs not raising a signal. Given that we're desperate to get this change released and into production, and the existing behaviour has not changed for non-idempotent jobs (no new inconsistency has been introduced!), I'm inclined to file a ticket and leave it to a follow-up branch. We don't make use of these types of jobs at all and consider them fair fodder for unacknowledged death, but I can see that at least it should send a signal like it would do under normal circumstances. |
The
enqueue_jobs_from_dead_broker.lua
script doesn't re-enqueue jobsif the max_retries was exceeded. This is very surprising behaviour for
most people, and because the failure handler is not called it means jobs
can be left in an inconsistent state.
This change will make sure that the failure handler is called and the
job moved to the FAILED state.
Drive-by: Add functional tests for concurrency that I forgot to
git add
previously.Drive-by: Add tags file to .gitignore
Drive-by: add flake8 to tox venv dependencies so that vscode works
better in that venv
Fixes: #14