Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signal job failures when recovering dead jobs #16

Merged
merged 1 commit into from
Mar 10, 2022

Conversation

bigjools
Copy link
Collaborator

@bigjools bigjools commented Mar 3, 2022

The enqueue_jobs_from_dead_broker.lua script doesn't re-enqueue jobs
if the max_retries was exceeded. This is very surprising behaviour for
most people, and because the failure handler is not called it means jobs
can be left in an inconsistent state.

This change will make sure that the failure handler is called and the
job moved to the FAILED state.

Drive-by: Add functional tests for concurrency that I forgot to git add previously.

Drive-by: Add tags file to .gitignore

Drive-by: add flake8 to tox venv dependencies so that vscode works
better in that venv

Fixes: #14

for i in range(0, 5):
spin.schedule(do_something, i)

# Start two workers; test that only one job runs at once as per the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] one job runs at once as per the [...]

giphy

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You speak English, yeah? Well, an odd version of it, but I think you do.

Comment on lines 17 to 18
if job["retries"] < job["max_retries"] then
job["retries"] = job["retries"] + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to unconditionally increment retries here, rather than treat it as a "mulligan".

As an unrealistic example from my fevered imagination: a job that happens to kill its worker somehow every time it runs would just endlessly re-queue itself and never really leave any indication that something has gone wrong. If it's incrementing the retries counter, there would at least be some forensics to use when looking around.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured this might be controversial. Blindly incrementing retries is what gets you into the problematic situation in the first place and if you're looking for forensics I'd say your log file will be full of them.

I suppose we do need a way to clean out bad jobs that keep killing the worker (cough cough) so if you have any ideas, please say now.

The only other thing that came to mind as an alternative is to make the failure handler run when the last retry is passed. That would require some extensive changes here I think which is why I went this way first. But the bottom line for me is that I didn't think we should couple broker death with job failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

Comment on lines 17 to 18
if job["retries"] < job["max_retries"] then
job["retries"] = job["retries"] + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

@juledwar
Copy link
Contributor

juledwar commented Mar 8, 2022

Ok so this version catches failed jobs as a cause of the dead worker, and sends appropriate signals.

The `enqueue_jobs_from_dead_broker.lua` script doesn't re-enqueue jobs
if the max_retries was exceeded. This is very surprising behaviour for
most people, and because the failure handler is not called it means jobs
can be left in an inconsistent state.

This change will make sure that the failure handler is called and the
job moved to the FAILED state.

Drive-by: Add functional tests for concurrency that I forgot to `git
add` previously.

Drive-by: Add tags file to .gitignore

Drive-by: add flake8 to tox venv dependencies so that vscode works
better in that venv

Fixes: NicolasLM#14
@bigjools bigjools changed the title Don't consider max_retries when recovering dead jobs Signal job failures when recovering dead jobs Mar 9, 2022
@bigjools
Copy link
Collaborator Author

bigjools commented Mar 9, 2022

helios_1        | Worker bbed04e7-113a-4153-9760-dfa6235fa5b9 on 2a57c732afed marked as dead, 0 jobs were re-enqueued
helios_1        | Error during execution 3/3 of Job <display FAILED cad958c4-4d54-4f6e-a59e-142997505dc2> after 0 ms
helios_1        | Exception: Worker 2a57c732afed died and max_retries exceeded

Tested in anger on an actual deployment and my job was eventually marked failed correctly.

@bigjools
Copy link
Collaborator Author

bigjools commented Mar 9, 2022

@NicolasLM Would you care to review and merge this please? Would you also consider giving us write access to your repo so we can manage some of this?

@NicolasLM
Copy link
Owner

Sorry for the delay, I've been quite busy and this fell off my radar.

I am all for making changes so that failed jobs always call the failure handler, however it seems to remove an inconsistency by introducing another one:

  • Before this change, dead brokers jobs did not trigger the failure signal which is surprising because users expect all failing jobs to call this signal no matter how they failed.
  • After this change, only idempotent jobs (max_retry > 0) call the signal when they fail because of a dead broker. Jobs with the default setting of max_retry = 0 will not call it.

The second option seems even more surprising to be than the current behavior. Please let me know your thoughts.

@NicolasLM
Copy link
Owner

@bigjools I gave you write access to the repository if you need it.

@bigjools
Copy link
Collaborator Author

bigjools commented Mar 9, 2022

Thanks for replying Nicolas and also thanks for giving me write access, I promise to be careful :) I'm not sure if you want to give me write access to pypi as well so I can make releases?

You raise a good point about the non-idempotent jobs not raising a signal. Given that we're desperate to get this change released and into production, and the existing behaviour has not changed for non-idempotent jobs (no new inconsistency has been introduced!), I'm inclined to file a ticket and leave it to a follow-up branch. We don't make use of these types of jobs at all and consider them fair fodder for unacknowledged death, but I can see that at least it should send a signal like it would do under normal circumstances.

@bigjools bigjools merged commit 96b6c7f into NicolasLM:master Mar 10, 2022
@bigjools bigjools deleted the issue/14 branch March 10, 2022 00:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sometimes, jobs from dead workers are not recovered
4 participants