Signal job failures when recovering dead jobs #16

bigjools · 2022-03-03T05:27:10Z

The enqueue_jobs_from_dead_broker.lua script doesn't re-enqueue jobs
if the max_retries was exceeded. This is very surprising behaviour for
most people, and because the failure handler is not called it means jobs
can be left in an inconsistent state.

This change will make sure that the failure handler is called and the
job moved to the FAILED state.

Drive-by: Add functional tests for concurrency that I forgot to git add previously.

Drive-by: Add tags file to .gitignore

Drive-by: add flake8 to tox venv dependencies so that vscode works
better in that venv

Fixes: #14

0xDEC0DE · 2022-03-03T18:19:12Z

tests/functional/test_concurrency.py

+    for i in range(0, 5):
+        spin.schedule(do_something, i)
+
+    # Start two workers; test that only one job runs at once as per the


[...] one job runs at once as per the [...]

You speak English, yeah? Well, an odd version of it, but I think you do.

0xDEC0DE · 2022-03-03T18:23:00Z

spinach/brokers/redis_scripts/enqueue_jobs_from_dead_broker.lua

-    if job["retries"] < job["max_retries"] then
-        job["retries"] = job["retries"] + 1


I think it might be better to unconditionally increment retries here, rather than treat it as a "mulligan".

As an unrealistic example from my fevered imagination: a job that happens to kill its worker somehow every time it runs would just endlessly re-queue itself and never really leave any indication that something has gone wrong. If it's incrementing the retries counter, there would at least be some forensics to use when looking around.

I figured this might be controversial. Blindly incrementing retries is what gets you into the problematic situation in the first place and if you're looking for forensics I'd say your log file will be full of them.

I suppose we do need a way to clean out bad jobs that keep killing the worker (cough cough) so if you have any ideas, please say now.

The only other thing that came to mind as an alternative is to make the failure handler run when the last retry is passed. That would require some extensive changes here I think which is why I went this way first. But the bottom line for me is that I didn't think we should couple broker death with job failure.

Fair enough.

0xDEC0DE · 2022-03-04T00:17:05Z

spinach/brokers/redis_scripts/enqueue_jobs_from_dead_broker.lua

-    if job["retries"] < job["max_retries"] then
-        job["retries"] = job["retries"] + 1


Fair enough.

juledwar · 2022-03-08T06:44:14Z

Ok so this version catches failed jobs as a cause of the dead worker, and sends appropriate signals.

The `enqueue_jobs_from_dead_broker.lua` script doesn't re-enqueue jobs if the max_retries was exceeded. This is very surprising behaviour for most people, and because the failure handler is not called it means jobs can be left in an inconsistent state. This change will make sure that the failure handler is called and the job moved to the FAILED state. Drive-by: Add functional tests for concurrency that I forgot to `git add` previously. Drive-by: Add tags file to .gitignore Drive-by: add flake8 to tox venv dependencies so that vscode works better in that venv Fixes: NicolasLM#14

bigjools · 2022-03-09T04:52:05Z

helios_1        | Worker bbed04e7-113a-4153-9760-dfa6235fa5b9 on 2a57c732afed marked as dead, 0 jobs were re-enqueued
helios_1        | Error during execution 3/3 of Job <display FAILED cad958c4-4d54-4f6e-a59e-142997505dc2> after 0 ms
helios_1        | Exception: Worker 2a57c732afed died and max_retries exceeded

Tested in anger on an actual deployment and my job was eventually marked failed correctly.

bigjools · 2022-03-09T04:55:42Z

@NicolasLM Would you care to review and merge this please? Would you also consider giving us write access to your repo so we can manage some of this?

NicolasLM · 2022-03-09T10:39:22Z

Sorry for the delay, I've been quite busy and this fell off my radar.

I am all for making changes so that failed jobs always call the failure handler, however it seems to remove an inconsistency by introducing another one:

Before this change, dead brokers jobs did not trigger the failure signal which is surprising because users expect all failing jobs to call this signal no matter how they failed.
After this change, only idempotent jobs (max_retry > 0) call the signal when they fail because of a dead broker. Jobs with the default setting of max_retry = 0 will not call it.

The second option seems even more surprising to be than the current behavior. Please let me know your thoughts.

NicolasLM · 2022-03-09T10:43:27Z

@bigjools I gave you write access to the repository if you need it.

bigjools · 2022-03-09T23:14:05Z

Thanks for replying Nicolas and also thanks for giving me write access, I promise to be careful :) I'm not sure if you want to give me write access to pypi as well so I can make releases?

You raise a good point about the non-idempotent jobs not raising a signal. Given that we're desperate to get this change released and into production, and the existing behaviour has not changed for non-idempotent jobs (no new inconsistency has been introduced!), I'm inclined to file a ticket and leave it to a follow-up branch. We don't make use of these types of jobs at all and consider them fair fodder for unacknowledged death, but I can see that at least it should send a signal like it would do under normal circumstances.

0xDEC0DE suggested changes Mar 3, 2022

View reviewed changes

0xDEC0DE approved these changes Mar 4, 2022

View reviewed changes

bigjools force-pushed the issue/14 branch from a447568 to d67d0df Compare March 8, 2022 06:34

bigjools force-pushed the issue/14 branch from d67d0df to 88108c4 Compare March 8, 2022 06:45

bigjools changed the title ~~Don't consider max_retries when recovering dead jobs~~ Signal job failures when recovering dead jobs Mar 9, 2022

0xDEC0DE approved these changes Mar 10, 2022

View reviewed changes

bigjools merged commit 96b6c7f into NicolasLM:master Mar 10, 2022

bigjools deleted the issue/14 branch March 10, 2022 00:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Signal job failures when recovering dead jobs #16

Signal job failures when recovering dead jobs #16

bigjools commented Mar 3, 2022 •

edited

Loading

0xDEC0DE Mar 3, 2022

bigjools Mar 3, 2022

0xDEC0DE Mar 3, 2022

bigjools Mar 3, 2022

0xDEC0DE Mar 4, 2022

0xDEC0DE Mar 4, 2022

juledwar commented Mar 8, 2022

bigjools commented Mar 9, 2022

bigjools commented Mar 9, 2022

NicolasLM commented Mar 9, 2022

NicolasLM commented Mar 9, 2022

bigjools commented Mar 9, 2022

		if job["retries"] < job["max_retries"] then
		job["retries"] = job["retries"] + 1

Signal job failures when recovering dead jobs #16

Signal job failures when recovering dead jobs #16

Conversation

bigjools commented Mar 3, 2022 • edited Loading

0xDEC0DE Mar 3, 2022

Choose a reason for hiding this comment

bigjools Mar 3, 2022

Choose a reason for hiding this comment

0xDEC0DE Mar 3, 2022

Choose a reason for hiding this comment

bigjools Mar 3, 2022

Choose a reason for hiding this comment

0xDEC0DE Mar 4, 2022

Choose a reason for hiding this comment

0xDEC0DE Mar 4, 2022

Choose a reason for hiding this comment

juledwar commented Mar 8, 2022

bigjools commented Mar 9, 2022

bigjools commented Mar 9, 2022

NicolasLM commented Mar 9, 2022

NicolasLM commented Mar 9, 2022

bigjools commented Mar 9, 2022

bigjools commented Mar 3, 2022 •

edited

Loading