Sometimes, jobs from dead workers are not recovered #14

juledwar · 2022-01-31T03:57:48Z

I'm filing this as a placeholder and will fill in more info as and when I can as the data I have is quite sparse and I'm not sure where else to look at the moment.

However - we're seeing jobs that don't get recovered in some cases where a worker thread dies. Since most of our jobs are I/O bound, we have many threads inside one process, and after a worker dies, we get messages like this (in docker-compose logs):

helios_1        | Worker 2c2db37f-f554-4a60-85c3-b24e2c2c2dc6 on 846e083b0422 marked as dead, 1 jobs were re-enqueued 
helios_1        | Worker 2c2db37f-f554-4a60-85c3-b24e2c2c2dc6 on 846e083b0422 marked as dead, 0 jobs were re-enqueued 
helios_1        | Worker 2c2db37f-f554-4a60-85c3-b24e2c2c2dc6 on 846e083b0422 marked as dead, 0 jobs were re-enqueued 
helios_1        | Worker 2c2db37f-f554-4a60-85c3-b24e2c2c2dc6 on 846e083b0422 marked as dead, 0 jobs were re-enqueued

Note that the same worker is mentioned more than once, and the subsequent log lines don't recover anything. This makes me think a loop is going wrong somewhere and iterating over the same worker object.

I can reproduce this at will in my application, however I don't yet know how to boil it down to a simple test case.

The text was updated successfully, but these errors were encountered:

juledwar · 2022-01-31T06:24:52Z

I also note that the failure handler is not called in this scenario. I think there needs to be some notification that the job died and was re-enqueued.

juledwar · 2022-02-15T05:11:52Z

I can reproduce this at will with a simple Task that does something like this:

@tasks.task(name='display')
def display(arg):
    logger.debug(f"DOING: {arg}")
    import time
    time.sleep(125)
    logger.debug(f"DONE: {arg}")

I wait until I see the DOING log, then kill the broker and restart it. After the dead broker threshold expires, I see a message as above saying the worker was marked as dead and 0 jobs are re-enqueued.

juledwar · 2022-02-16T05:05:09Z

@NicolasLM Do you have any thoughts on this before I spend time trying to trace it?

juledwar · 2022-03-03T04:42:17Z

I have got to the bottom of this. In my example, there's no "max_retries" defined, so it never gets moved to a new broker. This got me looking more closely at the enqueue_jobs_from_dead_broker.lua script and I can see that it also doesn't re-enqueue jobs if the max_retries was exceeded. This is very surprising behaviour for me, and because the failure handler is not called it means jobs can be left in an inconsistent state.

The easiest and most obvious change to make is to remove the max_retries check. It makes some sense as I don't think the fact that a broker died should be considered an inherent job failure.

The `enqueue_jobs_from_dead_broker.lua` script doesn't re-enqueue jobs if the max_retries was exceeded. This is very surprising behaviour for most people, and because the failure handler is not called it means jobs can be left in an inconsistent state. The easiest and most obvious change to make is to remove the max_retries check. It makes some sense as I don't think the fact that a broker died should be considered an inherent job failure. Drive-by: Add functional tests for concurrency that I forgot to `git add` previously. Drive-by: Add tags file to .gitignore Drive-by: add flake8 to tox venv dependencies so that vscode works better in that venv Fixes: NicolasLM#14

The `enqueue_jobs_from_dead_broker.lua` script doesn't re-enqueue jobs if the max_retries was exceeded. This is very surprising behaviour for most people, and because the failure handler is not called it means jobs can be left in an inconsistent state. This change will make sure that the failure handler is called and the job moved to the FAILED state. Drive-by: Add functional tests for concurrency that I forgot to `git add` previously. Drive-by: Add tags file to .gitignore Drive-by: add flake8 to tox venv dependencies so that vscode works better in that venv Fixes: NicolasLM#14

NicolasLM · 2022-03-09T10:14:00Z

Hi, as you figured it is actually the intended behavior. With Spinach max_retries == 0 means the task is not idempotent and should never be retried in case of failure. Since Spinach cannot know which user's code is safe to rerun, it defaults to assuming that all tasks are not safe to be retried, hence why tasks needs to opt-in to be rerun in case of a dead broker.

bigjools mentioned this issue Mar 3, 2022

Signal job failures when recovering dead jobs #16

Merged

bigjools closed this as completed in #16 Mar 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sometimes, jobs from dead workers are not recovered #14

Sometimes, jobs from dead workers are not recovered #14

juledwar commented Jan 31, 2022

juledwar commented Jan 31, 2022

juledwar commented Feb 15, 2022

juledwar commented Feb 16, 2022

juledwar commented Mar 3, 2022

NicolasLM commented Mar 9, 2022

Sometimes, jobs from dead workers are not recovered #14

Sometimes, jobs from dead workers are not recovered #14

Comments

juledwar commented Jan 31, 2022

juledwar commented Jan 31, 2022

juledwar commented Feb 15, 2022

juledwar commented Feb 16, 2022

juledwar commented Mar 3, 2022

NicolasLM commented Mar 9, 2022