Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes, jobs from dead workers are not recovered #14

Closed
juledwar opened this issue Jan 31, 2022 · 5 comments · Fixed by #16
Closed

Sometimes, jobs from dead workers are not recovered #14

juledwar opened this issue Jan 31, 2022 · 5 comments · Fixed by #16

Comments

@juledwar
Copy link
Contributor

I'm filing this as a placeholder and will fill in more info as and when I can as the data I have is quite sparse and I'm not sure where else to look at the moment.

However - we're seeing jobs that don't get recovered in some cases where a worker thread dies. Since most of our jobs are I/O bound, we have many threads inside one process, and after a worker dies, we get messages like this (in docker-compose logs):

helios_1        | Worker 2c2db37f-f554-4a60-85c3-b24e2c2c2dc6 on 846e083b0422 marked as dead, 1 jobs were re-enqueued 
helios_1        | Worker 2c2db37f-f554-4a60-85c3-b24e2c2c2dc6 on 846e083b0422 marked as dead, 0 jobs were re-enqueued 
helios_1        | Worker 2c2db37f-f554-4a60-85c3-b24e2c2c2dc6 on 846e083b0422 marked as dead, 0 jobs were re-enqueued 
helios_1        | Worker 2c2db37f-f554-4a60-85c3-b24e2c2c2dc6 on 846e083b0422 marked as dead, 0 jobs were re-enqueued

Note that the same worker is mentioned more than once, and the subsequent log lines don't recover anything. This makes me think a loop is going wrong somewhere and iterating over the same worker object.

I can reproduce this at will in my application, however I don't yet know how to boil it down to a simple test case.

@juledwar
Copy link
Contributor Author

I also note that the failure handler is not called in this scenario. I think there needs to be some notification that the job died and was re-enqueued.

@juledwar
Copy link
Contributor Author

I can reproduce this at will with a simple Task that does something like this:

@tasks.task(name='display')
def display(arg):
    logger.debug(f"DOING: {arg}")
    import time
    time.sleep(125)
    logger.debug(f"DONE: {arg}")

I wait until I see the DOING log, then kill the broker and restart it. After the dead broker threshold expires, I see a message as above saying the worker was marked as dead and 0 jobs are re-enqueued.

@juledwar
Copy link
Contributor Author

@NicolasLM Do you have any thoughts on this before I spend time trying to trace it?

@juledwar
Copy link
Contributor Author

juledwar commented Mar 3, 2022

I have got to the bottom of this. In my example, there's no "max_retries" defined, so it never gets moved to a new broker. This got me looking more closely at the enqueue_jobs_from_dead_broker.lua script and I can see that it also doesn't re-enqueue jobs if the max_retries was exceeded. This is very surprising behaviour for me, and because the failure handler is not called it means jobs can be left in an inconsistent state.

The easiest and most obvious change to make is to remove the max_retries check. It makes some sense as I don't think the fact that a broker died should be considered an inherent job failure.

bigjools added a commit to bigjools/spinach that referenced this issue Mar 3, 2022
The `enqueue_jobs_from_dead_broker.lua` script doesn't re-enqueue jobs
if the max_retries was exceeded. This is very surprising behaviour for
most people, and because the failure handler is not called it means jobs
can be left in an inconsistent state.

The easiest and most obvious change to make is to remove the max_retries
check. It makes some sense as I don't think the fact that a broker died
should be considered an inherent job failure.

Drive-by: Add functional tests for concurrency that I forgot to `git
add` previously.

Drive-by: Add tags file to .gitignore

Drive-by: add flake8 to tox venv dependencies so that vscode works
better in that venv

Fixes: NicolasLM#14
bigjools added a commit to bigjools/spinach that referenced this issue Mar 8, 2022
The `enqueue_jobs_from_dead_broker.lua` script doesn't re-enqueue jobs
if the max_retries was exceeded. This is very surprising behaviour for
most people, and because the failure handler is not called it means jobs
can be left in an inconsistent state.

This change will make sure that the failure handler is called and the
job moved to the FAILED state.

Drive-by: Add functional tests for concurrency that I forgot to `git
add` previously.

Drive-by: Add tags file to .gitignore

Drive-by: add flake8 to tox venv dependencies so that vscode works
better in that venv

Fixes: NicolasLM#14
bigjools added a commit to bigjools/spinach that referenced this issue Mar 8, 2022
The `enqueue_jobs_from_dead_broker.lua` script doesn't re-enqueue jobs
if the max_retries was exceeded. This is very surprising behaviour for
most people, and because the failure handler is not called it means jobs
can be left in an inconsistent state.

This change will make sure that the failure handler is called and the
job moved to the FAILED state.

Drive-by: Add functional tests for concurrency that I forgot to `git
add` previously.

Drive-by: Add tags file to .gitignore

Drive-by: add flake8 to tox venv dependencies so that vscode works
better in that venv

Fixes: NicolasLM#14
@NicolasLM
Copy link
Owner

Hi, as you figured it is actually the intended behavior. With Spinach max_retries == 0 means the task is not idempotent and should never be retried in case of failure. Since Spinach cannot know which user's code is safe to rerun, it defaults to assuming that all tasks are not safe to be retried, hence why tasks needs to opt-in to be rerun in case of a dead broker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants