-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometimes, jobs from dead workers are not recovered #14
Comments
I also note that the failure handler is not called in this scenario. I think there needs to be some notification that the job died and was re-enqueued. |
I can reproduce this at will with a simple Task that does something like this:
I wait until I see the DOING log, then kill the broker and restart it. After the dead broker threshold expires, I see a message as above saying the worker was marked as dead and 0 jobs are re-enqueued. |
@NicolasLM Do you have any thoughts on this before I spend time trying to trace it? |
I have got to the bottom of this. In my example, there's no "max_retries" defined, so it never gets moved to a new broker. This got me looking more closely at the The easiest and most obvious change to make is to remove the max_retries check. It makes some sense as I don't think the fact that a broker died should be considered an inherent job failure. |
The `enqueue_jobs_from_dead_broker.lua` script doesn't re-enqueue jobs if the max_retries was exceeded. This is very surprising behaviour for most people, and because the failure handler is not called it means jobs can be left in an inconsistent state. The easiest and most obvious change to make is to remove the max_retries check. It makes some sense as I don't think the fact that a broker died should be considered an inherent job failure. Drive-by: Add functional tests for concurrency that I forgot to `git add` previously. Drive-by: Add tags file to .gitignore Drive-by: add flake8 to tox venv dependencies so that vscode works better in that venv Fixes: NicolasLM#14
The `enqueue_jobs_from_dead_broker.lua` script doesn't re-enqueue jobs if the max_retries was exceeded. This is very surprising behaviour for most people, and because the failure handler is not called it means jobs can be left in an inconsistent state. This change will make sure that the failure handler is called and the job moved to the FAILED state. Drive-by: Add functional tests for concurrency that I forgot to `git add` previously. Drive-by: Add tags file to .gitignore Drive-by: add flake8 to tox venv dependencies so that vscode works better in that venv Fixes: NicolasLM#14
The `enqueue_jobs_from_dead_broker.lua` script doesn't re-enqueue jobs if the max_retries was exceeded. This is very surprising behaviour for most people, and because the failure handler is not called it means jobs can be left in an inconsistent state. This change will make sure that the failure handler is called and the job moved to the FAILED state. Drive-by: Add functional tests for concurrency that I forgot to `git add` previously. Drive-by: Add tags file to .gitignore Drive-by: add flake8 to tox venv dependencies so that vscode works better in that venv Fixes: NicolasLM#14
Hi, as you figured it is actually the intended behavior. With Spinach |
I'm filing this as a placeholder and will fill in more info as and when I can as the data I have is quite sparse and I'm not sure where else to look at the moment.
However - we're seeing jobs that don't get recovered in some cases where a worker thread dies. Since most of our jobs are I/O bound, we have many threads inside one process, and after a worker dies, we get messages like this (in docker-compose logs):
Note that the same worker is mentioned more than once, and the subsequent log lines don't recover anything. This makes me think a loop is going wrong somewhere and iterating over the same worker object.
I can reproduce this at will in my application, however I don't yet know how to boil it down to a simple test case.
The text was updated successfully, but these errors were encountered: