-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task Runs Concurrency slots not released when flow runs in Kubernetes are cancelled #8566
Comments
I've posted to the slack group about this too, but this is not exclusive to Kubernetes, I have stock standard tasks being sent to a dask cluster, and when the parent flow Crashes for any reason, the slots aren't released. I wonder if this is the same issue as reported over in #5995 |
Hmmm, if I've understood the merge, then potentially, though it would be good to have that cli endpoint invoked by prefect. I can see the reset method seems to be available in https://docs.prefect.io/api-ref/prefect/cli/concurrency_limit/, so I could add a flow that runs every few minutes which simply calls reset on all active limits Ive got defined. That said, does that reset endpoint clear all slots, or just zombie slots? It looks like the slot override would end up being none, and so it would remove even valid still running tasks from the slot, right? |
I've the same problem with tasks run with the concurrent runner (the defaut runner). They become zombie, probably because they use dask for the xarray math and some deadlock occurs in dask. @task(tags=["memory_intensive"], retries=2, retry_delay_seconds=400, timeout_seconds=1000) The fact that the task timeout does not work well is a main problem, but in any case concurrency_limit should be able to release the long-running tasks. Concurrency_limit could use the task timeout and release them, even if still running. Or a specific timeout in concurrency_limit. Currently this has a big impact in my work, I'm under-using a cluster because I've often 50% of the tasks stuck every ~10h of processing. I've to reset the concurrency_limit, which is a problem because it also release the effectively running (50%) tasks. This takes more memory and make my all system more instable. |
I have similar issue on task concurrency with kubernetes as well. mentioned this in slack |
First check
Bug summary
When running a prefect flow as a kubernetes job if the flow run is cancelled while tasks are in a running state the concurrency slots used by the tasks are not released though the tasks are in a cancelled state.
This is reproducible via the following steps with the code below with a flow run triggered as a kuberenetes job
KubernetesJob Config:
potentially related but separate issue:
#7732
Reproduction
Error
No response
Versions
Additional context
Cluster config, minus any sensitive information
The text was updated successfully, but these errors were encountered: