You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We see that evicted pods for flows started from a Kubernetes worker result in the Flow stuck in a running status. Furthermore, when cancelling a flow in this status, the flow gets stuck in Cancelling. This is a very troubling scenario because evictions are expected in kubernetes.
Additionally, for this case, we have tried setting timeout and retry annotations on our flows, and these are not respected in the case of evictions.
Expectation / Proposal
I expect that the flow should enter a crashed state (or be retried automatically). I expect that either the flow should handle evictions gracefully and post back crashed, or the worker should respond to the evicted pod and post back crashed or retry. The worker should also gracefully handle evictions to protect against the scenario when workers get restarted when monitoring flows.
We also expect that timeouts/retry annotations apply in the case of evictions.
We see that evicted pods for flows started from a Kubernetes worker result in the Flow stuck in a running status. Furthermore, when cancelling a flow in this status, the flow gets stuck in Cancelling. This is a very troubling scenario because evictions are expected in kubernetes.
Additionally, for this case, we have tried setting timeout and retry annotations on our flows, and these are not respected in the case of evictions.
Expectation / Proposal
I expect that the flow should enter a crashed state (or be retried automatically). I expect that either the flow should handle evictions gracefully and post back crashed, or the worker should respond to the evicted pod and post back crashed or retry. The worker should also gracefully handle evictions to protect against the scenario when workers get restarted when monitoring flows.
We also expect that timeouts/retry annotations apply in the case of evictions.
Traceback / Example
Flow:
prefect-client==2.16.2
K8s Worker:
prefect-client==2.16.2
prefect-kubernetes==0.3.5
Steps to reproduce:
Run deployment
prefect deployment run 'test_long_running_job/long-running-job
Wait a few minutes (9 minutes, in my tests) and evict the flow pod
The text was updated successfully, but these errors were encountered: