Handle failed kubernetes scheduling events more gracefully #12071

zangell44 · 2024-02-23T14:14:54Z

First check

I added a descriptive title to this issue.
I used the GitHub search to find a similar issue and didn't find it.
I searched the Prefect documentation for this issue.
I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

When a kubernetes cluster is unable to schedule a pod in time, flow runs will be reported as Crashed but subsequently complete successfully.

Here is the log output from an example run:

Note that Reported flow run 'c605bcd5-8048-4140-bea4-0d04f0a2af2c' as crashed: Flow run infrastructure exited with non-zero status code -1. occurs after the pod is scheduled successfully.

Reproduction

Repro can be a bit complex but we should be able to understand from k8s events without reproducing

- Start a Kubernetes worker
- Have the worker start a flow run in a cluster with no nodes available
- After ~2 minutes, ensure a node is available on the cluster

Error

No response

Versions

`2.15.0`

Additional context

No response

The text was updated successfully, but these errors were encountered:

zangell44 · 2024-05-02T20:52:43Z

Workaround for this is to increase the Pod Watch Timeout on the work pool to a large enough value to ensure the pod has time to schedule

zangell44 added bug Something isn't working needs:triage Needs feedback from the Prefect product team labels Feb 23, 2024

serinamarie removed the needs:triage Needs feedback from the Prefect product team label Feb 23, 2024

jeanluciano mentioned this issue Jun 10, 2024

Migrates to Kubernetes_asyncio for asynchronous support #13910

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle failed kubernetes scheduling events more gracefully #12071

Handle failed kubernetes scheduling events more gracefully #12071

zangell44 commented Feb 23, 2024

zangell44 commented May 2, 2024

Handle failed kubernetes scheduling events more gracefully #12071

Handle failed kubernetes scheduling events more gracefully #12071

Comments

zangell44 commented Feb 23, 2024

First check

Bug summary

Reproduction

Error

Versions

Additional context

zangell44 commented May 2, 2024