Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle failed kubernetes scheduling events more gracefully #12071

Open
4 tasks done
zangell44 opened this issue Feb 23, 2024 · 1 comment
Open
4 tasks done

Handle failed kubernetes scheduling events more gracefully #12071

zangell44 opened this issue Feb 23, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@zangell44
Copy link
Collaborator

First check

  • I added a descriptive title to this issue.
  • I used the GitHub search to find a similar issue and didn't find it.
  • I searched the Prefect documentation for this issue.
  • I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

When a kubernetes cluster is unable to schedule a pod in time, flow runs will be reported as Crashed but subsequently complete successfully.

Here is the log output from an example run:

image

Note that Reported flow run 'c605bcd5-8048-4140-bea4-0d04f0a2af2c' as crashed: Flow run infrastructure exited with non-zero status code -1. occurs after the pod is scheduled successfully.

Reproduction

Repro can be a bit complex but we should be able to understand from k8s events without reproducing

- Start a Kubernetes worker
- Have the worker start a flow run in a cluster with no nodes available
- After ~2 minutes, ensure a node is available on the cluster

Error

No response

Versions

`2.15.0`

Additional context

No response

@zangell44 zangell44 added bug Something isn't working needs:triage Needs feedback from the Prefect product team labels Feb 23, 2024
@serinamarie serinamarie removed the needs:triage Needs feedback from the Prefect product team label Feb 23, 2024
@zangell44
Copy link
Collaborator Author

Workaround for this is to increase the Pod Watch Timeout on the work pool to a large enough value to ensure the pod has time to schedule

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants