Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Replicate kube events to Prefect that occur before the pod starts #86

Closed
1 task
tekumara opened this issue Aug 21, 2023 · 3 comments
Closed
1 task

Replicate kube events to Prefect that occur before the pod starts #86

tekumara opened this issue Aug 21, 2023 · 3 comments

Comments

@tekumara
Copy link

tekumara commented Aug 21, 2023

Expectation / Proposal

Like agents on Prefect 1, when using Prefect 2 kube workers, make kube events that occur before the pod starts visible via Prefect.

At the moment all we get via Prefect is:

Worker 'KubernetesWorker edf7261c-e388-46f8-a0c5-9eb7de8c7c0f' submitting flow run 'ee662a6b-5303-430f-9ab7-acd5468f5d22' 02:20:25 PM prefect.flow_runs.worker
Creating Kubernetes job... 02:20:26 PM prefect.flow_runs.worker
Completed submission of flow run 'ee662a6b-5303-430f-9ab7-acd5468f5d22' 02:20:26 PM prefect.flow_runs.worker
Job 'beige-stingray-tzkrv': Pod has status 'Pending'. 02:20:27 PM prefect.flow_runs.worker
Job 'beige-stingray-tzkrv': Pod never started. 02:21:26 PM prefect.flow_runs.worker
Reported flow run 'ee662a6b-5303-430f-9ab7-acd5468f5d22' as crashed: Flow run infrastructure exited with non-zero status code -1. 02:21:27 PM prefect.flow_runs.worker

It would be useful to get the kube events so that we can diagnose this from Prefect without having to use kubectl etc.

prefect-kubernetes 0.2.8

Traceback / Example

Example events that are available via kubectl but not prefect:

17m         Normal    SuccessfulCreate    job/beige-stingray-tzkrv              Created pod: beige-stingray-tzkrv-vglpm
16m         Normal    TriggeredScaleUp    pod/beige-stingray-tzkrv-vglpm        pod triggered scale-up: [{gpu-accelerated-us-east-1a 1->2 (max: 5)}]
16m         Warning   FailedScheduling    pod/beige-stingray-tzkrv-vglpm        0/51 nodes are available: 1 node(s) had taint {sandbox: true}, that the pod didn't tolerate, 1 node(s) were unschedulable, 45 Insufficient cpu, 49 Insufficient nvidia.com/gpu, 7 Insufficient memory.
16m         Warning   FailedScheduling    pod/beige-stingray-tzkrv-vglpm        0/51 nodes are available: 1 node(s) had taint {sandbox: true}, that the pod didn't tolerate, 1 node(s) were unschedulable, 45 Insufficient cpu, 49 Insufficient nvidia.com/gpu, 6 Insufficient memory.
15m         Warning   FailedScheduling    pod/beige-stingray-tzkrv-vglpm        0/51 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 1 node(s) had taint {sandbox: true}, that the pod didn't tolerate, 45 Insufficient cpu, 49 Insufficient nvidia.com/gpu, 6 Insufficient memory.
15m         Warning   FailedScheduling    pod/beige-stingray-tzkrv-vglpm        0/51 nodes are available: 1 node(s) had taint {sandbox: true}, that the pod didn't tolerate, 45 Insufficient cpu, 50 Insufficient nvidia.com/gpu, 6 Insufficient memory.
14m         Normal    Scheduled           pod/beige-stingray-tzkrv-vglpm        Successfully assigned awesome-app/beige-stingray-tzkrv-vglpm to ip-10-144-171-199.ec2.internal
@zzstoatzz
Copy link
Collaborator

zzstoatzz commented Aug 21, 2023

hi @tekumara - I don't exactly recall how this worked in Prefect 1, but this seems useful! We'll need some internal feedback on how / which events should be obtained / displayed

** redacted my previously confused comments 🙂 **

@tekumara
Copy link
Author

It's possible this is resolved by #90 ... I'd be happy to test this, is there a new release planned soon?

@desertaxle
Copy link
Member

Addressed by #91

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants