Add detail to k8s agent logs when pod never starts (and maybe raise FAILED state?) #6394

zzstoatzz · 2022-08-11T23:15:19Z

First check

I added a descriptive title to this issue.
I used the GitHub search to find a similar request and didn't find it.
I searched the Prefect documentation for this feature.

Prefect Version

2.x

Describe the proposed behavior

If there is an issue with the k8s agent setup (e.g. an instance of a kubernetes-job block references the incorrect cluster service account) such that the pod is never able to start, we could surface job events as kubectl describe job tuscan-flamingojqdmg- would:

....
Events:
  Type     Reason        Age                    From            Message
  ----     ------        ----                   ----            -------
  Warning  FailedCreate  3m30s (x27 over 134m)  job-controller  Error creating: pods "tuscan-flamingojqdmg-" is forbidden: error looking up service account prefect-2/prefect-agent-1659997862: serviceaccount "prefect-agent-1659997862" not found

... and mark the corresponding flow as Failed

Describe the current behavior

Agent logs just state that pod never started and flow stays in Pending state

23:04:48.021 | DEBUG   | prefect.agent - Checking for flow runs...
23:04:48.147 | INFO    | prefect.agent - Submitting flow run 'cd17f57d-13ce-4a8c-92ef-475d0e1067d4'
23:04:53.153 | DEBUG   | prefect.agent - Checking for flow runs...
23:04:56.097 | ERROR   | prefect.infrastructure.kubernetes-job - Job 'tuscan-flamingojqdmg-': Pod never started.

Example Use

No response

Additional context

using the helm chart for k8s agent as found here

The text was updated successfully, but these errors were encountered:

tekumara · 2022-09-18T23:44:56Z

Related to PrefectHQ/prefect-kubernetes#90 and #5489

zanieb · 2022-09-19T01:13:32Z

Maybe we should use the same pattern as the ECS block where wait until the task starts and do not report the task as started until that occurs:

avishniakov · 2022-09-28T10:17:24Z

We experience same issues on 2.3.2. Would be cool to have it reported as failed and also get logs of that back in the run.

zanieb · 2022-09-28T14:48:40Z

I thought I'd have time to work on this but did not. This is open for contribution.

zzstoatzz added enhancement An improvement of an existing feature status:triage labels Aug 11, 2022

zanieb self-assigned this Aug 12, 2022

zanieb added priority:medium status:in-progress Someone is working on this, check-in here before beginning new work and removed status:triage labels Aug 12, 2022

zanieb removed their assignment Sep 28, 2022

zanieb added status:accepted We may work on this; we will accept work from external contributors and removed status:in-progress Someone is working on this, check-in here before beginning new work labels Sep 28, 2022

zanieb added the good first issue This issue is good for newcomers label Sep 28, 2022

cicdw removed the priority:medium label Aug 15, 2023

gabcoyne mentioned this issue Jun 13, 2024

Migrates to Kubernetes_asyncio for asynchronous support #13910

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add detail to k8s agent logs when pod never starts (and maybe raise FAILED state?) #6394

Add detail to k8s agent logs when pod never starts (and maybe raise FAILED state?) #6394

zzstoatzz commented Aug 11, 2022 •

edited

Loading

tekumara commented Sep 18, 2022

zanieb commented Sep 19, 2022 •

edited

Loading

avishniakov commented Sep 28, 2022

zanieb commented Sep 28, 2022

Add detail to k8s agent logs when pod never starts (and maybe raise FAILED state?) #6394

Add detail to k8s agent logs when pod never starts (and maybe raise FAILED state?) #6394

Comments

zzstoatzz commented Aug 11, 2022 • edited Loading

First check

Prefect Version

Describe the proposed behavior

Describe the current behavior

Example Use

Additional context

tekumara commented Sep 18, 2022

zanieb commented Sep 19, 2022 • edited Loading

avishniakov commented Sep 28, 2022

zanieb commented Sep 28, 2022

zzstoatzz commented Aug 11, 2022 •

edited

Loading

zanieb commented Sep 19, 2022 •

edited

Loading