Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add detail to k8s agent logs when pod never starts (and maybe raise FAILED state?) #6394

Open
3 tasks done
zzstoatzz opened this issue Aug 11, 2022 · 4 comments
Open
3 tasks done
Labels
enhancement An improvement of an existing feature good first issue This issue is good for newcomers status:accepted We may work on this; we will accept work from external contributors

Comments

@zzstoatzz
Copy link
Contributor

zzstoatzz commented Aug 11, 2022

First check

  • I added a descriptive title to this issue.
  • I used the GitHub search to find a similar request and didn't find it.
  • I searched the Prefect documentation for this feature.

Prefect Version

2.x

Describe the proposed behavior

If there is an issue with the k8s agent setup (e.g. an instance of a kubernetes-job block references the incorrect cluster service account) such that the pod is never able to start, we could surface job events as kubectl describe job tuscan-flamingojqdmg- would:

....
Events:
  Type     Reason        Age                    From            Message
  ----     ------        ----                   ----            -------
  Warning  FailedCreate  3m30s (x27 over 134m)  job-controller  Error creating: pods "tuscan-flamingojqdmg-" is forbidden: error looking up service account prefect-2/prefect-agent-1659997862: serviceaccount "prefect-agent-1659997862" not found

... and mark the corresponding flow as Failed

Describe the current behavior

Agent logs just state that pod never started and flow stays in Pending state

23:04:48.021 | DEBUG   | prefect.agent - Checking for flow runs...
23:04:48.147 | INFO    | prefect.agent - Submitting flow run 'cd17f57d-13ce-4a8c-92ef-475d0e1067d4'
23:04:53.153 | DEBUG   | prefect.agent - Checking for flow runs...
23:04:56.097 | ERROR   | prefect.infrastructure.kubernetes-job - Job 'tuscan-flamingojqdmg-': Pod never started.

Example Use

No response

Additional context

using the helm chart for k8s agent as found here

@zzstoatzz zzstoatzz added enhancement An improvement of an existing feature status:triage labels Aug 11, 2022
@zanieb zanieb self-assigned this Aug 12, 2022
@zanieb zanieb added priority:medium status:in-progress Someone is working on this, check-in here before beginning new work and removed status:triage labels Aug 12, 2022
@tekumara
Copy link
Contributor

Related to PrefectHQ/prefect-kubernetes#90 and #5489

@zanieb
Copy link
Contributor

zanieb commented Sep 19, 2022

Maybe we should use the same pattern as the ECS block where wait until the task starts and do not report the task as started until that occurs:

@avishniakov
Copy link

We experience same issues on 2.3.2. Would be cool to have it reported as failed and also get logs of that back in the run.

@zanieb zanieb removed their assignment Sep 28, 2022
@zanieb zanieb added status:accepted We may work on this; we will accept work from external contributors and removed status:in-progress Someone is working on this, check-in here before beginning new work labels Sep 28, 2022
@zanieb
Copy link
Contributor

zanieb commented Sep 28, 2022

I thought I'd have time to work on this but did not. This is open for contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement of an existing feature good first issue This issue is good for newcomers status:accepted We may work on this; we will accept work from external contributors
Projects
None yet
Development

No branches or pull requests

5 participants