-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long running kubernetes jobs are marked as crashed by the agent #8050
Comments
I can confirm on 2.7.7 same issue happens, but there is no stacktrace anymore. Also it's interesting that one of the databricks subflows (the one that monitors the job status) is keep running fine, while parent flow marked as crashed |
Thanks for the additional details! I'm investigating this issue and have added logs in #8097 to give us some insight into the suspicious code path. |
Also, since this is occurring in |
If you give me a hint how to get more useful info for debugging. There is nothing like flow error in the logs (i also checked agent logs), also K8s job is still alive and working just fine. |
Found this in agent logs:
Job is still alive and working though |
Tried without job_watch_timeout_seconds setting - still the same. Parent flow marked as crashed after 4+ hrs, while subflows keeps running fine |
We have a separate report that at ~4 hours the Kubernetes job watcher will mark the pod as crashed. I have no idea why 4 hours is the magic number. Perhaps it's an issue with the Kubernetes client implementation? We'll definitely need to add some sort of workaround here. |
We are presuming this is a bug with the |
I think a lot of users have kubernetes job running for more than 4 hours, we'll definitely need a workaround. |
Contributions are welcome here! We'll probably just want to move that watch out into a helper method that calls the watch repeatedly. See related at kubernetes-client/csharp#486 (comment) Ideally we'd use an Informer API, but the issue is quite stale kubernetes-client/python#868 We're likely to take this up ourselves next week. |
Noticed kopf's implementation mentioned in a comment, might be relevant. |
|
We've made significant progress here but will also need to add handling for different resource versions for this to be done. |
Is there any update on this initiative? Running into the same issue now. |
we'd like to know the status too |
Hi folks, I'm going to be looking into this. Thank you all for the info so far, the pointer to Another article that I came across researching this: https://platform9.com/kb/kubernetes/kubectl-exec-is-timed-out-after-4-hours I believe the fix would be to wrap the relevant I'll keep you posted. |
@m-denton and @klayhb, do you have any configuration details to share? Your Prefect and Kubernetes versions, and whether you're using the I see several changes in the past 8 months that may have improved this situation, so I'd love to know if you're still experiencing this and on what versions. I'm not having much luck reproducing this on |
I'm currently on vacation but as soon as I get back and I'd give all the
details 👍
…On Fri, 15 Sept 2023, 23:26 Chris Guidry, ***@***.***> wrote:
@m-denton <https://github.com/m-denton> and @klayhb
<https://github.com/klayhb>, do you have any configuration details to
share? Your Prefect and Kubernetes versions, and whether you're using the
prefect_kubernetes.KubernetesWorker worker or the
prefect.infrastructure.kubernetes.KubernetesJob? Any logs you have from
your agent/worker would be helpful.
I see several changes in the past 8 months that may have improved this
situation, so I'd love to know if you're still experiencing this and on
what versions.
I'm not having much luck reproducing this on minikube by lowering the
kubelet or apiserver timeouts. I'm next going to try just reproducing
this at 4 hours with a flow running in the background.
—
Reply to this email directly, view it on GitHub
<#8050 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMZWMFDTTKPQUY4DKTQPFFDX2S2Y5ANCNFSM6AAAAAATRC5REY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Update: I set up a flow run that runs many hours and I couldn't reproduce this on A few ideas/questions:
Thanks for your patience while we look into this, it's being stubborn to reproduce |
@chrisguidry sorry, there's been some kind of confusion on our part since we mixed up this issue with this one (that i opened myself) : #10620 i wrongly assumed it might be related to our issue but looks like it might be a separate thing. so, i'm probably not the one to ask for details on this. i'll give the stage to whoever complained about it |
No worries! I'll take a few more stabs at reproducing this and if I don't get any more info, I'll close it out in the next day or so. |
Update: still no repro of this issue with a flow that doesn't log. It ran fine for 5 hours without issue against @mmartsen or @m-denton, would either of you be able to share any details about your setup or what you're experiencing?
|
After two more attempts yesterday, I was unable to reproduce this. Please feel free to reopen this issue if you have more information to help debugging, including the software versions involved and perhaps a rough idea of the network topology we're using to access the cluster. |
First check
Bug summary
I have a flow that runs spark job using prefect-databricks connector. If the job is running for more than 4hrs, flow in prefect is marked as crashed after 4hrs + 1-4 minutes
Reproduction
Error
Versions
Additional context
No response
The text was updated successfully, but these errors were encountered: