-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow KubernetesJob._watch_job
to track long-running jobs
#8189
Conversation
✅ Deploy Preview for prefect-orion ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
@zzstoatzz this makes sense, but is the |
@madkinsz I see. so would we want to re-introduce the I haven't been able to reproduce that exact situation, so I'm not sure how we'd want to differentiate in the |
Adding more info if that could help. Inspired by the previous updates of this PR and with After a while, the |
KubernetesJob._watch_job
to track long-running jobs
@zzstoatzz have you observed In my context, I used watch interface to stream logs and in both cases i.e. pod ending before 4 hours and pod running but stream times out in 4 hours, there is no difference and the loop just finishes without any error/exception. |
hi @naveedhd
yep I have, I started seeing the exact error described in #7653 around 4 hours into flow runs
I didn't notice this before ~ 4 hours into flow runs, but whether or not the job completes before 4 hours, our handling should remain consistent - i.e. at any point if
it's likely there is a more precise way of handling this, but I didn't want to be overly selective in the exception handling without more information on why exactly |
Co-authored-by: Alexander Streed <alex.s@prefect.io>
Co-authored-by: Alexander Streed <alex.s@prefect.io>
@zzstoatzz @madkinsz Sorry to comment on a closed PR but we still hit this issue ( |
I also could confirm that this still exists on 2.7.10 |
I believe #8345 is related (if not the same exact issue). |
Co-authored-by: Alexander Streed <alex.s@prefect.io>
Co-authored-by: Alexander Streed <alex.s@prefect.io>
Overview
The purpose of this PR is to address the long-running kubernetes job issues described in #8050 and #7653. After around 4 hours of execution, Kubernetes jobs were being marked as crashed by Prefect although the job continued to run. Other users encountered an
InvalidChunkLength
error around the same time.Given user descriptions and my exploration of the issue, there seems to be two problems here:
streaming-connection-idle-timeout
(hard-coded default of 4 hours by Kubernetes) is reached and the watch stream exits without exception, causing Prefect to mark the job as failed prematurely and lose track of the job executionstream_output=True
on theKubernetesJob
, the pod logs stream also becomes stale after 4 hours, and encounters anInvalidChunkLength
exception (as previously mentioned here).Here we address these two problems by changing the following in
KubernetesRun._watch_job
(respectively):job_watch_timeout_seconds
elapses.KubernetesJob.run()
wherestream_output=True
.The behavior I've observed by running a k8s agent using these changes is the following:
InvalidChunkLength
exception thrown at around 4 hours on callinglogs.stream()
, previously called hereI have not directly reproduced the case where job tracking is lost and
Job did not complete
is logged prematurely, although we have received multiple reports of this case.Background
The default value of a kubelet's
--streaming-connection-idle-timeout
is 4 hours.The value of
streaming-connection-idle-timeout
can be configured by adjusting the kubelet config - however note that it is not generally recommended to disable this timeout (setting a value of 0), as it opens the door to denial-of-service attacks, persisting inactive connections, and running out of ephemeral ports.thanks to @naveedhd and @john-jam for the helpful context
Example
Checklist
<link to issue>
"fix
,feature
,enhancement