Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow KubernetesJob._watch_job to track long-running jobs #8189

Merged
merged 15 commits into from
Jan 26, 2023

Conversation

zzstoatzz
Copy link
Collaborator

@zzstoatzz zzstoatzz commented Jan 18, 2023

Overview

The purpose of this PR is to address the long-running kubernetes job issues described in #8050 and #7653. After around 4 hours of execution, Kubernetes jobs were being marked as crashed by Prefect although the job continued to run. Other users encountered an InvalidChunkLength error around the same time.

Given user descriptions and my exploration of the issue, there seems to be two problems here:

  • the job watch event stream becomes stale when the streaming-connection-idle-timeout (hard-coded default of 4 hours by Kubernetes) is reached and the watch stream exits without exception, causing Prefect to mark the job as failed prematurely and lose track of the job execution
  • when using stream_output=True on the KubernetesJob, the pod logs stream also becomes stale after 4 hours, and encounters an InvalidChunkLength exception (as previously mentioned here).

Here we address these two problems by changing the following in KubernetesRun._watch_job (respectively):

  • we now restart the k8s job events watch when the job event watch exits without exception, which will loop until a completion time is observed on the job or until job_watch_timeout_seconds elapses.
  • we will now raise a warning when encountering an exception (in lieu of letting the exception raise) while streaming pod logs during KubernetesJob.run() where stream_output=True.

The behavior I've observed by running a k8s agent using these changes is the following:

  • job execution proceeds as normal for the first 4 hours
  • InvalidChunkLength exception thrown at around 4 hours on calling logs.stream(), previously called here
  • exception surfaced in agent logs
  • job continues, flow run logs continue to be visible in the agent logs and in the UI

I have not directly reproduced the case where job tracking is lost and Job did not complete is logged prematurely, although we have received multiple reports of this case.

Background

The default value of a kubelet's --streaming-connection-idle-timeout is 4 hours.

The value of streaming-connection-idle-timeout can be configured by adjusting the kubelet config - however note that it is not generally recommended to disable this timeout (setting a value of 0), as it opens the door to denial-of-service attacks, persisting inactive connections, and running out of ephemeral ports.

thanks to @naveedhd and @john-jam for the helpful context

Example

Checklist

  • This pull request references any related issue by including "closes <link to issue>"
    • If no issue exists and your change is not a small fix, please create an issue first.
  • This pull request includes tests or only affects documentation.
  • This pull request includes a label categorizing the change e.g. fix, feature, enhancement

@netlify
Copy link

netlify bot commented Jan 18, 2023

Deploy Preview for prefect-orion ready!

Name Link
🔨 Latest commit 946084c
🔍 Latest deploy log https://app.netlify.com/sites/prefect-orion/deploys/63d2b8df8011b00008352c42
😎 Deploy Preview https://deploy-preview-8189--prefect-orion.netlify.app/api-ref/prefect/infrastructure
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@zzstoatzz zzstoatzz changed the title Restart k8s job watch Except timeout errors while streaming pod logs Jan 25, 2023
@zzstoatzz zzstoatzz marked this pull request as ready for review January 25, 2023 22:13
@zzstoatzz zzstoatzz requested a review from a team as a code owner January 25, 2023 22:13
@zanieb
Copy link
Contributor

zanieb commented Jan 25, 2023

@zzstoatzz this makes sense, but is the Watch also not a streaming operation? We have clear reports that Job did not complete was logged which only occurs at https://github.com/PrefectHQ/prefect/blob/main/src/prefect/infrastructure/kubernetes.py#L608 and is not covered by this change.

@zzstoatzz
Copy link
Collaborator Author

zzstoatzz commented Jan 25, 2023

@madkinsz I see. so would we want to re-introduce the while around the Watch like we had earlier in this PR in order to restart the watch?

I haven't been able to reproduce that exact situation, so I'm not sure how we'd want to differentiate in the for's else between a genuine failure and a failure that would be handled by restarting the job watch

@john-jam
Copy link
Contributor

john-jam commented Jan 26, 2023

Adding more info if that could help.

Inspired by the previous updates of this PR and with stream_output=False, we fixed the false-positive Job did not complete error with those updates.

After a while, the for event in watch.stream(...) statement is no longer waiting for events even though the job is not finished yet. It is exited without an exception. By retrying it until event["object"].status.completion_time is not None, I think we can catch true-positive failing jobs and true succeeded jobs.
We can return -1 or not here according to event["object"].status.failed or event["object"].status.succeeded maybe.

@zzstoatzz zzstoatzz added the fix A fix for a bug in an existing feature label Jan 26, 2023
@zzstoatzz zzstoatzz changed the title Except timeout errors while streaming pod logs allow KubernetesJob._watch_job to track long-running jobs Jan 26, 2023
@naveedhd
Copy link

naveedhd commented Jan 26, 2023

@zzstoatzz have you observed InvalidChunkLength with logs.stream() around 4 hours? and do you know if this is not raised when the job ends before 4 hours?

In my context, I used watch interface to stream logs and in both cases i.e. pod ending before 4 hours and pod running but stream times out in 4 hours, there is no difference and the loop just finishes without any error/exception.

@zzstoatzz
Copy link
Collaborator Author

hi @naveedhd

have you observed InvalidChunkLength with logs.stream() around 4 hours?

yep I have, I started seeing the exact error described in #7653 around 4 hours into flow runs

do you know if this is not raised when the job ends before 4 hours?

I didn't notice this before ~ 4 hours into flow runs, but whether or not the job completes before 4 hours, our handling should remain consistent - i.e. at any point if stream_output=True (otherwise we won't attempt to stream pod logs) , if we encounter an exception while attempting to stream pod logs in particular:

  • don't crash the prefect flow
  • stop sending pod logs to stdout

it's likely there is a more precise way of handling this, but I didn't want to be overly selective in the exception handling without more information on why exactly InvalidChunkLength occurs.

@zzstoatzz zzstoatzz merged commit a41e43c into main Jan 26, 2023
@zzstoatzz zzstoatzz deleted the restart-k8s-job-watch branch January 26, 2023 18:00
github-actions bot pushed a commit that referenced this pull request Jan 26, 2023
Co-authored-by: Alexander Streed <alex.s@prefect.io>
github-actions bot pushed a commit that referenced this pull request Jan 26, 2023
Co-authored-by: Alexander Streed <alex.s@prefect.io>
@john-jam
Copy link
Contributor

john-jam commented Feb 1, 2023

@zzstoatzz @madkinsz Sorry to comment on a closed PR but we still hit this issue (Job did not complete within timeout of Nones.) with prefect==2.7.10.
I think it's because of this else block. When the for event in watch.stream(...) fails silently (even though the job is still running), this else is called and made the agent to stop watching it but instead, it should go into another iteration in the wrapping loop while not completed: IMHO.

@BitTheByte
Copy link
Contributor

I also could confirm that this still exists on 2.7.10

@EmilRex
Copy link
Contributor

EmilRex commented Feb 1, 2023

I believe #8345 is related (if not the same exact issue).

@john-jam
Copy link
Contributor

john-jam commented Feb 1, 2023

@EmilRex It is the exact same issue. This PR removes the problematic else statement and will hopefully be merged soon 🎉

zanieb pushed a commit that referenced this pull request Feb 6, 2023
Co-authored-by: Alexander Streed <alex.s@prefect.io>
zanieb pushed a commit that referenced this pull request Feb 6, 2023
Co-authored-by: Alexander Streed <alex.s@prefect.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix A fix for a bug in an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants