allow `KubernetesJob._watch_job` to track long-running jobs #8189

zzstoatzz · 2023-01-18T00:44:19Z

Overview

The purpose of this PR is to address the long-running kubernetes job issues described in #8050 and #7653. After around 4 hours of execution, Kubernetes jobs were being marked as crashed by Prefect although the job continued to run. Other users encountered an InvalidChunkLength error around the same time.

Given user descriptions and my exploration of the issue, there seems to be two problems here:

the job watch event stream becomes stale when the streaming-connection-idle-timeout (hard-coded default of 4 hours by Kubernetes) is reached and the watch stream exits without exception, causing Prefect to mark the job as failed prematurely and lose track of the job execution
when using stream_output=True on the KubernetesJob, the pod logs stream also becomes stale after 4 hours, and encounters an InvalidChunkLength exception (as previously mentioned here).

Here we address these two problems by changing the following in KubernetesRun._watch_job (respectively):

we now restart the k8s job events watch when the job event watch exits without exception, which will loop until a completion time is observed on the job or until job_watch_timeout_seconds elapses.
we will now raise a warning when encountering an exception (in lieu of letting the exception raise) while streaming pod logs during KubernetesJob.run() where stream_output=True.

The behavior I've observed by running a k8s agent using these changes is the following:

job execution proceeds as normal for the first 4 hours
InvalidChunkLength exception thrown at around 4 hours on calling logs.stream(), previously called here
exception surfaced in agent logs
job continues, flow run logs continue to be visible in the agent logs and in the UI

I have not directly reproduced the case where job tracking is lost and Job did not complete is logged prematurely, although we have received multiple reports of this case.

Background

The default value of a kubelet's --streaming-connection-idle-timeout is 4 hours.

The value of streaming-connection-idle-timeout can be configured by adjusting the kubelet config - however note that it is not generally recommended to disable this timeout (setting a value of 0), as it opens the door to denial-of-service attacks, persisting inactive connections, and running out of ephemeral ports.

thanks to @naveedhd and @john-jam for the helpful context

Example

Checklist

This pull request references any related issue by including "closes <link to issue>"
- If no issue exists and your change is not a small fix, please create an issue first.
This pull request includes tests or only affects documentation.
This pull request includes a label categorizing the change e.g. fix, feature, enhancement

netlify · 2023-01-18T00:44:24Z

✅ Deploy Preview for prefect-orion ready!

Name	Link
🔨 Latest commit	`946084c`
🔍 Latest deploy log	https://app.netlify.com/sites/prefect-orion/deploys/63d2b8df8011b00008352c42
😎 Deploy Preview	https://deploy-preview-8189--prefect-orion.netlify.app/api-ref/prefect/infrastructure
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

src/prefect/infrastructure/kubernetes.py

zanieb · 2023-01-25T22:24:13Z

@zzstoatzz this makes sense, but is the Watch also not a streaming operation? We have clear reports that Job did not complete was logged which only occurs at https://github.com/PrefectHQ/prefect/blob/main/src/prefect/infrastructure/kubernetes.py#L608 and is not covered by this change.

zzstoatzz · 2023-01-25T23:33:00Z

@madkinsz I see. so would we want to re-introduce the while around the Watch like we had earlier in this PR in order to restart the watch?

I haven't been able to reproduce that exact situation, so I'm not sure how we'd want to differentiate in the for's else between a genuine failure and a failure that would be handled by restarting the job watch

john-jam · 2023-01-26T02:20:02Z

Adding more info if that could help.

Inspired by the previous updates of this PR and with stream_output=False, we fixed the false-positive Job did not complete error with those updates.

After a while, the for event in watch.stream(...) statement is no longer waiting for events even though the job is not finished yet. It is exited without an exception. By retrying it until event["object"].status.completion_time is not None, I think we can catch true-positive failing jobs and true succeeded jobs.
We can return -1 or not here according to event["object"].status.failed or event["object"].status.succeeded maybe.

…prefect into restart-k8s-job-watch

src/prefect/infrastructure/kubernetes.py

naveedhd · 2023-01-26T17:41:15Z

@zzstoatzz have you observed InvalidChunkLength with logs.stream() around 4 hours? and do you know if this is not raised when the job ends before 4 hours?

In my context, I used watch interface to stream logs and in both cases i.e. pod ending before 4 hours and pod running but stream times out in 4 hours, there is no difference and the loop just finishes without any error/exception.

zzstoatzz · 2023-01-26T17:57:57Z

hi @naveedhd

have you observed InvalidChunkLength with logs.stream() around 4 hours?

yep I have, I started seeing the exact error described in #7653 around 4 hours into flow runs

do you know if this is not raised when the job ends before 4 hours?

I didn't notice this before ~ 4 hours into flow runs, but whether or not the job completes before 4 hours, our handling should remain consistent - i.e. at any point if stream_output=True (otherwise we won't attempt to stream pod logs) , if we encounter an exception while attempting to stream pod logs in particular:

don't crash the prefect flow
stop sending pod logs to stdout

it's likely there is a more precise way of handling this, but I didn't want to be overly selective in the exception handling without more information on why exactly InvalidChunkLength occurs.

Co-authored-by: Alexander Streed <alex.s@prefect.io>

john-jam · 2023-02-01T05:39:06Z

@zzstoatzz @madkinsz Sorry to comment on a closed PR but we still hit this issue (Job did not complete within timeout of Nones.) with prefect==2.7.10.
I think it's because of this else block. When the for event in watch.stream(...) fails silently (even though the job is still running), this else is called and made the agent to stop watching it but instead, it should go into another iteration in the wrapping loop while not completed: IMHO.

BitTheByte · 2023-02-01T14:40:13Z

I also could confirm that this still exists on 2.7.10

EmilRex · 2023-02-01T15:59:19Z

I believe #8345 is related (if not the same exact issue).

john-jam · 2023-02-01T23:58:51Z

@EmilRex It is the exact same issue. This PR removes the problematic else statement and will hopefully be merged soon 🎉

Co-authored-by: Alexander Streed <alex.s@prefect.io>

zzstoatzz added 2 commits January 17, 2023 18:08

restart the k8s watch on 410 Gone errors raised by watch.stream

a3db2ba

subtract timeout by time spent watching expired streams

eab3a87

zanieb reviewed Jan 18, 2023

View reviewed changes

src/prefect/infrastructure/kubernetes.py Outdated Show resolved Hide resolved

Removes else branch and adds test

9cd39b4

zanieb reviewed Jan 19, 2023

View reviewed changes

src/prefect/infrastructure/kubernetes.py Outdated Show resolved Hide resolved

zanieb reviewed Jan 19, 2023

View reviewed changes

src/prefect/infrastructure/kubernetes.py Outdated Show resolved Hide resolved

desertaxle added 2 commits January 19, 2023 12:50

Adds timeout test

5b6232a

Addresses review comments

3a0d4c5

lucienfregosibodyguard mentioned this pull request Jan 24, 2023

Persistence Missing Result error #8228

Closed

4 tasks

zzstoatzz added 5 commits January 25, 2023 12:20

revert job event stream changes, add warning on stream pod log error

7770282

revert job event stream changes, except warning on stream pod log error

819a87d

rm test of watch timeout

86bba0b

add test that warning is raised when log stream fails

b13e280

Merge branch 'main' into restart-k8s-job-watch

1553680

zzstoatzz changed the title ~~Restart k8s job watch~~ Except timeout errors while streaming pod logs Jan 25, 2023

zzstoatzz marked this pull request as ready for review January 25, 2023 22:13

zzstoatzz requested a review from a team as a code owner January 25, 2023 22:13

zzstoatzz added 3 commits January 26, 2023 09:46

restart job event watch when watch.stream exits without exception

2b83613

Merge branch 'restart-k8s-job-watch' of https://github.com/PrefectHQ/…

9668941

…prefect into restart-k8s-job-watch

include elapsed time in timeout message

243c306

zzstoatzz added the fix A fix for a bug in an existing feature label Jan 26, 2023

zzstoatzz changed the title ~~Except timeout errors while streaming pod logs~~ allow KubernetesJob._watch_job to track long-running jobs Jan 26, 2023

zanieb reviewed Jan 26, 2023

View reviewed changes

src/prefect/infrastructure/kubernetes.py Outdated Show resolved Hide resolved

zanieb reviewed Jan 26, 2023

View reviewed changes

src/prefect/infrastructure/kubernetes.py Outdated Show resolved Hide resolved

defer status code return

3fbc882

zanieb approved these changes Jan 26, 2023

View reviewed changes

Merge branch 'main' into restart-k8s-job-watch

946084c

zzstoatzz merged commit a41e43c into main Jan 26, 2023

zzstoatzz deleted the restart-k8s-job-watch branch January 26, 2023 18:00

github-actions bot pushed a commit that referenced this pull request Jan 26, 2023

allow KubernetesJob._watch_job to track long-running jobs (#8189)

1953327

Co-authored-by: Alexander Streed <alex.s@prefect.io>

github-actions bot pushed a commit that referenced this pull request Jan 26, 2023

allow KubernetesJob._watch_job to track long-running jobs (#8189)

e3fac1d

Co-authored-by: Alexander Streed <alex.s@prefect.io>

zanieb mentioned this pull request Feb 6, 2023

Publish docs from release 2.7.12 #8431

Merged

zanieb pushed a commit that referenced this pull request Feb 6, 2023

allow KubernetesJob._watch_job to track long-running jobs (#8189)

30a6a8a

Co-authored-by: Alexander Streed <alex.s@prefect.io>

zanieb pushed a commit that referenced this pull request Feb 6, 2023

allow KubernetesJob._watch_job to track long-running jobs (#8189)

ed99501

Co-authored-by: Alexander Streed <alex.s@prefect.io>

EmilRex mentioned this pull request Oct 10, 2023

Worker raises InvalidChunkLength when replicating pod events PrefectHQ/prefect-kubernetes#96

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow `KubernetesJob._watch_job` to track long-running jobs #8189

allow `KubernetesJob._watch_job` to track long-running jobs #8189

zzstoatzz commented Jan 18, 2023 •

edited

Loading

netlify bot commented Jan 18, 2023 •

edited

Loading

zanieb commented Jan 25, 2023

zzstoatzz commented Jan 25, 2023 •

edited

Loading

john-jam commented Jan 26, 2023 •

edited

Loading

naveedhd commented Jan 26, 2023 •

edited

Loading

zzstoatzz commented Jan 26, 2023

john-jam commented Feb 1, 2023 •

edited

Loading

BitTheByte commented Feb 1, 2023

EmilRex commented Feb 1, 2023

john-jam commented Feb 1, 2023

allow KubernetesJob._watch_job to track long-running jobs #8189

allow KubernetesJob._watch_job to track long-running jobs #8189

Conversation

zzstoatzz commented Jan 18, 2023 • edited Loading

Overview

Background

Example

Checklist

netlify bot commented Jan 18, 2023 • edited Loading

✅ Deploy Preview for prefect-orion ready!

zanieb commented Jan 25, 2023

zzstoatzz commented Jan 25, 2023 • edited Loading

john-jam commented Jan 26, 2023 • edited Loading

naveedhd commented Jan 26, 2023 • edited Loading

zzstoatzz commented Jan 26, 2023

john-jam commented Feb 1, 2023 • edited Loading

BitTheByte commented Feb 1, 2023

EmilRex commented Feb 1, 2023

john-jam commented Feb 1, 2023

allow `KubernetesJob._watch_job` to track long-running jobs #8189

allow `KubernetesJob._watch_job` to track long-running jobs #8189

zzstoatzz commented Jan 18, 2023 •

edited

Loading

netlify bot commented Jan 18, 2023 •

edited

Loading

zzstoatzz commented Jan 25, 2023 •

edited

Loading

john-jam commented Jan 26, 2023 •

edited

Loading

naveedhd commented Jan 26, 2023 •

edited

Loading

john-jam commented Feb 1, 2023 •

edited

Loading