Long running kubernetes jobs are marked as crashed by the agent #8050

mmartsen · 2023-01-04T18:34:06Z

First check

I added a descriptive title to this issue.
I used the GitHub search to find a similar issue and didn't find it.
I searched the Prefect documentation for this issue.
I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

I have a flow that runs spark job using prefect-databricks connector. If the job is running for more than 4hrs, flow in prefect is marked as crashed after 4hrs + 1-4 minutes

Reproduction

# Kubernetes job infra defined as

k8s_job = KubernetesJob(
        namespace=namespace,
        image=image_name,
        image_pull_policy=KubernetesImagePullPolicy.ALWAYS,
        finished_job_ttl=300,
        service_account_name="prefect-server",
        job_watch_timeout_seconds=180,
        customizations={ .. volumes & secrets here },
        env={
            .. bunch of envs here.. 
        }
    )

# images is custom built on top of the 2.7.4 prefect:

ARG PREFECT_VERSION=2.7.4
ARG PYTHON_VERSION=python3.10
FROM prefecthq/prefect:${PREFECT_VERSION}-${PYTHON_VERSION}


# databricks job is submitted like this:

@flow
def databricks_run_reload(notebook_path, params):

    # compile job task settings
    job_task_settings = JobTaskSettings(
        new_cluster=new_cluster,
        notebook_task=spark_python_notebook_task,
        task_key=task_key,
        timeout_seconds=86400,
        libraries=[]
    )

    idempotency_key = str(uuid.uuid4())
    multi_task_runs = jobs_runs_submit_and_wait_for_completion(
        databricks_credentials=databricks_credentials,
        run_name=run_name,
        git_source=git,
        tasks=[job_task_settings],
        max_wait_seconds=86400,
        idempotency_token=idempotency_key
    )
    return multi_task_runs

# and used in other flow:

@flow
# @flow(persist_result=True, result_storage=S3.load(config.DEFAULT_S3_BLOCK_NAME)) ## tried this as well, but no difference
def do_reload():
    spark = databricks_run_reload(notebook_path="some_notebook", some_params)
    updated = execute_dml(merge_query, params=query_params, wait_for=[spark])

Error

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 637, in orchestrate_flow_run
    result = await run_sync(flow_call)
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/opt/prefect/flows/spark.py", line 45, in do_reload
    updated = execute_dml(query, params=query_params, wait_for=[spark])
  File "/usr/local/lib/python3.10/site-packages/prefect/tasks.py", line 436, in __call__
    return enter_task_run_engine(
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 927, in enter_task_run_engine
    return run_async_from_worker_thread(begin_run)
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 177, in run_async_from_worker_thread
    return anyio.from_thread.run(call)
  File "/usr/local/lib/python3.10/site-packages/anyio/from_thread.py", line 49, in run
    return asynclib.run_async_from_thread(func, *args)
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 970, in run_async_from_thread
    return f.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 1068, in get_task_call_return_value
    return await future._result()
  File "/usr/local/lib/python3.10/site-packages/prefect/futures.py", line 237, in _result
    return await final_state.result(raise_on_failure=raise_on_failure, fetch=True)
  File "/usr/local/lib/python3.10/site-packages/prefect/states.py", line 100, in _get_state_result
    raise MissingResult(
prefect.exceptions.MissingResult: State data is missing. Typically, this occurs when result persistence is disabled and the state has been retrieved from the API.

Versions

Version:             2.7.4
API version:         0.8.4
Python version:      3.10.8
Git commit:          30db76e7
Built:               Thu, Dec 22, 2022 2:55 PM
OS/Arch:             linux/x86_64
Profile:             dev
Server type:         hosted

Additional context

No response

The text was updated successfully, but these errors were encountered:

mmartsen · 2023-01-08T08:10:55Z

I can confirm on 2.7.7 same issue happens, but there is no stacktrace anymore. Also it's interesting that one of the databricks subflows (the one that monitors the job status) is keep running fine, while parent flow marked as crashed

zanieb · 2023-01-08T17:51:58Z

Thanks for the additional details! I'm investigating this issue and have added logs in #8097 to give us some insight into the suspicious code path.

zanieb · 2023-01-08T18:04:56Z

Also, since this is occurring in enter_task_run_engine/get_task_call_return_value this traceback looks like a failure in the flow run itself. Can you share more details about how this is ending up in a CRASHED state?

mmartsen · 2023-01-08T22:43:47Z

If you give me a hint how to get more useful info for debugging. There is nothing like flow error in the logs (i also checked agent logs), also K8s job is still alive and working just fine.
I can try to reproduce it with just single long running databricks flow (ie no other tasks/subflows etc), Will that help?

mmartsen · 2023-01-09T20:28:16Z

Found this in agent logs:

20:22:28.756 | ERROR | prefect.infrastructure.kubernetes-job - Job 'psi3-albali-jstdh': Job did not complete.
20:22:28.781 | INFO | prefect.agent - Reported flow run 'f89cf54a-388a-48f9-bfe7-72f073243cb3' as crashed: Flow run infrastructure exited with non-zero status code -1.

Job is still alive and working though

zanieb · 2023-01-09T20:57:18Z

@mmartsen since you set job_watch_timeout_seconds=180, if the agent does not see activity in 180s it will report the job as crashed. When the job is running for a long time, Kubernetes will not send any events in that time period. We recently set the default to None to avoid this in #7786.

mmartsen · 2023-01-11T08:04:04Z

Tried without job_watch_timeout_seconds setting - still the same. Parent flow marked as crashed after 4+ hrs, while subflows keeps running fine

zanieb · 2023-01-11T18:00:52Z

We have a separate report that at ~4 hours the Kubernetes job watcher will mark the pod as crashed. I have no idea why 4 hours is the magic number. Perhaps it's an issue with the Kubernetes client implementation? We'll definitely need to add some sort of workaround here.

zanieb · 2023-01-11T18:08:31Z

We are presuming this is a bug with the Watch implementation in the Kubernetes library where it does not handle dropped connections, we'll need to add some more robust wrapping to handle failure modes there.

jmprovencher · 2023-01-12T19:08:02Z

We have a separate report that at ~4 hours the Kubernetes job watcher will mark the pod as crashed. I have no idea why 4 hours is the magic number. Perhaps it's an issue with the Kubernetes client implementation? We'll definitely need to add some sort of workaround here.

I think a lot of users have kubernetes job running for more than 4 hours, we'll definitely need a workaround.

zanieb · 2023-01-12T19:13:43Z

Contributions are welcome here! We'll probably just want to move that watch out into a helper method that calls the watch repeatedly.

See related at kubernetes-client/csharp#486 (comment)

Ideally we'd use an Informer API, but the issue is quite stale kubernetes-client/python#868

We're likely to take this up ourselves next week.

hateyouinfinity · 2023-01-16T08:41:15Z

Noticed kopf's implementation mentioned in a comment, might be relevant.

naveedhd · 2023-01-25T09:28:13Z

4hrs is the maximum timelimit for streaming connections in kubelet which will cause the pods log loop to end even when the job/pod is running.

zanieb · 2023-03-10T21:54:06Z

We've made significant progress here but will also need to add handling for different resource versions for this to be done.

m-denton · 2023-08-08T18:09:22Z

Is there any update on this initiative? Running into the same issue now.

klayhb · 2023-08-30T09:06:42Z

we'd like to know the status too

chrisguidry · 2023-09-15T17:39:50Z

Hi folks, I'm going to be looking into this. Thank you all for the info so far, the pointer to kubelet's default timeout really helps! It looks like that setting is streamingConnectionIdleTimeout and I should be able to reproduce the problem in, well, less that 4 hours :D

Another article that I came across researching this: https://platform9.com/kb/kubernetes/kubectl-exec-is-timed-out-after-4-hours

I believe the fix would be to wrap the relevant Watch.stream calls in a loop that can handle being disconnected early. This could also help when there's cluster maintenance and kubelet needs to be restarted (which shouldn't affect any running pods).

I'll keep you posted.

chrisguidry · 2023-09-15T20:26:43Z

@m-denton and @klayhb, do you have any configuration details to share? Your Prefect and Kubernetes versions, and whether you're using the prefect_kubernetes.KubernetesWorker worker or the prefect.infrastructure.kubernetes.KubernetesJob? Any logs you have from your agent/worker would be helpful.

I see several changes in the past 8 months that may have improved this situation, so I'd love to know if you're still experiencing this and on what versions.

I'm not having much luck reproducing this on minikube by lowering the kubelet or apiserver timeouts. I'm next going to try just reproducing this at 4 hours with a flow running in the background.

klayhb · 2023-09-15T23:07:18Z

I'm currently on vacation but as soon as I get back and I'd give all the details 👍

…

On Fri, 15 Sept 2023, 23:26 Chris Guidry, ***@***.***> wrote: @m-denton <https://github.com/m-denton> and @klayhb <https://github.com/klayhb>, do you have any configuration details to share? Your Prefect and Kubernetes versions, and whether you're using the prefect_kubernetes.KubernetesWorker worker or the prefect.infrastructure.kubernetes.KubernetesJob? Any logs you have from your agent/worker would be helpful. I see several changes in the past 8 months that may have improved this situation, so I'd love to know if you're still experiencing this and on what versions. I'm not having much luck reproducing this on minikube by lowering the kubelet or apiserver timeouts. I'm next going to try just reproducing this at 4 hours with a flow running in the background. — Reply to this email directly, view it on GitHub <#8050 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMZWMFDTTKPQUY4DKTQPFFDX2S2Y5ANCNFSM6AAAAAATRC5REY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

chrisguidry · 2023-09-18T13:28:01Z

Update: I set up a flow run that runs many hours and I couldn't reproduce this on minikube. I'm wondering if there are some other complicating factors here.

A few ideas/questions:

Can you share your versions of prefect and kubernetes (pip freeze | grep "prefect\|kubernetes")
Which kind of Kubernetes cluster are you running on, and can you share the version? (kubectl version)
Is it a whole flow or a single task that lasts 4 hours?
For the flow/task that runs for that long, does it log any output during that time? One idea I'm exploring is whether the pod's lack of logs may cause something to timeout the watch
Are you running the Prefect Agent (prefect agent start ...) or a Prefect KubernetesWorker (prefect worker start ...)?

Thanks for your patience while we look into this, it's being stubborn to reproduce

klayhb · 2023-09-18T13:38:06Z

@chrisguidry sorry, there's been some kind of confusion on our part since we mixed up this issue with this one (that i opened myself) : #10620

i wrongly assumed it might be related to our issue but looks like it might be a separate thing. so, i'm probably not the one to ask for details on this. i'll give the stage to whoever complained about it

chrisguidry · 2023-09-18T13:42:16Z

No worries! I'll take a few more stabs at reproducing this and if I don't get any more info, I'll close it out in the next day or so.

chrisguidry · 2023-09-19T13:13:46Z

Update: still no repro of this issue with a flow that doesn't log. It ran fine for 5 hours without issue against minikube.

@mmartsen or @m-denton, would either of you be able to share any details about your setup or what you're experiencing?

Can you share your versions of prefect and kubernetes (pip freeze | grep "prefect\|kubernetes")

Which kind of Kubernetes cluster are you running on, and can you share the version? (kubectl version)

Is it a whole flow or a single task that lasts 4 hours?

For the flow/task that runs for that long, does it log any output during that time? One idea I'm exploring is whether the pod's lack of logs may cause something to timeout the watch

Are you running the Prefect Agent (prefect agent start ...) or a Prefect KubernetesWorker (prefect worker start ...)?

chrisguidry · 2023-09-20T13:51:00Z

After two more attempts yesterday, I was unable to reproduce this. Please feel free to reopen this issue if you have more information to help debugging, including the software versions involved and perhaps a rough idea of the network topology we're using to access the cluster.

mmartsen added bug Something isn't working status:triage labels Jan 4, 2023

zanieb added status:accepted and removed status:triage labels Jan 11, 2023

Jaunson mentioned this issue Jan 12, 2023

Document Kubernetes job permissions #8120

Open

1 task

zanieb mentioned this issue Jan 13, 2023

K8s: Flow run infrastructure exited with non-zero status code -1. #7854

Closed

4 tasks

zzstoatzz mentioned this issue Jan 18, 2023

allow KubernetesJob._watch_job to track long-running jobs #8189

Merged

3 tasks

zanieb mentioned this issue Jan 24, 2023

Persistence Missing Result error #8228

Closed

4 tasks

zanieb added the team:orchestration label Apr 24, 2023

cicdw removed the priority:high label Aug 15, 2023

chrisguidry self-assigned this Sep 15, 2023

chrisguidry added a commit to chrisguidry/prefect-coolness that referenced this issue Sep 18, 2023

Attempts to repro PrefectHQ/prefect#8050

661a6d0

chrisguidry closed this as completed Sep 20, 2023

nkemnitz mentioned this issue Oct 22, 2023

Training job shuts down after 240 minutes ZettaAI/zetta_utils#527

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long running kubernetes jobs are marked as crashed by the agent #8050

Long running kubernetes jobs are marked as crashed by the agent #8050

mmartsen commented Jan 4, 2023 •

edited

Loading

mmartsen commented Jan 8, 2023

zanieb commented Jan 8, 2023

zanieb commented Jan 8, 2023 •

edited

Loading

mmartsen commented Jan 8, 2023

mmartsen commented Jan 9, 2023 •

edited

Loading

zanieb commented Jan 9, 2023

mmartsen commented Jan 11, 2023

zanieb commented Jan 11, 2023

zanieb commented Jan 11, 2023

jmprovencher commented Jan 12, 2023

zanieb commented Jan 12, 2023

hateyouinfinity commented Jan 16, 2023

naveedhd commented Jan 25, 2023 •

edited

Loading

zanieb commented Mar 10, 2023

m-denton commented Aug 8, 2023

klayhb commented Aug 30, 2023

chrisguidry commented Sep 15, 2023

chrisguidry commented Sep 15, 2023

klayhb commented Sep 15, 2023 via email

chrisguidry commented Sep 18, 2023

klayhb commented Sep 18, 2023

chrisguidry commented Sep 18, 2023

chrisguidry commented Sep 19, 2023

chrisguidry commented Sep 20, 2023

Long running kubernetes jobs are marked as crashed by the agent #8050

Long running kubernetes jobs are marked as crashed by the agent #8050

Comments

mmartsen commented Jan 4, 2023 • edited Loading

First check

Bug summary

Reproduction

Error

Versions

Additional context

mmartsen commented Jan 8, 2023

zanieb commented Jan 8, 2023

zanieb commented Jan 8, 2023 • edited Loading

mmartsen commented Jan 8, 2023

mmartsen commented Jan 9, 2023 • edited Loading

zanieb commented Jan 9, 2023

mmartsen commented Jan 11, 2023

zanieb commented Jan 11, 2023

zanieb commented Jan 11, 2023

jmprovencher commented Jan 12, 2023

zanieb commented Jan 12, 2023

hateyouinfinity commented Jan 16, 2023

naveedhd commented Jan 25, 2023 • edited Loading

zanieb commented Mar 10, 2023

m-denton commented Aug 8, 2023

klayhb commented Aug 30, 2023

chrisguidry commented Sep 15, 2023

chrisguidry commented Sep 15, 2023

klayhb commented Sep 15, 2023 via email

chrisguidry commented Sep 18, 2023

klayhb commented Sep 18, 2023

chrisguidry commented Sep 18, 2023

chrisguidry commented Sep 19, 2023

chrisguidry commented Sep 20, 2023

mmartsen commented Jan 4, 2023 •

edited

Loading

zanieb commented Jan 8, 2023 •

edited

Loading

mmartsen commented Jan 9, 2023 •

edited

Loading

naveedhd commented Jan 25, 2023 •

edited

Loading