Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long running kubernetes jobs are marked as crashed by the agent #8050

Closed
4 tasks done
mmartsen opened this issue Jan 4, 2023 · 24 comments
Closed
4 tasks done

Long running kubernetes jobs are marked as crashed by the agent #8050

mmartsen opened this issue Jan 4, 2023 · 24 comments
Assignees
Labels
bug Something isn't working

Comments

@mmartsen
Copy link

mmartsen commented Jan 4, 2023

First check

  • I added a descriptive title to this issue.
  • I used the GitHub search to find a similar issue and didn't find it.
  • I searched the Prefect documentation for this issue.
  • I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

I have a flow that runs spark job using prefect-databricks connector. If the job is running for more than 4hrs, flow in prefect is marked as crashed after 4hrs + 1-4 minutes

Reproduction

# Kubernetes job infra defined as

k8s_job = KubernetesJob(
        namespace=namespace,
        image=image_name,
        image_pull_policy=KubernetesImagePullPolicy.ALWAYS,
        finished_job_ttl=300,
        service_account_name="prefect-server",
        job_watch_timeout_seconds=180,
        customizations={ .. volumes & secrets here },
        env={
            .. bunch of envs here.. 
        }
    )

# images is custom built on top of the 2.7.4 prefect:

ARG PREFECT_VERSION=2.7.4
ARG PYTHON_VERSION=python3.10
FROM prefecthq/prefect:${PREFECT_VERSION}-${PYTHON_VERSION}


# databricks job is submitted like this:

@flow
def databricks_run_reload(notebook_path, params):

    # compile job task settings
    job_task_settings = JobTaskSettings(
        new_cluster=new_cluster,
        notebook_task=spark_python_notebook_task,
        task_key=task_key,
        timeout_seconds=86400,
        libraries=[]
    )

    idempotency_key = str(uuid.uuid4())
    multi_task_runs = jobs_runs_submit_and_wait_for_completion(
        databricks_credentials=databricks_credentials,
        run_name=run_name,
        git_source=git,
        tasks=[job_task_settings],
        max_wait_seconds=86400,
        idempotency_token=idempotency_key
    )
    return multi_task_runs

# and used in other flow:

@flow
# @flow(persist_result=True, result_storage=S3.load(config.DEFAULT_S3_BLOCK_NAME)) ## tried this as well, but no difference
def do_reload():
    spark = databricks_run_reload(notebook_path="some_notebook", some_params)
    updated = execute_dml(merge_query, params=query_params, wait_for=[spark])

Error

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 637, in orchestrate_flow_run
    result = await run_sync(flow_call)
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/opt/prefect/flows/spark.py", line 45, in do_reload
    updated = execute_dml(query, params=query_params, wait_for=[spark])
  File "/usr/local/lib/python3.10/site-packages/prefect/tasks.py", line 436, in __call__
    return enter_task_run_engine(
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 927, in enter_task_run_engine
    return run_async_from_worker_thread(begin_run)
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 177, in run_async_from_worker_thread
    return anyio.from_thread.run(call)
  File "/usr/local/lib/python3.10/site-packages/anyio/from_thread.py", line 49, in run
    return asynclib.run_async_from_thread(func, *args)
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 970, in run_async_from_thread
    return f.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/site-packages/prefect/engine.py", line 1068, in get_task_call_return_value
    return await future._result()
  File "/usr/local/lib/python3.10/site-packages/prefect/futures.py", line 237, in _result
    return await final_state.result(raise_on_failure=raise_on_failure, fetch=True)
  File "/usr/local/lib/python3.10/site-packages/prefect/states.py", line 100, in _get_state_result
    raise MissingResult(
prefect.exceptions.MissingResult: State data is missing. Typically, this occurs when result persistence is disabled and the state has been retrieved from the API.

Versions

Version:             2.7.4
API version:         0.8.4
Python version:      3.10.8
Git commit:          30db76e7
Built:               Thu, Dec 22, 2022 2:55 PM
OS/Arch:             linux/x86_64
Profile:             dev
Server type:         hosted

Additional context

No response

@mmartsen mmartsen added bug Something isn't working status:triage labels Jan 4, 2023
@mmartsen
Copy link
Author

mmartsen commented Jan 8, 2023

I can confirm on 2.7.7 same issue happens, but there is no stacktrace anymore. Also it's interesting that one of the databricks subflows (the one that monitors the job status) is keep running fine, while parent flow marked as crashed

@zanieb
Copy link
Contributor

zanieb commented Jan 8, 2023

Thanks for the additional details! I'm investigating this issue and have added logs in #8097 to give us some insight into the suspicious code path.

@zanieb
Copy link
Contributor

zanieb commented Jan 8, 2023

Also, since this is occurring in enter_task_run_engine/get_task_call_return_value this traceback looks like a failure in the flow run itself. Can you share more details about how this is ending up in a CRASHED state?

@mmartsen
Copy link
Author

mmartsen commented Jan 8, 2023

If you give me a hint how to get more useful info for debugging. There is nothing like flow error in the logs (i also checked agent logs), also K8s job is still alive and working just fine.
I can try to reproduce it with just single long running databricks flow (ie no other tasks/subflows etc), Will that help?

@mmartsen
Copy link
Author

mmartsen commented Jan 9, 2023

Found this in agent logs:

20:22:28.756 | ERROR | prefect.infrastructure.kubernetes-job - Job 'psi3-albali-jstdh': Job did not complete.
20:22:28.781 | INFO | prefect.agent - Reported flow run 'f89cf54a-388a-48f9-bfe7-72f073243cb3' as crashed: Flow run infrastructure exited with non-zero status code -1.

Job is still alive and working though

@zanieb
Copy link
Contributor

zanieb commented Jan 9, 2023

@mmartsen since you set job_watch_timeout_seconds=180, if the agent does not see activity in 180s it will report the job as crashed. When the job is running for a long time, Kubernetes will not send any events in that time period. We recently set the default to None to avoid this in #7786.

@mmartsen
Copy link
Author

Tried without job_watch_timeout_seconds setting - still the same. Parent flow marked as crashed after 4+ hrs, while subflows keeps running fine

@zanieb
Copy link
Contributor

zanieb commented Jan 11, 2023

We have a separate report that at ~4 hours the Kubernetes job watcher will mark the pod as crashed. I have no idea why 4 hours is the magic number. Perhaps it's an issue with the Kubernetes client implementation? We'll definitely need to add some sort of workaround here.

@zanieb
Copy link
Contributor

zanieb commented Jan 11, 2023

We are presuming this is a bug with the Watch implementation in the Kubernetes library where it does not handle dropped connections, we'll need to add some more robust wrapping to handle failure modes there.

@jmprovencher
Copy link

We have a separate report that at ~4 hours the Kubernetes job watcher will mark the pod as crashed. I have no idea why 4 hours is the magic number. Perhaps it's an issue with the Kubernetes client implementation? We'll definitely need to add some sort of workaround here.

I think a lot of users have kubernetes job running for more than 4 hours, we'll definitely need a workaround.

@zanieb
Copy link
Contributor

zanieb commented Jan 12, 2023

Contributions are welcome here! We'll probably just want to move that watch out into a helper method that calls the watch repeatedly.

See related at kubernetes-client/csharp#486 (comment)

Ideally we'd use an Informer API, but the issue is quite stale kubernetes-client/python#868

We're likely to take this up ourselves next week.

@hateyouinfinity
Copy link
Contributor

Noticed kopf's implementation mentioned in a comment, might be relevant.

@naveedhd
Copy link

naveedhd commented Jan 25, 2023

4hrs is the maximum timelimit for streaming connections in kubelet which will cause the pods log loop to end even when the job/pod is running.

@zanieb
Copy link
Contributor

zanieb commented Mar 10, 2023

We've made significant progress here but will also need to add handling for different resource versions for this to be done.

@m-denton
Copy link

m-denton commented Aug 8, 2023

Is there any update on this initiative? Running into the same issue now.

@klayhb
Copy link

klayhb commented Aug 30, 2023

we'd like to know the status too

@chrisguidry chrisguidry self-assigned this Sep 15, 2023
@chrisguidry
Copy link
Collaborator

Hi folks, I'm going to be looking into this. Thank you all for the info so far, the pointer to kubelet's default timeout really helps! It looks like that setting is streamingConnectionIdleTimeout and I should be able to reproduce the problem in, well, less that 4 hours :D

Another article that I came across researching this: https://platform9.com/kb/kubernetes/kubectl-exec-is-timed-out-after-4-hours

I believe the fix would be to wrap the relevant Watch.stream calls in a loop that can handle being disconnected early. This could also help when there's cluster maintenance and kubelet needs to be restarted (which shouldn't affect any running pods).

I'll keep you posted.

@chrisguidry
Copy link
Collaborator

@m-denton and @klayhb, do you have any configuration details to share? Your Prefect and Kubernetes versions, and whether you're using the prefect_kubernetes.KubernetesWorker worker or the prefect.infrastructure.kubernetes.KubernetesJob? Any logs you have from your agent/worker would be helpful.

I see several changes in the past 8 months that may have improved this situation, so I'd love to know if you're still experiencing this and on what versions.

I'm not having much luck reproducing this on minikube by lowering the kubelet or apiserver timeouts. I'm next going to try just reproducing this at 4 hours with a flow running in the background.

@klayhb
Copy link

klayhb commented Sep 15, 2023 via email

@chrisguidry
Copy link
Collaborator

Update: I set up a flow run that runs many hours and I couldn't reproduce this on minikube. I'm wondering if there are some other complicating factors here.

A few ideas/questions:

  • Can you share your versions of prefect and kubernetes (pip freeze | grep "prefect\|kubernetes")
  • Which kind of Kubernetes cluster are you running on, and can you share the version? (kubectl version)
  • Is it a whole flow or a single task that lasts 4 hours?
  • For the flow/task that runs for that long, does it log any output during that time? One idea I'm exploring is whether the pod's lack of logs may cause something to timeout the watch
  • Are you running the Prefect Agent (prefect agent start ...) or a Prefect KubernetesWorker (prefect worker start ...)?

Thanks for your patience while we look into this, it's being stubborn to reproduce

chrisguidry added a commit to chrisguidry/prefect-coolness that referenced this issue Sep 18, 2023
@klayhb
Copy link

klayhb commented Sep 18, 2023

@chrisguidry sorry, there's been some kind of confusion on our part since we mixed up this issue with this one (that i opened myself) : #10620

i wrongly assumed it might be related to our issue but looks like it might be a separate thing. so, i'm probably not the one to ask for details on this. i'll give the stage to whoever complained about it

@chrisguidry
Copy link
Collaborator

No worries! I'll take a few more stabs at reproducing this and if I don't get any more info, I'll close it out in the next day or so.

@chrisguidry
Copy link
Collaborator

Update: still no repro of this issue with a flow that doesn't log. It ran fine for 5 hours without issue against minikube.

@mmartsen or @m-denton, would either of you be able to share any details about your setup or what you're experiencing?

  • Can you share your versions of prefect and kubernetes (pip freeze | grep "prefect\|kubernetes")
  • Which kind of Kubernetes cluster are you running on, and can you share the version? (kubectl version)
  • Is it a whole flow or a single task that lasts 4 hours?
  • For the flow/task that runs for that long, does it log any output during that time? One idea I'm exploring is whether the pod's lack of logs may cause something to timeout the watch
  • Are you running the Prefect Agent (prefect agent start ...) or a Prefect KubernetesWorker (prefect worker start ...)?

@chrisguidry
Copy link
Collaborator

After two more attempts yesterday, I was unable to reproduce this. Please feel free to reopen this issue if you have more information to help debugging, including the software versions involved and perhaps a rough idea of the network topology we're using to access the cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants