Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes Flows / Workers do not gracefully handle evictions #12988

Open
1 task
meggers opened this issue Mar 7, 2024 · 1 comment
Open
1 task

Kubernetes Flows / Workers do not gracefully handle evictions #12988

meggers opened this issue Mar 7, 2024 · 1 comment
Assignees
Labels
enhancement An improvement of an existing feature

Comments

@meggers
Copy link

meggers commented Mar 7, 2024

We see that evicted pods for flows started from a Kubernetes worker result in the Flow stuck in a running status. Furthermore, when cancelling a flow in this status, the flow gets stuck in Cancelling. This is a very troubling scenario because evictions are expected in kubernetes.

Additionally, for this case, we have tried setting timeout and retry annotations on our flows, and these are not respected in the case of evictions.

Expectation / Proposal

I expect that the flow should enter a crashed state (or be retried automatically). I expect that either the flow should handle evictions gracefully and post back crashed, or the worker should respond to the evicted pod and post back crashed or retry. The worker should also gracefully handle evictions to protect against the scenario when workers get restarted when monitoring flows.

We also expect that timeouts/retry annotations apply in the case of evictions.

Traceback / Example

Flow:
prefect-client==2.16.2

K8s Worker:
prefect-client==2.16.2
prefect-kubernetes==0.3.5

Steps to reproduce:

  1. Create deployment
from time import sleep
from prefect import flow

@flow(log_prints=True)
def test_long_running_job():
    sleep(630)

if __name__ == "__main__":
    test_long_running_job.deploy(
        name="long-running-job", 
        work_pool_name="my-workpool", 
        image="my/image"
    )
  1. Run deployment
    prefect deployment run 'test_long_running_job/long-running-job

  2. Wait a few minutes (9 minutes, in my tests) and evict the flow pod

from kubernetes import client, config

k8s_config_file = "your/config/file"
cluster = "your-cluster-context"
namespace = "your-namespace"
pod_name = "flow-pod-name"

config.load_kube_config(config_file="/home/vscode/.kube/config", context=cluster)
v1 = client.CoreV1Api()
v1.create_namespaced_pod_eviction(
    name=pod_name,
    namespace=namespace,
    body=client.V1Eviction(metadata=client.V1ObjectMeta(name=pod_name))
)
  1. Observe that the flow remains in Running. Logs from the worker (notice no logs indicating worker observed eviction):

image

@urimandujano urimandujano added the enhancement An improvement of an existing feature label Mar 7, 2024
@gabcoyne gabcoyne self-assigned this Apr 2, 2024
@meggers
Copy link
Author

meggers commented Apr 25, 2024

Couple notes:

This continues to happen almost daily for us. It is becoming a larger issue.

I notice that pod_watch_timeout_seconds is passed to KubernetesEventsReplicator . It ends up getting consumed by the kubernetes watch.stream. I think this is incorrect. I think instead the job_watch_timeout_seconds should be passed in IF IT IS SET, if it is not set I think there should be no timeout set at all. See https://github.com/kubernetes-client/python/blob/master/examples/watch/timeout-settings.md

I also notice that there are two watches, that from KubernetesEventsReplicator and that from _watch_job. Are two watches necessary?

In a recent case, we see that _replicate_pod_events failed after initially submitting the Job.
image

@desertaxle desertaxle transferred this issue from PrefectHQ/prefect-kubernetes Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants