New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Marked Failed by a Zombie Killer process #1954
Comments
I noticed similar issues have been raised but below is a simple reproducible example: here is the flow's script:
here is the worker's spec:
Running the flow with is this because while |
Hi @marwan116 - first question for you is does your docker image that you're using have an init process like tini? The way heartbeats currently work in Prefect is that each individual task creates a small subprocess that polls the Cloud API on a 30 second loop. If a heartbeat is not received for 2 minutes the task is considered a "Zombie". It seems we have a subset of users who are disproportionately affected by the presence of Zombies so we're trying to isolate the cause. One theory is that docker images without good process cleanups will result in resource bottlenecks, preventing subprocesses from running correctly. Because the heartbeat is sent via subprocess though it shouldn't interfere with your task's runtime logic in any way. Another question for you is: does your Flow run successfully 100% of the time whenever you disable heartbeats? You can do this by navigating to your Flow page and clicking the far right "Settings" tab. You should see a toggle for heartbeats. If you toggle this off, these subprocesses will never be created. I'd be very curious to know about |
Hi @cicdw
I believe so - I am building the image in the above code using prefect's Docker - which basically uses the prefect image (which has tini) as a base image and pip installs the specified dependencies - correct ?
Thank you for recommending this -when i toggle heartbeats off - my flow still fails - sorry about the false alarm, this doesn't seem to be a heartbeat related problem - thanks for explaining the potential pitfalls of the current design. Inspecting the logs I think I see the problem: "insufficient cpu" - I will have to reason about whether that makes sense given the available resources in the cluster and get back to you See the kubectl event logs below for more information
|
@cicdw - I got the problem resolved by lowering the cpu requirements from 2 to 1 The problem was purely "resource" related ... |
Awesome thanks for the update @marwan116 , and glad you were able to figure it out! |
Docs: Correct import for testing tutorial
Archived from the Prefect Public Slack Community
braun: I am seeing a situation where a task is getting zombie killed after about 4 mins of running, but then that task getting set to success when the actual task on the remote environment completes and report success
chris: Hey Braun this is super interesting! First question I have: do you see any logs for that task related to “heartbeats”?
braun: no i do not
braun: last log before the kill is
Task 'refresh_renewals_data[5]': Calling task.run() method...
braun: this is map task on local dask cluster with 1 worker and 3 threads
chris: interesting; so here’s what’s happening behind the scenes:
chris: for some reason the heartbeat thread “disappears” on certain dask configurations (specifically it seems when worker clients get involved), but we haven’t been able to track down the root cause yet
braun: mmm....I wonder if has to do with the thread limit on the LocalCluster
chris: Yea it’s possible, but that thread limit shouldn’t apply to threads spawned by individual tasks
chris: <@ULVA73B9P> archive “Marked Failed by a Zombie Killer process”
The text was updated successfully, but these errors were encountered: