Marked Failed by a Zombie Killer process #1954

marvin-robot · 2020-01-27T22:05:13Z

Archived from the Prefect Public Slack Community

braun: I am seeing a situation where a task is getting zombie killed after about 4 mins of running, but then that task getting set to success when the actual task on the remote environment completes and report success

chris: Hey Braun this is super interesting! First question I have: do you see any logs for that task related to “heartbeats”?

braun: no i do not

braun: last log before the kill is Task 'refresh_renewals_data[5]': Calling task.run() method...

braun: this is map task on local dask cluster with 1 worker and 3 threads

chris: interesting; so here’s what’s happening behind the scenes:

each task run spawns a new thread which is responsible for polling the API every 30 seconds
if a task doesn’t send a heartbeat after 2 minutes, Cloud marks it as a “Zombie”
we’ve seen other dask users experience unexpected Zombies when using mapping before, and we’ve been investigating but it’s a deep rabbit hole
if this keeps happening for you, i recommend turning off heartbeats as per this doc: https://docs.prefect.io/cloud/concepts/flows.html#disable-heartbeats

chris: for some reason the heartbeat thread “disappears” on certain dask configurations (specifically it seems when worker clients get involved), but we haven’t been able to track down the root cause yet

braun: mmm....I wonder if has to do with the thread limit on the LocalCluster

chris: Yea it’s possible, but that thread limit shouldn’t apply to threads spawned by individual tasks

chris: <@ULVA73B9P> archive “Marked Failed by a Zombie Killer process”

The text was updated successfully, but these errors were encountered:

marwan116 · 2020-05-21T23:14:45Z

I noticed similar issues have been raised but below is a simple reproducible example:

here is the flow's script:

from prefect import Flow, Parameter, task
from prefect.environments import DaskKubernetesEnvironment
from prefect.environments.storage import Docker
from prefect.engine.results import S3Result
import numpy as np
import os


@task()
def form_array(size, num_files):
    array = [size for _ in range(num_files)]
    return array


@task()
def generate_data(size):
    """
    Creates data of size {{size}}GB and saves it to S3
    """
    # 64 bit
    size = int(np.round((size * 10**9) / (64 / 8)))
    x = np.random.normal(size=size)
    return x


@task()
def produce_feature(x):
    return x * 4


def main():
    s3_result = S3Result(
        bucket=os.environ['AWS_BUCKET'],
    )

    with Flow(
        "Data Processing",
        environment=DaskKubernetesEnvironment(
            worker_spec_file="worker_spec.yaml",
            min_workers=1,
            max_workers=3,
        ),
        storage=Docker(
            registry_url=os.environ['GITLAB_REGISTRY'],
            image_name="dask-k8s-flow",
            image_tag="0.1.0",
            python_dependencies=[
                'boto3==1.13.14',
                'numpy==1.18.4'
            ]
        ),
        result=s3_result,
    ) as flow:

        size = Parameter("size", default=0.05)
        num_files = Parameter("num_files", default=2)

        sizes = form_array(size=size, num_files=num_files)
        data = generate_data.map(sizes)
        produce_feature.map(data)

    flow.register('Test Project')


if __name__ == "__main__":
    main()

here is the worker's spec:

kind: Pod
metadata:
  labels:
    app: prefect-dask-worker
spec:
  replicas: 2
  restartPolicy: Never
  imagePullSecrets:
  - name: gitlab-secret
  containers:
    - image: registry.gitlab.com/xxxx
      imagePullPolicy: IfNotPresent
      args: [dask-worker, --nthreads, "2", --no-bokeh, --memory-limit, 4GB]
      name: dask-worker
      env:
        - name: AWS_BUCKET
          value: xxxxx
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: aws-secret
              key: AWS_ACCESS_KEY_ID
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: aws-secret
              key: AWS_SECRET_ACCESS_KEY
      resources:
        limits:
          cpu: "2"
          memory: 4G
        requests:
          cpu: "2"
          memory: 2G

Running the flow with size=0.05 (50MB) works fine
Running the flow with size=0.25 (250MB) - fails at produce_feature

is this because while produce_feature is trying to download the data from s3 - it can't respond back to the heartbeat ? (trying to understand why that would be the case)

cicdw · 2020-05-21T23:42:53Z

Hi @marwan116 - first question for you is does your docker image that you're using have an init process like tini?

The way heartbeats currently work in Prefect is that each individual task creates a small subprocess that polls the Cloud API on a 30 second loop. If a heartbeat is not received for 2 minutes the task is considered a "Zombie". It seems we have a subset of users who are disproportionately affected by the presence of Zombies so we're trying to isolate the cause. One theory is that docker images without good process cleanups will result in resource bottlenecks, preventing subprocesses from running correctly.

Because the heartbeat is sent via subprocess though it shouldn't interfere with your task's runtime logic in any way.

Another question for you is: does your Flow run successfully 100% of the time whenever you disable heartbeats? You can do this by navigating to your Flow page and clicking the far right "Settings" tab. You should see a toggle for heartbeats. If you toggle this off, these subprocesses will never be created.

I'd be very curious to know about tini and whether toggling off heartbeats works for you, as I'm currently considering a redesign of how we detect zombies and all user data helps!

marwan116 · 2020-05-22T02:05:20Z

Hi @cicdw

first question for you is does your docker image that you're using have an init process like tini?

I believe so - I am building the image in the above code using prefect's Docker - which basically uses the prefect image (which has tini) as a base image and pip installs the specified dependencies - correct ?

Another question for you is: does your Flow run successfully 100% of the time whenever you disable heartbeats? You can do this by navigating to your Flow page and clicking the far right "Settings" tab. You should see a toggle for heartbeats. If you toggle this off, these subprocesses will never be created.

Thank you for recommending this -when i toggle heartbeats off - my flow still fails - sorry about the false alarm, this doesn't seem to be a heartbeat related problem - thanks for explaining the potential pitfalls of the current design.

Inspecting the logs I think I see the problem: "insufficient cpu" - I will have to reason about whether that makes sense given the available resources in the cluster and get back to you

See the kubectl event logs below for more information

6m46s       Warning   FailedScheduling   pod/dask-root-3c7ea69b-0csjjw                                     0/2 nodes are available: 2 Insufficient cpu.
6m46s       Normal    Created            pod/dask-root-3c7ea69b-0ssxgv                                     Created container dask-worker
6m45s       Normal    Started            pod/dask-root-3c7ea69b-0ssxgv                                     Started container dask-worker
6m44s       Warning   FailedScheduling   pod/dask-root-3c7ea69b-0csjjw                                     skip schedule deleting pod: default/dask-root-3c7ea69b-0csjjw
6m37s       Normal    Killing            pod/dask-root-3c7ea69b-0lpqk5                                     Stopping container dask-worker

marwan116 · 2020-05-22T16:32:03Z

@cicdw - I got the problem resolved by lowering the cpu requirements from 2 to 1

The problem was purely "resource" related ...

cicdw · 2020-05-22T16:46:08Z

Awesome thanks for the update @marwan116 , and glad you were able to figure it out!

Docs: Correct import for testing tutorial

marvin-robot added the Prefect Slack Community label Jan 27, 2020

marvin-robot closed this as completed Jan 27, 2020

zanieb pushed a commit that referenced this issue Jun 7, 2022

Merge pull request #1954 from PrefectHQ/docs-1953

ea1d82c

Docs: Correct import for testing tutorial

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marked Failed by a Zombie Killer process #1954

Marked Failed by a Zombie Killer process #1954

marvin-robot commented Jan 27, 2020

marwan116 commented May 21, 2020 •

edited

cicdw commented May 21, 2020

marwan116 commented May 22, 2020 •

edited

marwan116 commented May 22, 2020 •

edited

cicdw commented May 22, 2020 •

edited

Marked Failed by a Zombie Killer process #1954

Marked Failed by a Zombie Killer process #1954

Comments

marvin-robot commented Jan 27, 2020

Archived from the Prefect Public Slack Community

marwan116 commented May 21, 2020 • edited

cicdw commented May 21, 2020

marwan116 commented May 22, 2020 • edited

marwan116 commented May 22, 2020 • edited

cicdw commented May 22, 2020 • edited

marwan116 commented May 21, 2020 •

edited

marwan116 commented May 22, 2020 •

edited

marwan116 commented May 22, 2020 •

edited

cicdw commented May 22, 2020 •

edited