Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marked Failed by a Zombie Killer process #1954

Closed
marvin-robot opened this issue Jan 27, 2020 · 5 comments
Closed

Marked Failed by a Zombie Killer process #1954

marvin-robot opened this issue Jan 27, 2020 · 5 comments

Comments

@marvin-robot
Copy link
Member

Archived from the Prefect Public Slack Community

braun: I am seeing a situation where a task is getting zombie killed after about 4 mins of running, but then that task getting set to success when the actual task on the remote environment completes and report success

chris: Hey Braun this is super interesting! First question I have: do you see any logs for that task related to “heartbeats”?

braun: no i do not

braun: last log before the kill is Task 'refresh_renewals_data[5]': Calling task.run() method...

braun: this is map task on local dask cluster with 1 worker and 3 threads

chris: interesting; so here’s what’s happening behind the scenes:

  • each task run spawns a new thread which is responsible for polling the API every 30 seconds
  • if a task doesn’t send a heartbeat after 2 minutes, Cloud marks it as a “Zombie”
  • we’ve seen other dask users experience unexpected Zombies when using mapping before, and we’ve been investigating but it’s a deep rabbit hole
  • if this keeps happening for you, i recommend turning off heartbeats as per this doc: https://docs.prefect.io/cloud/concepts/flows.html#disable-heartbeats

chris: for some reason the heartbeat thread “disappears” on certain dask configurations (specifically it seems when worker clients get involved), but we haven’t been able to track down the root cause yet

braun: mmm....I wonder if has to do with the thread limit on the LocalCluster

chris: Yea it’s possible, but that thread limit shouldn’t apply to threads spawned by individual tasks

chris: <@ULVA73B9P> archive “Marked Failed by a Zombie Killer process”

@marwan116
Copy link
Contributor

marwan116 commented May 21, 2020

I noticed similar issues have been raised but below is a simple reproducible example:

here is the flow's script:

from prefect import Flow, Parameter, task
from prefect.environments import DaskKubernetesEnvironment
from prefect.environments.storage import Docker
from prefect.engine.results import S3Result
import numpy as np
import os


@task()
def form_array(size, num_files):
    array = [size for _ in range(num_files)]
    return array


@task()
def generate_data(size):
    """
    Creates data of size {{size}}GB and saves it to S3
    """
    # 64 bit
    size = int(np.round((size * 10**9) / (64 / 8)))
    x = np.random.normal(size=size)
    return x


@task()
def produce_feature(x):
    return x * 4


def main():
    s3_result = S3Result(
        bucket=os.environ['AWS_BUCKET'],
    )

    with Flow(
        "Data Processing",
        environment=DaskKubernetesEnvironment(
            worker_spec_file="worker_spec.yaml",
            min_workers=1,
            max_workers=3,
        ),
        storage=Docker(
            registry_url=os.environ['GITLAB_REGISTRY'],
            image_name="dask-k8s-flow",
            image_tag="0.1.0",
            python_dependencies=[
                'boto3==1.13.14',
                'numpy==1.18.4'
            ]
        ),
        result=s3_result,
    ) as flow:

        size = Parameter("size", default=0.05)
        num_files = Parameter("num_files", default=2)

        sizes = form_array(size=size, num_files=num_files)
        data = generate_data.map(sizes)
        produce_feature.map(data)

    flow.register('Test Project')


if __name__ == "__main__":
    main()

here is the worker's spec:

kind: Pod
metadata:
  labels:
    app: prefect-dask-worker
spec:
  replicas: 2
  restartPolicy: Never
  imagePullSecrets:
  - name: gitlab-secret
  containers:
    - image: registry.gitlab.com/xxxx
      imagePullPolicy: IfNotPresent
      args: [dask-worker, --nthreads, "2", --no-bokeh, --memory-limit, 4GB]
      name: dask-worker
      env:
        - name: AWS_BUCKET
          value: xxxxx
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: aws-secret
              key: AWS_ACCESS_KEY_ID
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: aws-secret
              key: AWS_SECRET_ACCESS_KEY
      resources:
        limits:
          cpu: "2"
          memory: 4G
        requests:
          cpu: "2"
          memory: 2G

Running the flow with size=0.05 (50MB) works fine
Running the flow with size=0.25 (250MB) - fails at produce_feature

is this because while produce_feature is trying to download the data from s3 - it can't respond back to the heartbeat ? (trying to understand why that would be the case)

@cicdw
Copy link
Member

cicdw commented May 21, 2020

Hi @marwan116 - first question for you is does your docker image that you're using have an init process like tini?

The way heartbeats currently work in Prefect is that each individual task creates a small subprocess that polls the Cloud API on a 30 second loop. If a heartbeat is not received for 2 minutes the task is considered a "Zombie". It seems we have a subset of users who are disproportionately affected by the presence of Zombies so we're trying to isolate the cause. One theory is that docker images without good process cleanups will result in resource bottlenecks, preventing subprocesses from running correctly.

Because the heartbeat is sent via subprocess though it shouldn't interfere with your task's runtime logic in any way.

Another question for you is: does your Flow run successfully 100% of the time whenever you disable heartbeats? You can do this by navigating to your Flow page and clicking the far right "Settings" tab. You should see a toggle for heartbeats. If you toggle this off, these subprocesses will never be created.

I'd be very curious to know about tini and whether toggling off heartbeats works for you, as I'm currently considering a redesign of how we detect zombies and all user data helps!

@marwan116
Copy link
Contributor

marwan116 commented May 22, 2020

Hi @cicdw

first question for you is does your docker image that you're using have an init process like tini?

I believe so - I am building the image in the above code using prefect's Docker - which basically uses the prefect image (which has tini) as a base image and pip installs the specified dependencies - correct ?

Another question for you is: does your Flow run successfully 100% of the time whenever you disable heartbeats? You can do this by navigating to your Flow page and clicking the far right "Settings" tab. You should see a toggle for heartbeats. If you toggle this off, these subprocesses will never be created.

Thank you for recommending this -when i toggle heartbeats off - my flow still fails - sorry about the false alarm, this doesn't seem to be a heartbeat related problem - thanks for explaining the potential pitfalls of the current design.

Inspecting the logs I think I see the problem: "insufficient cpu" - I will have to reason about whether that makes sense given the available resources in the cluster and get back to you

See the kubectl event logs below for more information

6m46s       Warning   FailedScheduling   pod/dask-root-3c7ea69b-0csjjw                                     0/2 nodes are available: 2 Insufficient cpu.
6m46s       Normal    Created            pod/dask-root-3c7ea69b-0ssxgv                                     Created container dask-worker
6m45s       Normal    Started            pod/dask-root-3c7ea69b-0ssxgv                                     Started container dask-worker
6m44s       Warning   FailedScheduling   pod/dask-root-3c7ea69b-0csjjw                                     skip schedule deleting pod: default/dask-root-3c7ea69b-0csjjw
6m37s       Normal    Killing            pod/dask-root-3c7ea69b-0lpqk5                                     Stopping container dask-worker

@marwan116
Copy link
Contributor

marwan116 commented May 22, 2020

@cicdw - I got the problem resolved by lowering the cpu requirements from 2 to 1

The problem was purely "resource" related ...

@cicdw
Copy link
Member

cicdw commented May 22, 2020

Awesome thanks for the update @marwan116 , and glad you were able to figure it out!

zanieb pushed a commit that referenced this issue Jun 7, 2022
Docs: Correct import for testing tutorial
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants