Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery Worker docker healthcheck causes a memory leak #21026

Closed
2 tasks done
mtraynham opened this issue Jan 21, 2022 · 6 comments
Closed
2 tasks done

Celery Worker docker healthcheck causes a memory leak #21026

mtraynham opened this issue Jan 21, 2022 · 6 comments
Labels
area:core duplicate Issue that is duplicated kind:bug This is a clearly a bug

Comments

@mtraynham
Copy link
Contributor

mtraynham commented Jan 21, 2022

Apache Airflow version

2.2.3 (latest released)

What happened

With a docker setup as defined by this compose file, the airflow-worker service healthcheck.test command causes a general increase in memory use overtime, even when idle. This was observed with Airflow 2.1.4 and 2.2.3.

healthcheck:
test:
- "CMD-SHELL"
- 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
interval: 10s
timeout: 10s
retries: 5

We observed this in our AWS ECS cluster which has a 0.5 CPU/1 GB Mem Worker setup. We strangely had a task fail (at the 2nd dip in memory use of the picture below), which prompted further investigation. The task had actually succeeded, but for some reason notified dependent tasks as failed. Subsequent tasks were marked as upstream failure, but the webserver reported the task as success. We noticed the metrics page looked like the image below.
image

We raised the CPU & Memory to 2 CPU / 4 GB Mem and restarted the service, which still produced a gradual increase in memory.
image

What you expected to happen

It should not increase in memory when the system is idle, but rather spike during healthcheck and release memory back to the host.

How to reproduce

We use a modified version of the compose file and instead favor docker stack, but the same setup should apply to the documented compose file. A slimmed down compose file is below. It has 2 workers, one with a healthcheck and one without.

A secondary script was written to scrape the docker statistics in 10 second intervals and write them to a CSV file.

Executing both commands can be done like so:

$ docker stack deploy -c docker-compose.yaml airflow
$ nohup ./collect_stats.sh > stats.csv &

The necessary files are below. I'm also including a sample of the CSV file run locally.
worker_stats.csv

It shows that over a ~2 hour time period the general increase of airflow_worker_healthcheck. It consumes ~45 MB per hour if the healthcheck occurs at 10 second intervals.

Date Container CPU Percent Mem Usage Mem Percent
2022-01-21T19:17:57UTC airflow_worker_no_healthcheck.1.z3lwru6mh22tzpe6dc8h8ekvi 0.47% 1.108GiB / 14.91GiB 7.43%
2022-01-21T19:17:57UTC airflow_worker_healthcheck.1.vw9d1q18dx75v7w3zcfb7fypt 0.57% 1.1GiB / 14.91GiB 7.38%
2022-01-21T20:34:01UTC airflow_worker_no_healthcheck.1.z3lwru6mh22tzpe6dc8h8ekvi 0.28% 1.108GiB / 14.91GiB 7.43%
2022-01-21T20:34:01UTC airflow_worker_healthcheck.1.vw9d1q18dx75v7w3zcfb7fypt 0.76% 1.157GiB / 14.91GiB 7.76%

collect_stats.sh

#!/usr/bin/env sh

echo "Date,Container,CPU Percent,Mem Usage,Mem Percent"
while true; do
    time=$(date --utc +%FT%T%Z)
    docker stats \
      --format "table {{.Name}},{{.CPUPerc}},{{.MemUsage}},{{.MemPerc}}" \
      --no-stream \
      | grep worker \
      | awk -vT="${time}," '{ print T $0 }'
    sleep 10
done

docker-compose.yaml

---
version: '3.7'

networks:
  net:
    driver: overlay
    attachable: true

volumes:
  postgres-data:
  redis-data:

services:
  postgres:
    image: postgres:13.2-alpine
    volumes:
      - postgres-data:/var/lib/postgresql/data
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    healthcheck:
      test: pg_isready -U airflow -d airflow
      interval: 10s
      timeout: 3s
      start_period: 15s
    ports:
      - '5432:5432'
    networks:
      - net

  redis:
    image: redis:6.2
    volumes:
      - redis-data:/data
    healthcheck:
      test: redis-cli ping
      interval: 10s
      timeout: 3s
      start_period: 15s
    ports:
      - '6379:6379'
    networks:
      - net

  webserver:
    image: apache/airflow:2.2.3-python3.8
    command:
      - bash
      - -c
      - 'airflow db init
      && airflow db upgrade
      && airflow users create --username admin --firstname Admin --lastname User --password admin --role Admin --email test@admin.org
      && airflow webserver'
    environment:
      AIRFLOW__API__AUTH_BACKEND: airflow.api.auth.backend.basic_auth
      AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/1
      AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__CORE__EXECUTOR: CeleryExecutor
      AIRFLOW__CORE__FERNET_KEY: yxfSDUw_7SG6BhBstIt7dFzL5rpnxvr_Jkv0tFyEJ3s=
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__LOGGING__LOGGING_LEVEL: INFO
      AIRFLOW__WEBSERVER__SECRET_KEY: 0123456789
    healthcheck:
      test: curl --fail http://localhost:8080/health
      interval: 10s
      timeout: 10s
      retries: 10
      start_period: 90s
    ports:
      - '8080:8080'
    networks:
      - net

  scheduler:
    image: apache/airflow:2.2.3-python3.8
    command: scheduler
    environment:
      AIRFLOW__API__AUTH_BACKEND: airflow.api.auth.backend.basic_auth
      AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/1
      AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__CORE__EXECUTOR: CeleryExecutor
      AIRFLOW__CORE__FERNET_KEY: yxfSDUw_7SG6BhBstIt7dFzL5rpnxvr_Jkv0tFyEJ3s=
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__LOGGING__LOGGING_LEVEL: INFO
      AIRFLOW__WEBSERVER__SECRET_KEY: 0123456789
    healthcheck:
      test: airflow db check
      interval: 20s
      timeout: 10s
      retries: 5
      start_period: 40s
    networks:
      - net

  worker_healthcheck:
    image: apache/airflow:2.2.3-python3.8
    command: celery worker
    environment:
      AIRFLOW__API__AUTH_BACKEND: airflow.api.auth.backend.basic_auth
      AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/1
      AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__CORE__EXECUTOR: CeleryExecutor
      AIRFLOW__CORE__FERNET_KEY: yxfSDUw_7SG6BhBstIt7dFzL5rpnxvr_Jkv0tFyEJ3s=
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__LOGGING__LOGGING_LEVEL: DEBUG
      AIRFLOW__WEBSERVER__SECRET_KEY: 0123456789
    healthcheck:
      test:
        - "CMD-SHELL"
        - 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
      interval: 10s
      timeout: 10s
      retries: 5
      start_period: 40s
    networks:
      - net

  worker_no_healthcheck:
    image: apache/airflow:2.2.3-python3.8
    command: celery worker
    environment:
      AIRFLOW__API__AUTH_BACKEND: airflow.api.auth.backend.basic_auth
      AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/1
      AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__CORE__EXECUTOR: CeleryExecutor
      AIRFLOW__CORE__FERNET_KEY: yxfSDUw_7SG6BhBstIt7dFzL5rpnxvr_Jkv0tFyEJ3s=
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__LOGGING__LOGGING_LEVEL: DEBUG
      AIRFLOW__WEBSERVER__SECRET_KEY: 0123456789
    networks:
      - net

Operating System

Ubuntu 20.04.3 LTS

Versions of Apache Airflow Providers

Using the base Python 3.8 Docker images from,
https://hub.docker.com/layers/apache/airflow/2.2.3-python3.8/images/sha256-a8c86724557a891104e91da8296157b4cabd73d81011ee1f733cbb7bbe61d374?context=explore
https://hub.docker.com/layers/apache/airflow/2.1.4-python3.8/images/sha256-d14244034721583a4a2d9760ffc9673307a56be5d8c248df02c466ca86704763?context=explore

Deployment

Docker-Compose is included above.

Deployment details

Tested with Python 3.8 images

Anything else

We did not see similar issues with the Webserver or Scheduler deployments.

I and my colleague think this might be related to some underlying Celery memory leaks. He has informed me of an upcoming release which includes #19703. I'd be interested to see if a similar issue occurs with the newer version.

I don't believe there's much else that can be done on Airflow's part here besides upgrading Celery. I just wanted to bring awareness to this outstanding issue. We are currently in search for a different healthcheck which potentially avoids Celery. If there are suggestions, I would gladly create a PR to update the documented compose file.

Other related issues may be:
celery/celery#4843
celery/kombu#1470

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@mtraynham mtraynham added area:core kind:bug This is a clearly a bug labels Jan 21, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Jan 21, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@potiuk
Copy link
Member

potiuk commented Jan 21, 2022

We will release 2.2.4 (and 2.3.0 later) with Celery 5.2.3 which should solve the problem. See #19703 for details

@potiuk potiuk added the duplicate Issue that is duplicated label Jan 21, 2022
@potiuk potiuk closed this as completed Jan 21, 2022
@potiuk
Copy link
Member

potiuk commented Jan 21, 2022

Closed as duplicate of #19703

@mtraynham
Copy link
Contributor Author

mtraynham commented Feb 28, 2022

@potiuk I've tested Airflow 2.2.4 and still see this issue with the recommended healthcheck.test, https://github.com/apache/airflow/blob/958860fcd7c9ecdf60b7ebeef4397b348835c8db/docs/apache-airflow/start/docker-compose.yaml

I've taken the docker-compose.yml file from above and replaced references of 2.2.3 with 2.2.4. Also using the included script to collect service stats, from above, I've found the following:

Date Container CPU Percent Mem Usage Mem Percent
2022-02-28T16:59:44UTC airflow_worker_no_healthcheck.1.fyihqxhf8hb574ivnz7uflpbz 2.08% 1.123GiB / 14.91GiB 7.53%
2022-02-28T16:59:44UTC airflow_worker_healthcheck.1.rq48arboysa4b8fqn6i5jb7ad 0.43% 1.139GiB / 14.91GiB 7.64%
2022-02-28T21:00:41UTC airflow_worker_no_healthcheck.1.fyihqxhf8hb574ivnz7uflpbz 0.75% 1.123GiB / 14.91GiB 7.54%
2022-02-28T21:00:41UTC airflow_worker_healthcheck.1.rq48arboysa4b8fqn6i5jb7ad 88.68% 1.319GiB / 14.91GiB 8.85%

Over a period of 4 hours, memory on the container with the health check has increased ~180 MB in usage, or about 45 MB per hour, which is what I observed before. Over the course of say 3 days that eventually becomes ~3.24 GB of unclaimed memory.

For reference, the test environment is:
Airflow 2.2.4 - Celery Executor
1 Webserver, 1 Scheduler, 2 Workers (1 with health check)
Ubuntu 20.04.3 LTS
Linux host1 5.4.0-88-generic 99-Ubuntu SMP Thu Sep 23 17:29:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Docker version 20.10.12, build e91ed57

I've included the dataset as CSV. Every now and then memory spikes and some is reclaimed, but it still keeps trickling up unlike the service without a healthcheck.

stats_airflow.csv

Is there potentially a different health check the worker service could use instead?

@mtraynham
Copy link
Contributor Author

I've posed a general Q&A discussion on the Celery repo, celery/celery#7327. General feedback has suggested this is potentially related to celery/celery#6009.

@auvipy
Copy link
Contributor

auvipy commented Mar 6, 2022

still some memory leaks persist in celery

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core duplicate Issue that is duplicated kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

3 participants