Celery Worker docker healthcheck causes a memory leak #21026

mtraynham · 2022-01-21T20:24:10Z

Apache Airflow version

2.2.3 (latest released)

What happened

With a docker setup as defined by this compose file, the airflow-worker service healthcheck.test command causes a general increase in memory use overtime, even when idle. This was observed with Airflow 2.1.4 and 2.2.3.

airflow/docs/apache-airflow/start/docker-compose.yaml

Lines 131 to 137 in 958860f

    
           healthcheck: 
        
             test: 
        
               - "CMD-SHELL" 
        
               - 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"' 
        
             interval: 10s 
        
             timeout: 10s 
        
             retries: 5

We observed this in our AWS ECS cluster which has a 0.5 CPU/1 GB Mem Worker setup. We strangely had a task fail (at the 2nd dip in memory use of the picture below), which prompted further investigation. The task had actually succeeded, but for some reason notified dependent tasks as failed. Subsequent tasks were marked as upstream failure, but the webserver reported the task as success. We noticed the metrics page looked like the image below.

We raised the CPU & Memory to 2 CPU / 4 GB Mem and restarted the service, which still produced a gradual increase in memory.

What you expected to happen

It should not increase in memory when the system is idle, but rather spike during healthcheck and release memory back to the host.

How to reproduce

We use a modified version of the compose file and instead favor docker stack, but the same setup should apply to the documented compose file. A slimmed down compose file is below. It has 2 workers, one with a healthcheck and one without.

A secondary script was written to scrape the docker statistics in 10 second intervals and write them to a CSV file.

Executing both commands can be done like so:

$ docker stack deploy -c docker-compose.yaml airflow
$ nohup ./collect_stats.sh > stats.csv &

The necessary files are below. I'm also including a sample of the CSV file run locally.
worker_stats.csv

It shows that over a ~2 hour time period the general increase of airflow_worker_healthcheck. It consumes ~45 MB per hour if the healthcheck occurs at 10 second intervals.


Date	Container	CPU Percent	Mem Usage	Mem Percent
2022-01-21T19:17:57UTC	airflow_worker_no_healthcheck.1.z3lwru6mh22tzpe6dc8h8ekvi	0.47%	1.108GiB / 14.91GiB	7.43%
2022-01-21T19:17:57UTC	airflow_worker_healthcheck.1.vw9d1q18dx75v7w3zcfb7fypt	0.57%	1.1GiB / 14.91GiB	7.38%
2022-01-21T20:34:01UTC	airflow_worker_no_healthcheck.1.z3lwru6mh22tzpe6dc8h8ekvi	0.28%	1.108GiB / 14.91GiB	7.43%
2022-01-21T20:34:01UTC	airflow_worker_healthcheck.1.vw9d1q18dx75v7w3zcfb7fypt	0.76%	1.157GiB / 14.91GiB	7.76%

collect_stats.sh

#!/usr/bin/env sh

echo "Date,Container,CPU Percent,Mem Usage,Mem Percent"
while true; do
    time=$(date --utc +%FT%T%Z)
    docker stats \
      --format "table {{.Name}},{{.CPUPerc}},{{.MemUsage}},{{.MemPerc}}" \
      --no-stream \
      | grep worker \
      | awk -vT="${time}," '{ print T $0 }'
    sleep 10
done

docker-compose.yaml

---
version: '3.7'

networks:
  net:
    driver: overlay
    attachable: true

volumes:
  postgres-data:
  redis-data:

services:
  postgres:
    image: postgres:13.2-alpine
    volumes:
      - postgres-data:/var/lib/postgresql/data
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    healthcheck:
      test: pg_isready -U airflow -d airflow
      interval: 10s
      timeout: 3s
      start_period: 15s
    ports:
      - '5432:5432'
    networks:
      - net

  redis:
    image: redis:6.2
    volumes:
      - redis-data:/data
    healthcheck:
      test: redis-cli ping
      interval: 10s
      timeout: 3s
      start_period: 15s
    ports:
      - '6379:6379'
    networks:
      - net

  webserver:
    image: apache/airflow:2.2.3-python3.8
    command:
      - bash
      - -c
      - 'airflow db init
      && airflow db upgrade
      && airflow users create --username admin --firstname Admin --lastname User --password admin --role Admin --email test@admin.org
      && airflow webserver'
    environment:
      AIRFLOW__API__AUTH_BACKEND: airflow.api.auth.backend.basic_auth
      AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/1
      AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__CORE__EXECUTOR: CeleryExecutor
      AIRFLOW__CORE__FERNET_KEY: yxfSDUw_7SG6BhBstIt7dFzL5rpnxvr_Jkv0tFyEJ3s=
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__LOGGING__LOGGING_LEVEL: INFO
      AIRFLOW__WEBSERVER__SECRET_KEY: 0123456789
    healthcheck:
      test: curl --fail http://localhost:8080/health
      interval: 10s
      timeout: 10s
      retries: 10
      start_period: 90s
    ports:
      - '8080:8080'
    networks:
      - net

  scheduler:
    image: apache/airflow:2.2.3-python3.8
    command: scheduler
    environment:
      AIRFLOW__API__AUTH_BACKEND: airflow.api.auth.backend.basic_auth
      AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/1
      AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__CORE__EXECUTOR: CeleryExecutor
      AIRFLOW__CORE__FERNET_KEY: yxfSDUw_7SG6BhBstIt7dFzL5rpnxvr_Jkv0tFyEJ3s=
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__LOGGING__LOGGING_LEVEL: INFO
      AIRFLOW__WEBSERVER__SECRET_KEY: 0123456789
    healthcheck:
      test: airflow db check
      interval: 20s
      timeout: 10s
      retries: 5
      start_period: 40s
    networks:
      - net

  worker_healthcheck:
    image: apache/airflow:2.2.3-python3.8
    command: celery worker
    environment:
      AIRFLOW__API__AUTH_BACKEND: airflow.api.auth.backend.basic_auth
      AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/1
      AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__CORE__EXECUTOR: CeleryExecutor
      AIRFLOW__CORE__FERNET_KEY: yxfSDUw_7SG6BhBstIt7dFzL5rpnxvr_Jkv0tFyEJ3s=
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__LOGGING__LOGGING_LEVEL: DEBUG
      AIRFLOW__WEBSERVER__SECRET_KEY: 0123456789
    healthcheck:
      test:
        - "CMD-SHELL"
        - 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
      interval: 10s
      timeout: 10s
      retries: 5
      start_period: 40s
    networks:
      - net

  worker_no_healthcheck:
    image: apache/airflow:2.2.3-python3.8
    command: celery worker
    environment:
      AIRFLOW__API__AUTH_BACKEND: airflow.api.auth.backend.basic_auth
      AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/1
      AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__CORE__EXECUTOR: CeleryExecutor
      AIRFLOW__CORE__FERNET_KEY: yxfSDUw_7SG6BhBstIt7dFzL5rpnxvr_Jkv0tFyEJ3s=
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__LOGGING__LOGGING_LEVEL: DEBUG
      AIRFLOW__WEBSERVER__SECRET_KEY: 0123456789
    networks:
      - net

Operating System

Ubuntu 20.04.3 LTS

Versions of Apache Airflow Providers

Using the base Python 3.8 Docker images from,
https://hub.docker.com/layers/apache/airflow/2.2.3-python3.8/images/sha256-a8c86724557a891104e91da8296157b4cabd73d81011ee1f733cbb7bbe61d374?context=explore
https://hub.docker.com/layers/apache/airflow/2.1.4-python3.8/images/sha256-d14244034721583a4a2d9760ffc9673307a56be5d8c248df02c466ca86704763?context=explore

Deployment

Docker-Compose is included above.

Deployment details

Tested with Python 3.8 images

Anything else

We did not see similar issues with the Webserver or Scheduler deployments.

I and my colleague think this might be related to some underlying Celery memory leaks. He has informed me of an upcoming release which includes #19703. I'd be interested to see if a similar issue occurs with the newer version.

I don't believe there's much else that can be done on Airflow's part here besides upgrading Celery. I just wanted to bring awareness to this outstanding issue. We are currently in search for a different healthcheck which potentially avoids Celery. If there are suggestions, I would gladly create a PR to update the documented compose file.

Other related issues may be:
celery/celery#4843
celery/kombu#1470

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2022-01-21T20:24:12Z

Thanks for opening your first issue here! Be sure to follow the issue template!

potiuk · 2022-01-21T21:46:47Z

We will release 2.2.4 (and 2.3.0 later) with Celery 5.2.3 which should solve the problem. See #19703 for details

potiuk · 2022-01-21T21:47:45Z

Closed as duplicate of #19703

mtraynham · 2022-02-28T21:20:01Z

@potiuk I've tested Airflow 2.2.4 and still see this issue with the recommended healthcheck.test, https://github.com/apache/airflow/blob/958860fcd7c9ecdf60b7ebeef4397b348835c8db/docs/apache-airflow/start/docker-compose.yaml

I've taken the docker-compose.yml file from above and replaced references of 2.2.3 with 2.2.4. Also using the included script to collect service stats, from above, I've found the following:

Date	Container	CPU Percent	Mem Usage	Mem Percent
2022-02-28T16:59:44UTC	airflow_worker_no_healthcheck.1.fyihqxhf8hb574ivnz7uflpbz	2.08%	1.123GiB / 14.91GiB	7.53%
2022-02-28T16:59:44UTC	airflow_worker_healthcheck.1.rq48arboysa4b8fqn6i5jb7ad	0.43%	1.139GiB / 14.91GiB	7.64%
2022-02-28T21:00:41UTC	airflow_worker_no_healthcheck.1.fyihqxhf8hb574ivnz7uflpbz	0.75%	1.123GiB / 14.91GiB	7.54%
2022-02-28T21:00:41UTC	airflow_worker_healthcheck.1.rq48arboysa4b8fqn6i5jb7ad	88.68%	1.319GiB / 14.91GiB	8.85%

Over a period of 4 hours, memory on the container with the health check has increased ~180 MB in usage, or about 45 MB per hour, which is what I observed before. Over the course of say 3 days that eventually becomes ~3.24 GB of unclaimed memory.

For reference, the test environment is:
Airflow 2.2.4 - Celery Executor
1 Webserver, 1 Scheduler, 2 Workers (1 with health check)
Ubuntu 20.04.3 LTS
Linux host1 5.4.0-88-generic 99-Ubuntu SMP Thu Sep 23 17:29:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Docker version 20.10.12, build e91ed57

I've included the dataset as CSV. Every now and then memory spikes and some is reclaimed, but it still keeps trickling up unlike the service without a healthcheck.

stats_airflow.csv

Is there potentially a different health check the worker service could use instead?

mtraynham · 2022-03-01T13:30:41Z

I've posed a general Q&A discussion on the Celery repo, celery/celery#7327. General feedback has suggested this is potentially related to celery/celery#6009.

auvipy · 2022-03-06T03:46:25Z

still some memory leaks persist in celery

mtraynham added area:core kind:bug This is a clearly a bug labels Jan 21, 2022

potiuk added the duplicate Issue that is duplicated label Jan 21, 2022

potiuk closed this as completed Jan 21, 2022

potiuk mentioned this issue Jan 6, 2023

airflow workers and scheduler memory leak #28740

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Celery Worker docker healthcheck causes a memory leak #21026

Celery Worker docker healthcheck causes a memory leak #21026

mtraynham commented Jan 21, 2022 •

edited

boring-cyborg bot commented Jan 21, 2022

potiuk commented Jan 21, 2022 •

edited

potiuk commented Jan 21, 2022

mtraynham commented Feb 28, 2022 •

edited

mtraynham commented Mar 1, 2022

auvipy commented Mar 6, 2022

Celery Worker docker healthcheck causes a memory leak #21026

Celery Worker docker healthcheck causes a memory leak #21026

Comments

mtraynham commented Jan 21, 2022 • edited

Apache Airflow version

What happened

What you expected to happen

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented Jan 21, 2022

potiuk commented Jan 21, 2022 • edited

potiuk commented Jan 21, 2022

mtraynham commented Feb 28, 2022 • edited

mtraynham commented Mar 1, 2022

auvipy commented Mar 6, 2022

mtraynham commented Jan 21, 2022 •

edited

potiuk commented Jan 21, 2022 •

edited

mtraynham commented Feb 28, 2022 •

edited