-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Celery Worker docker healthcheck causes a memory leak #21026
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
We will release 2.2.4 (and 2.3.0 later) with Celery 5.2.3 which should solve the problem. See #19703 for details |
Closed as duplicate of #19703 |
@potiuk I've tested Airflow 2.2.4 and still see this issue with the recommended healthcheck.test, https://github.com/apache/airflow/blob/958860fcd7c9ecdf60b7ebeef4397b348835c8db/docs/apache-airflow/start/docker-compose.yaml I've taken the
Over a period of 4 hours, memory on the container with the health check has increased ~180 MB in usage, or about 45 MB per hour, which is what I observed before. Over the course of say 3 days that eventually becomes ~3.24 GB of unclaimed memory. For reference, the test environment is: I've included the dataset as CSV. Every now and then memory spikes and some is reclaimed, but it still keeps trickling up unlike the service without a healthcheck. Is there potentially a different health check the worker service could use instead? |
I've posed a general Q&A discussion on the Celery repo, celery/celery#7327. General feedback has suggested this is potentially related to celery/celery#6009. |
still some memory leaks persist in celery |
Apache Airflow version
2.2.3 (latest released)
What happened
With a docker setup as defined by this compose file, the
airflow-worker
servicehealthcheck.test
command causes a general increase in memory use overtime, even when idle. This was observed with Airflow 2.1.4 and 2.2.3.airflow/docs/apache-airflow/start/docker-compose.yaml
Lines 131 to 137 in 958860f
We observed this in our AWS ECS cluster which has a 0.5 CPU/1 GB Mem Worker setup. We strangely had a task fail (at the 2nd dip in memory use of the picture below), which prompted further investigation. The task had actually succeeded, but for some reason notified dependent tasks as failed. Subsequent tasks were marked as upstream failure, but the webserver reported the task as success. We noticed the metrics page looked like the image below.
We raised the CPU & Memory to 2 CPU / 4 GB Mem and restarted the service, which still produced a gradual increase in memory.
What you expected to happen
It should not increase in memory when the system is idle, but rather spike during healthcheck and release memory back to the host.
How to reproduce
We use a modified version of the compose file and instead favor docker stack, but the same setup should apply to the documented compose file. A slimmed down compose file is below. It has 2 workers, one with a healthcheck and one without.
A secondary script was written to scrape the docker statistics in 10 second intervals and write them to a CSV file.
Executing both commands can be done like so:
The necessary files are below. I'm also including a sample of the CSV file run locally.
worker_stats.csv
It shows that over a ~2 hour time period the general increase of
airflow_worker_healthcheck
. It consumes ~45 MB per hour if the healthcheck occurs at 10 second intervals.collect_stats.sh
docker-compose.yaml
Operating System
Ubuntu 20.04.3 LTS
Versions of Apache Airflow Providers
Using the base Python 3.8 Docker images from,
https://hub.docker.com/layers/apache/airflow/2.2.3-python3.8/images/sha256-a8c86724557a891104e91da8296157b4cabd73d81011ee1f733cbb7bbe61d374?context=explore
https://hub.docker.com/layers/apache/airflow/2.1.4-python3.8/images/sha256-d14244034721583a4a2d9760ffc9673307a56be5d8c248df02c466ca86704763?context=explore
Deployment
Docker-Compose is included above.
Deployment details
Tested with Python 3.8 images
Anything else
We did not see similar issues with the Webserver or Scheduler deployments.
I and my colleague think this might be related to some underlying Celery memory leaks. He has informed me of an upcoming release which includes #19703. I'd be interested to see if a similar issue occurs with the newer version.
I don't believe there's much else that can be done on Airflow's part here besides upgrading Celery. I just wanted to bring awareness to this outstanding issue. We are currently in search for a different healthcheck which potentially avoids Celery. If there are suggestions, I would gladly create a PR to update the documented compose file.
Other related issues may be:
celery/celery#4843
celery/kombu#1470
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: