Large number of workers are booting too long or not booting the first time #3365
Replies: 4 comments
-
Have you checked your system logs? Does they say something interesting? |
Beta Was this translation helpful? Give feedback.
-
Are you giving Docker enough memory? Maybe this #3193 is happening? Maybe you can try #3236 (comment) |
Beta Was this translation helpful? Give feedback.
-
Decided to check it again - nothing in journalctl and dmesg :(
Docker settings are set by default, I've only changed file descriptors limit in
Just tried this one. |
Beta Was this translation helpful? Give feedback.
-
I think you need to dig deeper. Maybe https://github.com/amrabed/strace-docker can help Can you try the same thing without Docker? |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
I have some machines with 64C/128T CPU and 128GB RAM. I'm experiencing issues with running Rails application with Puma in it (using Docker, of course). When the number of workers exceeds 50 Puma stucks while booting. Not all the workers can be booted for the first time and I need to restart Docker container 1 or 2 times to make it boot properly.
When the issue exists, Puma control server shows, for example, 27 workers, but even they're not shown as booted. htop also shows only 27 workers, so that's not the control server metrics problem. In spite of all of this, Puma is able to accept incoming connections.
No errors appear during "bad boot" even with debug loglevel.
Puma config:
I'm running with a docker_entrypoint.rb file via function in it described below
I'm not sure abount running multiple containers with a less number of workers (balanced by nginx) as it creates some inconvinieces with logging.
Any ideas why does it happen?
Some additional info
I've reproduced the issue and noticed that the time of last checkin is not updating since the application deployment (it should update every 5 seconds, but the metrics below were copied more than 40 minutes ago. Also, the workers could be divided into groups of 1-3 workers. Each group started 60-90 seconds apart.
Current amount of workers: 100
Some additional info 2
If amount of workers exceeds 100, I'm getting the following errors during "bad boot"
This TimeOut and Out-Of-Sync errors doesn't appear with less amount of workers and also doesn't appear every time when "bad boot" happens.
Environment
Thanks in advance :)
Beta Was this translation helpful? Give feedback.
All reactions