Large number of workers are booting too long or not booting the first time #3365

snowboy932 · 2024-04-01T11:02:52Z

snowboy932
Apr 1, 2024

Describe the bug
I have some machines with 64C/128T CPU and 128GB RAM. I'm experiencing issues with running Rails application with Puma in it (using Docker, of course). When the number of workers exceeds 50 Puma stucks while booting. Not all the workers can be booted for the first time and I need to restart Docker container 1 or 2 times to make it boot properly.
When the issue exists, Puma control server shows, for example, 27 workers, but even they're not shown as booted. htop also shows only 27 workers, so that's not the control server metrics problem. In spite of all of this, Puma is able to accept incoming connections.
No errors appear during "bad boot" even with debug loglevel.

Puma config:

queue_requests false

preload_app!

before_fork do
  if defined?(ActiveRecord::Base)
    ActiveRecord::Base.connection_pool.disconnect!
  end
end

after_worker_fork do
  require 'prometheus_exporter/instrumentation'
  PrometheusExporter::Instrumentation::Process.start(type: 'web')
end

after_worker_boot do
  require 'prometheus_exporter/instrumentation'
  PrometheusExporter::Instrumentation::Puma.start unless PrometheusExporter::Instrumentation::Puma.started?
end

require 'myapplication'

on_worker_boot do
  if defined?(ActiveRecord::Base)
    ActiveSupport.on_load(:active_record) do
      ActiveRecord::Base.establish_connection
    end

    PrometheusExporter::Instrumentation::Puma.start unless PrometheusExporter::Instrumentation::Puma.started?

    require 'prometheus_exporter/instrumentation'
    PrometheusExporter::Instrumentation::ActiveRecord.start(
      custom_labels: { type: 'puma_worker' },
      config_labels: %i[database host],
    )
  end
end

I'm running with a docker_entrypoint.rb file via function in it described below

def web_server(app)
    socket_backlog = ENV.fetch('SOCKET_BACKLOG')
    port = ENV.fetch('PORT')
    default_bind = "tcp://0.0.0.0:#{port}?backlog=#{socket_backlog}"
    default_control_url = 'tcp://127.0.0.1:9293'
    default_control_token = 'DefaultPumaControlToken'

    run_command_within(
      app,
      'bundle exec puma',
      '-t :threads::threads -w :workers -e :env -b :bind --control-url :control_url --control-token :control_token',
      threads: ENV.fetch('THREADS'),
      workers: ENV.fetch('WORKERS'),
      env: ENV.fetch('RAILS_ENV'),
      bind: ENV.fetch('BIND', default_bind),
      control_url: ENV.fetch('PUMA_CONTROL_URL', default_control_url),
      control_token: ENV.fetch('PUMA_CONTROL_TOKEN', default_control_token),
    )
  end

# <...>

def run_command_within(app, command, options_string = '', options = {})
    validate_app(app)

    command = Cocaine::CommandLine.new(
      command,
      options_string,
    ).command(options)

    Dir.chdir("apps/#{app}")
    exec(command)
  end

I'm not sure abount running multiple containers with a less number of workers (balanced by nginx) as it creates some inconvinieces with logging.

Any ideas why does it happen?

Some additional info
I've reproduced the issue and noticed that the time of last checkin is not updating since the application deployment (it should update every 5 seconds, but the metrics below were copied more than 40 minutes ago. Also, the workers could be divided into groups of 1-3 workers. Each group started 60-90 seconds apart.

Current amount of workers: 100

{
   "started_at":"2024-04-01T10:00:40Z",
   "workers":27,
   "phase":0,
   "booted_workers":0,
   "old_workers":0,
   "worker_status":[
      {
         "started_at":"2024-04-01T10:00:46Z",
         "pid":27,
         "index":0,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:00:46Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:00:46Z",
         "pid":34,
         "index":1,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:00:46Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:01:16Z",
         "pid":43,
         "index":2,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:01:16Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:01:16Z",
         "pid":80,
         "index":3,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:01:16Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:01:46Z",
         "pid":85,
         "index":4,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:01:46Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:01:46Z",
         "pid":116,
         "index":5,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:01:46Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:01:46Z",
         "pid":122,
         "index":6,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:01:46Z",
         "last_status":{
            
         }
      },
      {
         "started_at":"2024-04-01T10:02:16Z",
         "pid":128,
         "index":7,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:02:16Z",
         "last_status":{
            
         }
      },
     <...>
      {
         "started_at":"2024-04-01T10:05:16Z",
         "pid":468,
         "index":26,
         "phase":0,
         "booted":false,
         "last_checkin":"2024-04-01T10:05:16Z",
         "last_status":{
            
         }
      }
   ],
   "versions":{
      "puma":"6.2.0",
      "ruby":{
         "engine":"ruby",
         "version":"3.1.4",
         "patchlevel":223
      }
   }
}

Some additional info 2
If amount of workers exceeds 100, I'm getting the following errors during "bad boot"

This TimeOut and Out-Of-Sync errors doesn't appear with less amount of workers and also doesn't appear every time when "bad boot" happens.

Environment

Server: (64C/128T) AMD EPYC 7713P
OS: CentOS 7, Linux kernel: 5.4.260
Docker version: 23.0.6
Puma Version 6.2.0
Ruby version: 3.1.4

Thanks in advance :)

dentarg · 2024-04-01T21:49:06Z

dentarg
Apr 1, 2024
Maintainer

Have you checked your system logs? Does they say something interesting?

0 replies

dentarg · 2024-04-01T21:52:29Z

dentarg
Apr 1, 2024
Maintainer

Are you giving Docker enough memory? Maybe this #3193 is happening?

Maybe you can try #3236 (comment)

0 replies

snowboy932 · 2024-04-02T12:07:59Z

snowboy932
Apr 2, 2024
Author

@dentarg

Have you checked your system logs? Does they say something interesting?

Decided to check it again - nothing in journalctl and dmesg :(

Are you giving Docker enough memory? Maybe this #3193 is happening?

Docker settings are set by default, I've only changed file descriptors limit in sysctl (/etc/sysctl.d/99-custom.conf) and ulimit (/etc/security/limits.conf). According to docker stats all the containers I have can use max memory (128Gi)

Maybe you can try #3236 (comment)

Just tried this one.
I've set 150 workers and 1 threads and made some tries to launch the container
The booting stagging is working as expected, but it stops booting and spawning new workers after 38-49 workers with 5 seconds interval (obviously). Last time it stopped on "workers":48 and "booted_workers":47. And, of course, got nothing new in system logs.

0 replies

dentarg · 2024-04-03T17:42:47Z

dentarg
Apr 3, 2024
Maintainer

I think you need to dig deeper. Maybe https://github.com/amrabed/strace-docker can help

Can you try the same thing without Docker?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large number of workers are booting too long or not booting the first time #3365

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Large number of workers are booting too long or not booting the first time #3365

snowboy932 Apr 1, 2024

Replies: 4 comments

dentarg Apr 1, 2024 Maintainer

dentarg Apr 1, 2024 Maintainer

snowboy932 Apr 2, 2024 Author

dentarg Apr 3, 2024 Maintainer

snowboy932
Apr 1, 2024

dentarg
Apr 1, 2024
Maintainer

dentarg
Apr 1, 2024
Maintainer

snowboy932
Apr 2, 2024
Author

dentarg
Apr 3, 2024
Maintainer