Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHP-FPM constantly restarting children with New Relic extension enabled in container with non root mode #862

Open
Aspirin4k opened this issue Mar 20, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@Aspirin4k
Copy link

Description

I am trying to implement rootless containers for our application (php-fpm + nginx in k8s).
Problem occurs only when new relic extension is enabled (more precisely - when extension successfully connects to daemon. if not - everything works fine)
Depending on pm mode I observe following situations:

  1. In dynamic mode at some point children start receiving SIGKILL after ~10-20 seconds. Master process constantly restarting these processes as it expects children to finish with SIGQUIT
  2. In static mode if max_requests is set php-fpm stop correctly working after all children exceeds max_requests value, Master process can't respawn children
  3. In static mode if max_requests is 0 everything works fine. When running container under root user everything works fine as well.

At the same time I don't see any suspicious logs and metrics/transactions are correctly sent to New Relic collectors. There are no traffic on application except kube probe. There are no any spikes in memory/cpu utilisation

Steps to Reproduce

  1. Set up pod with enabled security context and non root user with enabled new relic extension

  2. Generate traffic (for example, 5 VU for 5 minutes)

  3. Stop traffic

Expected Behavior

PHP-FPM should work as expected under non root user

Relevant Logs / Console output

[20-Mar-2024 09:49:15.245718] DEBUG: pid 1, fpm_pctl_perform_idle_server_maintenance(), line 395: [pool www] currently 0 active children, 7 spare children, 7 running children. Spawning rate 1
[20-Mar-2024 09:49:15.249579] DEBUG: pid 1, fpm_event_loop(), line 440: event module triggered 1 events
[20-Mar-2024 09:49:15.249662] WARNING: pid 1, fpm_children_bury(), line 258: [pool www] child 287 exited on signal 9 (SIGKILL) after 14.015627 seconds from start
[20-Mar-2024 09:49:15.249741] DEBUG: pid 1, fpm_children_make(), line 405: blocking signals before child birth
[20-Mar-2024 09:49:15.249427] DEBUG: pid 1, fpm_event_loop(), line 440: event module triggered 2 events
[20-Mar-2024 09:49:15.249520] DEBUG: pid 1, fpm_got_signal(), line 82: received SIGCHLD
[20-Mar-2024 09:49:15.253627] NOTICE: pid 1, fpm_children_make(), line 435: [pool www] child 294 started
[20-Mar-2024 09:49:15.253408] DEBUG: pid 1, fpm_children_make(), line 429: unblocking signals, child born
2024/03/20 09:49:16.176045 (26) Debug: received binary message, len=9940
2024-03-20 09:49:16.178 +0000 (288 288) debug: APPINFO reply connected
2024-03-20 09:49:16.179 +0000 (288 288) debug: APPINFO reply full app='***' agent_run_id=BQAEdEbs***
2024-03-20 09:49:16.179 +0000 (288 288) debug: Adaptive sampling configuration. Connect: 1710928127000000 us. Frequency: 60000000 us. Target: 10.
2024-03-20 09:49:16.181 +0000 (288 288) debug: 'WT_IS_FILENAME & SCRIPT_FILENAME' naming is '/index.php'
2024-03-20 09:49:16.183 +0000 (288 288) warning: User instrumentation from opcache: missing 'scripts' key in status information
2024-03-20 09:49:16.184 +0000 (288 288) debug: detected library=Guzzle 6
2024-03-20 09:49:16.244 +0000 (288 288) debug: detected library=Monolog
[20-Mar-2024 09:49:16.247757] DEBUG: pid 1, fpm_pctl_perform_idle_server_maintenance(), line 395: [pool www] currently 1 active children, 6 spare children, 7 running children. Spawning rate 1
2024-03-20 09:49:16.255 +0000 (289 289) debug: closed daemon connection fd=7
2024-03-20 09:49:16.254 +0000 (289 289) debug: MSHUTDOWN processing started
2024-03-20 09:49:16.330 +0000 (288 288) debug: ignoring this transaction
2024-03-20 09:49:16.334 +0000 (288 288) debug: # elements: 5, # buckets used: 4
[20-Mar-2024 09:49:16.334698] DEBUG: pid 1, fpm_event_loop(), line 440: event module triggered 1 events
2024-03-20 09:49:16.334 +0000 (288 288) debug: collisions - min: 1, max: 2, avg: 2
2024/03/20 09:49:17.086909 (26) Debug: harvesting log events
2024/03/20 09:49:17.087126 (26) Debug: harvesting error events
2024/03/20 09:49:17.087204 (26) Debug: harvesting custom events
2024/03/20 09:49:17.087026 (26) Debug: harvesting transaction events
[20-Mar-2024 09:49:17.249764] DEBUG: pid 1, fpm_pctl_perform_idle_server_maintenance(), line 395: [pool www] currently 0 active children, 7 spare children, 7 running children. Spawning rate 1
2024-03-20 09:49:17.255 +0000 (288 288) debug: closed daemon connection fd=7
2024-03-20 09:49:17.255 +0000 (288 288) debug: MSHUTDOWN processing started
[20-Mar-2024 09:49:18.251010] DEBUG: pid 1, fpm_pctl_perform_idle_server_maintenance(), line 395: [pool www] currently 0 active children, 7 spare children, 7 running children. Spawning rate 1
[20-Mar-2024 09:49:18.256774] DEBUG: pid 1, fpm_got_signal(), line 82: received SIGCHLD
[20-Mar-2024 09:49:18.256957] DEBUG: pid 1, fpm_children_make(), line 405: blocking signals before child birth
[20-Mar-2024 09:49:18.256902] WARNING: pid 1, fpm_children_bury(), line 258: [pool www] child 288 exited on signal 9 (SIGKILL) after 15.021025 seconds from start
[20-Mar-2024 09:49:18.256844] DEBUG: pid 1, fpm_event_loop(), line 440: event module triggered 1 events
[20-Mar-2024 09:49:18.256668] DEBUG: pid 1, fpm_event_loop(), line 440: event module triggered 1 events
[20-Mar-2024 09:49:18.256574] DEBUG: pid 1, fpm_event_loop(), line 440: event module triggered 1 events
[20-Mar-2024 09:49:18.259668] DEBUG: pid 1, fpm_children_make(), line 429: unblocking signals, child born
[20-Mar-2024 09:49:18.259823] NOTICE: pid 1, fpm_children_make(), line 435: [pool www] child 295 started
[20-Mar-2024 09:49:19.252095] DEBUG: pid 1, fpm_pctl_perform_idle_server_maintenance(), line 395: [pool www] currently 0 active children, 7 spare children, 7 running children. Spawning rate 1
[20-Mar-2024 09:49:19.256189] DEBUG: pid 1, fpm_event_loop(), line 440: event module triggered 1 events
[20-Mar-2024 09:49:19.256306] DEBUG: pid 1, fpm_children_make(), line 405: blocking signals before child birth
[20-Mar-2024 09:49:19.256250] WARNING: pid 1, fpm_children_bury(), line 258: [pool www] child 289 exited on signal 9 (SIGKILL) after 14.015966 seconds from start
[20-Mar-2024 09:49:19.256113] DEBUG: pid 1, fpm_event_loop(), line 440: event module triggered 2 events
[20-Mar-2024 09:49:19.256177] DEBUG: pid 1, fpm_got_signal(), line 82: received SIGCHLD
[20-Mar-2024 09:49:19.259848] DEBUG: pid 1, fpm_children_make(), line 429: unblocking signals, child born
[20-Mar-2024 09:49:19.260000] NOTICE: pid 1, fpm_children_make(), line 435: [pool www] child 296 started
[20-Mar-2024 09:49:20.254242] DEBUG: pid 1, fpm_pctl_perform_idle_server_maintenance(), line 395: [pool www] currently 0 active children, 7 spare children, 7 running children. Spawning rate 1
2024-03-20 09:49:20.262 +0000 (290 290) debug: closed daemon connection fd=7
2024-03-20 09:49:20.262 +0000 (290 290) debug: MSHUTDOWN processing started
[20-Mar-2024 09:49:21.255477] DEBUG: pid 1, fpm_pctl_perform_idle_server_maintenance(), line 395: [pool www] currently 0 active children, 7 spare children, 7 running children. Spawning rate 1
[20-Mar-2024 09:49:21.259052] DEBUG: pid 1, fpm_got_signal(), line 82: received SIGCHLD
[20-Mar-2024 09:49:21.259138] WARNING: pid 1, fpm_children_bury(), line 258: [pool www] child 290 exited on signal 9 (SIGKILL) after 14.018960 seconds from start

Your Environment

Dockerfile

FROM ubuntu:focal

ENV \
  USER='user' \
  UID=1033 \
  GROUP='app' \
  GID=1033 \
  PHP_VERSION="8.0"

RUN \
  set -x && \
  groupadd --system --gid ${GID} ${GROUP}  && \
  useradd --system --gid ${GID} --no-create-home --home /nonexistent --shell /bin/false --uid ${UID} ${USER}

# Installing packages
RUN \
  apt-get update && \
  apt-get install -y \
    curl \
    apt-utils \
    gettext-base \
    apt-transport-https \
    ca-certificates \
    language-pack-en-base \
    software-properties-common && \
  LC_ALL=en_US.UTF-8 apt-add-repository ppa:ondrej/php && \
  echo 'deb http://apt.newrelic.com/debian/ newrelic non-free' | tee /etc/apt/sources.list.d/newrelic.list && \
  curl https://download.newrelic.com/548C16BF.gpg | apt-key add - && \
  apt-get update && \
  apt-get install -y --no-install-recommends \
    php${PHP_VERSION} \
    php${PHP_VERSION}-fpm \
    php${PHP_VERSION}-bcmath \
    php${PHP_VERSION}-bz2 \
    php${PHP_VERSION}-yaml \
    php${PHP_VERSION}-curl \
    php${PHP_VERSION}-dba \
    php${PHP_VERSION}-gd \
    php${PHP_VERSION}-gmp \
    php${PHP_VERSION}-apcu \
    php${PHP_VERSION}-imap \
    php${PHP_VERSION}-intl \
    php${PHP_VERSION}-ldap \
    php${PHP_VERSION}-mbstring \
    php${PHP_VERSION}-mysql \
    php${PHP_VERSION}-opcache \
    php${PHP_VERSION}-pgsql \
    php${PHP_VERSION}-readline \
    php${PHP_VERSION}-soap \
    php${PHP_VERSION}-xml \
    php${PHP_VERSION}-xmlrpc \
    php${PHP_VERSION}-xsl \
    php${PHP_VERSION}-zip \
    php${PHP_VERSION}-amqp \
    php${PHP_VERSION}-gnupg \
    php${PHP_VERSION}-igbinary \
    php${PHP_VERSION}-memcached \
    php${PHP_VERSION}-imagick \
    php${PHP_VERSION}-mongo \
    php${PHP_VERSION}-mongodb \
    php${PHP_VERSION}-redis \
    php${PHP_VERSION}-ssh2 \
    php${PHP_VERSION}-tideways \
    php${PHP_VERSION}-xhprof \
    php${PHP_VERSION}-grpc \
    newrelic-php5 && \
  apt-get autoremove -y && \
  apt-get clean -y && \
  rm -rf /var/lib/apt/lists/* && \
  # creating symlink for the current php version
  ln -s $(which php-fpm${PHP_VERSION}) /usr/local/sbin/php-fpm

COPY configs/php-fpm.template /etc/php/${PHP_VERSION}/fpm/php-fpm.template
COPY configs/newrelic.ini.template /etc/php/${PHP_VERSION}/fpm/newrelic.ini.template
COPY scripts/startup.sh /usr/local/bin/startup.sh
COPY configs/conf_dir.php_mod_apcu/ /etc/php/${PHP_VERSION}

RUN \
  mkdir -p \
    /app \
    /run/php/ \
    /var/run/syslog-ng/ \
    /var/log/php \
    /usr/local/etc/webapp && \
  # stage env for old projects, and dafault value for the image
  echo "test" > /usr/local/etc/webapp/srvtype && \
  touch /run/php${PHP_VERSION}-fpm.pid && \
  touch /run/php${PHP_VERSION}-newrelic.pid && \
  touch /tmp/.newrelic.sock && \
  chown -R ${USER}:${GROUP} \
    /run/php${PHP_VERSION}-fpm.pid \
    /run/php${PHP_VERSION}-newrelic.pid \
    /etc/php/ \
    /run/php/ \
    /var/log/ \
    /usr/local/etc/webapp/srvtype \
    /tmp/.newrelic.sock \
    /app

WORKDIR /app
COPY .opcache.preload.php /app

USER ${USER}

ENV \
  # Clear env=no, for the app's that are using vars in the fpm pool
  PHP_FPM_CLEAR_ENV='yes' \
  PHP_FPM_ERROR_LOG='/var/log/php/php-error.log' \
  PHP_FPM_SLOW_LOG='/var/log/php/php-slow.log' \
  PHP_FPM_SLOW_LOG_TIMEOUT='0' \
  # port or socket
  PHP_FPM_LISTEN_TYPE='port' \
  PHP_FPM_LISTEN_PORT=9000 \
  PHP_FPM_LISTEN_SOCKET='/run/php/php-fpm.sock' \
  # Old projects env selection /usr/local/etc/webapp/srvtype
  SRV_TYPE='test' \
  # New relic vars
  NEW_RELIC_ENABLED='false' \
  NEW_RELIC_LICENSE='' \
  NEW_RELIC_APP_NAME='' \
  PHP_FPM_STATUS_PATH='/fpm_status'

# Generating default php-fpm config.
# Config will be overriten if default variables are changed
# Or in case, if image entrypoint will be overriten, so startup script will not launched.
RUN \
  /usr/local/bin/startup.sh ls -l /etc/php/${PHP_VERSION}/fpm/php-fpm.conf

EXPOSE ${PHP_FPM_LISTEN_PORT}

STOPSIGNAL SIGQUIT

ENTRYPOINT [ "/usr/local/bin/startup.sh" ]

CMD [ "php-fpm" ]

PHP-FPM configuration

[global]
pid = /run/php${PHP_VERSION}-fpm.pid
error_log = ${PHP_FPM_ERROR_LOG}
daemonize = no
log_level = debug

[www]
clear_env = ${PHP_FPM_CLEAR_ENV}

catch_workers_output = yes

user   = ${USER}
group  = ${GROUP}
listen = ${PHP_FPM_LISTEN_REAL_ADDR}
listen.mode  = 0660
listen.owner = ${USER}
listen.group = ${GROUP}

pm = dynamic
pm.status_path = ${PHP_FPM_STATUS_PATH}
pm.max_children  = 10
pm.start_servers = 2
pm.max_requests  = 0
pm.min_spare_servers = 2
pm.max_spare_servers = 5
pm.process_idle_timeout = 60s

slowlog = ${PHP_FPM_SLOW_LOG}
request_slowlog_timeout   = ${PHP_FPM_SLOW_LOG_TIMEOUT}
request_terminate_timeout = 0

New Relic configuration

extension = "newrelic.so"

[newrelic]
newrelic.enabled = "${NEW_RELIC_ENABLED}"

newrelic.daemon.app_connect_timeout=15s
newrelic.daemon.start_timeout=5s

newrelic.license = "${NEW_RELIC_LICENSE}"

newrelic.logfile = "${PHP_FPM_ERROR_LOG}"
newrelic.loglevel = "debug"

newrelic.appname = "${NEW_RELIC_APP_NAME}"

newrelic.daemon.logfile = "${PHP_FPM_ERROR_LOG}"
newrelic.daemon.loglevel = "debug"

newrelic.daemon.port = "/tmp/.newrelic.sock"

newrelic.daemon.pidfile = "/run/php${PHP_VERSION}-newrelic.pid"

newrelic.browser_monitoring.auto_instrument = false

PHP version:

PHP 8.0.28 (cli) (built: Feb 14 2023 18:32:57) ( NTS )
Copyright (c) The PHP Group
Zend Engine v4.0.28, Copyright (c) Zend Technologies
    with Zend OPcache v8.0.28, Copyright (c), by Zend Technologies

New Relic version: New Relic daemon version 10.18.0.8-e78afe1fd086

@Aspirin4k Aspirin4k added the bug Something isn't working label Mar 20, 2024
@Aspirin4k
Copy link
Author

Debugging a little bit deeper revealed following: here are traces for master and child processes
Master

--
getpid()                                = 1
write(10, "C", 1)                       = 1
rt_sigreturn({mask=[]})                 = 2
read(24, "", 1023)                      = 0
epoll_ctl(12, EPOLL_CTL_DEL, 24, 0x7ffdfa69f06c) = 0
close(24)                               = 0
getpid()                                = 1
read(17, "", 1023)                      = 0
epoll_ctl(12, EPOLL_CTL_DEL, 17, 0x7ffdfa69f06c) = 0
close(17)                               = 0
getpid()                                = 1
getpid()                                = 1
epoll_wait(12, [{EPOLLIN, {u32=2398040064, u64=94616232603648}}], 21, 998) = 1
read(9, "C", 1)                         = 1
read(9, 0x7ffdfa69f517, 1)              = -1 EAGAIN (Resource temporarily unavailable)
getpid()                                = 1
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL}], WNOHANG|WSTOPPED, NULL) = 42173
write(8, "[21-Mar-2024 02:13:38] WARNING: "..., 119) = 119
pipe([17, 23])                          = 0
pipe([24, 34])                          = 0
fcntl(17, F_GETFL)                      = 0 (flags O_RDONLY)
fcntl(17, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
fcntl(24, F_GETFL)                      = 0 (flags O_RDONLY)
fcntl(24, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
rt_sigprocmask(SIG_BLOCK, [QUIT USR1 USR2 TERM CHLD], NULL, 8) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f8e9c433890) = 42213
rt_sigprocmask(SIG_UNBLOCK, ~[RTMIN RT_1], NULL, 8) = 0
close(23)                               = 0
close(34)                               = 0
epoll_ctl(12, EPOLL_CTL_ADD, 17, {EPOLLIN, {u32=2418135400, u64=94616252698984}}) = 0
epoll_ctl(12, EPOLL_CTL_ADD, 24, {EPOLLIN, {u32=2418135472, u64=94616252699056}}) = 0
write(8, "[21-Mar-2024 02:13:38] NOTICE: ["..., 62) = 62
wait4(-1, 0x7ffdfa69f49c, WNOHANG|WSTOPPED, NULL) = 0
getpid()                                = 1
getpid()                                = 1
epoll_wait(12, [], 21, 995)             = 0
getpid()                                = 1
getsockopt(11, SOL_TCP, TCP_INFO, "\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\377\1\0\0"..., [104]) = 0
kill(42177, SIGQUIT)                    = 0
getpid()                                = 1
getpid()                                = 1
epoll_wait(12, [], 21, 1)               = 0
getpid()                                = 1
getpid()                                = 1
epoll_wait(12, [], 21, 999)             = 0
getpid()                                = 1
getsockopt(11, SOL_TCP, TCP_INFO, "\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\377\1\0\0"..., [104]) = 0
kill(42177, SIGKILL)                    = 0
getpid()                                = 1
getpid()                                = 1
epoll_wait(12, [{EPOLLHUP, {u32=2418136144, u64=94616252699728}}, {EPOLLHUP, {u32=2418136072, u64=94616252699656}}], 21, 1000) = 2
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=42177, si_uid=1033, si_status=SIGKILL, si_utime=0, si_stime=0} ---

Child:

strace: Process 2476046 attached
accept(19, 0x7ffdfa69f540, [112])       = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=1, si_uid=1033} ---
close(19)                               = 0
socket(AF_UNIX, SOCK_STREAM, 0)         = 8
rt_sigreturn({mask=[]})                 = 43
accept(19, 0x7ffdfa69f540, [112])       = -1 EBADF (Bad file descriptor)
munmap(0x7f8e978b8000, 127760)          = 0
...
gettid()                                = 41701
getpid()                                = 41701
writev(3, [{iov_base="2024-03-21 02:09:41.584 +0000 (4"..., iov_len=51}, {iov_base="MSHUTDOWN processing started", iov_len=28}, {iov_base="\n", iov_len=1}], 3) = 80
close(3)                                = 0
munmap(0x7f8e97f35000, 1104248)         = 0
munmap(0x7f8e98043000, 166040)          = 0
munmap(0x7f8e9806c000, 75048)           = 0
socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
close(3) 

munmap(0x7f8e99da6000, 20720)           = 0
futex(0x7f8e982e3bd0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
munmap(0x7f8e9808a000, 2492416)         = 0
munmap(0x7f8e9807f000, 41112)           = 0
munmap(0x7f8e9aab9000, 94520)           = 0
munmap(0x7f8e9847f000, 107456)          = 0
munmap(0x7f8e987a9000, 490896)          = 0
...
munmap(0x7f8e9946c000, 69968)           = 0
munmap(0x7f8e9947e000, 53496)           = 0
futex(0x7f8e994ace40, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f8e994ac140, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ?
+++ killed by SIGKILL +++

Combining with logs above it looks like master process sends SIGQUIT signal but child cannot end in 1 second (time between idle maintenance heartbeat in php-fpm) and next master sends SIGKILL. Having child exited with SIGKILL is considered as unexpected that's why it creates new child (So each child takes 2 iterations to restart (2 seconds), having pool of 6-10 workers leads to killing children after 12-20 seconds).

It looks like problem occur in MSHUTDOWN handler of new relic agent. More precisely when handling one of the mutexes (probably applist->applist_lock)

@Aspirin4k
Copy link
Author

It turned out that SEGFAULT in grpc extension for some reason leads to hanging mutex in new relic extension.
I didn't manage to find out why grpc crashes in rootless environment, but disabling it resolves the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant