100% CPU usage of nginx worker process after custom plugin crash #9301

ghost · 2022-08-24T09:14:54Z

Is there an existing issue for this?

I have searched the existing issues

Kong version (`$ kong version`)

2.7.1

Current Behavior

When custom plugin crash for some reason Nginx worker process stuck in some kind of loop and CPU usage of worker process reach to 100%. After some time container start throttling and half of the request going to this pod result in 502. This remain same for almost 10 min. Eventually Kong main process get exited and container restart. Custom plugins are written in Golang and running in embedded server mode.

Attaching screenshot of pod where CPU is high after restart.

Memory usage of same container also increase steadily.

Output of top command withing container whose CPU is high

GDB output of worker process using bt

(gdb) bt
#0  0x00007f8972a2942a in epoll_wait (epfd=11, events=0x5635d3a4cb90, maxevents=512, timeout=timeout@entry=0) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00005635d2244f8c in ngx_epoll_process_events (cycle=0x5635d398cd20, timer=0, flags=1) at src/event/modules/ngx_epoll_module.c:800
#2  0x00005635d223bbe8 in ngx_process_events_and_timers (cycle=cycle@entry=0x5635d398cd20) at src/event/ngx_event.c:257
#3  0x00005635d22435c0 in ngx_worker_process_cycle (cycle=cycle@entry=0x5635d398cd20, data=data@entry=0x0) at src/os/unix/ngx_process_cycle.c:782
#4  0x00005635d2241b8d in ngx_spawn_process (cycle=cycle@entry=0x5635d398cd20, proc=proc@entry=0x5635d2243540 <ngx_worker_process_cycle>, data=data@entry=0x0, 
    name=name@entry=0x5635d2398899 "worker process", respawn=respawn@entry=-3) at src/os/unix/ngx_process.c:199
#5  0x00005635d2242d74 in ngx_start_worker_processes (cycle=cycle@entry=0x5635d398cd20, n=1, type=type@entry=-3) at src/os/unix/ngx_process_cycle.c:382
#6  0x00005635d2244032 in ngx_master_process_cycle (cycle=0x5635d398cd20) at src/os/unix/ngx_process_cycle.c:135
#7  0x00005635d221a7c6 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386

strace output of process

$ strace -p 1130
strace: Process 1130 attached
gettimeofday({tv_sec=1661325156, tv_usec=619141}, NULL) = 0
epoll_wait(11, [], 512, 0)              = 0 
gettimeofday({tv_sec=1661325156, tv_usec=619459}, NULL) = 0
epoll_wait(11, [], 512, 0)              = 0 
gettimeofday({tv_sec=1661325156, tv_usec=619640}, NULL) = 0
epoll_wait(11, [], 512, 0)              = 0
gettimeofday({tv_sec=1661325156, tv_usec=619750}, NULL) = 0
epoll_wait(11, [], 512, 0)              = 0 
gettimeofday({tv_sec=1661325156, tv_usec=619878}, NULL) = 0
epoll_wait(11, [], 512, 0)              = 0 
gettimeofday({tv_sec=1661325156, tv_usec=620019}, NULL) = 0
epoll_wait(11, [], 512, 0)              = 0 
gettimeofday({tv_sec=1661325156, tv_usec=620272}, NULL) = 0
epoll_wait(11, [], 512, 0)              = 0 
gettimeofday({tv_sec=1661325156, tv_usec=620454}, NULL) = 0

Expected Behavior

If custom plugin crash it should not have any impact nginx worker process.

Steps To Reproduce

1. Create plugin with logic to do panic in specific case and run it in embedded server mode.
2. Execute panic use case
3. Once plugin crash it will get restarted, again hit same api where plugin is enabled. 
4. Nginx worker process probably stuck in loop and cpu usage reach to 100%.

Anything else?

We are running kong in hybrid mode. Plugin are running in embedded server mode. Apart from custome plugin we are also using correlation-id, request-transformer, jwt, response-transformer plugins.

The text was updated successfully, but these errors were encountered:

fairyqb · 2022-08-24T09:55:45Z

recommended flame graph

ghost · 2022-08-25T09:02:03Z

@fairyqb Attaching CPU flame Graph of process

hanshuebner · 2022-08-25T09:23:54Z

Looks like epoll looping, maybe because a socket is reported as ready (readable/writeable) that is then not handled (read/written), causing the next epoll call to immediately return again. Given that this is about the Golang plugin, what does it mean that the plugin "crashed"? Can we even expect to gracefully handle this situation?

ghost · 2022-09-01T08:56:23Z

Hi Samuele, We are not using go-pluginserver to run custom plugins, we are running a plugin in embedded server mode.

…

On Thu, Sep 1, 2022 at 12:57 PM Samuele Illuminati ***@***.***> wrote: Hello @dhrumil29699 <https://github.com/dhrumil29699>, I was trying to replicate this but couldn't: for me when a simple go plugin crashes, the go-pluginserver is quickly restarted without affecting the CPU utilization noticeably. Could you share some additional details of how exactly your plugin is crashing, what is causing the failure and whether kong is logging something similar to: external pluginserver 'go-pluginserver' terminated? Also, could you let us know if this can be replicated on the latest version of Kong - 2.8.1.4? — Reply to this email directly, view it on GitHub <#9301 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFZJ3FZR3LWGQM3INDLMB3LV4BLGHANCNFSM57OQP2AQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Thanks & Regards Dhrumil

-- The information in this mail and any attachment(s) to this message is/are confidential and intended solely for the addressee or organisation to whom it is addressed. If you have erroneously received this message, please ***@***.*** ***@***.***> immediately and destroy the message and attachment(s). If you are not the intended recipient, any copying, forwarding, altering or disclosing the contents of this email message may be unlawful. The information, attachment(s) or the opinions expressed in this mail are those of the individual sender and not necessarily those of ACKO. ACKO accepts no responsibility for any loss or damage arising from the use of this email message or its attachment(s).

samugi · 2022-09-01T11:47:57Z

Hi Samuele, We are not using go-pluginserver to run custom plugins, we are running a plugin in embedded server mode.

Hi @dhrumil29699, yes I've meant to review my answer after noticing that detail in your issue's description, to verify whether it made a difference. After testing this in embedded server mode I can confirm that I could still not reproduce this behaviour: after the embedded server terminates and restart due to the plugin's crash, the resources utilization remain stable, even after running an heavy load of requests on the gateway.

Based on that I believe this problem might be due to the particular nature of the crash of your plugin and/or environment configuration or plugin implementation. The following additional details could help identifying the root cause:

Can this be replicated with the following conditions?:

a dummy plugin, with no logic other than the intentional panic
no other plugins configured
the latest version of Kong gateway - 2.8.1.4
the external go-pluginserver (This would provide better isolation so potentially could result in a different behaviour).

ghost · 2022-09-01T13:07:15Z

@samugi
after plugin crash if we hit the same api again then nginx stuck in loop and result in 100% CPU. for crash I am using go panic (https://gobyexample.com/panic)

samugi · 2022-09-03T10:38:46Z

hello @dhrumil29699, thanks for getting back to us. As mentioned I couldn't replicate this in a test environment based on your description. I would advice going through each bullet point from my previous answer, thank you.

ghost · 2022-09-05T05:42:35Z

@samugi I have tried to replicate issue mention in above condition.

a dummy plugin, with no logic other than the intentional panic - Yes issue is replicated
no other plugins configured - Yes issue is replicated
the latest version of Kong gateway - 2.8.1.4 - Yes issue is replicated (I used latest docker image)
the external go-pluginserver - Issue in not replicating, Kong working normal after restart
Although in go-pluginserver case from logs it's seems plugin server restart logic is different then embedded server mode

go-pluginserver mode logs

2022/09/05 05:25:24 [error] 1121#0: *89 [kong] mp_rpc.lua:308 [go-crash] No plugin instance 0, client: 172.22.0.1, server: kong, request: "GET /get HTTP/1.1", host: "localhost:8080"
2022/09/05 05:25:24 [error] 1121#0: *89 [kong] mp_rpc.lua:308 [go-crash] no data, client: 172.22.0.1, server: kong, request: "GET /get HTTP/1.1", host: "localhost:8080"
2022/09/05 05:25:24 [notice] 1121#0: signal 17 (SIGCHLD) received from 1184
2022/09/05 05:25:24 [notice] 1121#0: *4 [kong] process.lua:272 external pluginserver 'go' terminated: exit 2, context: ngx.timer
2022/09/05 05:25:24 [notice] 1121#0: *4 [kong] process.lua:256 Starting go, context: ngx.timer
172.22.0.1 - - [05/Sep/2022:05:25:25 +0000] "GET /get HTTP/1.1" 200 1124 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"

embedded server mode logs

2022/09/05 05:32:09 [error] 1122#0: *122 [kong] pb_rpc.lua:365 [go-crash] closed, client: 172.22.0.1, server: kong, request: "GET /get HTTP/1.1", host: "localhost:8080"
2022/09/05 05:32:09 [error] 1122#0: *122 connect() to unix:/usr/local/kong/go-crash.socket failed (111: Connection refused), client: 172.22.0.1, server: kong, request: "GET /get HTTP/1.1", host: "localhost:8080"
2022/09/05 05:32:09 [error] 1122#0: *122 [kong] init.lua:297 [go-crash] ...cal/share/lua/5.1/kong/runloop/plugin_servers/pb_rpc.lua:274: connection refused, client: 172.22.0.1, server: kong, request: "GET /get HTTP/1.1", host: "localhost:8080"
2022/09/05 05:32:09 [notice] 1122#0: signal 17 (SIGCHLD) received from 1123
2022/09/05 05:32:09 [notice] 1122#0: *4 [kong] process.lua:272 external pluginserver 'go-crash' terminated: exit 2, context: ngx.timer
2022/09/05 05:32:09 [notice] 1122#0: *4 [kong] process.lua:256 Starting go-crash, context: ngx.timer
172.22.0.1 - - [05/Sep/2022:05:32:09 +0000] "GET /get HTTP/1.1" 500 42 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"

embedded server mode has one more issue earlier in which it's not able to call plugin after crash for that it require few changes in pb_rpc.lua file issue mention in #8293.

samugi · 2022-09-05T21:10:35Z

Hi @dhrumil29699 thank you for the additional details. Can you confirm that you can only replicate this by applying the patch on pb_rpc.lua? This should not be reproducible with any Kong release that is currently available. I am investigating that patch and will update you with any relevant details.

ghost · 2022-09-06T04:15:39Z

Hi @samugi Yes, without the patch plugin server is not getting restarted by Kong and endpoint using plugin get proxied without plugin logic.

This is a fix for #9301 when a request made the plugin fail, the creation of a new instance was not handling a failed connection properly. Upon failure to connect to the socket, it was possible to end up in a state where the instance existed without an instance id, causing an infinite loop and high CPU utilization.

chronolaw added the core/pdk label Aug 24, 2022

ghost changed the title ~~100% CPU usage of nginx worker process after customs plugin crash~~ 100% CPU usage of nginx worker process after custom plugin crash Aug 24, 2022

mayocream added the core/performance label Aug 26, 2022

guanlan assigned samugi Aug 31, 2022

samugi mentioned this issue Sep 6, 2022

fix(external-plugins): handle failing instance creation #9384

Merged

samugi linked a pull request Sep 6, 2022 that will close this issue

fix(external-plugins): handle failing instance creation #9384

Merged

kikito closed this as completed in #9384 Sep 6, 2022

zffocussss mentioned this issue Oct 14, 2022

external pluginserver 'elastic-apm' terminated matthyx/kong-elastic-apm#25

Open

This was referenced Jan 18, 2024

100% CPU usage and keep posting event 'reset_instances' to go plugin server #12363

Closed

100% CPU usage and keep posting event 'reset_instances' to go plugin server #12364

Closed

locao pushed a commit that referenced this issue Jun 21, 2024

tests(api): stabilize several tests (#9301)

dc1887d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

100% CPU usage of nginx worker process after custom plugin crash #9301

100% CPU usage of nginx worker process after custom plugin crash #9301

ghost commented Aug 24, 2022

fairyqb commented Aug 24, 2022

ghost commented Aug 25, 2022

hanshuebner commented Aug 25, 2022

ghost commented Sep 1, 2022 via email

samugi commented Sep 1, 2022

ghost commented Sep 1, 2022

samugi commented Sep 3, 2022

ghost commented Sep 5, 2022

samugi commented Sep 5, 2022

ghost commented Sep 6, 2022

100% CPU usage of nginx worker process after custom plugin crash #9301

100% CPU usage of nginx worker process after custom plugin crash #9301

Comments

ghost commented Aug 24, 2022

Is there an existing issue for this?

Kong version ($ kong version)

Current Behavior

Expected Behavior

Steps To Reproduce

Anything else?

fairyqb commented Aug 24, 2022

ghost commented Aug 25, 2022

hanshuebner commented Aug 25, 2022

ghost commented Sep 1, 2022 via email

samugi commented Sep 1, 2022

ghost commented Sep 1, 2022

samugi commented Sep 3, 2022

ghost commented Sep 5, 2022

samugi commented Sep 5, 2022

ghost commented Sep 6, 2022

Kong version (`$ kong version`)