Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% CPU usage of nginx worker process after custom plugin crash #9301

Closed
1 task done
ghost opened this issue Aug 24, 2022 · 10 comments · Fixed by #9384
Closed
1 task done

100% CPU usage of nginx worker process after custom plugin crash #9301

ghost opened this issue Aug 24, 2022 · 10 comments · Fixed by #9384

Comments

@ghost
Copy link

ghost commented Aug 24, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Kong version ($ kong version)

2.7.1

Current Behavior

When custom plugin crash for some reason Nginx worker process stuck in some kind of loop and CPU usage of worker process reach to 100%. After some time container start throttling and half of the request going to this pod result in 502. This remain same for almost 10 min. Eventually Kong main process get exited and container restart. Custom plugins are written in Golang and running in embedded server mode.

Attaching screenshot of pod where CPU is high after restart.
Screenshot 2022-08-19 at 11 50 12 AM
Memory usage of same container also increase steadily.
Screenshot 2022-08-19 at 11 50 29 AM
Output of top command withing container whose CPU is high
Screenshot 2022-08-22 at 5 39 02 PM

GDB output of worker process using bt

(gdb) bt
#0  0x00007f8972a2942a in epoll_wait (epfd=11, events=0x5635d3a4cb90, maxevents=512, timeout=timeout@entry=0) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00005635d2244f8c in ngx_epoll_process_events (cycle=0x5635d398cd20, timer=0, flags=1) at src/event/modules/ngx_epoll_module.c:800
#2  0x00005635d223bbe8 in ngx_process_events_and_timers (cycle=cycle@entry=0x5635d398cd20) at src/event/ngx_event.c:257
#3  0x00005635d22435c0 in ngx_worker_process_cycle (cycle=cycle@entry=0x5635d398cd20, data=data@entry=0x0) at src/os/unix/ngx_process_cycle.c:782
#4  0x00005635d2241b8d in ngx_spawn_process (cycle=cycle@entry=0x5635d398cd20, proc=proc@entry=0x5635d2243540 <ngx_worker_process_cycle>, data=data@entry=0x0, 
    name=name@entry=0x5635d2398899 "worker process", respawn=respawn@entry=-3) at src/os/unix/ngx_process.c:199
#5  0x00005635d2242d74 in ngx_start_worker_processes (cycle=cycle@entry=0x5635d398cd20, n=1, type=type@entry=-3) at src/os/unix/ngx_process_cycle.c:382
#6  0x00005635d2244032 in ngx_master_process_cycle (cycle=0x5635d398cd20) at src/os/unix/ngx_process_cycle.c:135
#7  0x00005635d221a7c6 in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:386

strace output of process

$ strace -p 1130
strace: Process 1130 attached
gettimeofday({tv_sec=1661325156, tv_usec=619141}, NULL) = 0
epoll_wait(11, [], 512, 0)              = 0 
gettimeofday({tv_sec=1661325156, tv_usec=619459}, NULL) = 0
epoll_wait(11, [], 512, 0)              = 0 
gettimeofday({tv_sec=1661325156, tv_usec=619640}, NULL) = 0
epoll_wait(11, [], 512, 0)              = 0
gettimeofday({tv_sec=1661325156, tv_usec=619750}, NULL) = 0
epoll_wait(11, [], 512, 0)              = 0 
gettimeofday({tv_sec=1661325156, tv_usec=619878}, NULL) = 0
epoll_wait(11, [], 512, 0)              = 0 
gettimeofday({tv_sec=1661325156, tv_usec=620019}, NULL) = 0
epoll_wait(11, [], 512, 0)              = 0 
gettimeofday({tv_sec=1661325156, tv_usec=620272}, NULL) = 0
epoll_wait(11, [], 512, 0)              = 0 
gettimeofday({tv_sec=1661325156, tv_usec=620454}, NULL) = 0

Expected Behavior

If custom plugin crash it should not have any impact nginx worker process.

Steps To Reproduce

1. Create plugin with logic to do panic in specific case and run it in embedded server mode.
2. Execute panic use case
3. Once plugin crash it will get restarted, again hit same api where plugin is enabled. 
4. Nginx worker process probably stuck in loop and cpu usage reach to 100%.

Anything else?

We are running kong in hybrid mode. Plugin are running in embedded server mode. Apart from custome plugin we are also using correlation-id, request-transformer, jwt, response-transformer plugins.

@fairyqb
Copy link
Contributor

fairyqb commented Aug 24, 2022

recommended flame graph

@ghost ghost changed the title 100% CPU usage of nginx worker process after customs plugin crash 100% CPU usage of nginx worker process after custom plugin crash Aug 24, 2022
@ghost
Copy link
Author

ghost commented Aug 25, 2022

@fairyqb Attaching CPU flame Graph of process

kernel

@hanshuebner
Copy link
Contributor

Looks like epoll looping, maybe because a socket is reported as ready (readable/writeable) that is then not handled (read/written), causing the next epoll call to immediately return again. Given that this is about the Golang plugin, what does it mean that the plugin "crashed"? Can we even expect to gracefully handle this situation?

@ghost
Copy link
Author

ghost commented Sep 1, 2022 via email

@samugi
Copy link
Member

samugi commented Sep 1, 2022

Hi Samuele, We are not using go-pluginserver to run custom plugins, we are running a plugin in embedded server mode.

Hi @dhrumil29699, yes I've meant to review my answer after noticing that detail in your issue's description, to verify whether it made a difference. After testing this in embedded server mode I can confirm that I could still not reproduce this behaviour: after the embedded server terminates and restart due to the plugin's crash, the resources utilization remain stable, even after running an heavy load of requests on the gateway.

Based on that I believe this problem might be due to the particular nature of the crash of your plugin and/or environment configuration or plugin implementation. The following additional details could help identifying the root cause:

Can this be replicated with the following conditions?:

  • a dummy plugin, with no logic other than the intentional panic
  • no other plugins configured
  • the latest version of Kong gateway - 2.8.1.4
  • the external go-pluginserver (This would provide better isolation so potentially could result in a different behaviour).

@ghost
Copy link
Author

ghost commented Sep 1, 2022

@samugi
after plugin crash if we hit the same api again then nginx stuck in loop and result in 100% CPU. for crash I am using go panic (https://gobyexample.com/panic)

@samugi
Copy link
Member

samugi commented Sep 3, 2022

hello @dhrumil29699, thanks for getting back to us. As mentioned I couldn't replicate this in a test environment based on your description. I would advice going through each bullet point from my previous answer, thank you.

@ghost
Copy link
Author

ghost commented Sep 5, 2022

@samugi I have tried to replicate issue mention in above condition.

  1. a dummy plugin, with no logic other than the intentional panic - Yes issue is replicated
  2. no other plugins configured - Yes issue is replicated
  3. the latest version of Kong gateway - 2.8.1.4 - Yes issue is replicated (I used latest docker image)
  4. the external go-pluginserver - Issue in not replicating, Kong working normal after restart
    Although in go-pluginserver case from logs it's seems plugin server restart logic is different then embedded server mode

go-pluginserver mode logs

2022/09/05 05:25:24 [error] 1121#0: *89 [kong] mp_rpc.lua:308 [go-crash] No plugin instance 0, client: 172.22.0.1, server: kong, request: "GET /get HTTP/1.1", host: "localhost:8080"
2022/09/05 05:25:24 [error] 1121#0: *89 [kong] mp_rpc.lua:308 [go-crash] no data, client: 172.22.0.1, server: kong, request: "GET /get HTTP/1.1", host: "localhost:8080"
2022/09/05 05:25:24 [notice] 1121#0: signal 17 (SIGCHLD) received from 1184
2022/09/05 05:25:24 [notice] 1121#0: *4 [kong] process.lua:272 external pluginserver 'go' terminated: exit 2, context: ngx.timer
2022/09/05 05:25:24 [notice] 1121#0: *4 [kong] process.lua:256 Starting go, context: ngx.timer
172.22.0.1 - - [05/Sep/2022:05:25:25 +0000] "GET /get HTTP/1.1" 200 1124 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"

embedded server mode logs

2022/09/05 05:32:09 [error] 1122#0: *122 [kong] pb_rpc.lua:365 [go-crash] closed, client: 172.22.0.1, server: kong, request: "GET /get HTTP/1.1", host: "localhost:8080"
2022/09/05 05:32:09 [error] 1122#0: *122 connect() to unix:/usr/local/kong/go-crash.socket failed (111: Connection refused), client: 172.22.0.1, server: kong, request: "GET /get HTTP/1.1", host: "localhost:8080"
2022/09/05 05:32:09 [error] 1122#0: *122 [kong] init.lua:297 [go-crash] ...cal/share/lua/5.1/kong/runloop/plugin_servers/pb_rpc.lua:274: connection refused, client: 172.22.0.1, server: kong, request: "GET /get HTTP/1.1", host: "localhost:8080"
2022/09/05 05:32:09 [notice] 1122#0: signal 17 (SIGCHLD) received from 1123
2022/09/05 05:32:09 [notice] 1122#0: *4 [kong] process.lua:272 external pluginserver 'go-crash' terminated: exit 2, context: ngx.timer
2022/09/05 05:32:09 [notice] 1122#0: *4 [kong] process.lua:256 Starting go-crash, context: ngx.timer
172.22.0.1 - - [05/Sep/2022:05:32:09 +0000] "GET /get HTTP/1.1" 500 42 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"

embedded server mode has one more issue earlier in which it's not able to call plugin after crash for that it require few changes in pb_rpc.lua file issue mention in #8293.

@samugi
Copy link
Member

samugi commented Sep 5, 2022

Hi @dhrumil29699 thank you for the additional details. Can you confirm that you can only replicate this by applying the patch on pb_rpc.lua? This should not be reproducible with any Kong release that is currently available. I am investigating that patch and will update you with any relevant details.

@ghost
Copy link
Author

ghost commented Sep 6, 2022

Hi @samugi Yes, without the patch plugin server is not getting restarted by Kong and endpoint using plugin get proxied without plugin logic.

@samugi samugi linked a pull request Sep 6, 2022 that will close this issue
samugi added a commit that referenced this issue Sep 6, 2022
This is a fix for #9301
when a request made the plugin fail, the creation of a new instance
was not handling a failed connection properly.
Upon failure to connect to the socket, it was possible to end up
in a state where the instance existed without an instance id,
causing an infinite loop and high CPU utilization.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants