Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graceful reload: kong reload on a machine with reasonable traffic and a go plugin leads to 404 and DOS #8531

Closed
1 task done
diefans opened this issue Mar 10, 2022 · 6 comments

Comments

@diefans
Copy link

diefans commented Mar 10, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Kong version ($ kong version)

2.7.1

Current Behavior

When just reloading kong while the host receives reasonable traffic passed through a go plugin, all responses after reload have a status==404 rendering the service useless.

The error log shows:

kong_1               | 192.168.33.1 - - [10/Mar/2022:17:34:35 +0000] "GET /auth/version HTTP/1.1" 404 396 "-" "-"
kong_1               | 2022/03/10 17:34:35 [error] 3442#0: *1453547 [kong] pb_rpc.lua:365 [bm-authx] closed, client: 192.168.33.1, server: kong, request: "GET /auth/version HTTP/1.1", host: "example.com"
kong_1               | 192.168.33.1 - - [10/Mar/2022:17:34:35 +0000] "GET /auth/version HTTP/1.1" 404 396 "-" "-"
kong_1               | 2022/03/10 17:34:35 [error] 3442#0: *1453556 [kong] pb_rpc.lua:365 [bm-authx] closed, client: 192.168.33.1, server: kong, request: "GET /auth/version HTTP/1.1", host: "example.com"
kong_1               | 192.168.33.1 - - [10/Mar/2022:17:34:35 +0000] "GET /auth/version HTTP/1.1" 404 396 "-" "-"
kong_1               | 2022/03/10 17:34:35 [error] 3442#0: *1453548 [kong] pb_rpc.lua:365 [bm-authx] closed, client: 192.168.33.1, server: kong, request: "GET /auth/version HTTP/1.1", host: "example.com"

Expected Behavior

kong reloads gracefully without notice to a client.

Steps To Reproduce

1. involve go plugin route in a loadtest
2. reload kong e.g. `kong reload`

Anything else?

We believe, this is related to how kong is managing recreating its workers and the go plugin process, e.g. binding new workers to the old plugin process and stopping it the latter.

@diefans
Copy link
Author

diefans commented Mar 14, 2022

It seems I am able to "fix" the big 404 issue by resetting the plugin instance:

diff --git a/kong/runloop/plugin_servers/pb_rpc.lua b/kong/runloop/plugin_servers/pb_rpc.lua
index 38f3aafe7..ae4531cdd 100644
--- a/kong/runloop/plugin_servers/pb_rpc.lua
+++ b/kong/runloop/plugin_servers/pb_rpc.lua
@@ -396,7 +396,7 @@ function Rpc:handle_event(plugin_name, conf, phase)
   if not res then
     kong.log.err(err)
 
-    if string.match(err:lower(), "no plugin instance") then
+    if string.match(err:lower(), "no plugin instance") or err:lower() == "closed" then
       self.reset_instance(plugin_name, conf)
       return self:handle_event(plugin_name, conf, phase)
     end

But still the reload is not graceful - at least only 3 requests fail due to connection closed, probably related to those 3 workers affected:

20002:kong_1                | 2022/03/14 16:36:24 [error] 1195#0: *21628 [kong] pb_rpc.lua:397 [bm-authx] closed, client: 192.168.33.1, server: kong, request: "GET /auth/version HTTP/1.1",host: "example.com"
20003:kong_1                | 2022/03/14 16:36:24 [error] 1195#0: *21630 [kong] pb_rpc.lua:397 [bm-authx] closed, client: 192.168.33.1, server: kong, request: "GET /auth/version HTTP/1.1",host: "example.com"
20005:kong_1                | 2022/03/14 16:36:24 [error] 1195#0: *21624 [kong] pb_rpc.lua:397 [bm-authx] closed, client: 192.168.33.1, server: kong, request: "GET /auth/version HTTP/1.1",host: "example.com"
❯ cargo run --release --example goose -- --host https://example.com/auth/version -t 10 --no-reset-metrics
    Finished release [optimized] target(s) in 0.14s
     Running `target/release/examples/goose --host 'https://example.com/auth/version' -t 10 --no-reset-metrics`
16:36:23 [WARN] "/auth/version": error sending request for url (https://example.com/auth/version): connection closed before message completed
16:36:23 [WARN] "/auth/version": error sending request for url (https://example.com/auth/version): connection closed before message completed
16:36:23 [WARN] "/auth/version": error sending request for url (https://example.com/auth/version): connection closed before message completed
All 8 users hatched.


 === PER TASK METRICS ===
 ------------------------------------------------------------------------------
 Name                     |   # times run |        # fails |   task/s |  fail/s
 ------------------------------------------------------------------------------
 1: LoadtestTasks         |
   1:                     |             8 |         0 (0%) |     0.80 |    0.00
   2:                     |        68,858 |         0 (0%) |     6886 |    0.00
 -------------------------+---------------+----------------+----------+--------
 Aggregated               |        68,866 |         0 (0%) |     6887 |    0.00
 ------------------------------------------------------------------------------
 Name                     |    Avg (ms) |        Min |         Max |     Median
 ------------------------------------------------------------------------------
 1: LoadtestTasks         |
   1:                     |       11.12 |          7 |          14 |         10
   2:                     |        1.13 |          1 |         105 |          1
 -------------------------+-------------+------------+-------------+-----------
 Aggregated               |        1.13 |          1 |         105 |          1

 === PER REQUEST METRICS ===
 ------------------------------------------------------------------------------
 Name                     |        # reqs |        # fails |    req/s |  fail/s
 ------------------------------------------------------------------------------
 GET /auth/version        |        68,858 |         3 (0%) |     6886 |    0.30
 ------------------------------------------------------------------------------
 Name                     |    Avg (ms) |        Min |         Max |     Median
 ------------------------------------------------------------------------------
 GET /auth/version        |        1.11 |          1 |         105 |          1
 ------------------------------------------------------------------------------
 Slowest page load within specified percentile of requests (in ms):
 ------------------------------------------------------------------------------
 Name                     |    50% |    75% |    98% |    99% |  99.9% | 99.99%
 ------------------------------------------------------------------------------
 GET /auth/version        |      1 |      1 |      4 |      5 |     11 |     34

 === ERRORS ===
 ------------------------------------------------------------------------------
 Count       | Error
 ------------------------------------------------------------------------------
 3             GET /auth/version: error sending request /auth/version: connection closed before message completed
 ------------------------------------------------------------------------------
 ------------------------------------------------------------------------------
 Users: 8
 Target host: https://example.com/auth/version
 Starting: 2022-03-14 17:36:17 - 2022-03-14 17:36:25 (duration: 00:00:08)
 Running:  2022-03-14 17:36:25 - 2022-03-14 17:36:35 (duration: 00:00:10)
 Stopping: 2022-03-14 17:36:35 - 2022-03-14 17:36:35 (duration: 00:00:00)

 goose v0.15.2
 ------------------------------------------------------------------------------

@HairyMike
Copy link

I think we're hitting this on kong version 2.8.0. Slightly different line number:

2022/03/30 01:04:30 [error] 26036#0: *1644465 [kong] pb_rpc.lua:397 [rtoken] closed, client: 172.16.80.250, server: kong, request: "GET ***redacted*** HTTP/1.1", host: "20.89.27.110"

https://github.com/Kong/kong/blob/master/kong/runloop/plugin_servers/pb_rpc.lua#L397

@diefans
Copy link
Author

diefans commented Mar 31, 2022

...still testing....

kong reached the following state, after continuous reloading under load, with the above patch

Mem: 30873812K used, 1917048K free, 222636K shrd, 2792924K buff, 11316512K cached
CPU:   5% usr   1% sys   0% nic  93% idle   0% io   0% irq   0% sirq
Load average: 0.65 2.36 3.38 5/2240 1959
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
 1477     1 kong     S     711m   2%   4   0% /go-plugins/bm-authx -kong-prefix /usr/local/ko
 1504     1 kong     S     419m   1%   2   0% nginx: worker process
 1370     1 kong     S     711m   2%   5   0% /go-plugins/bm-authx -kong-prefix /usr/local/ko
 1231     1 kong     S     711m   2%   2   0% /go-plugins/bm-authx -kong-prefix /usr/local/ko
 1421     1 kong     S     711m   2%   3   0% /go-plugins/bm-authx -kong-prefix /usr/local/ko
 1509  1499 kong     S     711m   2%   5   0% /go-plugins/bm-authx -kong-prefix /usr/local/ko
 1499     1 kong     S     450m   1%   6   0% nginx: worker process
 1501     1 kong     S     438m   1%   1   0% nginx: worker process
 1505     1 kong     S     419m   1%   4   0% nginx: worker process
 1502     1 kong     S     419m   1%   6   0% nginx: worker process
 1500     1 kong     S     419m   1%   2   0% nginx: worker process
 1506     1 kong     S     419m   1%   4   0% nginx: worker process
 1503     1 kong     S     419m   1%   2   0% nginx: worker process
    1     0 kong     S     381m   1%   1   0% nginx: master process /usr/local/openresty/ngin
 1189     0 kong     S     2592   0%   5   0% bash
 1520  1189 kong     R     1600   0%   0   0% top

So there are now a couple of plugin processes hanging around... all but one lost their worker process.

@diefans
Copy link
Author

diefans commented Apr 26, 2022

I created https://github.com/bettermarks/kong_8531 to easy reproduce this issue.

Also interesting - JavaScript plugins fail even worse, while Python plugins seem to repair itself and nearly do not mess up.

The 404 from the issue title is a by-product of the plugin being skipped/omitted due to its failure, where the request reached the upstream in a unintended way. So this issue is related also to #8546.

StarlightIbuki added a commit that referenced this issue Jun 8, 2022
When we exit or reload kong, plugin servers are quiting because they encounter error in socket. This behavior is not proper. We not close with SIGTERM when possible.

Hopefully this can be a fix to #8531
aboudreault pushed a commit that referenced this issue Jun 10, 2022
* fix(language) plugin server graceful shutdown

When we exit or reload kong, plugin servers are quiting because they encounter error in socket. This behavior is not proper. We not close with SIGTERM when possible.

Hopefully this can be a fix to #8531
@mayocream
Copy link
Contributor

Fixed by #8923

@chithanh12
Copy link

I think we're hitting this on kong version 2.8.0. Slightly different line number:

2022/03/30 01:04:30 [error] 26036#0: *1644465 [kong] pb_rpc.lua:397 [rtoken] closed, client: 172.16.80.250, server: kong, request: "GET ***redacted*** HTTP/1.1", host: "20.89.27.110"

https://github.com/Kong/kong/blob/master/kong/runloop/plugin_servers/pb_rpc.lua#L397

I got the same issue but don't know how to fixed it. I use the latest version of go pdk. How did you solve it?

chithanh12 added a commit to chithanh12/kong that referenced this issue Apr 18, 2023
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants