Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kong2.3.0, cpu 100%, nginx No response #6739

Closed
RyouZhang opened this issue Jan 15, 2021 · 11 comments · Fixed by #6760
Closed

Kong2.3.0, cpu 100%, nginx No response #6739

RyouZhang opened this issue Jan 15, 2021 · 11 comments · Fixed by #6760

Comments

@RyouZhang
Copy link

Summary

SUMMARY_GOES_HERE

Steps To Reproduce

Additional Details & Logs

@bungle
Copy link
Member

bungle commented Jan 15, 2021

@RyouZhang, please give us more information. Did you do inplace update? If so, you might have conflicting rockspecs installed.

@bungle bungle added the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Jan 15, 2021
@RyouZhang
Copy link
Author

RyouZhang commented Jan 15, 2021

Use docker deploy
yestoday upgrade to kong 2.3.0, it work fine.
suddeny kong cpu 100% and not response.

after docker restart 24h, kong cpu 100%, and not response; It's very interesting.

I do a case, one kong has normal traffic,another only has metric; but after 24h,cpu 100% and no response all of.

I think maybe metric has some trouble.

After 24h, the issue reappear,kong reload, can response but cpu 100% and memory leak happend

Proc info
/proc/{pid}/wchan = 0
/proc/{pid}/syscall = running
/proc/{pid}/stack = [<0000000000000000>] 0xffffffffffffffff
/proc/{pid}stat = 24 (nginx) R 1 1 1 0 -1 4202816 2247197 0 0 0 85695 7564 0 0 20 0 1 0 3437536495 657178624 38614 18446744073709551615 94407332544512 94407334300730 140726817353536 140726817351104 139739583651365 0 65536 1073745920 402745863 0 0 0 17 16 0 0 0 0 0 94407334675680 94407334830717 94407338496000 140726817355784 140726817355855 140726817355855 140726817357778 0

strack -d :
strace: new tcb for pid 39217, active tcbs:1
strace: ptrace_setoptions = 0x11
strace: attach to pid 39217 (main) succeeded
strace: Process 39217 attached
strace: [wait(0x80057f) = 39217] WIFSTOPPED,sig=SIGTRAP,EVENT_STOP (128)
strace: pid 39217 has TCB_STARTUP, initializing it
strace: pid 39217 invalid syscall -1

@ghost
Copy link

ghost commented Jan 19, 2021

Having the same issue. Reproduction steps below:


Install gojira

Follow the below execution flow:

export GOJIRA_IMAGE=kong:2.3.0
gojira hybrid up

# Start the CP
gojira hybrid run@kong-cp kong migrations bootstrap
gojira hybrid run@kong-cp kong start

# Start the DP
gojira hybrid run@kong-dp kong start

# Install deck on the CP and sync the kong.yaml below
gojira hybrid shell@kong-cp
kong-cp:/# ./install_deck.sh
kong-cp:/# deck sync -s kong.yaml

# Start monitoring the CPU of both the CP and the DP.
# Wait a few minutes. I had to wait 1 to 2 minutes.
# Now all workers from both the DP and the CP are each consuming 50% of CPU.
# This issue does not happen in traditional mode deployment. Also does not happen on 2.2.0 hybrid mode following the same steps.
# Proxy requests just hang and time out after 30 seconds. Nothing in the notice error logs (attached).


kong.yaml:

_format_version: "1.1"
services:
- connect_timeout: 60000
  host: testUps6066_01
  name: httpbin
  port: 80
  protocol: http
  read_timeout: 60000
  retries: 5
  write_timeout: 60000
  routes:
  - id: c62fbb13-d29a-4bac-9c2a-92610ac7b6e6
    paths:
    - /test
    path_handling: v0
    preserve_host: false
    protocols:
    - http
    regex_priority: 0
    strip_path: true
    https_redirect_status_code: 426
upstreams:
- name: testUps6066_01
  algorithm: round-robin
  slots: 10000
  healthchecks:
    active:
      concurrency: 10
      healthy:
        http_statuses:
        - 200
        - 302
        interval: 1
        successes: 1
      http_path: /status/404
      https_verify_certificate: true
      type: http
      timeout: 1
      unhealthy:
        http_failures: 1
        http_statuses:
        - 429
        - 404
        - 500
        - 501
        - 502
        - 503
        - 504
        - 505
        tcp_failures: 1
        timeouts: 1
        interval: 1
    passive:
      healthy:
        http_statuses:
        - 200
        - 201
        - 202
        - 203
        - 204
        - 205
        - 206
        - 207
        - 208
        - 226
        - 300
        - 301
        - 302
        - 303
        - 304
        - 305
        - 306
        - 307
        - 308
        successes: 0
      type: http
      unhealthy:
        http_failures: 0
        http_statuses:
        - 429
        - 500
        - 503
        tcp_failures: 0
        timeouts: 0
    threshold: 0
  hash_on: none
  hash_fallback: none
  hash_on_cookie_path: /
  targets:
  - target: httpbin.org:80
    weight: 100

error logs:

cp_logs.txt
dp_logs.txt

@ghost
Copy link

ghost commented Jan 19, 2021

Both DP and CP logs in debug level start filling up extremely quickly with

2021/01/19 14:39:24 [debug] 70#0: *733 [lua] base.lua:493: f(): [upstream:testUps6066_01 2] no dns changes detected for httpbin.org
2021/01/19 14:39:25 [debug] 72#0: *803 [lua] base.lua:448: [upstream:testUps6066_01 2] executing requery for: httpbin.org
2021/01/19 14:39:25 [debug] 71#0: *805 [lua] base.lua:448: [upstream:testUps6066_01 2] executing requery for: httpbin.org
2021/01/19 14:39:24 [debug] 70#0: *733 [lua] base.lua:448: [upstream:testUps6066_01 2] executing requery for: httpbin.org
2021/01/19 14:39:25 [debug] 72#0: *803 [lua] base.lua:640: queryDns(): [upstream:testUps6066_01 2] querying dns for httpbin.org
2021/01/19 14:39:25 [debug] 71#0: *805 [lua] base.lua:640: queryDns(): [upstream:testUps6066_01 2] querying dns for httpbin.org
2021/01/19 14:39:24 [debug] 70#0: *733 [lua] base.lua:640: queryDns(): [upstream:testUps6066_01 2] querying dns for httpbin.org
2021/01/19 14:39:25 [debug] 72#0: *803 [lua] base.lua:493: f(): [upstream:testUps6066_01 2] no dns changes detected for httpbin.org
2021/01/19 14:39:25 [debug] 71#0: *805 [lua] base.lua:493: f(): [upstream:testUps6066_01 2] no dns changes detected for httpbin.org
2021/01/19 14:39:25 [debug] 72#0: *803 [lua] base.lua:448: [upstream:testUps6066_01 2] executing requery for: httpbin.org
2021/01/19 14:39:24 [debug] 70#0: *733 [lua] base.lua:493: f(): [upstream:testUps6066_01 2] no dns changes detected for httpbin.org
2021/01/19 14:39:25 [debug] 71#0: *805 [lua] base.lua:448: [upstream:testUps6066_01 1] executing requery for: httpbin.org
2021/01/19 14:39:25 [debug] 72#0: *803 [lua] base.lua:640: queryDns(): [upstream:testUps6066_01 2] querying dns for httpbin.org
2021/01/19 14:39:24 [debug] 70#0: *733 [lua] base.lua:448: [upstream:testUps6066_01 2] executing requery for: httpbin.org
2021/01/19 14:39:25 [debug] 71#0: *805 [lua] base.lua:640: queryDns(): [upstream:testUps6066_01 1] querying dns for httpbin.org

To the point where I can't even attach the error log in debug level neither from the DP nor CP because of how big in size they get after just a few seconds.

@bungle bungle added task/bug and removed pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... labels Jan 19, 2021
@bungle
Copy link
Member

bungle commented Jan 19, 2021

@murillopaula thank you for pointers. It looks like a bug in healthchecks. @RyouZhang, I will look what is happening, and will report back. It seems like it is healthchecks related and happens mostly on hybrid mode. @RyouZhang what mode did you run yours?

@RyouZhang
Copy link
Author

RyouZhang commented Jan 19, 2021

debug log, I‘ll try tomorrow.
today I got a new case
use Kong:2.3.0-ubuntu image, 4 worker, after 24h 3 worker cpu 100%, 1 worker fine.

and memory monitor:
微信图片_20210120003535
微信图片_20210120003543

@ghost
Copy link

ghost commented Jan 19, 2021

@RyouZhang do you use hybrid mode? do you use upstreams/targets with health checks (passive/active)?

@RyouZhang
Copy link
Author

RyouZhang commented Jan 20, 2021

ok, this is my kong.yaml and docker cmd
Archive.zip
2 day cpu used:
91b96aaf112458679cc527602649ec27

@ghost
Copy link

ghost commented Jan 20, 2021

@RyouZhang ok, it seems like you are also using health checks. We've found a temporary workaround to mitigate this issue:

luarocks remove lua-resty-dns-client --force
luarocks install lua-resty-dns-client 5.1.1 --force

i.e. downgrade the lua-resty-dns-client to 5.1.1.

Can you let us know if that works for you? Note this is a temporary hack and should not be considered a definitive solution.

@RyouZhang
Copy link
Author

use lua-resty-dns-client 5.1.1, work 24h, it's fine.

bungle added a commit that referenced this issue Jan 21, 2021
### Summary

@RyouZhang reported with #6739 that Kong starts to spin CPUs 100% with `2.3.0` when healthchecks are
enabled. @murillopaula confirmed it and gave additional information and we could reproduce it easily.

We identified an issue in lua-resty-dns-client 5.2.1 together with @Tieske and @locao, fixed it.

This PR bumps the Kong dependency to the fixed version of the library.

### Issues Resolved

Fix #6739
@bungle
Copy link
Member

bungle commented Jan 21, 2021

@RyouZhang we released lua-resty-dns-client 5.2.1, that also fixes the issue, and will be included in Kong 2.3.1 that we are releasing within the next few days. Thank you for cooperating and letting us know about it!

locao pushed a commit that referenced this issue Jan 21, 2021
### Summary

@RyouZhang reported with #6739 that Kong starts to spin CPUs 100% with `2.3.0` when healthchecks are
enabled. @murillopaula confirmed it and gave additional information and we could reproduce it easily.

We identified an issue in lua-resty-dns-client 5.2.1 together with @Tieske and @locao, fixed it.

This PR bumps the Kong dependency to the fixed version of the library.

### Issues Resolved

Fix #6739
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants