Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plugin and timer performance on Kong 3.x #9964

Closed
1 task done
dcarley opened this issue Dec 14, 2022 · 13 comments
Closed
1 task done

Plugin and timer performance on Kong 3.x #9964

dcarley opened this issue Dec 14, 2022 · 13 comments
Assignees
Labels
core/performance stale/pending revisit Too old ticket. Closed, but we may revisit later.

Comments

@dcarley
Copy link
Contributor

dcarley commented Dec 14, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Kong version ($ kong version)

3.0.1 (also tested against 3.1.0)

Current Behavior

(sorry, this is long)

We have a set of integration tests that run against our installation of Kong. Our rate limiting tests have been failing intermittently, with more requests allowed than there should be, when testing an upgrade from Kong 2.8.3 to Kong 3.x (initially 3.0.0 but we've more recently tried 3.0.1 and 3.1.0). When this happens it appears to take a while to recover and sometimes doesn't recover at all. That's not a terribly useful report without a complete reproduction of how we deploy and test Kong though so I've attempted to narrow it down.

One way that I've been able to reproduce the same symptoms is by configuring the Zipkin plugin (which we normally use) with an endpoint that times out (e.g. an unreachable IP address). When generating requests against 3.x this immediately causes more requests to be allowed than there should be. When increasing the request rate it eventually causes other plugins and operations that depend on timers to also fail and not recover:

api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:18 [error] 1109#0: *17773 [kong] handler.lua:106 zipkin reporter flush failed to request: timeout, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:19 [error] 1109#0: *16756 [kong] handler.lua:106 zipkin reporter flush failed to request: timeout, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:19 [error] 1109#0: *18694 [kong] handler.lua:106 zipkin reporter flush failed to request: timeout, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:19 [error] 1109#0: *18691 [kong] handler.lua:106 zipkin reporter flush failed to request: timeout, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:19 [warn] 1110#0: *151 [lua] super.lua:120: scaling_execute(): [timer-ng] overload: load_avg: 10.239453125, runable_jobs_avg: 2621.3, alive_threads_avg: 256, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:31 [warn] 1110#0: *151 [lua] super.lua:120: scaling_execute(): [timer-ng] overload: load_avg: 10.26875, runable_jobs_avg: 2628.8, alive_threads_avg: 256, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:44 [warn] 1110#0: *151 [lua] super.lua:120: scaling_execute(): [timer-ng] overload: load_avg: 10.294921875, runable_jobs_avg: 2635.5, alive_threads_avg: 256, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [warn] 1110#0: *151 [lua] super.lua:120: scaling_execute(): [timer-ng] overload: load_avg: 10.324609375, runable_jobs_avg: 2643.1, alive_threads_avg: 256, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [error] 1110#0: *16493 [kong] handler.lua:74 failed to create Statsd logger: failed to connect to metrics.kube-system.svc.cluster.local.:8125: [cosocket] DNS resolution failed: dns lookup pool exceeded retries (1): timeout. Tried: ["(short)metrics.kube-system.svc.cluster.local.:(na) - cache-miss","metrics.kube-system.svc.cluster.local.:33 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","metrics.kube-system.svc.cluster.local.:1 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","metrics.kube-system.svc.cluster.local.:5 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout"], context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [error] 1110#0: *16494 [kong] handler.lua:74 failed to create Statsd logger: failed to connect to metrics.kube-system.svc.cluster.local.:8125: [cosocket] DNS resolution failed: dns lookup pool exceeded retries (1): timeout. Tried: ["(short)metrics.kube-system.svc.cluster.local.:(na) - cache-miss","metrics.kube-system.svc.cluster.local.:33 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","metrics.kube-system.svc.cluster.local.:1 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","metrics.kube-system.svc.cluster.local.:5 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout"], context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [error] 1110#0: *16495 [kong] handler.lua:74 failed to create Statsd logger: failed to connect to metrics.kube-system.svc.cluster.local.:8125: [cosocket] DNS resolution failed: dns lookup pool exceeded retries (1): timeout. Tried: ["(short)metrics.kube-system.svc.cluster.local.:(na) - cache-miss","metrics.kube-system.svc.cluster.local.:33 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","metrics.kube-system.svc.cluster.local.:1 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","metrics.kube-system.svc.cluster.local.:5 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout"], context: ngx.timer
…
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [error] 1110#0: *16501 [kong] init.lua:75 failed to connect to Redis: [cosocket] DNS resolution failed: dns lookup pool exceeded retries (1): timeout. Tried: ["(short)redis.kong.svc.cluster.local.:(na) - cache-miss","redis.kong.svc.cluster.local.:33 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","redis.kong.svc.cluster.local.:1 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","redis.kong.svc.cluster.local.:5 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout"], context: ngx.timer
…
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [error] 1110#0: *16505 [kong] init.lua:75 failed to connect to Redis: [cosocket] DNS resolution failed: dns lookup pool exceeded retries (1): timeout. Tried: ["(short)redis.kong.svc.cluster.local.:(na) - cache-miss","redis.kong.svc.cluster.local.:33 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","redis.kong.svc.cluster.local.:1 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","redis.kong.svc.cluster.local.:5 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout"], context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [error] 1110#0: *16506 [kong] init.lua:75 failed to connect to Redis: [cosocket] DNS resolution failed: dns lookup pool exceeded retries (1): timeout. Tried: ["(short)redis.kong.svc.cluster.local.:(na) - cache-miss","redis.kong.svc.cluster.local.:33 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","redis.kong.svc.cluster.local.:1 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","redis.kong.svc.cluster.local.:5 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout"], context: ngx.timer

The significant difference between 2.x and 3.x is that it appears to fail with lower requests rates, earlier on, and doesn't recover. I think that rate limiting is acting as an "early warning" of this because delays in running timers means that usage counters aren't being incremented quick enough within each "second" bucket.

Expected Behavior

The performance of one plugin and its network dependencies shouldn't adversely affect the performance of other plugins and operations.

I expect that changes like #9538 (rate-limiting) and #9521 (Datadog) will alleviate some of the symptoms of this. I wasn't able to reproduce the same problem with the OpenTelemetry plugin, which already batches+intervals submissions. I suspect that applying the same changes to the Zipkin plugin would help.

However I think there's also an underlying problem where timer performance appears to be worse under certain circumstances in 3.x and these kinds of issues are likely to reoccur without some safety guards. Which is what I'm most interested in focusing on.

Steps To Reproduce

This is the simplest way that I could reproduce the symptom. I wasn't able to push it hard enough to cause timer errors in Gojira, like we can in Minikube, because Docker appeared to fall over first.

Install:

Create egg.yaml:

version: '3.5'
services:
  kong:
    ports:
      - 8000:8000
      - 8001:8001
  httpbin:
    image: "kong/httpbin:latest"
    labels:
      com.konghq.gojira: "True"
    networks:
    - gojira

Create kong.yaml

_format_version: "2.1"
_transform: true

plugins:
  - name: zipkin
    config:
      http_endpoint: http://172.18.0.254/will/timeout
      sample_ratio: 1
  - name: rate-limiting
    config:
      second: 50
      limit_by: ip
      policy: redis
      redis_host: redis

services:
- name: httpbin
  url: http://httpbin:80
  routes:
  - name: httpbin
    paths:
    - /

Start the container:

gojira up --tag 3.0.1
gojira run --tag 3.0.1 make dev
gojira snapshot --tag 3.0.1
gojira up --tag 3.0.1 --egg egg.yaml --yml kong.yaml

Run a load test:

echo "GET http://127.0.0.1:8000/status/200" | vegeta attack -rate 200 -duration 2m | tee vegeta.out | vegeta encode | ~/go/bin/jaggr '@count=rps' 'hist[100,200,300,400,500]:code' | tee jaggr.out

Check the results, ideally:

  • jaggr.out should show roughly 50 responses with 200 statuses each second
  • vegeta report vegeta.out should show a success rate of 25% (50/200)

Anything else?

I've put the results of my tests in a Gist so as not to clutter the issue: https://gist.github.com/dcarley/cc3c8959fd8f6a811d0b3c0ddf458a5c

@flrgh
Copy link
Contributor

flrgh commented Dec 14, 2022

@ADD-SP what do ya think?

@guanlan guanlan assigned guanlan and ADD-SP and unassigned guanlan Dec 28, 2022
@ADD-SP
Copy link
Contributor

ADD-SP commented Dec 28, 2022

@flrgh We've run into this type of issue, which is essentially timer abuse, and I think we should fix this issue like #9521 did.

@kikito
Copy link
Member

kikito commented Jan 3, 2023

While we will investigate this issue and put some limits on the amount of memory and timers that zipkin consumes, I must point out that this setting is not realistic for production environments:

sample_ratio: 1

The intended usage for the Zipkin plugin is that tracing is done on a sample of all the requests - that is why the default value for sample_ratio is 0.001. It means that 1 out of 1000 requests will produce traces. Setting it to 1 is akin to setting the log level to debug. Something that would be done on a production system on an emergency setting temporarily in order to debug a critical failure.

I will make sure that we point this out in our docs.

@dcarley
Copy link
Contributor Author

dcarley commented Jan 3, 2023

We use a sampling rate of 100% because we do dynamic sampling in Refinery. We've been using this setup with 2.x in production mostly without problems since we added timeouts in #8735. We are planning to migrate to the OpenTelemetry plugin, with the same sample rate, once we've validated that the upgrade path to 3.x is safe. Why is why I'm most interested in:

However I think there's also an underlying problem where timer performance appears to be worse under certain circumstances in 3.x and these kinds of issues are likely to reoccur without some safety guards. Which is what I'm most interested in focusing on.

I'm not able to view the issues FTI-4367 and FTI-4269 that were referenced in #9521. Are you able to provide some more information about what originally prompted that change?

Is there anything else that I can do to help?

@ADD-SP
Copy link
Contributor

ADD-SP commented Jan 4, 2023

@dcarley I'm curious in which cases the performance of the timer becomes bad. In 3.x we introduced lua-resty-timer-ng to reduce the overhead of the timer, and it shouldn't get any worse.

@chronolaw chronolaw added the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Jan 5, 2023
@stale
Copy link

stale bot commented Jan 20, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 20, 2023
@dcarley
Copy link
Contributor Author

dcarley commented Jan 20, 2023

@dcarley I'm curious in which cases the performance of the timer becomes bad. In 3.x we introduced lua-resty-timer-ng to reduce the overhead of the timer, and it shouldn't get any worse.

We've observed that 3.x performance is worse under load when a plugin makes calls to an external network dependency and those calls take longer than normal.

Are you able to provide any information from FTI-4367, FTI-4269, and FT-3464? Were those issues prompted by timer performance?

@stale
Copy link

stale bot commented Feb 4, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 4, 2023
@ADD-SP ADD-SP removed the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Feb 5, 2023
@stale stale bot removed the stale label Feb 5, 2023
@surenraju-careem
Copy link

Another issue related to time abuse #9959

@xiaotuzididi
Copy link

good thankyou

@ADD-SP
Copy link
Contributor

ADD-SP commented Mar 28, 2023

Hi @dcarley , could you please try 3.2.2 to check if this issue exists now?

@ADD-SP ADD-SP added the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Mar 28, 2023
@dcarley
Copy link
Contributor Author

dcarley commented Apr 3, 2023

Hi @dcarley , could you please try 3.2.2 to check if this issue exists now?

3.2.2 appears to be worse than 3.0.1 for the reproduction case where rate-limiting "fails open" when the zipkin endpoint is timing out. I've updated the results in the gist here: https://gist.github.com/dcarley/cc3c8959fd8f6a811d0b3c0ddf458a5c#gojira-322

@ADD-SP ADD-SP removed the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Apr 3, 2023
@StarlightIbuki
Copy link
Contributor

Dear contributor,
We're closing this issue as there hasn't been any update to it for a long time. If the issue is still relevant in the latest version, please feel free to reopen it. We're more than happy to revisit it again. Your contribution is greatly appreciated!
Please have a look at our pledge to the community for more information.
Sincerely,
Kong Gateway Team

@StarlightIbuki StarlightIbuki added the stale/pending revisit Too old ticket. Closed, but we may revisit later. label Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core/performance stale/pending revisit Too old ticket. Closed, but we may revisit later.
Projects
None yet
Development

No branches or pull requests

9 participants