Plugin and timer performance on Kong 3.x #9964

dcarley · 2022-12-14T17:27:47Z

Is there an existing issue for this?

I have searched the existing issues

Kong version (`$ kong version`)

3.0.1 (also tested against 3.1.0)

Current Behavior

(sorry, this is long)

We have a set of integration tests that run against our installation of Kong. Our rate limiting tests have been failing intermittently, with more requests allowed than there should be, when testing an upgrade from Kong 2.8.3 to Kong 3.x (initially 3.0.0 but we've more recently tried 3.0.1 and 3.1.0). When this happens it appears to take a while to recover and sometimes doesn't recover at all. That's not a terribly useful report without a complete reproduction of how we deploy and test Kong though so I've attempted to narrow it down.

One way that I've been able to reproduce the same symptoms is by configuring the Zipkin plugin (which we normally use) with an endpoint that times out (e.g. an unreachable IP address). When generating requests against 3.x this immediately causes more requests to be allowed than there should be. When increasing the request rate it eventually causes other plugins and operations that depend on timers to also fail and not recover:

api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:18 [error] 1109#0: *17773 [kong] handler.lua:106 zipkin reporter flush failed to request: timeout, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:19 [error] 1109#0: *16756 [kong] handler.lua:106 zipkin reporter flush failed to request: timeout, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:19 [error] 1109#0: *18694 [kong] handler.lua:106 zipkin reporter flush failed to request: timeout, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:19 [error] 1109#0: *18691 [kong] handler.lua:106 zipkin reporter flush failed to request: timeout, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:19 [warn] 1110#0: *151 [lua] super.lua:120: scaling_execute(): [timer-ng] overload: load_avg: 10.239453125, runable_jobs_avg: 2621.3, alive_threads_avg: 256, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:31 [warn] 1110#0: *151 [lua] super.lua:120: scaling_execute(): [timer-ng] overload: load_avg: 10.26875, runable_jobs_avg: 2628.8, alive_threads_avg: 256, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:44 [warn] 1110#0: *151 [lua] super.lua:120: scaling_execute(): [timer-ng] overload: load_avg: 10.294921875, runable_jobs_avg: 2635.5, alive_threads_avg: 256, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [warn] 1110#0: *151 [lua] super.lua:120: scaling_execute(): [timer-ng] overload: load_avg: 10.324609375, runable_jobs_avg: 2643.1, alive_threads_avg: 256, context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [error] 1110#0: *16493 [kong] handler.lua:74 failed to create Statsd logger: failed to connect to metrics.kube-system.svc.cluster.local.:8125: [cosocket] DNS resolution failed: dns lookup pool exceeded retries (1): timeout. Tried: ["(short)metrics.kube-system.svc.cluster.local.:(na) - cache-miss","metrics.kube-system.svc.cluster.local.:33 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","metrics.kube-system.svc.cluster.local.:1 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","metrics.kube-system.svc.cluster.local.:5 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout"], context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [error] 1110#0: *16494 [kong] handler.lua:74 failed to create Statsd logger: failed to connect to metrics.kube-system.svc.cluster.local.:8125: [cosocket] DNS resolution failed: dns lookup pool exceeded retries (1): timeout. Tried: ["(short)metrics.kube-system.svc.cluster.local.:(na) - cache-miss","metrics.kube-system.svc.cluster.local.:33 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","metrics.kube-system.svc.cluster.local.:1 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","metrics.kube-system.svc.cluster.local.:5 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout"], context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [error] 1110#0: *16495 [kong] handler.lua:74 failed to create Statsd logger: failed to connect to metrics.kube-system.svc.cluster.local.:8125: [cosocket] DNS resolution failed: dns lookup pool exceeded retries (1): timeout. Tried: ["(short)metrics.kube-system.svc.cluster.local.:(na) - cache-miss","metrics.kube-system.svc.cluster.local.:33 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","metrics.kube-system.svc.cluster.local.:1 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","metrics.kube-system.svc.cluster.local.:5 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout"], context: ngx.timer
…
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [error] 1110#0: *16501 [kong] init.lua:75 failed to connect to Redis: [cosocket] DNS resolution failed: dns lookup pool exceeded retries (1): timeout. Tried: ["(short)redis.kong.svc.cluster.local.:(na) - cache-miss","redis.kong.svc.cluster.local.:33 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","redis.kong.svc.cluster.local.:1 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","redis.kong.svc.cluster.local.:5 - cache-miss/scheduled/try 1 error: timeout/scheduled/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout"], context: ngx.timer
…
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [error] 1110#0: *16505 [kong] init.lua:75 failed to connect to Redis: [cosocket] DNS resolution failed: dns lookup pool exceeded retries (1): timeout. Tried: ["(short)redis.kong.svc.cluster.local.:(na) - cache-miss","redis.kong.svc.cluster.local.:33 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","redis.kong.svc.cluster.local.:1 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","redis.kong.svc.cluster.local.:5 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout"], context: ngx.timer
api-gateway-v1-kong-5cd66f4b67-m5ql4 proxy 2022/12/02 12:03:57 [error] 1110#0: *16506 [kong] init.lua:75 failed to connect to Redis: [cosocket] DNS resolution failed: dns lookup pool exceeded retries (1): timeout. Tried: ["(short)redis.kong.svc.cluster.local.:(na) - cache-miss","redis.kong.svc.cluster.local.:33 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","redis.kong.svc.cluster.local.:1 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout","redis.kong.svc.cluster.local.:5 - cache-miss/in progress (sync)/try 1 error: timeout/in progress (sync)/try 2 error: timeout/dns lookup pool exceeded retries (1): timeout"], context: ngx.timer

The significant difference between 2.x and 3.x is that it appears to fail with lower requests rates, earlier on, and doesn't recover. I think that rate limiting is acting as an "early warning" of this because delays in running timers means that usage counters aren't being incremented quick enough within each "second" bucket.

Expected Behavior

The performance of one plugin and its network dependencies shouldn't adversely affect the performance of other plugins and operations.

I expect that changes like #9538 (rate-limiting) and #9521 (Datadog) will alleviate some of the symptoms of this. I wasn't able to reproduce the same problem with the OpenTelemetry plugin, which already batches+intervals submissions. I suspect that applying the same changes to the Zipkin plugin would help.

However I think there's also an underlying problem where timer performance appears to be worse under certain circumstances in 3.x and these kinds of issues are likely to reoccur without some safety guards. Which is what I'm most interested in focusing on.

Steps To Reproduce

This is the simplest way that I could reproduce the symptom. I wasn't able to push it hard enough to cause timer errors in Gojira, like we can in Minikube, because Docker appeared to fall over first.

Install:

vegeta: https://github.com/tsenart/vegeta
jaggr: https://github.com/rs/jaggr
gojira using this branch: kong-att: Source 42-kong-envs.sh gojira#74

Create egg.yaml:

version: '3.5'
services:
  kong:
    ports:
      - 8000:8000
      - 8001:8001
  httpbin:
    image: "kong/httpbin:latest"
    labels:
      com.konghq.gojira: "True"
    networks:
    - gojira

Create kong.yaml

_format_version: "2.1"
_transform: true

plugins:
  - name: zipkin
    config:
      http_endpoint: http://172.18.0.254/will/timeout
      sample_ratio: 1
  - name: rate-limiting
    config:
      second: 50
      limit_by: ip
      policy: redis
      redis_host: redis

services:
- name: httpbin
  url: http://httpbin:80
  routes:
  - name: httpbin
    paths:
    - /

Start the container:

gojira up --tag 3.0.1
gojira run --tag 3.0.1 make dev
gojira snapshot --tag 3.0.1
gojira up --tag 3.0.1 --egg egg.yaml --yml kong.yaml

Run a load test:

echo "GET http://127.0.0.1:8000/status/200" | vegeta attack -rate 200 -duration 2m | tee vegeta.out | vegeta encode | ~/go/bin/jaggr '@count=rps' 'hist[100,200,300,400,500]:code' | tee jaggr.out

Check the results, ideally:

jaggr.out should show roughly 50 responses with 200 statuses each second
vegeta report vegeta.out should show a success rate of 25% (50/200)

Anything else?

I've put the results of my tests in a Gist so as not to clutter the issue: https://gist.github.com/dcarley/cc3c8959fd8f6a811d0b3c0ddf458a5c

The text was updated successfully, but these errors were encountered:

flrgh · 2022-12-14T17:42:34Z

@ADD-SP what do ya think?

ADD-SP · 2022-12-28T03:31:21Z

@flrgh We've run into this type of issue, which is essentially timer abuse, and I think we should fix this issue like #9521 did.

kikito · 2023-01-03T14:55:51Z

While we will investigate this issue and put some limits on the amount of memory and timers that zipkin consumes, I must point out that this setting is not realistic for production environments:

sample_ratio: 1

The intended usage for the Zipkin plugin is that tracing is done on a sample of all the requests - that is why the default value for sample_ratio is 0.001. It means that 1 out of 1000 requests will produce traces. Setting it to 1 is akin to setting the log level to debug. Something that would be done on a production system on an emergency setting temporarily in order to debug a critical failure.

I will make sure that we point this out in our docs.

dcarley · 2023-01-03T16:02:12Z

We use a sampling rate of 100% because we do dynamic sampling in Refinery. We've been using this setup with 2.x in production mostly without problems since we added timeouts in #8735. We are planning to migrate to the OpenTelemetry plugin, with the same sample rate, once we've validated that the upgrade path to 3.x is safe. Why is why I'm most interested in:

However I think there's also an underlying problem where timer performance appears to be worse under certain circumstances in 3.x and these kinds of issues are likely to reoccur without some safety guards. Which is what I'm most interested in focusing on.

I'm not able to view the issues FTI-4367 and FTI-4269 that were referenced in #9521. Are you able to provide some more information about what originally prompted that change?

Is there anything else that I can do to help?

ADD-SP · 2023-01-04T02:13:48Z

@dcarley I'm curious in which cases the performance of the timer becomes bad. In 3.x we introduced lua-resty-timer-ng to reduce the overhead of the timer, and it shouldn't get any worse.

stale · 2023-01-20T06:38:31Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

dcarley · 2023-01-20T11:39:54Z

@dcarley I'm curious in which cases the performance of the timer becomes bad. In 3.x we introduced lua-resty-timer-ng to reduce the overhead of the timer, and it shouldn't get any worse.

We've observed that 3.x performance is worse under load when a plugin makes calls to an external network dependency and those calls take longer than normal.

Are you able to provide any information from FTI-4367, FTI-4269, and FT-3464? Were those issues prompted by timer performance?

stale · 2023-02-04T20:39:55Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

surenraju-careem · 2023-02-17T17:44:26Z

Another issue related to time abuse #9959

xiaotuzididi · 2023-03-09T05:16:31Z

good thankyou

ADD-SP · 2023-03-28T09:28:13Z

Hi @dcarley , could you please try 3.2.2 to check if this issue exists now?

dcarley · 2023-04-03T11:07:38Z

Hi @dcarley , could you please try 3.2.2 to check if this issue exists now?

3.2.2 appears to be worse than 3.0.1 for the reproduction case where rate-limiting "fails open" when the zipkin endpoint is timing out. I've updated the results in the gist here: https://gist.github.com/dcarley/cc3c8959fd8f6a811d0b3c0ddf458a5c#gojira-322

StarlightIbuki · 2023-10-11T06:37:19Z

Dear contributor,
We're closing this issue as there hasn't been any update to it for a long time. If the issue is still relevant in the latest version, please feel free to reopen it. We're more than happy to revisit it again. Your contribution is greatly appreciated!
Please have a look at our pledge to the community for more information.
Sincerely,
Kong Gateway Team

flrgh added the core/performance label Dec 14, 2022

guanlan assigned guanlan and ADD-SP and unassigned guanlan Dec 28, 2022

chronolaw added the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Jan 5, 2023

stale bot added the stale label Jan 20, 2023

stale bot removed the stale label Jan 20, 2023

hanshuebner mentioned this issue Jan 20, 2023

Noticing high kong latency after upgrading to 3.1.1 #10141

Closed

1 task

stale bot added the stale label Feb 4, 2023

ADD-SP removed the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Feb 5, 2023

stale bot removed the stale label Feb 5, 2023

ADD-SP added the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Mar 28, 2023

ADD-SP removed the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Apr 3, 2023

StarlightIbuki closed this as completed Oct 11, 2023

StarlightIbuki added the stale/pending revisit Too old ticket. Closed, but we may revisit later. label Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plugin and timer performance on Kong 3.x #9964

Plugin and timer performance on Kong 3.x #9964

dcarley commented Dec 14, 2022

flrgh commented Dec 14, 2022

ADD-SP commented Dec 28, 2022

kikito commented Jan 3, 2023

dcarley commented Jan 3, 2023

ADD-SP commented Jan 4, 2023

stale bot commented Jan 20, 2023

dcarley commented Jan 20, 2023

stale bot commented Feb 4, 2023

surenraju-careem commented Feb 17, 2023

xiaotuzididi commented Mar 9, 2023

ADD-SP commented Mar 28, 2023

dcarley commented Apr 3, 2023

StarlightIbuki commented Oct 11, 2023

Plugin and timer performance on Kong 3.x #9964

Plugin and timer performance on Kong 3.x #9964

Comments

dcarley commented Dec 14, 2022

Is there an existing issue for this?

Kong version ($ kong version)

Current Behavior

Expected Behavior

Steps To Reproduce

Anything else?

flrgh commented Dec 14, 2022

ADD-SP commented Dec 28, 2022

kikito commented Jan 3, 2023

dcarley commented Jan 3, 2023

ADD-SP commented Jan 4, 2023

stale bot commented Jan 20, 2023

dcarley commented Jan 20, 2023

stale bot commented Feb 4, 2023

surenraju-careem commented Feb 17, 2023

xiaotuzididi commented Mar 9, 2023

ADD-SP commented Mar 28, 2023

dcarley commented Apr 3, 2023

StarlightIbuki commented Oct 11, 2023

Kong version (`$ kong version`)