-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plugin and timer performance on Kong 3.x #9964
Comments
@ADD-SP what do ya think? |
While we will investigate this issue and put some limits on the amount of memory and timers that zipkin consumes, I must point out that this setting is not realistic for production environments:
The intended usage for the Zipkin plugin is that tracing is done on a sample of all the requests - that is why the default value for I will make sure that we point this out in our docs. |
We use a sampling rate of 100% because we do dynamic sampling in Refinery. We've been using this setup with 2.x in production mostly without problems since we added timeouts in #8735. We are planning to migrate to the OpenTelemetry plugin, with the same sample rate, once we've validated that the upgrade path to 3.x is safe. Why is why I'm most interested in:
I'm not able to view the issues FTI-4367 and FTI-4269 that were referenced in #9521. Are you able to provide some more information about what originally prompted that change? Is there anything else that I can do to help? |
@dcarley I'm curious in which cases the performance of the timer becomes bad. In 3.x we introduced lua-resty-timer-ng to reduce the overhead of the timer, and it shouldn't get any worse. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
We've observed that 3.x performance is worse under load when a plugin makes calls to an external network dependency and those calls take longer than normal. Are you able to provide any information from FTI-4367, FTI-4269, and FT-3464? Were those issues prompted by timer performance? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Another issue related to time abuse #9959 |
good thankyou |
Hi @dcarley , could you please try 3.2.2 to check if this issue exists now? |
3.2.2 appears to be worse than 3.0.1 for the reproduction case where |
Dear contributor, |
Is there an existing issue for this?
Kong version (
$ kong version
)3.0.1 (also tested against 3.1.0)
Current Behavior
(sorry, this is long)
We have a set of integration tests that run against our installation of Kong. Our rate limiting tests have been failing intermittently, with more requests allowed than there should be, when testing an upgrade from Kong 2.8.3 to Kong 3.x (initially 3.0.0 but we've more recently tried 3.0.1 and 3.1.0). When this happens it appears to take a while to recover and sometimes doesn't recover at all. That's not a terribly useful report without a complete reproduction of how we deploy and test Kong though so I've attempted to narrow it down.
One way that I've been able to reproduce the same symptoms is by configuring the Zipkin plugin (which we normally use) with an endpoint that times out (e.g. an unreachable IP address). When generating requests against 3.x this immediately causes more requests to be allowed than there should be. When increasing the request rate it eventually causes other plugins and operations that depend on timers to also fail and not recover:
The significant difference between 2.x and 3.x is that it appears to fail with lower requests rates, earlier on, and doesn't recover. I think that rate limiting is acting as an "early warning" of this because delays in running timers means that usage counters aren't being incremented quick enough within each "second" bucket.
Expected Behavior
The performance of one plugin and its network dependencies shouldn't adversely affect the performance of other plugins and operations.
I expect that changes like #9538 (rate-limiting) and #9521 (Datadog) will alleviate some of the symptoms of this. I wasn't able to reproduce the same problem with the OpenTelemetry plugin, which already batches+intervals submissions. I suspect that applying the same changes to the Zipkin plugin would help.
However I think there's also an underlying problem where timer performance appears to be worse under certain circumstances in 3.x and these kinds of issues are likely to reoccur without some safety guards. Which is what I'm most interested in focusing on.
Steps To Reproduce
This is the simplest way that I could reproduce the symptom. I wasn't able to push it hard enough to cause timer errors in Gojira, like we can in Minikube, because Docker appeared to fall over first.
Install:
Create
egg.yaml
:Create
kong.yaml
Start the container:
Run a load test:
Check the results, ideally:
jaggr.out
should show roughly 50 responses with 200 statuses each secondvegeta report vegeta.out
should show a success rate of 25% (50/200)Anything else?
I've put the results of my tests in a Gist so as not to clutter the issue: https://gist.github.com/dcarley/cc3c8959fd8f6a811d0b3c0ddf458a5c
The text was updated successfully, but these errors were encountered: