-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KONG performance break down while updating entity via admin API #7543
Comments
@23henne do you have |
Good catch, it defaults to |
@23henne, yes let's see if we can drop strict on |
@bungle Questions
Thanks |
Hello @23henne.
|
Hi @23henne! #6833 is related to
|
Hello @locao! Coming back to your questions: Do you see any
But we know such targets were available. Can you share more details on your What is the behavior during the low performance period? Do clients get connection refused errors, timeouts, 500s? To be honest if I'd not know it better for me it looks like we hit some kind of TCP or resource limit. It is striking that we have huge CPU consumption during this time frame. I thought it was caused by update propagation, that's why it was very promising to play with It is really strange. Maybe it's worth to mention that we have 2k RPS on our 6 instances. |
Hello @23henne , Could you check your DNS resolver's performance? If the DNS resolver is slow then it may cause new requests to be blocked on DNS resolution after |
Hi @23henne, could you try using target IP addresses instead of host names, so we can pinpoint if this is related to DNS or not? Out of curiosity, unrelated to this issue, are you using https to active health check the upstreams? |
@locao, are we talking about upstreams? We don't use hostnames in there! But we also have service targets which are not referring to KONG upstreams but some other external destination using domain name. We use active health check but is HTTP, not HTTPS. This is what happens if I perform 3000 requests like this
via KONG admin API within 50 minutes. And this applies for all cluster nodes. I have no clue. |
Hello @23henne, Could you share us the debug level logs for Kong when the latency spike happens? |
Hello @23henne, Yeah, please share the debug logs with us once it is ready. If you don't feel comfortable uploading it on GitHub, you can also email the logs to me at: datong.sun-konghq.com (replace - with @) Just mention the issue # in the email. Also, your system CPU usage isn't that high (84% idling), I wonder if increasing the number of Nginx workers on the machine can help reduce the CPU usage of individual workers? |
Hello @dndx, thank you, hopefully we can provide logfiles shortly. As I said we already increased worker processes to 16 per node and also in this case all were more or less busy. Don't you think it was enough? Shame on me but I didn't execute an While going thru documentation we found following paragraph concerning
tbh this confuses me. It defaults to 128M but you recommend 500M or as large as possible. Maybe you can help us to find a proper sizing for our demands.
This is what our current configuration looks like. We were also wondering if it was worth it to play with update frequency parameters.
Thank you so much |
It is most probably a router rebuild that is happening there. Router is rebuilt on each worker when you change settings (more workers more work to be done). If you change things in batch, e.g. multiple changes, it may rebuilt router multiple times. Eventual rebuilding happens on background, but it is still work to be done. |
Sure, but it should never block complete environment for several minutes. |
We set This is what was issued via admin API.
|
I think it's normal behaviour that we see thousands of entries like below while running an
|
@23henne because Nginx uses non blocking I/O, One way to rule out the router rebuild as the culprit is to add debug logs to print out timestamps around calls to https://github.com/Kong/kong/blob/master/kong/runloop/handler.lua#L671. If the router rebuild is slow either due to slow DB or slow rebuild, it should be very obvious. If possible, you can also run https://github.com/kong/stapxx#lj-lua-stacks tool on the offending instance according to the directions to capture the CPU flame graph of the LuaJIT VM, it will also tell you where the CPU time has been spent on. |
@dndx We will implement logging enhancement by next week. |
Finally, we were able to reproduce the problem on test environments and find the root cause. It is related to the local cache in router.lua, and more precisely to the way it is built. After update or kong restart, this cache is built from scratch and each incoming request have to review all routes definitions to find the one that matches it. The matching algorithm is that paths definitions in routes which are regular expressions are matched at the beginning and the prefixed definitions (without wildcards) are matched after that. So, for each incoming URI, which is not in the router local cache and does not match any definition that is the regular expression, but does match the definition that is the prefix, kong have to parse all the regular expressions anyway, which costs a lot of CPU time. If we have many regex paths definitions (in our case it is more than 3000), it takes a lot of time. In combination with high incoming traffic it causes a rapid increase in CPU usage. The situation is made worse by the fact that each nginx worker has a separate local cache to build. An additional side effect we discovered is that sending many requests which generate 404 errors may lead to a DoS because for each such request, kong has to review all route definitions every time. |
@marcinkierus Glad you were able to find the issue. Regex routes in general are very expensive, because they are hard to parallelize. Even softwares like Nginx will run those one by one, essentially being a O(n) solution. There are possibly some ways to combine multiple regexes into a few bigger automata that can be run simultaneously, but it will require some custom regex engine to work. Out of curiosity, could you share with us what your 3000 regex routes looks like? Is there any similarities between them that may enable a more efficient search algorithm (e.g. radix tree)? |
@dndx I'd like to create an artificial example to explain our use case, since our ruleset is changing frequently and optimizing one time might not be helpful on the long run: There are multiple base path we'd like to route to a service serving a specific part our our web portal: eg. /products To achieve this we are currently forced to use regex for exact match, but if we define our rules (like /products) with high trafic as a prefix we observe the performance problems explained above by @marcinkierus, since it's processed at the end of the ruelset. To reduce our problem we currently define high traffic routes which don't have the need for exact matches as regex with priority 1. eg /static(/|$) In general the number of exact match rules might grow and an alternative approach would be helpful to do exact matches in a less cpu hungry approach. (I hope I was able to clarify our general situation) |
@dndx after further investigation it turned out, that our problem might be related to #7482 First tests are showing that after increasing lua_regex_cache_max_entries to a value greater than the amount of all our existing regex rules, latency and cpu usage look way better during updating entities. Since I consider this more as a workaround than as a solution, I am wondering if this is a limitation of Kong in terms of regex usage? Default value of lua_regex_cache_max_entries is 1024 and our problems started around the time when number of rules exceeded this value. |
@23henne Have you solved this problem? I have a similar problem. Update Kong.yml every second , |
We have similar problem. After we edit a plugin config , the related route will perform high latency in the first request. We fixed by this related pr: |
@ADD-SP Could you take a look and update? |
This issue is marked as stale because it has been open for 14 days with no activity. |
Dear contributor, We are automatically closing this issue because it has not seen any activity for three weeks. Your contribution is greatly appreciated! Please have a look Sincerely, |
Summary
We are currently facing a production issue. Once we apply a change via PATCH on one particular route entity, KONG nodes are not able to operate on normal level anymore. It does not accept requests or at least processing it is very slow. See some statistics attached.
Also latency explodes in such a time.
How to find out if we suffer from bad performance during cache invalidation?
We run a six node cluster using cassandra as a datastore. We cannot see any problems on database level. As per documentation (at least this is what I understand) only particular cache key will be invalidated once it was updated and not whole cache. Why do I face issues on all services/routes when changing only one particular entity?
We use 2.2.1 version. I don’t have any clue how to identify the root cause 😦
Thanks in advance
Henning
The text was updated successfully, but these errors were encountered: