-
Notifications
You must be signed in to change notification settings - Fork 4.8k
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus plugin causes much longer serving duration for concurrent HTTP requests #8844
Comments
We can add some |
In one Kubernetes cluster suffering this problem, we have 428 services and 1,633 routes. I understand that yielding occasionally during the scrape might help, but the real problem here is the periodic call to Better would be for each NGINX worker to only keep track of its metric values locally, and have them each respond to a "scrape" request—be it via IPC (per an example from five years ago) or an HTTP request—from the worker that handles the |
I have been experimenting with a modified copy of the Prometheus plugin, testing my hypotheses about the cause of this problem. It turns out that I was wrong on several counts or, more forgivingly, I was at least wrong about the significance of several design choices. I disabled the following operations both in isolation and in various combinations, repeatedly issuing requests to see whether the approximately 1.5 second delay remained:
To my surprise, none of these have any observable effect on the maximum request serving duration. The only operation that seems to have any effect is the fetching of the set of keys from the shared dictionary via the At present, in one of the Kubernetes clusters in which I've been testing, our shared dictionary has 3,284 keys. I suspect that the key count is higher—on the order of twice that—in some of our other clusters. I studied both the documentation and the implementation for the I think the next step is to experiment with an alternative design for this bookkeeping. Again, the ngx_lua_ipc library looks like a promising part of eliminating our heavy use of this shared dictionary, but getting a Kong container image with OpenResty built with this module included is quite difficult. |
One incremental approach I neglected to mention is to shard or partition the single shared memory zone used for the Prometheus metrics dictionary. Choose a fixed split factor such as 256, create that many shared memory zones, write a surjective projection function that maps from a short metric name to an integer in [0, 255], and access the dictionary in the appropriate shared memory zones for each metric name. That's more bookkeeping overhead, but if we assume that a projection function won't yield "hot" partitions, we'd split the size of the key set retrieved from the shared dictionary by the chosen factor. It may turn out that the overhead of making, say, 256 |
Hello @seh , I was looking at this issue (for which you have done a very good job at analyzing that the problem) and realized that you are using Kong Enterprise. This is the Kong Opensource repo. Enterprise issues have a separate tracking (and priority) system - the first step being contacting your sales rep or customer success engineer at Kong. If you have not done so already, I recommend you initiate that as soon as possible. On your issue: I don't have an answer or a quick fix for it. I can confirm reading from shared dictionaries, especially |
Thank you, Enrique. We have been working with Kong account representatives and engineers since I filed this issue, so we are in good hands there. I filed it publicly deliberately because the Prometheus plugin's source is open, and it behaves no differently in the Enterprise Gateway version. I wanted more eyes on the problem—assuming that there may be other users out there ho have not yet realized the cause of this serving delay. After all, it took us about two and a half years to notice it. We've had to cease not only use of the Prometheus plugin (hooking the "log" serving phase), but also our allowing of Prometheus to scrape our proxy's Admin API. Removing the plugin was not enough. Just serving the HTTP requests against the /metrics path still caused intolerable delays in the proxy's serving of concurrent requests. |
Not at all happy with this approach. It might be better to just scrape host metrics from Prometheus, depending on this performance issue: - Kong#8844
This was an interesting read and it might explain some spikes I was experiencing. For the opensource users I thank you. |
Hi All, for the Prometheus problem, we publish a new blog article to guide us on how to use Prometheus in Kong Gateway. |
I'm closing this issue as it was fixed by #9028. |
Partially fixed, that is. |
Yes, you are right. To completely fix it, a considerably deeper change is required. With the original issue, using Kong in a high-traffic volume environment, a large number of entities, and the Prometheus plugin at the same time was virtually impossible. |
Is there an existing issue for this?
Kong version (
$ kong version
)Kong Enterprise 2.8.1.0
Current Behavior
When we activate Kong's Prometheus plugin in "global" mode—by labeling the KongClusterPlugin object with the "global" label with a value of "true"—and have Prometheus scrape the proxy container at 0.1 Hz (once every ten seconds), Kong will take an extra 1.5 seconds or so to serve many other requests that arrive concurrently with Kong responding to that scrape request. Note that the Kubernetes kubelet is also issuing probes to the Kong Admin API's /status HTTP path at the same frequency, but we don't find that those probes cause this problem when we stop Prometheus from scraping the /metrics HTTP path.
When Prometheus is not scraping Kong, and even in the approximately seven second period between when Kong starts and later finishes responding to the scrape request, we can issue thousands of requests against nonexistent routes, and Kong will respond with status code 404 within 80ms or less. For example, we might use the bombardier tool to issue three seconds' worth of requests, at around 10,000 requests per second, and Kong will never take more than 80ms to respond to any of them. However, if those requests overlap with the Prometheus scrape, then some of them—less than 1% of them as reported by bombardier—will take around 1.6 seconds.
Here is a sample invocation of bombardier, provoking 404 responses deliberately by requesting a virtual hostname for which we have no Kubernetes Ingress object and hence no route within Kong's proxy:
We can issue that from the same machine hosting the Kong proxy container in a Kubernetes pod. We use TLS here to mimic a common request pattern that our proxies actually serve daily.
If we patch the KongClusterPlugin for Prometheus and either set the "global" label's value to "false" or remove it entirely, strangely we don't see this problem abate. Even without the Prometheus plugin activated, Kong still serves metrics on the /metrics path; it just doesn't collect them for requests targeting covered routes and services. Only when we stop Prometheus from scraping Kong—or when we decrease its scrape frequency, elongating the periods between scrape requests—does the problem abate.
While running top on the same machine on which Kong's proxy is running, we can see the reported CPU use by each of the nginx processes swell from around 10% to 50% approximately every ten seconds, where it hovers before falling again about three seconds later.
I have studied the source code for the Prometheus plugin, looking for evidence that a long-running /metrics request could block concurrent adjustment of metrics while serving other requests concurrently. I do see that
Prometheus:collect
callsPrometheus:metric_data
, which calls onself._counter.sync
, and then callsself.dict:get_keys
. That last call is hitting thengx.shared.DICT.get_keys
function, whose documentation mentions the following:Its implementation involves acquiring and holding a mutex on the shared memory pool. Having seen that, I went looking for some call path involved in updating metrics for other HTTP requests that NGINX is serving concurrently.
PrometheusHandler.log
callsexporter.log
. Within that latterlog
function, we find many calls to lua-resty-counter'sincr
function. That function only mutates what looks to be a thread-local dictionary, the content of which it only synchronizes periodically via an NGINX timer when calling thesync
function. Thatsync
function calls on lua-resty-counter'sincr
function repeatedly, the implementation of which also requires acquiring and holding a mutex. However, if those calls only arise in the context of an NGINX timer running in the background, presumably not interlocked with any normal HTTP request processing, it's not yet clear to me how the call toPrometheus:collect
winds up getting blocked by any of this other per-worker synchronization with the shared memory dictionary. Am I missing some other call path involved in normal HTTP request processing that would also try to acquire this same mutex?I also considered the two calls to
ngx.sleep
—one inprometheus.del
, and one inprometheus.reset
—as each of those could cause a stall for the hard-coded synchronization interval of one second. However, since the only metric that gets reset frequently is a gauge, we don't wind up calling onngx.sleep
.Expected Behavior
It would be acceptable if using Kong's Prometheus plugin added a millisecond or two to the request serving duration, to allow for counter and gauge adjustments—ideally as atomic operations. We don't expect to suffer these drastic variations in request serving durations while Prometheus is scraping Kong's proxy.
Steps To Reproduce
Anything else?
Operating system: Flatcar Container Linux version 2765.1.0, kernel version 5.10.16-flatcar
Kong configuration:
HTTP GET against Admin port
The text was updated successfully, but these errors were encountered: