Unusual spike in response with 499 status code #2072
Replies: 27 comments 17 replies
-
Hi Kishan! I don't believe that Ocelot contains a major bug that would manifest as a spike in your logs. However, let's brainstorm possibilities...
Could you point me to any code snippet where Ocelot forcibly cancels upstream or downstream requests on its own initiative?
Do you think that if the downstream request is cancelled, the service still returns a body that we need to relay back upstream? Interesting... Why do you need this response? What will you do with the technical data in the body? Isn't a 499 status code sufficient for the upstream client to make decisions?
Cancelled requests can be replicated easily through page reloading from a browser. However, regarding webhooks, some systems may cancel an ongoing webhook request if there's a new state or it's re-triggered.
It's unclear. Have you attempted deploying Ocelot versions prior to 23.0.0? What were the outcomes? Did you observe similar spikes in the logs? Theoretically, the new changes to the Ocelot kernel in v23.0.0 could affect webhook behavior, but further investigation is required.
Currently, we cannot determine the root cause as we do not oversee your environment. However, we can collaborate on brainstorming, and collectively, we can suggest the subsequent steps for identification.
Understood!
It's commonly believed that software built using SaaS or SOA architectures invariably encounters "spikes" problems in graph logs. 😄 |
Beta Was this translation helpful? Give feedback.
-
@raman-m it's here: Ocelot/src/Ocelot/Errors/Middleware/ExceptionHandlerMiddleware.cs Lines 40 to 47 in 6e9a975 I haven't checked it yet, but what could cause this is the default request timeout, 90 seconds...
|
Beta Was this translation helpful? Give feedback.
-
@ggnaegi Thanks! I'm aware of all the references to the 499 status in our code. Indeed, timeout events cancel requests and can cause some "spikes." However, in this instance, I'm puzzled by the issue reporting. @kishan-vachhani, could you please take a screenshot of the entire page showing the spike and share it with us? What type of spikes are you experiencing? Additionally, please provide more details from your logs or the graphs from your monitoring tool. |
Beta Was this translation helpful? Give feedback.
-
@raman-m I'm currently conducting further investigation to determine the source of the request cancellation. I understand that if the upstream request is cancelled, the downstream request should also be cancelled. This could potentially result in a spike in log entries. What concerns me is that I've noticed a consistent spike in log entries following the deployment of the Ocelot version upgrade (on 05/08/2024). Please refer to the image below.
I agree with you on if the downstream request is cancelled, its response shouldn't be relayed in the upstream response. Here, I was trying to convey that requests shouldn't be cancelled unless it's done manually or due to a timeout.
Yes, by refreshing the browser or closing the tab which is executing will cancel the request but It's concerning that we're observing cancellations in production for routes (not only webhook ones) that have a very low probability of such actions, like refreshing the browser or closing the tab etc. This behavior seems unexpected and imo requires further investigation.
It appears that after upgrading to the latest version of Ocelot, we've observed a significant increase in the occurrences of the 499 response code, as shown in the first attached image. This notable change prompted me to delve deeper into understanding the root cause behind this surge. Especially considering that I was using a lower version of Ocelot previously.
Certainly, I grasp your perspective. To aid in your comprehension, I've attached a screenshot containing all the logs pertaining to a single request that resulted in a 499 response. I'm seeking collaborative efforts to identify and rectify this issue (if really it is). In the mean time, could you please provide guidance on potential methods to pinpoint the source of cancellation? One notable change I've observed is the shift from utilizing the HTTP Client's Timeout property to employing the TimeoutDelegatingHandler in combination with the CancellationToken.
Yeah true 😄 |
Beta Was this translation helpful? Give feedback.
-
@kishan-vachhani @raman-m Ok I will compare the Timeout in HttpClient with our custom Timeout Delegating Handler. What would be great is to identify a scenario that we could reproduce. |
Beta Was this translation helpful? Give feedback.
-
@kishan-vachhani Do you use QoS feature for the routes?
I'm confused by this graph. What does the Y-axis represent? Is it the number of 499 status codes, or is it the count of log entries? Could you attach (copy-paste) all content of your |
Beta Was this translation helpful? Give feedback.
-
I can't see major differences between the timeout logic in http client and the delegating handler we have implemented |
Beta Was this translation helpful? Give feedback.
-
Gui, is this the logic you're referring to? 👉 Ocelot/src/Ocelot/Requester/MessageInvokerPool.cs Lines 59 to 66 in 6e9a975 🆗... Here's my understanding of the reported "spikes" issue:
Considering it a problem may not be necessary; it's not an issue with Ocelot itself, but rather incidents of no response from the downstream system, leading Ocelot to naturally cancel the requests. The absence of spikes before the deployment of v23.0.0 is because Ocelot did not generate the 499 status prior to this version, correct? Since the introduction of v23.0, Ocelot has been producing the 499 status, which the monitoring tool logs, resulting in the observed spikes. Bingo! 💥 @ggnaegi Is this the same conclusion you've reached? |
Beta Was this translation helpful? Give feedback.
-
@raman-m Yes, from the code, it appears that if no timeout is specified, the gateway will use the default timeout of 90 seconds. If the downstream application does not respond within this timeframe, it should throw an exception. Since I haven't configured any Quality of Service (QoS) settings or specified a timeout, it defaults to 90 seconds. Moreover, the downstream application is operational. As evident from the screenshot of the single request trace provided earlier, the gateway responded with a 499 status code within 148.4331 milliseconds. This indicates that the response time was well within the default timeout period of 90 seconds (same is the case for all). Furthermore, with the introduction of new logic in the 23.0.0 release, a Timeout Error is returned with status code 503. It's worth noting that Ocelot did generate the 499 status prior to this version as well. Ocelot/src/Ocelot/Requester/HttpExceptionToErrorMapper.cs Lines 34 to 38 in 6e9a975 Also, according to the code snippet below, if the downstream application is taking too long to respond or is unavailable, the cancellationToken.IsCancellationRequested should be false. This condition triggers the throw of a TimeoutException, resulting in a response status code of 503. Ocelot/src/Ocelot/Requester/TimeoutDelegatingHandler.cs Lines 16 to 30 in 6e9a975 IMO, something is triggering cancellation token prematurely. 🤔 |
Beta Was this translation helpful? Give feedback.
-
@kishan-vachhani Could you give us some metrics about your environment? Such as Requests per second etc... From our side, it's very difficult to draw some conclusions without more detailled observations. Besides the changes were tested and rolled out on production environments (very heavy load). |
Beta Was this translation helpful? Give feedback.
-
@raman-m I'm not utilizing the Quality of Service (QoS) feature for any of my routes. The Y-axis of the graph represents the number of responses with 499 status codes, while the X-axis represents the timeline. Unfortunately, I cannot share my
|
Beta Was this translation helpful? Give feedback.
-
@ggnaegi The issue I'm currently encountering in the production environment involves managing throughput, averaging 2.37k requests per minute (rpm) over the past 24 hours. During peak hours, this figure rises to 8k rpm. |
Beta Was this translation helpful? Give feedback.
-
@kishan-vachhani Ok, the latest version is running on a production environment showing the following metrics, on average (24h): 650 requests per second, 39k requests/minute. I checked the request, why do you have 102 status code, and it's unknown. Maybe this is the cause of the cancellation? ... wait a minute... Why did we do that dear @raman-m? Ocelot/src/Ocelot/Requester/Middleware/HttpRequesterMiddleware.cs Lines 40 to 43 in 171e3a7 |
Beta Was this translation helpful? Give feedback.
-
@ggnaegi, Do you want to discuss #1953 changes or do you want to find root cause of the reported issue? |
Beta Was this translation helpful? Give feedback.
-
@raman-m I was looking for the error and then this 102 status code popped up. This is not the truth, why would you write a message with a status code that is not correct? It's only a symptom. We might have indeed a threading issue somewhere... As a matter of fact, After a short review, the design of the Timeout Handler is imo thread safe: ... Further investigations tomorrow... |
Beta Was this translation helpful? Give feedback.
-
The "rustic" way of managing the timout without QoS is, I think, the source of your problems (we have several open issues on the subject, it should at least be configurable). |
Beta Was this translation helpful? Give feedback.
-
@ggnaegi I can confirm what I told you yesterday: no 499 in the last 48 hours (I can't go back any further). |
Beta Was this translation helpful? Give feedback.
-
@RaynaldM Thanks a lot! |
Beta Was this translation helpful? Give feedback.
-
@raman-m @RaynaldM Maybe we should move the default timeout to the QoS and provide some global parameters to it. |
Beta Was this translation helpful? Give feedback.
-
I don't think so, they're 2 very different systems. |
Beta Was this translation helpful? Give feedback.
-
@RaynaldM Ok, but we could use a default Polly implementation and as soon as QoS parameters are defined use the QoS... We wouldn't have the timeout as delegating handler and we would avoid discussions with colleagues using the solution. |
Beta Was this translation helpful? Give feedback.
-
But @kishan-vachhani I'm quite sure the delegating handler is thread safe though... |
Beta Was this translation helpful? Give feedback.
-
I also think there could be an issue with the thread that might be causing request cancellation due to a race condition. I've reviewed my custom delegating handler, but had no luck. However, this pattern of 499 status codes persists for routes without a custom delegating handler as well. @RaynaldM @ggnaegi The issue is not caused by individual requests hitting the 90-second timeout threshold, so setting up QoS may not be helpful. If I am mistaken, please let me know. |
Beta Was this translation helpful? Give feedback.
-
I've had enough of this debate! Currently, I perceive no problems with Ocelot. @kishan-vachhani, I encourage you to partake in the discussion more light-heartedly. You are obliged to demonstrate that there is indeed a bug in Ocelot❗ |
Beta Was this translation helpful? Give feedback.
-
@kishan-vachhani you're right, I was trying to find the source of a possible race condition. And the delegating handlers are possible candidates. So I reviewed the design, realized that the thread safety was preserved and I thought: "Well, we could go a step further and avoid problems and discussions by moving the timeout to polly and QoS". |
Beta Was this translation helpful? Give feedback.
-
The proof I'm old, but I couldn't find any better expression: ROFL! 😸 |
Beta Was this translation helpful? Give feedback.
-
@RaynaldM @ggnaegi Regarding timeouts Surprise! 💥 We have this old property in FileRoute (route JSON configuration) but it is never used:So, this is a good candidate (starting point) to implement default Timeouts per route or globally. We have appropriate issues already in backlog. |
Beta Was this translation helpful? Give feedback.
-
Expected Behavior
The response from the downstream service should be forwarded for the incoming request, and the gateway should not return a 499 status code.
Actual Behavior
Random requests to the downstream service are being canceled, resulting in the gateway returning a 499 status code.
Steps to Reproduce the Problem
I don't have the exact steps to reproduce this issue, but it seems to occur more frequently for routes with high incoming request rates, such as webhooks. These requests are primarily automated, reducing the likelihood of manual cancellation of the
CancellationToken
.Upon reviewing the change log for the major release of version 23.0.0, I noticed updates to the downstream implementation for performance enhancement. This includes the introduction of
HttpMessageInvoker
and the addition ofPooledConnectionIdleTimeout
. Could these changes be contributing to the issue?I will continue investigating and update this issue with any additional findings or if I can identify the exact steps to reproduce the problem. Any assistance in identifying the cause would be appreciated.
Specifications
If anyone are facing this the same issue are welcome to add more details or finding about this.
Beta Was this translation helpful? Give feedback.
All reactions