Unusual spike in response with 499 status code #2072

kishan-v-mtech · 2024-05-14T16:54:23Z

kishan-v-mtech
May 14, 2024

Expected Behavior

The response from the downstream service should be forwarded for the incoming request, and the gateway should not return a 499 status code.

Actual Behavior

Random requests to the downstream service are being canceled, resulting in the gateway returning a 499 status code.

Steps to Reproduce the Problem

I don't have the exact steps to reproduce this issue, but it seems to occur more frequently for routes with high incoming request rates, such as webhooks. These requests are primarily automated, reducing the likelihood of manual cancellation of the CancellationToken.

Upon reviewing the change log for the major release of version 23.0.0, I noticed updates to the downstream implementation for performance enhancement. This includes the introduction of HttpMessageInvoker and the addition of PooledConnectionIdleTimeout. Could these changes be contributing to the issue?

I will continue investigating and update this issue with any additional findings or if I can identify the exact steps to reproduce the problem. Any assistance in identifying the cause would be appreciated.

Specifications

Version: 23.0.0+

If anyone are facing this the same issue are welcome to add more details or finding about this.

raman-m · 2024-05-15T08:32:39Z

raman-m
May 15, 2024
Maintainer

Hi Kishan!
Welcome to Ocelot world! 🐅

I don't believe that Ocelot contains a major bug that would manifest as a spike in your logs. However, let's brainstorm possibilities...

Actual Behavior

Random requests to the downstream service are being canceled, resulting in the gateway returning a 499 status code.

Could you point me to any code snippet where Ocelot forcibly cancels upstream or downstream requests on its own initiative?
Yes, we utilize the CancellationToken from the HTTP request to propagate it from upstream to downstream. It is imperative that Ocelot forwards this token. Thus, if the upstream client cancels the request (as detected by the end of communication in HTTP 1.1+ protocol), then Ocelot is also required to cancel the downstream request(s). This may result in a surge of log entries. Beyond this, I am uncertain.

Expected Behavior

The response from the downstream service should be forwarded for the incoming request, and the gateway should not return a 499 status code.

Do you think that if the downstream request is cancelled, the service still returns a body that we need to relay back upstream? Interesting... Why do you need this response? What will you do with the technical data in the body? Isn't a 499 status code sufficient for the upstream client to make decisions?
@ggnaegi Gui, do we send back the response/body if the downstream request was cancelled by a CancellationToken?

Steps to Reproduce the Problem

I don't have the exact steps to reproduce this issue, but it seems to occur more frequently for routes with high incoming request rates, such as webhooks. These requests are primarily automated, reducing the likelihood of manual cancellation of the CancellationToken.

Cancelled requests can be replicated easily through page reloading from a browser. However, regarding webhooks, some systems may cancel an ongoing webhook request if there's a new state or it's re-triggered.
Which specific webhooks are in question? Do you know the cause behind the cancellation of an active webhook?

Upon reviewing the change log for the major release of version 23.0.0, I noticed updates to the downstream implementation for performance enhancement. This includes the introduction of HttpMessageInvoker and the addition of PooledConnectionIdleTimeout. Could these changes be contributing to the issue?

It's unclear. Have you attempted deploying Ocelot versions prior to 23.0.0? What were the outcomes? Did you observe similar spikes in the logs?

Theoretically, the new changes to the Ocelot kernel in v23.0.0 could affect webhook behavior, but further investigation is required.
@ggnaegi, what is your perspective?

I will continue investigating and update this issue with any additional findings or if I can identify the exact steps to reproduce the problem. Any assistance in identifying the cause would be appreciated.

Currently, we cannot determine the root cause as we do not oversee your environment. However, we can collaborate on brainstorming, and collectively, we can suggest the subsequent steps for identification.

Specifications

Version: 23.0.0+

Understood!

If anyone are facing this the same issue are welcome to add more details or finding about this.

It's commonly believed that software built using SaaS or SOA architectures invariably encounters "spikes" problems in graph logs. 😄

0 replies

ggnaegi · 2024-05-15T08:53:43Z

ggnaegi
May 15, 2024
Collaborator

@raman-m it's here:

Ocelot/src/Ocelot/Errors/Middleware/ExceptionHandlerMiddleware.cs

Lines 40 to 47 in 6e9a975

    
           catch (OperationCanceledException) when (httpContext.RequestAborted.IsCancellationRequested) 
        
           { 
        
               Logger.LogDebug("operation canceled"); 
        
               if (!httpContext.Response.HasStarted) 
        
               { 
        
                   httpContext.Response.StatusCode = 499; 
        
               } 
        
           }

I haven't checked it yet, but what could cause this is the default request timeout, 90 seconds...

PooledConnectionIdleTimeout shouldn't be the cause imo, but we could investigate it further.

SocketsHttpHandler.PooledConnectionIdleTimeout Property

0 replies

raman-m · 2024-05-15T09:09:57Z

raman-m
May 15, 2024
Maintainer

@ggnaegi Thanks! I'm aware of all the references to the 499 status in our code. Indeed, timeout events cancel requests and can cause some "spikes." However, in this instance, I'm puzzled by the issue reporting.

@kishan-vachhani, could you please take a screenshot of the entire page showing the spike and share it with us? What type of spikes are you experiencing? Additionally, please provide more details from your logs or the graphs from your monitoring tool.

0 replies

kishan-v-mtech · 2024-05-15T14:41:07Z

kishan-v-mtech
May 15, 2024
Author

Could you point me to any code snippet where Ocelot forcibly cancels upstream or downstream requests on its own initiative? Yes, we utilize the CancellationToken from the HTTP request to propagate it from upstream to downstream. It is imperative that Ocelot forwards this token. Thus, if the upstream client cancels the request (as detected by the end of communication in HTTP 1.1+ protocol), then Ocelot is also required to cancel the downstream request(s). This may result in a surge of log entries. Beyond this, I am uncertain.

@raman-m I'm currently conducting further investigation to determine the source of the request cancellation. I understand that if the upstream request is cancelled, the downstream request should also be cancelled. This could potentially result in a spike in log entries. What concerns me is that I've noticed a consistent spike in log entries following the deployment of the Ocelot version upgrade (on 05/08/2024). Please refer to the image below.

Do you think that if the downstream request is cancelled, the service still returns a body that we need to relay back upstream? Interesting... Why do you need this response? What will you do with the technical data in the body? Isn't a 499 status code sufficient for the upstream client to make decisions? @ggnaegi Gui, do we send back the response/body if the downstream request was cancelled by a CancellationToken?

I agree with you on if the downstream request is cancelled, its response shouldn't be relayed in the upstream response. Here, I was trying to convey that requests shouldn't be cancelled unless it's done manually or due to a timeout.

Cancelled requests can be replicated easily through page reloading from a browser. However, regarding webhooks, some systems may cancel an ongoing webhook request if there's a new state or it's re-triggered. Which specific webhooks are in question? Do you know the cause behind the cancellation of an active webhook?

Yes, by refreshing the browser or closing the tab which is executing will cancel the request but It's concerning that we're observing cancellations in production for routes (not only webhook ones) that have a very low probability of such actions, like refreshing the browser or closing the tab etc. This behavior seems unexpected and imo requires further investigation.

It's unclear. Have you attempted deploying Ocelot versions prior to 23.0.0? What were the outcomes? Did you observe similar spikes in the logs?

Theoretically, the new changes to the Ocelot kernel in v23.0.0 could affect webhook behavior, but further investigation is required. @ggnaegi, what is your perspective?

It appears that after upgrading to the latest version of Ocelot, we've observed a significant increase in the occurrences of the 499 response code, as shown in the first attached image. This notable change prompted me to delve deeper into understanding the root cause behind this surge. Especially considering that I was using a lower version of Ocelot previously.

Currently, we cannot determine the root cause as we do not oversee your environment. However, we can collaborate on brainstorming, and collectively, we can suggest the subsequent steps for identification.

Certainly, I grasp your perspective. To aid in your comprehension, I've attached a screenshot containing all the logs pertaining to a single request that resulted in a 499 response. I'm seeking collaborative efforts to identify and rectify this issue (if really it is). In the mean time, could you please provide guidance on potential methods to pinpoint the source of cancellation? One notable change I've observed is the shift from utilizing the HTTP Client's Timeout property to employing the TimeoutDelegatingHandler in combination with the CancellationToken.

It's commonly believed that software built using SaaS or SOA architectures invariably encounters "spikes" problems in graph logs. 😄

Yeah true 😄

0 replies

ggnaegi · 2024-05-15T15:17:16Z

ggnaegi
May 15, 2024
Collaborator

@kishan-vachhani @raman-m Ok I will compare the Timeout in HttpClient with our custom Timeout Delegating Handler. What would be great is to identify a scenario that we could reproduce.

0 replies

raman-m · 2024-05-16T10:22:30Z

raman-m
May 16, 2024
Maintainer

@kishan-vachhani Do you use QoS feature for the routes?

that if the upstream request is cancelled, the downstream request should also be cancelled. This could potentially result in a spike in log entries. What concerns me is that I've noticed a consistent spike in log entries following the deployment of the Ocelot version upgrade (on 05/08/2024). Please refer to the image below.

I'm confused by this graph. What does the Y-axis represent? Is it the number of 499 status codes, or is it the count of log entries?
How can we ensure this is graph of monitored Ocelot instance?

Could you attach (copy-paste) all content of your ocelot.json please?
We need to look at your configuration.
Do you have some custom setup for Ocelot: delegating handlers, middleware overridings, service replacement in DI?

0 replies

ggnaegi · 2024-05-16T10:35:34Z

ggnaegi
May 16, 2024
Collaborator

@kishan-vachhani @raman-m Ok I will compare the Timeout in HttpClient with our custom Timeout Delegating Handler. What would be great is to identify a scenario that we could reproduce.

I can't see major differences between the timeout logic in http client and the delegating handler we have implemented

0 replies

raman-m · 2024-05-16T10:50:14Z

raman-m
May 16, 2024
Maintainer

@ggnaegi commented on May 15:

@kishan-vachhani @raman-m Ok I will compare the Timeout in HttpClient with our custom Timeout Delegating Handler.

Gui, is this the logic you're referring to? 👉

Ocelot/src/Ocelot/Requester/MessageInvokerPool.cs

Lines 59 to 66 in 6e9a975

    
           // Adding timeout handler to the top of the chain. 
        
           // It's standard behavior to throw TimeoutException after the defined timeout (90 seconds by default) 
        
           var timeoutHandler = new TimeoutDelegatingHandler(downstreamRoute.QosOptions.TimeoutValue == 0 
        
               ? TimeSpan.FromSeconds(RequestTimeoutSeconds) 
        
               : TimeSpan.FromMilliseconds(downstreamRoute.QosOptions.TimeoutValue)) 
        
           { 
        
               InnerHandler = baseHandler, 
        
           };

🆗... Here's my understanding of the reported "spikes" issue:

The TimeoutDelegatingHandler is responsible for cancelling requests after the default 90 seconds.
The developer did not specify any timeouts for routes, so the default value of 90 seconds is used.
Webhooks are received by Ocelot and forwarded to downstream services (webhook receivers).
Downstream services may be offline, or the downstream system may use load balancing with services going offline/online.
In case of absent response, after 90 seconds, the Ocelot downstream request is cancelled, and a record is written into the log with a 499 status, correct?

Considering it a problem may not be necessary; it's not an issue with Ocelot itself, but rather incidents of no response from the downstream system, leading Ocelot to naturally cancel the requests. The absence of spikes before the deployment of v23.0.0 is because Ocelot did not generate the 499 status prior to this version, correct? Since the introduction of v23.0, Ocelot has been producing the 499 status, which the monitoring tool logs, resulting in the observed spikes. Bingo! 💥

@ggnaegi Is this the same conclusion you've reached?
It seems we are handling a user scenario where downstream requests are being swallowed, which is why Ocelot is cancelling them with a 499 status code.
This issue stems from the problem of overloaded webhook receivers. In my opinion, we should inquire with the author about which webhook tools or products are currently in use as deployed downstream services.

0 replies

kishan-v-mtech · 2024-05-16T12:05:11Z

kishan-v-mtech
May 16, 2024
Author

@ggnaegi commented on May 15:

@kishan-vachhani @raman-m Ok I will compare the Timeout in HttpClient with our custom Timeout Delegating Handler.

Gui, is this the logic you're referring to? 👉

Ocelot/src/Ocelot/Requester/MessageInvokerPool.cs

Lines 59 to 66 in 6e9a975

// Adding timeout handler to the top of the chain.

// It's standard behavior to throw TimeoutException after the defined timeout (90 seconds by default)

var timeoutHandler = new TimeoutDelegatingHandler(downstreamRoute.QosOptions.TimeoutValue == 0

? TimeSpan.FromSeconds(RequestTimeoutSeconds)

: TimeSpan.FromMilliseconds(downstreamRoute.QosOptions.TimeoutValue))

{

InnerHandler = baseHandler,

};

🆗... Here's my understanding of the reported "spikes" issue:

The TimeoutDelegatingHandler is responsible for cancelling requests after the default 90 seconds.

The developer did not specify any timeouts for routes, so the default value of 90 seconds is used.

Webhooks are received by Ocelot and forwarded to downstream services (webhook receivers).

Downstream services may be offline, or the downstream system may use load balancing with services going offline/online.

In case of absent response, after 90 seconds, the Ocelot downstream request is cancelled, and a record is written into the log with a 499 status, correct?

Considering it a problem may not be necessary; it's not an issue with Ocelot itself, but rather incidents of no response from the downstream system, leading Ocelot to naturally cancel the requests. The absence of spikes before the deployment of v23.0.0 is because Ocelot did not generate the 499 status prior to this version, correct? Since the introduction of v23.0, Ocelot has been producing the 499 status, which the monitoring tool logs, resulting in the observed spikes. Bingo! 💥

@ggnaegi Is this the same conclusion you've reached? It seems we are handling a user scenario where downstream requests are being swallowed, which is why Ocelot is cancelling them with a 499 status code. This issue stems from the problem of overloaded webhook receivers. In my opinion, we should inquire with the author about which webhook tools or products are currently in use as deployed downstream services.

@raman-m Yes, from the code, it appears that if no timeout is specified, the gateway will use the default timeout of 90 seconds. If the downstream application does not respond within this timeframe, it should throw an exception.

Since I haven't configured any Quality of Service (QoS) settings or specified a timeout, it defaults to 90 seconds. Moreover, the downstream application is operational. As evident from the screenshot of the single request trace provided earlier, the gateway responded with a 499 status code within 148.4331 milliseconds. This indicates that the response time was well within the default timeout period of 90 seconds (same is the case for all).

Furthermore, with the introduction of new logic in the 23.0.0 release, a Timeout Error is returned with status code 503. It's worth noting that Ocelot did generate the 499 status prior to this version as well.

Ocelot/src/Ocelot/Requester/HttpExceptionToErrorMapper.cs

Lines 34 to 38 in 6e9a975

    
           // here are mapped the exceptions thrown from Ocelot core application 
        
           if (type == typeof(TimeoutException)) 
        
           { 
        
               return new RequestTimedOutError(exception); 
        
           }

Also, according to the code snippet below, if the downstream application is taking too long to respond or is unavailable, the cancellationToken.IsCancellationRequested should be false. This condition triggers the throw of a TimeoutException, resulting in a response status code of 503.

Ocelot/src/Ocelot/Requester/TimeoutDelegatingHandler.cs

Lines 16 to 30 in 6e9a975

    
           protected override async Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, 
        
               CancellationToken cancellationToken) 
        
           { 
        
               using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken); 
        
               cts.CancelAfter(_timeout); 
        
               try 
        
               { 
        
                   return await base.SendAsync(request, cts.Token); 
        
               } 
        
               catch (OperationCanceledException) when (!cancellationToken.IsCancellationRequested) 
        
               { 
        
                   throw new TimeoutException(); 
        
               } 
        
           }

IMO, something is triggering cancellation token prematurely. 🤔

0 replies

ggnaegi · 2024-05-16T12:10:55Z

ggnaegi
May 16, 2024
Collaborator

@kishan-vachhani Could you give us some metrics about your environment? Such as Requests per second etc... From our side, it's very difficult to draw some conclusions without more detailled observations. Besides the changes were tested and rolled out on production environments (very heavy load).

0 replies

kishan-v-mtech · 2024-05-16T13:42:13Z

kishan-v-mtech
May 16, 2024
Author

@kishan-vachhani Do you use QoS feature for the routes?

that if the upstream request is cancelled, the downstream request should also be cancelled. This could potentially result in a spike in log entries. What concerns me is that I've noticed a consistent spike in log entries following the deployment of the Ocelot version upgrade (on 05/08/2024). Please refer to the image below.

I'm confused by this graph. What does the Y-axis represent? Is it the number of 499 status codes, or is it the count of log entries? How can we ensure this is graph of monitored Ocelot instance?

Could you attach (copy-paste) all content of your ocelot.json please? We need to look at your configuration. Do you have some custom setup for Ocelot: delegating handlers, middleware overridings, service replacement in DI?

@raman-m I'm not utilizing the Quality of Service (QoS) feature for any of my routes. The Y-axis of the graph represents the number of responses with 499 status codes, while the X-axis represents the timeline.

Unfortunately, I cannot share my ocelot.json file due to its production status. However, I can provide the schema of the properties in use. I haven't overridden any middleware, but for certain routes, I am employing a custom delegating handler. It's worth noting that the issue we are discussing is affecting routes both with and without the custom delegation handler.

{
	"UpstreamPathTemplate": "/upstream/route",
	"UpstreamHttpMethod": [
		"Get",
		"Options"
	],
	"DownstreamPathTemplate": "/downstream/route",
	"DownstreamScheme": "https",
	"DownstreamHostAndPorts": [
		{
			"Host": "downstream-host",
			"Port": 443
		}
	],
	"AuthenticationOptions": {
		"AuthenticationProviderKey": "Bearer",
		"AllowedScopes": [
			"route:read"
		]
	},
	"UpstreamHeaderTransform": {
		"X-Forwarded-Host": "abc.com"
	},
	"DelegatingHandlers": [
		"CustomDelegatingHandler"
	]
}

1 reply

raman-m May 17, 2024
Maintainer

	"DelegatingHandlers": [
		"CustomDelegatingHandler"
	]

Usually we don't help and don't accept issues coming from Ocelot environments with customizations.
You have to 👉

Remove all custom classes of your Ocelot app and then start new testing phase watching 499 status
Show us the code of all custom classes

🆗❓

kishan-v-mtech · 2024-05-16T14:12:08Z

kishan-v-mtech
May 16, 2024
Author

@kishan-vachhani Could you give us some metrics about your environment? Such as Requests per second etc... From our side, it's very difficult to draw some conclusions without more detailled observations. Besides the changes were tested and rolled out on production environments (very heavy load).

@ggnaegi The issue I'm currently encountering in the production environment involves managing throughput, averaging 2.37k requests per minute (rpm) over the past 24 hours. During peak hours, this figure rises to 8k rpm.

0 replies

ggnaegi · 2024-05-16T17:02:06Z

ggnaegi
May 16, 2024
Collaborator

@kishan-vachhani Ok, the latest version is running on a production environment showing the following metrics, on average (24h): 650 requests per second, 39k requests/minute.

I checked the request, why do you have 102 status code, and it's unknown. Maybe this is the cause of the cancellation?
https://evertpot.com/http/102-processing

... wait a minute... Why did we do that dear @raman-m?
171e3a7
102 Processing is for old webdav stuff, why are we showing that misleading message here?

Ocelot/src/Ocelot/Requester/Middleware/HttpRequesterMiddleware.cs

Lines 40 to 43 in 171e3a7

    
           private void CreateLogBasedOnResponse(Response<HttpResponseMessage> response) 
        
           { 
        
               var status = response.Data?.StatusCode ?? HttpStatusCode.Processing; 
        
               var reason = response.Data?.ReasonPhrase ?? "unknown";

1 reply

raman-m Oct 15, 2024
Maintainer

I am turned red 😊

raman-m · 2024-05-16T18:55:51Z

raman-m
May 16, 2024
Maintainer

@kishan-vachhani Ok, the latest version is running on a production environment showing the following metrics, on average (24h): 650 requests per second, 39k requests/minute.

I checked the request, why do you have 102 status code, and it's unknown. Maybe this is the cause of the cancellation?
https://evertpot.com/http/102-processing

... wait a minute... Why did we do that dear @raman-m?
171e3a7
102 Processing is for old webdav stuff, why are we showing that misleading message here?

Ocelot/src/Ocelot/Requester/Middleware/HttpRequesterMiddleware.cs

Lines 40 to 43 in 171e3a7

private void CreateLogBasedOnResponse(Response<HttpResponseMessage> response)

{

var status = response.Data?.StatusCode ?? HttpStatusCode.Processing;

var reason = response.Data?.ReasonPhrase ?? "unknown";

@ggnaegi,
How is this your researched code related to the author's 499 spike problem?
I don't see any relationship!
Also I don't see a problem in logging warnings if status >= 400 in #1953. This is not error reporting, this is warning one. The author must increase logging level from Warning to Error and all spikes will disappear.

Do you want to discuss #1953 changes or do you want to find root cause of the reported issue?
I'm a bit tired today to discuss this issue...

0 replies

ggnaegi · 2024-05-16T19:04:07Z

ggnaegi
May 16, 2024
Collaborator

@raman-m I was looking for the error and then this 102 status code popped up. This is not the truth, why would you write a message with a status code that is not correct? It's only a symptom. We might have indeed a threading issue somewhere...

As a matter of fact, MessageInvoker.SendAsync is thread safe, but yes, we might have a problem with some delegating handlers, and @kishan-vachhani it could be your delegating handler too... I will check the timeout delegating handler again.

After a short review, the design of the Timeout Handler is imo thread safe:
the timeout field is readonly, so immutable
the CancellationTokenSource object is only used within the SendAsync method and then disposed
And again, it would throw a TimeoutException

... Further investigations tomorrow...

1 reply

raman-m May 17, 2024
Maintainer

Gui, this discussion is becoming unproductive. There are numerous potential causes for a 499 status code appearing in the author's upstream clients. The root cause could be:

Slow network speeds between the upstream client and the downstream server, leading to timeouts indicated by an Ocelot 499 status.
Unstable downstream services resulting in timeouts and a 499 status.
Offline downstream services causing timeouts and a 499 status.
Solar flares leading to timeouts and a 499 status.
The author's poor mood causing timeouts and a 499 status.

Do you see the pattern? That's why it would be more constructive for the author to provide proof of a bug. His Ocelot environment has some customizations. Better to test on the clean Ocelot setup. But seems the author won't do that.

RaynaldM · 2024-05-17T06:59:25Z

RaynaldM
May 17, 2024
Maintainer

@kishan-vachhani Do you use QoS feature for the routes?

that if the upstream request is cancelled, the downstream request should also be cancelled. This could potentially result in a spike in log entries. What concerns me is that I've noticed a consistent spike in log entries following the deployment of the Ocelot version upgrade (on 05/08/2024). Please refer to the image below.

I'm confused by this graph. What does the Y-axis represent? Is it the number of 499 status codes, or is it the count of log entries? How can we ensure this is graph of monitored Ocelot instance?
Could you attach (copy-paste) all content of your ocelot.json please? We need to look at your configuration. Do you have some custom setup for Ocelot: delegating handlers, middleware overridings, service replacement in DI?

@raman-m I'm not utilizing the Quality of Service (QoS) feature for any of my routes. The Y-axis of the graph represents the number of responses with 499 status codes, while the X-axis represents the timeline.

Unfortunately, I cannot share my ocelot.json file due to its production status. However, I can provide the schema of the properties in use. I haven't overridden any middleware, but for certain routes, I am employing a custom delegating handler. It's worth noting that the issue we are discussing is affecting routes both with and without the custom delegation handler.
{
	"UpstreamPathTemplate": "/upstream/route",
	"UpstreamHttpMethod": [
		"Get",
		"Options"
	],
	"DownstreamPathTemplate": "/downstream/route",
	"DownstreamScheme": "https",
	"DownstreamHostAndPorts": [
		{
			"Host": "downstream-host",
			"Port": 443
		}
	],
	"AuthenticationOptions": {
		"AuthenticationProviderKey": "Bearer",
		"AllowedScopes": [
			"route:read"
		]
	},
	"UpstreamHeaderTransform": {
		"X-Forwarded-Host": "abc.com"
	},
	"DelegatingHandlers": [
		"CustomDelegatingHandler"
	]
}

The "rustic" way of managing the timout without QoS is, I think, the source of your problems (we have several open issues on the subject, it should at least be configurable).
If you use QoS, you won't have those 499.

2 replies

raman-m May 17, 2024
Maintainer

😄

Yes, there are open issues regarding the implementation of Timeout at the route level.

raman-m Oct 15, 2024
Maintainer

FYI #1314 #1869 #2072 Default timeout enhancements #2073

RaynaldM · 2024-05-17T07:02:03Z

RaynaldM
May 17, 2024
Maintainer

@kishan-vachhani Ok, the latest version is running on a production environment showing the following metrics, on average (24h): 650 requests per second, 39k requests/minute.

I checked the request, why do you have 102 status code, and it's unknown. Maybe this is the cause of the cancellation? https://evertpot.com/http/102-processing

... wait a minute... Why did we do that dear @raman-m? 171e3a7 102 Processing is for old webdav stuff, why are we showing that misleading message here?

Ocelot/src/Ocelot/Requester/Middleware/HttpRequesterMiddleware.cs

Lines 40 to 43 in 171e3a7

private void CreateLogBasedOnResponse(Response<HttpResponseMessage> response)

{

var status = response.Data?.StatusCode ?? HttpStatusCode.Processing;

var reason = response.Data?.ReasonPhrase ?? "unknown";

@ggnaegi I can confirm what I told you yesterday: no 499 in the last 48 hours (I can't go back any further).

1 reply

raman-m May 17, 2024
Maintainer

@RaynaldM All your downstream services are online.
It seems the instability may stem from the author's environment, potentially being offline. This could explain the hypothetical "spike problems." I've suggested increasing the logging level from Warning to Error to mitigate the issue, and the 499 errors should subside. However, he insists on stabilizing his environment, which unfortunately requires our team to allocate our valuable time.

ggnaegi · 2024-05-17T07:57:50Z

ggnaegi
May 17, 2024
Collaborator

@RaynaldM Thanks a lot!

0 replies

ggnaegi · 2024-05-17T09:02:52Z

ggnaegi
May 17, 2024
Collaborator

The "rustic" way of managing the timout without QoS is, I think, the source of your problems (we have several open issues on the subject, it should at least be configurable). If you use QoS, you won't have those 499.

@raman-m @RaynaldM Maybe we should move the default timeout to the QoS and provide some global parameters to it.

3 replies

raman-m May 17, 2024
Maintainer

No! As far as I know QoS has it's own Timeout option, right?
For the routes without QoS => Why not to ask the author to implement the next Timeout feature?

raman-m May 17, 2024
Maintainer

Open Issues

raman-m Oct 14, 2024
Maintainer

Current PR with Timeout work in progress

#1314 #1869 #2072 Default timeout enhancements #2073

RaynaldM · 2024-05-17T09:21:34Z

RaynaldM
May 17, 2024
Maintainer

Maybe we should move the default timeout to the QoS and provide some global parameters to it.

I don't think so, they're 2 very different systems.
The standard timout has very basic mechanics, but that may be enough for some. It just needs to be configurable globally, or even by endpoint.
But to work with a wide variety of APIs, it's better to use QoS, which is much more robust in production (the heterogeneity of response times for certain APIs is the problem).

1 reply

raman-m Oct 14, 2024
Maintainer

Current PR with Timeout work in progress

#1314 #1869 #2072 Default timeout enhancements #2073

ggnaegi · 2024-05-17T09:54:14Z

ggnaegi
May 17, 2024
Collaborator

@RaynaldM Ok, but we could use a default Polly implementation and as soon as QoS parameters are defined use the QoS... We wouldn't have the timeout as delegating handler and we would avoid discussions with colleagues using the solution.

0 replies

ggnaegi · 2024-05-17T11:07:42Z

ggnaegi
May 17, 2024
Collaborator

But @kishan-vachhani I'm quite sure the delegating handler is thread safe though...

1 reply

raman-m May 17, 2024
Maintainer

Initially, he needs to present the code for the custom delegating handler. For a clearer experiment, he should remove any customizations, such as classes and DI service replacements, and then demonstrate that the 499 error continues to occur.

kishan-v-mtech · 2024-05-17T11:12:56Z

kishan-v-mtech
May 17, 2024
Author

@raman-m I was looking for the error and then this 102 status code popped up. This is not the truth, why would you write a message with a status code that is not correct? It's only a symptom. We might have indeed a threading issue somewhere...

As a matter of fact, MessageInvoker.SendAsync is thread safe, but yes, we might have a problem with some delegating handlers, and @kishan-vachhani it could be your delegating handler too... I will check the timeout delegating handler again.

After a short review, the design of the Timeout Handler is imo thread safe: the timeout field is readonly, so immutable the CancellationTokenSource object is only used within the SendAsync method and then disposed And again, it would throw a TimeoutException

... Further investigations tomorrow...

I also think there could be an issue with the thread that might be causing request cancellation due to a race condition. I've reviewed my custom delegating handler, but had no luck. However, this pattern of 499 status codes persists for routes without a custom delegating handler as well.

@RaynaldM @ggnaegi The issue is not caused by individual requests hitting the 90-second timeout threshold, so setting up QoS may not be helpful. If I am mistaken, please let me know.

2 replies

raman-m May 28, 2024
Maintainer

so setting up QoS may not be helpful. If I am mistaken, please let me know.

What do you want that we will let you know? You need to try first and then say "setting up QoS may not be helpful"!
It is necessary to enable QoS for the route and increase the timeout beyond the current default of 90 seconds, for instance, to 200 seconds. If the 499 error persists, then there is likely an issue with your downstream services.

raman-m Oct 14, 2024
Maintainer

Current PR with work in progress

#1314 #1869 #2072 Default timeout enhancements #2073

raman-m · 2024-05-17T11:23:00Z

raman-m
May 17, 2024
Maintainer

I've had enough of this debate! Currently, I perceive no problems with Ocelot.

@kishan-vachhani, I encourage you to partake in the discussion more light-heartedly.
Your theoretical conclusions and speculations do not captivate me.
Cease causing distress to my team!

You are obliged to demonstrate that there is indeed a bug in Ocelot❗

0 replies

ggnaegi · 2024-05-17T11:27:58Z

ggnaegi
May 17, 2024
Collaborator

@RaynaldM @ggnaegi The issue is not caused by individual requests hitting the 90-second timeout threshold, so setting up QoS may not be helpful. If I am mistaken, please let me know.

@kishan-vachhani you're right, I was trying to find the source of a possible race condition. And the delegating handlers are possible candidates. So I reviewed the design, realized that the thread safety was preserved and I thought: "Well, we could go a step further and avoid problems and discussions by moving the timeout to polly and QoS".

1 reply

raman-m May 17, 2024
Maintainer

Regarding individual per route Timeout setting see my comment plz!

ggnaegi · 2024-05-17T11:29:55Z

ggnaegi
May 17, 2024
Collaborator

Cease causing distress to my team!

The proof I'm old, but I couldn't find any better expression: ROFL! 😸

1 reply

raman-m May 17, 2024
Maintainer

Aha...

raman-m · 2024-05-17T11:38:39Z

raman-m
May 17, 2024
Maintainer

@RaynaldM @ggnaegi Regarding timeouts

Ocelot/src/Ocelot/Configuration/File/FileRoute.cs

Line 58 in 6e9a975

public int Timeout { get; set; }

Surprise! 💥
We have this old property in FileRoute (route JSON configuration) but it is never used:

So, this is a good candidate (starting point) to implement default Timeouts per route or globally.
We have appropriate issues already in backlog.

2 replies

raman-m May 17, 2024
Maintainer

Also we have official this issue:

HTTP status 499 seems inappropriate when gateway times out waiting on server #1687

I believe this is precisely what we are currently discussing. It seems we need to accept this issue.

raman-m Oct 14, 2024
Maintainer

Current PR with work in progress

#1314 #1869 #2072 Default timeout enhancements #2073

Unusual spike in response with 499 status code #2072

kishan-v-mtech May 14, 2024

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

Replies: 27 comments · 17 replies

raman-m May 15, 2024 Maintainer

Actual Behavior

Expected Behavior

Steps to Reproduce the Problem

Specifications

ggnaegi May 15, 2024 Collaborator

raman-m May 15, 2024 Maintainer

kishan-v-mtech May 15, 2024 Author

ggnaegi May 15, 2024 Collaborator

raman-m May 16, 2024 Maintainer

ggnaegi May 16, 2024 Collaborator

raman-m May 16, 2024 Maintainer

kishan-v-mtech May 16, 2024 Author

ggnaegi May 16, 2024 Collaborator

kishan-v-mtech May 16, 2024 Author

raman-m May 17, 2024 Maintainer

kishan-v-mtech May 16, 2024 Author

ggnaegi May 16, 2024 Collaborator

raman-m Oct 15, 2024 Maintainer

raman-m May 16, 2024 Maintainer

ggnaegi May 16, 2024 Collaborator

raman-m May 17, 2024 Maintainer

RaynaldM May 17, 2024 Maintainer

raman-m May 17, 2024 Maintainer

😄

raman-m Oct 15, 2024 Maintainer

RaynaldM May 17, 2024 Maintainer

raman-m May 17, 2024 Maintainer

ggnaegi May 17, 2024 Collaborator

ggnaegi May 17, 2024 Collaborator

raman-m May 17, 2024 Maintainer

raman-m May 17, 2024 Maintainer

Open Issues

raman-m Oct 14, 2024 Maintainer

Current PR with Timeout work in progress

RaynaldM May 17, 2024 Maintainer

raman-m Oct 14, 2024 Maintainer

Current PR with Timeout work in progress

ggnaegi May 17, 2024 Collaborator

ggnaegi May 17, 2024 Collaborator

raman-m May 17, 2024 Maintainer

kishan-v-mtech May 17, 2024 Author

raman-m May 28, 2024 Maintainer

raman-m Oct 14, 2024 Maintainer

Current PR with work in progress

raman-m May 17, 2024 Maintainer

You are obliged to demonstrate that there is indeed a bug in Ocelot❗

ggnaegi May 17, 2024 Collaborator

raman-m May 17, 2024 Maintainer

ggnaegi May 17, 2024 Collaborator

raman-m May 17, 2024 Maintainer

raman-m May 17, 2024 Maintainer

raman-m May 17, 2024 Maintainer

raman-m Oct 14, 2024 Maintainer

Current PR with work in progress

kishan-v-mtech
May 14, 2024

Replies: 27 comments 17 replies

raman-m
May 15, 2024
Maintainer

ggnaegi
May 15, 2024
Collaborator

raman-m
May 15, 2024
Maintainer

kishan-v-mtech
May 15, 2024
Author

ggnaegi
May 15, 2024
Collaborator

raman-m
May 16, 2024
Maintainer

ggnaegi
May 16, 2024
Collaborator

raman-m
May 16, 2024
Maintainer

kishan-v-mtech
May 16, 2024
Author

ggnaegi
May 16, 2024
Collaborator

kishan-v-mtech
May 16, 2024
Author

raman-m May 17, 2024
Maintainer

kishan-v-mtech
May 16, 2024
Author

ggnaegi
May 16, 2024
Collaborator

raman-m Oct 15, 2024
Maintainer

raman-m
May 16, 2024
Maintainer

ggnaegi
May 16, 2024
Collaborator

raman-m May 17, 2024
Maintainer

RaynaldM
May 17, 2024
Maintainer

raman-m May 17, 2024
Maintainer

raman-m Oct 15, 2024
Maintainer

RaynaldM
May 17, 2024
Maintainer

raman-m May 17, 2024
Maintainer

ggnaegi
May 17, 2024
Collaborator

ggnaegi
May 17, 2024
Collaborator

raman-m May 17, 2024
Maintainer

raman-m May 17, 2024
Maintainer

raman-m Oct 14, 2024
Maintainer

RaynaldM
May 17, 2024
Maintainer

raman-m Oct 14, 2024
Maintainer

ggnaegi
May 17, 2024
Collaborator

ggnaegi
May 17, 2024
Collaborator

raman-m May 17, 2024
Maintainer

kishan-v-mtech
May 17, 2024
Author

raman-m May 28, 2024
Maintainer

raman-m Oct 14, 2024
Maintainer

raman-m
May 17, 2024
Maintainer

ggnaegi
May 17, 2024
Collaborator

raman-m May 17, 2024
Maintainer

ggnaegi
May 17, 2024
Collaborator

raman-m May 17, 2024
Maintainer

raman-m
May 17, 2024
Maintainer

raman-m May 17, 2024
Maintainer

raman-m Oct 14, 2024
Maintainer