-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RedisTimeoutException exception message question #973
Comments
Without more context it is hard to know what is happening. On the surface,
the custom pool seems to think it is idle. Whether that is right or not is
complex - I should ensure that is visible in the log. The high "in" is
concerning, though. I hope we haven't found a new way of stealing reader
threads!
…On Fri, 12 Oct 2018, 22:53 Chris Dargis, ***@***.***> wrote:
Hi. Using running version 2.0. 495. I've seen this
<https://github.com/StackExchange/StackExchange.Redis/blob/master/docs/Timeouts.md>
page and have found some great information. I have a question about some of
the information in the timeout exception message. As an example:
StackExchange.Redis.RedisTimeoutException: Timeout awaiting response
(15281ms elapsed, timeout is 15000ms), inst: 0, qs: 414, in: 65536,
serverEndpoint: ***, mgr: 10 of 10 available, clientName: ***, IOCP:
(Busy=1,Free=999,Min=2,Max=1000), WORKER:
(Busy=68,Free=32699,Min=2,Max=32767)
I understand that Busy > Min = waiting 500ms for a new thread to get added
to the pool. What I am wondering is why mgr: 10 of 10 available -
shouldn't the mgr count read 0 of 10 available since we are in the global
thread pool? In pretty much all of our timeout errors there are at least 9
of 10 available. I've seen this with both worker and IOCP threads being >
than the min.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#973>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AABDsMgq5ldZfvP8ntaCn5sAmtcg7g__ks5ukQ9YgaJpZM4XaPuR>
.
|
Thanks for the prompt response. For context - we recently upgraded from 1.2 and have seen 10x more timeouts. We use redis to cache our NoSQL objects (which we can pull from Dynamo under the same load with no issues). The objects cached are around 200-300k. Some can be a little bigger. We now realize the recommendation is to have objects at 100k and we'll be making those changes. We use hashed objects for most of the objects that are larger and shoot for bucket sizes of 1000. I can say we experience these timeouts when bursts of parallel async work comes in (Task.WhenAll()). Previously we did not need to set min thread numbers but now we're setting them as high as 1000 to resolve these timeouts. With the min thread count this high we can get through the timeouts but everything is much more performant going straight to Dynamo ( we don't need to touch the min thread count for Dynamo) EDIT: It is possible there are other contributing factors to our recent surge in timeouts such as change and size of our data. Is there any more context I can give you? Some follow-up questions:
|
Another error that may help provide insight here:
This is the first I've seen the 0 of 10 available in our errors, but I think the most interesting here is:
930 seconds is 15 minutes. No idea why the host would be shutting down the connection. |
I'm having similar timeout issues with the > 2.0 libraries compared to 1.2 version. The issues seems to come from what looks like higher Server load/server side retries (it starts to fall apart for me around 2k connections, and 20% Server load from an Azure Redis C6 instance). The general errors I see are similar to the below example Based on this: qs numbers seem really high-- my understanding based on the documentation, is that these are outstanding requests that have not been processed. I'm trying to figure out how these numbers are so high, as based on my code, I don't think it allows creating more than 5k requests. Is there a way to clear out outstanding requests, or flush a connection under timeouts? I've looked through the source code, but it's not clear to me how the library clears out requests that have been "abandoned" by the client. @mgravell : Is there a recommended setting for high loads for the 2.0 libraries? I'm experimenting with some different settings now, although I think the general consensus I'm coming to is that the retry/settings that worked for 1.2, don't seem to behave well for us with 2.0 libraries. |
I'm having the same error in a website only when a have peak access (more requests than my average). I already increased the thread quantity in machine.config and even implemented a "connection pool" as described in https://stackexchange.github.io/StackExchange.Redis/Timeouts) but no luck with that. Here its a sample of the errors:
I noticed there the "in" amount in one of the log messages (in: 65536) its exactly the same as @CDargis (!?). Thats mean something? Any ideias @mgravell ? |
I've also implemented connection pooling that doesn't seem to help much. We have a typical cache-aside pattern with redis/dynamo. What I noticed is that we experience the timeouts when large amounts of async work comes in ( |
We are experiencing similar issues in v2. Our Redis is a pretty over-provisioned ElastiCache cluster. Workload is mostly cache-aside with small items (only a few KB at most). I've noticed the errors come in three flavors. Here are some representative samples:
We have hundreds of containers in production all talking to the same ElastiCache cluster, but these connection errors only occur on one host at a time. I.e., it's not a general Redis outage, it's something localized to a specific container. |
Any update on this? There seems to be a pretty noticeable increase in connection issues reported for v2.x. To expand on my previous comment, the primary issue isn't that we see a handful of connection errors/timeouts here and there. It's that we'll often see a single container suddenly just start spewing hundreds of these connection errors. SE.Redis fails to recover and the only resolution is to kill the container. |
We have the same scenario as @hpk. Is there any way to fix this ? |
got the same issue here - cache was working fine with 1.2.6, continually falling over with 2+ |
Hello, same kind of issue in 2.0, but also I have this issue only since I'm using elasticache with redis 4.0.
I tried to add way more vCPU, but the issue is the same. |
Hi, I'm experiencing the same issue in a .net core 2.1 project. Any idea on this issue? I'm not see a lot, average around 5 to 10 per day.
|
We are getting AWS Redis 5 |
Hello All, All of a sudden our production server (Azure) throwing exception messages: Exception Category 1 "Timeout performing EVAL, inst: 1, mgr: Inactive, err: never, queue: 17, qu: 0, qs: 17, qc: 0, wr: 0, wq: 0, in: 0, ar: 0, clientName......" Exception Category 2 "No connection is available to service this operation: EVAL; A connection attempt failed because the connected party did not properly respond after a period of time..." Is this something related to Threadpool Throttling ?? |
Same problem here: Exception occurred: Timeout performing HGET (5000ms), next: HGET ....com, inst: 3, qu: 0, qs: 2, aw: False, rs: ReadAsync, ws: Idle, in: 0, serverEndpoint: Unspecified/....redis.cache.windows.net:6380, mgr: 10 of 10 available, clientName: RD..., IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=8,Free=8183,Min=1,Max=8191), v: 2.0.601.3402 |
Getting random errors - |
Hi all; I'm sorry for delays getting back onto here - lots of plates spinning, and I simply haven't been available. I'm ramping up some investigations into timeouts, so: if this is still impacting you, I'm genuinely here again! Some things that I'm keenly interested in for all of these things, because they radically change the moving pieces:
(For example on the "rs" one; yesterday I was talking to someone who was seeing rs:CompletePendingMessage, which sounds like there may be a TPL issue re thread-stealing; I'm investigating that possibility currently) |
hi @mgravell , nice to see that you're back.
I'm on MacOS, .Net Core 3.1.101 |
For me the rs value seems to always be ReadAsync. |
OS: Windows 2012 R2 rs seems to be ReadAsync most of the time Here are a few of the different logged errors:
|
@jasoncono the first one looks like a genuine server (or network) stall (nothing inbound or outbound); the other two... could be just about anything, annoyingly; yet more data points, at least... |
To follow up on my earlier comment, below are some more error examples. We're connecting to a Redis cluster that never sees higher than 15% CPU usage. No TLS is involved, and there's nothing in the Redis slowlog to indicate that the server is getting overwhelmed. We have a couple environments, and see these errors in both: OS: Windows Server 2019 in a Docker image on AWS ECS
OS: Windows & Linux, running in Docker images on AWS EKS
The nodes indicated by Happy to provide any more information you need; just let me know! |
Endpoint: AWS ElastiCache Redis
.NET framework 4.7.2
Yes. AWS configuration of Redis says "Encryption in-transit: Yes"
I do not see this in the logs
The metrics on the server seemed normal. |
We're exploring a theory on a now-known cause and could use your help. For anyone experiencing issues on a .NET Framework application - are you in either of the following cases? .NET Framework Web ProjectsIn <add key="aspnet:UseTaskFriendlySynchronizationContext" value="false" /> or, something less than 4.5 in <httpRuntime targetFramework="4.x" /> .NET Framework Non-web ProjectsIn
To clarify: it doesn't matter what your If this matches anyone here, it'd be hugely helpful to know. @mgravell is working on a workaround for this scenario. |
Hey @NickCraver, @mgravell We have aspnet:UseTaskFriendlySynchronizationContext set to true I can't see that we have httpruntime set. We were having issues with version 1.2.6 on production. Occasionally the queue on one of our servers would stop processing and the queue number kept incremeanting. We put a fudge in, so that if we get x amount of these consistently we recreate the multiplexer. Looking around your project we saw you had rewritten this, so did an upgrade to the latest version. That upgrade is currently failing in QA as we get similar issues under load. "StackExchange.Redis.RedisTimeoutException: Timeout performing GET (5000ms), inst: 0, qu: 1187, qs: 0, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, serverEndpoint: Unspecified/removed, mgr: 10 of 10 available, clientName: Removed IOCP: (Busy=0,Free=1000,Min=100,Max=1000), WORKER: (Busy=5,Free=32762,Min=600,Max=32767), v: 2.0.601.3402" We too are using AWS Elasticache. I am not sure how much time you have dedicated to this project, but I could probably recreate this with you in our staging environment, if that would help. |
I have similar issue it work fine for 4 months and suddenly i get this error
|
@hshazli your workers just hit 210 and your min was 200. Do you think increasing your min workers Would help? I’m not an expert in this but you could try it. |
@afinzel but this one error other message in other error message i get the worker hit 300 or above what is the min value i should use and how to select the proper one, |
@hshazli Yep, that's normal. In your case, it seems to be a very overloaded thread pool and it wasn't able to pull bytes off the queue fast enough during a spike. It's possible network blips resulting in spikes of incoming traffic contributed to that. @afinzel I believe your issue is different (#1120), but there's a fix in queue in #1374 there. Does yours happen after a network blip of some sort (even AWS failing over)? That's the situation there, the connection's hung in a bad state afterwards. |
@NickCraver, I didn't think so but I see that changed is merged, so I will get the latest version and retest. Thanks. edit: It looks like it isn't on Nuget yet. |
hi, i using redis over a month efficiently but i just got this error yesterday. logged:
infos:
i hope that would helpful, thanks |
@NickCraver, Just wanted to clarify here: Is the following fine in my App.config?
According to this Microsoft documentation, any .Net Framework version from 4.0-4.8 should use "v4.0" Your comment, which had seemed to imply that we could have a value greater than |
Hi there, Also getting similar error. This is running on a azure function. Built using .net core 2.2 using version 2.1.39 System.Exception: Error fetching customer tags for devices 'xxxx'. ---> StackExchange.Redis.RedisTimeoutException: Timeout performing EXISTS (5000ms), next: EXISTS xxxxxxx, inst: 1, qu: 0, qs: 16, aw: False, rs: ReadAsync, ws: Idle, in: 850, serverEndpoint: tttttt:6380, mc: 1/1/0, mgr: 10 of 10 available, clientName: xxxx, IOCP: (Busy=16,Free=984,Min=6,Max=1000), WORKER: (Busy=0,Free=32767,Min=6,Max=32767), v: 2.1.39.39788 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts) Any help appreciated. Cheers, |
I get below errors where try to to get or set a big data (about 70 MegaByte). any idea? can i increase redis timeout? I use redis on windows. what framework exactly? .NETFramework 4.6.1 is there TLS involved? no
|
This issue lives on - 2 years later. @CDargis was definitely onto something with: "We only experience this when running many task / with (Task.WhenAll())" . In my case we see this when doing 1,000s of GETs while using Parallel.ForEach. Setting MaxDegreeOfParallelism = Environment.ProcessorCount was a big help - but does not fix everything. Has anyone else found a better solution? Active issue for 2 years now. |
Found a fix here: Change GetDatabase to return a IDatabaseAsync instead of a IDatabase and problem disappears. |
@mgot90 |
Having the same exception without any obvious reason.
I set both |
@jeffras I'm just another dev trying to fix the same problem. Here's what we did:
In the end, we still have this happening every now and then. We're using Azure for our Redis provider. I haven't ruled out that our pricing tier is also related: https://azure.microsoft.com/en-us/pricing/details/cache/ Not sure if it's useful to you, but here is the Azure best practices as well: https://docs.microsoft.com/en-us/azure/azure-cache-for-redis/cache-best-practices Did you guys have luck or try anything different? |
@mgot90 Thanks for the info, did you modify the stack exchange code for IDatabaseAsync? I did see it in the source but didn't want to maintain a branch to accomplish this. We still have the problem. We ended up having to wrap our code in a lock and it went away, but our throughput clearly suffers. |
We've been dealing with these kind of errors since months (if not even years). Three months ago we've decided to reimplement our caching layer with following design assumptions:
I'm aware that this approach might not be ideal for everyone and has some performance footprint (as you need to check the size) but at least may shed some more light. Our setup:
|
Going to close this out to cleanup - we had lots of various cases leading to timeouts in a myriad of environments and combinations. Many of these reports led to tweaks, some improved handling, and much better error messaged giving us more information. I want to close this old issue out so that anyone hitting something today in a much newer version and error message can more easily get help. |
Hi. Using version 2.0. 495. I've seen this page and have found some great information. I have a question about some of the information in the timeout exception message. As an example:
I understand that Busy > Min = waiting 500ms for a new thread to get added to the pool. What I am wondering is why
mgr: 10 of 10 available
- shouldn't the mgr count read 0 of 10 available since we are in the global thread pool? In pretty much all of our timeout errors there are at least 9 of 10 available (and we have A LOT of timeouts). I've seen this with both worker and IOCP threads being > than the min.The text was updated successfully, but these errors were encountered: