Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout issues if running in docker container #2253

Closed
isenmann opened this issue Sep 15, 2022 · 7 comments
Closed

Timeout issues if running in docker container #2253

isenmann opened this issue Sep 15, 2022 · 7 comments

Comments

@isenmann
Copy link

Hi,
we discovered a strange behaviour if we run our code with StackExchange.Redis client in a linux docker container. It seems that inside docker the StackExchange.Redis client handles connections or stuff differently than on my own machine (Windows) locally.

Thre redis sever is in both cases located locally on my machine.

Let me try to explain the scenario:

  • 100000 keys/value pairs on Redis
  • storing the keys of them into a set to access them fast
  • receiving all members (which are the keys for the key/value pairs) of the set
  • receiving all (100000) key/value pairs with one call
  • writing all (100000) key/value pairs with one call

Those calls are done in 3 threads in parallel.

Locally on Windows (not in Docker) these calls run in a while loop took about 300ms each, it's a little bit slow, but it's working.

BUT running the code in a linux docker container, after two to three loops, there are lots of timeouts in nearly every call to Redis. The timeout is set to the default value (5s).

Because of that issue we improved the requests and split the set to buckets in 1000 elements per set, which improved the access time dramatically on the local machine. But even in docker this approach is much slower than on my local machine under Windows and after running several minutes the requests get slower and slower.

Below is a link to a repository which demonstrate this issue in a small program. It can be run with the bucket optimization and without it. By default it uses not the improved bucket version. You can enable it by setting useBuckets to true in Program.cs:line 3
Redis connection string must be set in the appsettings.json configuration file.

Our main issue is now why there appears timeouts in the non-optimized version only if it is run in docker and not locally? And why it happens even in the optimized version after several minutes? Even tried this in AWS and docker and there the timeouts also appears.

Demo project/repository to show the issue

Any help/hint is appreciated.

@NickCraver
Copy link
Collaborator

NickCraver commented Sep 15, 2022

It's hard to tell from the demo because I see a few major issues in the demo itself:

  1. Console logging is effectively locked, so all that parallelism is going to be heavily locked on console writing and dramatically slowing your throughput.
  2. There are a lot of allocations happening with key conversions and lists so that's going to net a lot of GC time. This includes List resizes as well. Also of note are array allocations between most of the RedisClient commands allocating a collection - and at your sizes, those arrays are on the large object heap.
  3. No need to .GetDatabase() on each call, you can keep a ref to this :)
  4. There's a lot of Parallel.ForEachAsync but it won't actually be parallel. Especially in saw your Docker container only so many threads are going to be allocated by the thread pool based on perceived hardware by the .NET Runtime (same with GC heaps), so what the code is attempting to do and how much is actually parallel is going to vary drastically between these kinds of environments.

That's the code issues, which I want to separate from the Redis usage - suggestions:

  1. Doing 100,000 keys at once across the wire isn't awesome - that's a huge payload and bad for multiplexing a connection. We'd recommend much smaller batches.
  2. Perhaps in-line with number 1, I see you're getting they key names from a set and then getting the keys - you don't need 2 roundtrips to do this, that's better as a LUA operation server-side (but again, not with a huge payload per call).
  3. If you don't need the result of a command, you need not wait on it - see CommandFlags.FireAndForget for args on a command to put it on the buffer and release your CPU quicker to do other work.

I know the above is a bit vague but the demo has a lot of internal issues with LOH arrays just from looking that it's hard to suggest targeted fixes there. But hopefully the above advice helps on direction.

@isenmann
Copy link
Author

Thanks for the feedback. I will try to add some more information about the demo project to clarify the mentioned code issues.

  1. Console logging is effectively locked, so all that parallelism is going to be heavily locked on console writing and dramatically slowing your throughput.

Sure, but they are only there to print out the measurements. The timeout also occurs if all Console logging is gone, so it make no difference here and you can see some output while running it, which is helpful in the end. There is no Console logging in production code ;)

  1. There are a lot of allocations happening with key conversions and lists so that's going to net a lot of GC time. This includes List resizes as well. Also of note are array allocations between most of the RedisClient commands allocating a collection - and at your sizes, those arrays are on the large object heap.

I heard you and I have changed it in my last commit. Every array allocation is now done only once at the beginning before anything starts.

  1. No need to .GetDatabase() on each call, you can keep a ref to this :)

Also done at the beginning now. :) But I have just followed the official Basic Usage guide which mentioned that this is a cheap pass-thru object, and does not need to be stored. ;-)

  1. There's a lot of Parallel.ForEachAsync but it won't actually be parallel. Especially in saw your Docker container only so many threads are going to be allocated by the thread pool based on perceived hardware by the .NET Runtime (same with GC heaps), so what the code is attempting to do and how much is actually parallel is going to vary drastically between these kinds of environments.

Even if it is not really in parallel, those blocks should demonstrate that from different threads such requests could happen. In the production code, we have a singleton redis cache service which get requests from different threads. So this parallel foreach is just for demo purposes there and should somehow "simulate" our scenario.

That's the code issues, which I want to separate from the Redis usage - suggestions:

  1. Doing 100,000 keys at once across the wire isn't awesome - that's a huge payload and bad for multiplexing a connection. We'd recommend much smaller batches.

That's right, that's why we use buckets of 1000 keys already in the production code which could also be enabled in the demo project by setting the useBuckets variable to true. And using those buckets pushes the performance dramatically. But as mentioned, even in Docker it's much slower than on Windows.

  1. Perhaps in-line with number 1, I see you're getting they key names from a set and then getting the keys - you don't need 2 roundtrips to do this, that's better as a LUA operation server-side (but again, not with a huge payload per call).

Good point, didn't know that. This could be a future improvement. Thanks for this hint.

  1. If you don't need the result of a command, you need not wait on it - see CommandFlags.FireAndForget for args on a command to put it on the buffer and release your CPU quicker to do other work.

Thanks for this one, too. Also an improvement for the future :) I did not added it the the demo project in my last commit.

I know the above is a bit vague but the demo has a lot of internal issues with LOH arrays just from looking that it's hard to suggest targeted fixes there. But hopefully the above advice helps on direction.

Thanks for all the information, even if it is a bit vague. :)
Nevertheless we wonder why it run into timeouts under docker, even with the array allocations fix from above and so on.

Meanwhile I have tested it under Linux (natively) and it runs without any issues there, too. Just to be sure that's not an issue from the runtime under Linux.

@NickCraver
Copy link
Collaborator

Wanted to follow-up here - were you able to determine anymore? The only advice I'd have past reading above is "get a profile in the docker container" which...I'm not an expert on. But we'd really need to observe what's happening. Is it GC contention? Core/heap issues? Bandwidth restrictions? Some system call overhead? It's such a different environment it's really hard to say anything other than profile and observe what's happening as the bottleneck. Given the LOH bits we saw I think you may be into GC style issues more so with Docker than on something exposing more cores for example. Tools like dotnet-monitor and friends (side menu here) - can also observe with realmon.

I wish I had more specific advice here but that's the best I can think of as next steps if still having issues here.

@isenmann
Copy link
Author

Sadly I had not much time to dive deeper into that issue again. As I wrote it was working acceptable with the bucket stuff and it wasn't that urgent anymore. As soon as there is enough time again, I will try to analyze it as you suggested it.

@anthony-keller
Copy link

anthony-keller commented Nov 13, 2022

I've just migrated a web app into a Linux docker container and started seeing this error:

StackExchange.Redis.RedisTimeoutException: Timeout performing HMGET (5000ms), next: HMGET ec6ea9e7-8694-897d-969f-2f5f3b34a4fa, inst: 0, qu: 0, qs: 9, aw: False, rs: ReadAsync, ws: Idle, in: 2312, in-pipe: 0, out-pipe: 0, serverEndpoint: de-prod-cache-001.qvhagr.0001.apse2.cache.amazonaws.com:6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: FPCS.Web, IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=14,Free=32753,Min=2,Max=32767), v: 2.2.4.27433 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)    at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 2817    at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) in /_/src/StackExchange.Redis/RedisBase.cs:line 54    at StackExchange.Redis.RedisDatabase.HashGet(RedisKey key, RedisValue[] hashFields, CommandFlags flags) in /_/src/StackExchange.Redis/RedisDatabase.cs:line 324    at Microsoft.Extensions.Caching.StackExchangeRedis.RedisExtensions.HashMemberGet(IDatabase cache, String key, String[] members)    at Microsoft.Extensions.Caching.StackExchangeRedis.RedisCache.GetAndRefresh(String key, Boolean getData)    at Microsoft.AspNetCore.Session.DistributedSession.Load()

This error does not occur when the same web app is deployed to a Windows EC2 server.

I'm going to test using the latest version tomorrow.

We're seeing this error when using Redis as the Session provider.

Environment
mcr.microsoft.com/dotnet/aspnet:6.0-focal
Container running in AWS AppMesh, ECS Fargate
AWS Elasticache 5.0.5

I'd be happy to provide more info if needed.

@isenmann
Copy link
Author

Finally I have some time to give some feedback again for this issue. It seems that we finally solved our issue. The final solution was to increase the min threads in the thread pool and to use the ThreadPool instead the shared one for the SocketManager. With this change the timeouts are gone and it's working like a charm right now. I have read through some documentation and some blog posts to understand the exception completely and it seems really that the worker threads were too high for the minimum threads which were set by default and creating new threads were too slow. At least that's how I understand the issue. Running with .NET 6 I think that using the ThreadPool is not an issue.

So, just for reference, here is the commit which solved our issue: isenmann/Redis.Issue.Testsuite@44f468c

But what I have seen, it seems that timeouts can have lots of causes and it's not always solved by increasing the threads.

@NickCraver
Copy link
Collaborator

Glad you got a resolution here - thread pool is a constant source of confusion and I'm thinking of adding a bit to our logging there to help users debug even more. Closing this one out to tidy up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants