Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread pool starvation caused by sync waiting inside PhysicalBridge #2664

Closed
krasin-ga opened this issue Mar 7, 2024 · 12 comments · Fixed by #2667
Closed

Thread pool starvation caused by sync waiting inside PhysicalBridge #2664

krasin-ga opened this issue Mar 7, 2024 · 12 comments · Fixed by #2667

Comments

@krasin-ga
Copy link

Here is the code that starts a dedicated backlog processing thread and provides some motivation in comments on the decision not to use the thread pool:

// Start the backlog processor; this is a bit unorthodox, as you would *expect* this to just
// be Task.Run; that would work fine when healthy, but when we're falling on our face, it is
// easy to get into a thread-pool-starvation "spiral of death" if we rely on the thread-pool
// to unblock the thread-pool when there could be sync-over-async callers. Note that in reality,
// the initial "enough" of the back-log processor is typically sync, which means that the thread
// we start is actually useful, despite thinking "but that will just go async and back to the pool"
var thread = new Thread(s => ((PhysicalBridge)s!).ProcessBacklogAsync().RedisFireAndForget())
{
IsBackground = true, // don't keep process alive (also: act like the thread-pool used to)
Name = "StackExchange.Redis Backlog", // help anyone looking at thread-dumps
};
thread.Start(this);

But after asynchronously awaiting ProcessBridgeBacklogAsync, the rest of the code (including loop iterations) will not only continue on the thread pool thread but also block it on a sync lock:"

private async Task ProcessBacklogAsync()
{
_backlogStatus = BacklogStatus.Starting;
try
{
while (!isDisposed)
{
if (!_backlog.IsEmpty)
{
// TODO: vNext handoff this backlog to another primary ("can handle everything") connection
// and remove any per-server commands. This means we need to track a bit of whether something
// was server-endpoint-specific in PrepareToPushMessageToBridge (was the server ref null or not)
await ProcessBridgeBacklogAsync().ForAwait();
}
// The cost of starting a new thread is high, and we can bounce in and out of the backlog a lot.
// So instead of just exiting, keep this thread waiting for 5 seconds to see if we got another backlog item.
_backlogStatus = BacklogStatus.SpinningDown;
// Note this is happening *outside* the lock
var gotMore = _backlogAutoReset.WaitOne(5000);
if (!gotMore)
{
break;
}
}

In our case, we have several relatively large Redis clusters with tens to hundreds of nodes, and sometimes we experience situations where the majority of thread pool threads are being blocked inside instances of PhysicalBridge:
trace

So we have the best of two worlds: hundreds of expensive creations of worker threads and thread pool starvation 🙂

@mgravell
Copy link
Collaborator

mgravell commented Mar 7, 2024

Now this is very interesting (pinging @NickCraver @philon-msft ) - it looks like the intent is for ProcessBacklogAsync to execute on a dedicated thread, but the fact that we have an await in there means that if anything is there, we immediately jump to the thread pool - and we also have a big fat WaitOne(5000). This looks... unintentional. And bad, especially (as you say) for large clusters.

Great report @krasin-ga ! Will investigate further.

@NickCraver
Copy link
Collaborator

Yep, fully agree, we changed how this works several times back at the beginning but the thread pool hopping would happen at several points now. Needs love and potentially capping somehow in thread creation lest we create hundreds for hundreds of endpoints during a storm.

@krasin-ga
Copy link
Author

Thanks for the quick reply!

Please allow me to add my two cents.

I understand that this logic was added to circumvent issues related to using synchronous operations and thread pool starvation in general.

However, it seems that this is not the problem that the Redis client library should solve. It would be more rational to focus on minimizing overhead within the library itself. At first glance, it appears that abandoning synchronous locks in favor of wait-free alternatives and using System.Threading.Channels for the backlog could simplify the code and improve performance in high-load scenarios. Even better would be to extract the backlog and the logic of its processing out of PhysicalBridge into its own interface to make it configurable for different scenarios.

@NickCraver
Copy link
Collaborator

We're tweaking how this works in #2667 to reflect the current dedicated thread behavior. With respect to the follow-up comments: I'm just going to assume you've only taken a glance because the rest of it's pretty uninformed with respect to how things in the library behave.

I'm going to write off the other comments as not having taken any deep view of things in the library or dealt with thread pool starvation in hundreds of apps as we have and continue to help others with and leave it at that.

@krasin-ga
Copy link
Author

As I understand the current architecture:

From the top view, PhysicalPridge is responsible for pushing messages to the backlog from which they will be read by a dedicated thread and serialized to a pipe (from which searialized messages will be pushed to the socket by PipeScheduler) or completed by timeout. Dedicated threads are only used to deal with situations when the runtime experiences thread pool starvation. There is one PhysicalPridge instance per connection. As of now there is a way to cofigure PipeScheduler, but there is no way to configure backlog processing behavior.

My points:

  1. Having a thread per backlog/connection is far from optimal, especially when the Redis cluster is quite large. More threads mean more overhead on context switches, scheduling, etc.
  2. If my application is experiencing thread pool starvation due to the usage of synchronous I/O or synchronous locking, having this will only hide the real source of the issue and not help much (especially if I use SocketManager.ThreadPool), and my application will still perform very poorly.
  3. The .NET thread pool has gotten significantly better at detecting starvation and will quickly create threads by itself.
  4. It's an absolutely good idea to help people diagnose that the thread pool is experiencing starvation, as in the message of RedisTimeoutException. Or to prevent such issues with the help of some Roslyn analyzer or find possible sources of starvation with Ben.BlockingDetector, etc.
  5. It will not make things worse if backlog behavior can be make configurable

I wrote a quick benchmark that emulates backlog behavior and processes 10,000,000 items:

BenchmarkDotNet v0.13.12, Windows 11
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 8.0.200
  [Host]     : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2

| Method         | NumberOfConsumers | NumberOfProducers | Mean        | Error     | StdDev    | Median      |
|----------------|------------------ |------------------ |------------:|----------:|----------:|------------:|
| Tasks+Channels | 1                 | 1                 |  2,515.8 ms |  48.65 ms |  47.78 ms |  2,500.9 ms |
| Threads+CQ     | 1                 | 1                 |  2,518.7 ms |  44.62 ms |  41.74 ms |  2,517.4 ms |
| Tasks+Channels | 1                 | 8                 |  2,774.9 ms |  22.63 ms |  21.17 ms |  2,770.6 ms |
| Threads+CQ     | 1                 | 8                 |  3,265.1 ms |  88.79 ms | 261.79 ms |  3,432.6 ms |
| Tasks+Channels | 1                 | 32                |  2,831.8 ms |  55.89 ms |  87.02 ms |  2,825.2 ms |
| Threads+CQ     | 1                 | 32                |  3,791.6 ms |  72.86 ms |  77.96 ms |  3,770.4 ms |
| Tasks+Channels | 1                 | 128               |  3,117.2 ms |  59.07 ms |  68.03 ms |  3,127.5 ms |
| Threads+CQ     | 1                 | 128               |  3,986.0 ms |  77.86 ms |  98.47 ms |  3,987.2 ms |
| Tasks+Channels | 8                 | 1                 |  5,156.9 ms |  41.39 ms |  38.72 ms |  5,152.8 ms |
| Threads+CQ     | 8                 | 1                 | 19,966.5 ms |  86.58 ms |  80.99 ms | 19,966.4 ms |
| Tasks+Channels | 8                 | 8                 |    864.5 ms |  16.60 ms |  18.45 ms |    859.2 ms |
| Threads+CQ     | 8                 | 8                 |  1,586.8 ms |  31.37 ms |  55.76 ms |  1,601.4 ms |
| Tasks+Channels | 8                 | 32                |  1,117.6 ms |  16.38 ms |  15.32 ms |  1,116.2 ms |
| Threads+CQ     | 8                 | 32                |    912.3 ms |  18.10 ms |  47.67 ms |    913.7 ms |
| Tasks+Channels | 8                 | 128               |  1,206.8 ms |  23.42 ms |  30.45 ms |  1,198.5 ms |
| Threads+CQ     | 8                 | 128               |    983.0 ms |  18.51 ms |  19.80 ms |    988.7 ms |
| Tasks+Channels | 32                | 1                 |  5,191.6 ms |  31.90 ms |  28.28 ms |  5,184.7 ms |
| Threads+CQ     | 32                | 1                 | 19,619.6 ms | 162.84 ms | 144.36 ms | 19,644.6 ms |
| Tasks+Channels | 32                | 8                 |    977.1 ms |   8.87 ms |   8.29 ms |    978.7 ms |
| Threads+CQ     | 32                | 8                 |  1,825.8 ms |  35.98 ms |  38.50 ms |  1,838.9 ms |
| Tasks+Channels | 32                | 32                |    652.4 ms |   5.04 ms |   4.47 ms |    654.0 ms |
| Threads+CQ     | 32                | 32                |    835.5 ms |  16.55 ms |  45.58 ms |    839.3 ms |
| Tasks+Channels | 32                | 128               |    661.7 ms |  13.03 ms |  19.10 ms |    659.8 ms |
| Threads+CQ     | 32                | 128               |    839.7 ms |  16.69 ms |  36.29 ms |    844.8 ms |
| Tasks+Channels | 128               | 1                 |  4,756.7 ms |  21.07 ms |  17.59 ms |  4,760.6 ms |
| Threads+CQ     | 128               | 1                 | 19,599.3 ms | 262.33 ms | 245.39 ms | 19,649.7 ms |
| Tasks+Channels | 128               | 8                 |    995.9 ms |   9.45 ms |   8.84 ms |    997.0 ms |
| Threads+CQ     | 128               | 8                 |  2,869.4 ms |  56.35 ms | 100.17 ms |  2,864.1 ms |
| Tasks+Channels | 128               | 32                |    581.5 ms |  11.47 ms |  20.09 ms |    583.2 ms |
| Threads+CQ     | 128               | 32                |    867.5 ms |  17.07 ms |  27.56 ms |    869.9 ms |
| Tasks+Channels | 128               | 128               |    618.2 ms |   6.56 ms |   6.14 ms |    618.5 ms |
| Threads+CQ     | 128               | 128               |    912.5 ms |  19.55 ms |  57.66 ms |    921.6 ms |

Both methods can be optimized, but I hope they are still close enough for illustrative purposes.

PS: I'm not a native English speaker and don't use English for communication very often, so my selection of words may not be very accurate. Please forgive any unintended offense.

@NickCraver
Copy link
Collaborator

@krasin-ga I think we have mismatch in understanding here: that the backlog is part of the normal flow at all. It is not. The backlog is not normally written to, but it must be synchronously drained from to help prevent spiral of death scenarios (because it's quite likely that writes or response processing cannot complete due to starvation). The normal path for async commands is fully async and uses Pipes.

The backlog is only used as a buffer when we can't write to the connection, e.g. we're disconnected. And as per the docs, you can turn this off if you want to fail immediately indicating no connections are available. You're stating this can be optimized: hey sure, I'm all ears, but it needs to be in a way that the backlog has a chance of draining thread pool starvation is hit (which is a common scenario for applications losing connections unexpectedly, often having some sync-over-async elsewhere we don't control). That's why we create a thread to do this work and listen for incoming for 5 more seconds instead of repeatedly spinning up a thread if we're recovering from overwhelming the outbound socket buffer.

As for the thread pool growing, this is often a misconception when it gets into the details.

  1. We support more than .NET 6+, e.g. .NET Framework which doesn't have any love for rapid thread pool growth expansion at all.
  2. The .NET Core mechanism you're thinking of doesn't apply to Monitor.Wait. We discovered this after discussion with Stephen triggered by Rollback: remove .NET 6 build #1992 when we attempted to utilize that behavior.

@krasin-ga
Copy link
Author

krasin-ga commented Mar 10, 2024

I think we have mismatch in understanding here: that the backlog is part of the normal flow at all. It is not

Yep, I certainly misunderstood that part. Thanks for the clarification!

Upd: But not that much: if we can't take mutex we still end up enqueuing to backlog, and this will happen quite often under load.

gotLock = _singleWriterMutex.Wait(0);
if (!gotLock)
#else
token = _singleWriterMutex.TryWait(WaitOptions.NoDelay);
if (!token.Success)
#endif
{
// If we can't get it *instantaneously*, pass it to the backlog for throughput
if (TryPushToBacklog(message, onlyIfExists: false))
{
return WriteResult.Success; // queued counts as success
}

NickCraver added a commit that referenced this issue Mar 11, 2024
… dedicated thread (#2667)

Resolving #2664.

This went through a few iterations early on, but in the current form the issue is that we can await hop to a thread pool thread and have our wait blocking there - not awesome. We want to avoid that entirely and do a full sync path here.
@NickCraver
Copy link
Collaborator

@krasin-ga Thanks again for the report! Version 2.7.33 with this fixed up is available on NuGet now :)

@krasin-ga
Copy link
Author

Nice, thanks for the quick fix!

@giuliano-macedo
Copy link

@krasin-ga @NickCraver Do you guys know which version this issue started to happen?

@NickCraver
Copy link
Collaborator

@giuliano-macedo I'd assume this was a potential issue when consuming a lot of endpoints since 2.5.x, that's my best guess here.

@giuliano-macedo
Copy link

@NickCraver Great! And thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants