PhysicalBridge always returns NoConnection after exception during message write / connection flush #1374

hamish-omny · 2020-03-06T01:20:24Z

We recently encountered an issue where we repeatedly (for several hours) got the exception "No connection is available to service this operation". Using version 2.0.601 (with AbortOnConnectFail set to false) running on asp.net core, in Azure and connecting to an Azure Redis instance. This was only resolved by a restart of the instance, other application instance could connect to the redis instance during this time

The full exception was as follows

No connection is available to service this operation: SMEMBERS XXXX; An existing connection was forcibly closed by the remote host; IOCP: (Busy=1,Free=999,Min=2,Max=1000), WORKER: (Busy=0,Free=32767,Min=2,Max=32767), Local-CPU: n/a
SocketFailure on xxx.redis.cache.windows.net:6380/Interactive, Idle/Faulted, last: GET, origin: ReadFromPipe, outstanding: 3, last-read: 0s ago, last-write: 0s ago, unanswered-write: 0s ago, keep-alive: 10s, state: ConnectedEstablished, mgr: 8 of 10 available, in: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.0.601.3402

I was able to reproduce this situation with a small console app that repeatably writes and reads keys and while this is running i kill the connection from the redis cli client kill type normal. This sometimes causes an exception from within one of the WriteMessage* methods to be thrown. This in turns causes the _activeMessage field in PhysicalBridge to not be set back to null and from that point forward the compare exchange logic in WriteMessageInsideLock bridge will always return "NoConnection". At this point any request for the ConnectionMultiplexer fails and it won't recover (at least in my testing)

I note there have been a number of issues opened that have have a similar error message and symptoms. Looking at the commit history this commit changed from clearing _activeMessage in ReleaseSingleWriterLock which was called from the finally block to the current approach, although obviously along with a significant number of other changes. Version 2.0.519 doesn't include this change and that might explain why people have had some success rolling back to it.

This PR moves back to always clearing _activeMessage within the finally block of the three relevant write methods, ensuring it's cleared even in the case of an exception occuring. Happy to try write some tests for this, just wanted to validate the change is acceptable first

Finally for reference a specific example the following exception was thrown during my testing, from PhysicalBridge.WriteMessageTakingWriteLockAsync when flushing the PhysicalConnection.

   at Pipelines.Sockets.Unofficial.Internal.Throw.Socket(Int32 errorCode) in C:\Code\Pipelines.Sockets.Unofficial\src\Pipelines.Sockets.Unofficial\Internal\Throw.cs:line 59
   at Pipelines.Sockets.Unofficial.SocketAwaitableEventArgs.GetResult() in C:\Code\Pipelines.Sockets.Unofficial\src\Pipelines.Sockets.Unofficial\SocketAwaitableEventArgs.cs:line 74
   at Pipelines.Sockets.Unofficial.SocketConnection.<DoSendAsync>d__82.MoveNext() in C:\Code\Pipelines.Sockets.Unofficial\src\Pipelines.Sockets.Unofficial\SocketConnection.Send.cs:line 64
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.IO.Pipelines.PipeCompletion.ThrowLatchedException()
   at System.IO.Pipelines.Pipe.GetFlushResult(FlushResult& result)
   at System.IO.Pipelines.Pipe.PrepareFlush(CompletionData& completionData, ValueTask`1& result, CancellationToken cancellationToken)
   at System.IO.Pipelines.Pipe.FlushAsync(CancellationToken cancellationToken)
   at System.IO.Pipelines.Pipe.DefaultPipeWriter.FlushAsync(CancellationToken cancellationToken)
   at Pipelines.Sockets.Unofficial.SocketConnection.WrappedWriter.FlushAsync(CancellationToken cancellationToken) in C:\Code\Pipelines.Sockets.Unofficial\src\Pipelines.Sockets.Unofficial\SocketConnection.cs:line 434
   at StackExchange.Redis.PhysicalConnection.FlushAsync(Boolean throwOnFailure) in C:\Users\XXXX\source\repos\StackExchange.Redis\src\StackExchange.Redis\PhysicalConnection.cs:line 905
   at StackExchange.Redis.PhysicalBridge.WriteMessageTakingWriteLockAsync(PhysicalConnection physical, Message message) in C:\Users\XXXX\source\repos\StackExchange.Redis\src\StackExchange.Redis\PhysicalBridge.cs:line 977

…on is thrown while trying to send the message or flush the connection. Leaving _activeMessage set causes WriteMessageInsideLock to always return a "NoConnectionAvailable" error indefinitely

hamish-omny · 2020-03-06T03:10:51Z

This is the source of my test project. Note

I referenced a locally built debug version of stackexchange.redis, you may need to fix the references
I tested with a non-standard redis port of (6382), see top of Program.cs to change this
It only writes errors to the console. You can stop the threads reading/writing to redis with s then enter, and start the threads again with r then enter. This doesn't recreate the multiplexer i just used it to test that not sending requests didn't allow it to recover.
RedisStuckRepo.zip

mgravell · 2020-03-06T07:10:33Z

This looks very interesting. I'm going to look at this first thing today.

davidchencloudexceed · 2020-03-10T02:34:30Z

Glad to see WIP. Any update can share with us and while waiting for the final fix, any temp mitigation solution?

Deep232 · 2020-03-12T09:03:55Z

We are also facing the same issue with Azure Redis Cache . Any update or pointer how to resolve it.

dceravigupta · 2020-03-12T18:54:17Z

@hamish-omny thanks for the local repo instruction.

@mgravell I was also able to repo the "No connection is available to service this operation" problem with v1.2.6 and Azure Redis cache. In my case, I noticed that while Reconfiguring the connection in ConnectionMultiplexer, sometimes PING command to server endpoint fails and we kind of fallback to ServerSelectionStrategy.ServerType as Standalone due to which sometime ConnectionMultiplexer gets stuck as actually endpoints are of type Cluster (in case of Azure Redis) and continue to throw "No connection is available to service this operation".

My idea to fix it is to let give an option to user to decide ServerSelectionStrategy.ServerType so that we can be more deterministic instead of doing best effort specially during time of failures. I floated a PR 1381 for the same. Let me know your thoughts.

NickCraver · 2020-03-14T10:59:54Z

This PR is hard to read for me since the rename affects a lot more places than the fix - I'm going to go rename this in master and merge in so the diff here is much easier to review :)

Tidying this up so the diff/fix in #1374 is easier to analyze.

NickCraver · 2020-03-14T11:16:22Z

@hamish-omny Awesome analysis here, thank you for everything!

@mgravell I went through this, my 2 questions are:

Should we be clearing only if token.Success? It seems we should clear regardless of success.
Should we move the .SetIdle() calls to the same position here? I can't see a downside to keeping these coupled and logically it makes more sense. Sure it errored, but the connection is locally idle afterwards - the activity status is independent of the health.

This PR updates `Redis.StackExchange` library to version 2.1.30 to include a fix that supposedly solves the problem with the inability of a process to connect to Redis instance. This is the PR that this version includes: StackExchange/StackExchange.Redis#1374 We do have many cases when the process fails to connect to on Redis instance and after the service restart it immediately (within seconds) successfully connects to it. Related work items: #1697377

pdonovan · 2020-07-27T00:38:55Z

I've been searching the issues here for this error (No connection is available to service this operation), as it's been causing us lots of problems in production. This pull request does seem to tally with what we see - a write exception (preceded by lots of timeouts), then 'no connection available'. We're running ASP.NET Core 3.1, connecting to an Azure Redis C5 instance. These are our logs for two different days when the problem occurred (under high system usage):

26 Jul 2020 00:19:50.623[2020-07-26 00:18:30Z ERR] Mx=RD0003FF85E598 (Td=5) CacheException: Timeout performing PSETEX (15000ms), next: PSETEX m:ScheduledPublished, inst: 0, qu: 0, qs: 3, aw: True, rs: ReadAsync, ws: Flushing, in: 0, serverEndpoint: Unspecified/bm8-au-cache.redis.cache.windows.net:6380, mgr: 10 of 10 available, clientName: RD0003FF85E598, IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=39,Free=8152,Min=2,Max=8191), v: 2.0.601.3402 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts) \

26 Jul 2020 00:19:50.631[2020-07-26 00:18:30Z ERR] Mx=RD0003FF85E598 (Td=14) CacheException: An unknown error occurred when writing the message

26 Jul 2020 00:19:50.631[2020-07-26 00:18:30Z ERR] Mx=RD0003FF85E598 (Td=73) CacheException: The timeout was reached before the message could be written to the output buffer, and it was not sent, command=PSETEX, timeout: 15000, outbound: 0KiB, inbound: 0KiB, next: PSETEX m:ScheduledPublished, inst: 0, qu: 0, qs: 3, aw: True, rs: ReadAsync, ws: Idle, in: 0, serverEndpoint: Unspecified/bm8-au-cache.redis.cache.windows.net:6380, mgr: 10 of 10 available, clientName: RD0003FF85E598, IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=39,Free=8152,Min=2,Max=8191), v: 2.0.601.3402 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

26 Jul 2020 00:19:50.631[2020-07-26 00:18:30Z ERR] Mx=RD0003FF85E598 (Td=48) CacheException: No connection is available to service this operation: GET m:identityidkey:Yor-C3SWMDk8Y4ya4LECw2djAkKO7ious4Umog2; IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=39,Free=8152,Min=2,Max=8191), Local-CPU: n/a

27 Jul 2020 00:10:59.591[2020-07-27 00:10:56Z ERR] Mx=RD0003FF85C879 (Td=5) CacheException:Timeout performing TYPE (5000ms), next: PSETEX m:ScheduledPublished, inst: 36, qu: 0, qs: 26, aw: True, rs: ReadAsync, ws: Flushing, in: 0, serverEndpoint: Unspecified/bm8-au-cache.redis.cache.windows.net:6380, mgr: 10 of 10 available, clientName: RD0003FF85C879, IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=40,Free=8151,Min=2,Max=8191), v: 2.0.601.3402 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

27 Jul 2020 00:10:59.591[2020-07-27 00:10:56Z ERR] Mx=RD0003FF85C879 (Td=77) CacheException: An unknown error occurred when writing the message

27 Jul 2020 00:10:59.592[2020-07-27 00:10:56Z ERR] Mx=RD0003FF85C879 (Td=89) CacheException: No connection is available to service this operation: TYPE m:template:105_2; IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=40,Free=8151,Min=2,Max=8191), Local-CPU: n/a

Td=XX is the thread id.

I know the issue may (?) have been fixed in 2.1, but hopefully this log might lend further confidence to that?

rayansinha · 2020-09-24T14:04:14Z

I am sorry but i cannot see any resolution to this issue. Can someone please give a solution?

jwagon · 2020-09-25T23:39:29Z

I'm curious as well, we're seeing similar. Do we know for sure what version this affects, and which version the fix is in if any? Thanks.

Ensure that _activeMessage is always cleared, including if an excepti…

8cefc81

…on is thrown while trying to send the message or flush the connection. Leaving _activeMessage set causes WriteMessageInsideLock to always return a "NoConnectionAvailable" error indefinitely

Plasma mentioned this pull request Mar 6, 2020

RedisConnectionException: No connection is available to service this operation #1120

Closed

dceravigupta mentioned this pull request Mar 12, 2020

Let users explicty configure ServerSelectionStrategy.ServerType in cases they are using a Redis Cluster. #1381

Closed

Merge remote-tracking branch 'origin/master' into pr/1374

f33b340

NickCraver pushed a commit that referenced this pull request Mar 14, 2020

Fix naming of _activeMessage

5a36dc7

Tidying this up so the diff/fix in #1374 is easier to analyze.

Merge remote-tracking branch 'origin/master' into pr/1374

75e082b

NickCraver requested a review from mgravell March 14, 2020 11:20

NickCraver linked an issue Mar 14, 2020 that may be closed by this pull request

RedisConnectionException: No connection is available to service this operation #1120

Closed

NickCraver linked an issue Mar 14, 2020 that may be closed by this pull request

StackExchange.Redis.RedisConnectionException: SocketClosed on <redis server> #1168

Closed

NickCraver mentioned this pull request Mar 14, 2020

StackExchange.Redis.RedisConnectionException: SocketClosed on <redis server> #1168

Closed

NickCraver mentioned this pull request Mar 14, 2020

No connection is available to service this operation: HKEYS #1082

Closed

NickCraver linked an issue Mar 14, 2020 that may be closed by this pull request

"No connection is available" - tcpping fine, server is up, etc #1051

Closed

NickCraver mentioned this pull request Mar 14, 2020

"No connection is available" - tcpping fine, server is up, etc #1051

Closed

NickCraver linked an issue Mar 14, 2020 that may be closed by this pull request

SocketFailure on node/Interactive [Azure Redis] #1100

Closed

This was referenced Mar 14, 2020

SocketFailure on node/Interactive [Azure Redis] #1100

Closed

Subscription connection not reconnecting #1163

Closed

PreferSlave seems not honored after cache failover #1077

Closed

NickCraver linked an issue Mar 14, 2020 that may be closed by this pull request

PreferSlave seems not honored after cache failover #1077

Closed

NickCraver mentioned this pull request Mar 14, 2020

Weird/Slow behavior on after connection to Redis is restored #1050

Closed

NickCraver linked an issue Mar 14, 2020 that may be closed by this pull request

Weird/Slow behavior on after connection to Redis is restored #1050

Closed

NickCraver mentioned this pull request Mar 14, 2020

No connection is available to service this operation: HGET #959

Closed

NickCraver linked an issue Mar 14, 2020 that may be closed by this pull request

No connection is available to service this operation: HGET #959

Closed

NickCraver mentioned this pull request Mar 14, 2020

Transient connection failures on Azure Redis instance #1149

Closed

NickCraver linked an issue Mar 15, 2020 that may be closed by this pull request

SET errors on Azure Redis cluster #1172

Closed

This was referenced Mar 15, 2020

Redis Cache exception StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl #1285

Closed

RedisTimeoutException exception message question #973

Closed

mgravell merged commit 58e8d63 into StackExchange:master Mar 18, 2020

NickCraver mentioned this pull request Mar 18, 2020

Fixing a bug where READONLY is run before AUTH. #1367

Merged

gadeweever mentioned this pull request Mar 19, 2020

No connection is available to service this operation #1387

Closed

tebeco mentioned this pull request Mar 20, 2020

Update StackExchange.Redis from 2.0.x to 2.1.0 dotnet/aspnetcore#20042

Closed

erizzo mentioned this pull request Apr 3, 2020

Consider updating to StackExchange.Redis 2.1 MichaCo/CacheManager#309

Open

jouniheikniemi mentioned this pull request Dec 3, 2020

Upgrade Caching.RedisStackExchange to refer to Redis.StackExchange 2.1 dotnet/aspnetcore#28369

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PhysicalBridge always returns NoConnection after exception during message write / connection flush #1374

PhysicalBridge always returns NoConnection after exception during message write / connection flush #1374

hamish-omny commented Mar 6, 2020 •

edited

Loading

hamish-omny commented Mar 6, 2020 •

edited

Loading

mgravell commented Mar 6, 2020

davidchencloudexceed commented Mar 10, 2020

Deep232 commented Mar 12, 2020

dceravigupta commented Mar 12, 2020

NickCraver commented Mar 14, 2020

NickCraver commented Mar 14, 2020

pdonovan commented Jul 27, 2020

rayansinha commented Sep 24, 2020

jwagon commented Sep 25, 2020

PhysicalBridge always returns NoConnection after exception during message write / connection flush #1374

PhysicalBridge always returns NoConnection after exception during message write / connection flush #1374

Conversation

hamish-omny commented Mar 6, 2020 • edited Loading

hamish-omny commented Mar 6, 2020 • edited Loading

mgravell commented Mar 6, 2020

davidchencloudexceed commented Mar 10, 2020

Deep232 commented Mar 12, 2020

dceravigupta commented Mar 12, 2020

NickCraver commented Mar 14, 2020

NickCraver commented Mar 14, 2020

pdonovan commented Jul 27, 2020

rayansinha commented Sep 24, 2020

jwagon commented Sep 25, 2020

hamish-omny commented Mar 6, 2020 •

edited

Loading

hamish-omny commented Mar 6, 2020 •

edited

Loading