Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PhysicalBridge always returns NoConnection after exception during message write / connection flush #1374

Merged
merged 3 commits into from
Mar 18, 2020

Conversation

hamish-omny
Copy link
Contributor

@hamish-omny hamish-omny commented Mar 6, 2020

We recently encountered an issue where we repeatedly (for several hours) got the exception "No connection is available to service this operation". Using version 2.0.601 (with AbortOnConnectFail set to false) running on asp.net core, in Azure and connecting to an Azure Redis instance. This was only resolved by a restart of the instance, other application instance could connect to the redis instance during this time

The full exception was as follows

No connection is available to service this operation: SMEMBERS XXXX; An existing connection was forcibly closed by the remote host; IOCP: (Busy=1,Free=999,Min=2,Max=1000), WORKER: (Busy=0,Free=32767,Min=2,Max=32767), Local-CPU: n/a
SocketFailure on xxx.redis.cache.windows.net:6380/Interactive, Idle/Faulted, last: GET, origin: ReadFromPipe, outstanding: 3, last-read: 0s ago, last-write: 0s ago, unanswered-write: 0s ago, keep-alive: 10s, state: ConnectedEstablished, mgr: 8 of 10 available, in: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.0.601.3402

I was able to reproduce this situation with a small console app that repeatably writes and reads keys and while this is running i kill the connection from the redis cli client kill type normal. This sometimes causes an exception from within one of the WriteMessage* methods to be thrown. This in turns causes the _activeMessage field in PhysicalBridge to not be set back to null and from that point forward the compare exchange logic in WriteMessageInsideLock bridge will always return "NoConnection". At this point any request for the ConnectionMultiplexer fails and it won't recover (at least in my testing)

I note there have been a number of issues opened that have have a similar error message and symptoms. Looking at the commit history this commit changed from clearing _activeMessage in ReleaseSingleWriterLock which was called from the finally block to the current approach, although obviously along with a significant number of other changes. Version 2.0.519 doesn't include this change and that might explain why people have had some success rolling back to it.

This PR moves back to always clearing _activeMessage within the finally block of the three relevant write methods, ensuring it's cleared even in the case of an exception occuring. Happy to try write some tests for this, just wanted to validate the change is acceptable first

Finally for reference a specific example the following exception was thrown during my testing, from PhysicalBridge.WriteMessageTakingWriteLockAsync when flushing the PhysicalConnection.

   at Pipelines.Sockets.Unofficial.Internal.Throw.Socket(Int32 errorCode) in C:\Code\Pipelines.Sockets.Unofficial\src\Pipelines.Sockets.Unofficial\Internal\Throw.cs:line 59
   at Pipelines.Sockets.Unofficial.SocketAwaitableEventArgs.GetResult() in C:\Code\Pipelines.Sockets.Unofficial\src\Pipelines.Sockets.Unofficial\SocketAwaitableEventArgs.cs:line 74
   at Pipelines.Sockets.Unofficial.SocketConnection.<DoSendAsync>d__82.MoveNext() in C:\Code\Pipelines.Sockets.Unofficial\src\Pipelines.Sockets.Unofficial\SocketConnection.Send.cs:line 64
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.IO.Pipelines.PipeCompletion.ThrowLatchedException()
   at System.IO.Pipelines.Pipe.GetFlushResult(FlushResult& result)
   at System.IO.Pipelines.Pipe.PrepareFlush(CompletionData& completionData, ValueTask`1& result, CancellationToken cancellationToken)
   at System.IO.Pipelines.Pipe.FlushAsync(CancellationToken cancellationToken)
   at System.IO.Pipelines.Pipe.DefaultPipeWriter.FlushAsync(CancellationToken cancellationToken)
   at Pipelines.Sockets.Unofficial.SocketConnection.WrappedWriter.FlushAsync(CancellationToken cancellationToken) in C:\Code\Pipelines.Sockets.Unofficial\src\Pipelines.Sockets.Unofficial\SocketConnection.cs:line 434
   at StackExchange.Redis.PhysicalConnection.FlushAsync(Boolean throwOnFailure) in C:\Users\XXXX\source\repos\StackExchange.Redis\src\StackExchange.Redis\PhysicalConnection.cs:line 905
   at StackExchange.Redis.PhysicalBridge.WriteMessageTakingWriteLockAsync(PhysicalConnection physical, Message message) in C:\Users\XXXX\source\repos\StackExchange.Redis\src\StackExchange.Redis\PhysicalBridge.cs:line 977

…on is thrown while trying to send the message or flush the connection. Leaving _activeMessage set causes WriteMessageInsideLock to always return a "NoConnectionAvailable" error indefinitely
@hamish-omny
Copy link
Contributor Author

hamish-omny commented Mar 6, 2020

This is the source of my test project. Note

  • I referenced a locally built debug version of stackexchange.redis, you may need to fix the references
  • I tested with a non-standard redis port of (6382), see top of Program.cs to change this
  • It only writes errors to the console. You can stop the threads reading/writing to redis with s then enter, and start the threads again with r then enter. This doesn't recreate the multiplexer i just used it to test that not sending requests didn't allow it to recover.
    RedisStuckRepo.zip

@mgravell
Copy link
Collaborator

mgravell commented Mar 6, 2020

This looks very interesting. I'm going to look at this first thing today.

@davidchencloudexceed
Copy link

Glad to see WIP. Any update can share with us and while waiting for the final fix, any temp mitigation solution?

@Deep232
Copy link

Deep232 commented Mar 12, 2020

We are also facing the same issue with Azure Redis Cache . Any update or pointer how to resolve it.

@dceravigupta
Copy link

@hamish-omny thanks for the local repo instruction.

@mgravell I was also able to repo the "No connection is available to service this operation" problem with v1.2.6 and Azure Redis cache. In my case, I noticed that while Reconfiguring the connection in ConnectionMultiplexer, sometimes PING command to server endpoint fails and we kind of fallback to ServerSelectionStrategy.ServerType as Standalone due to which sometime ConnectionMultiplexer gets stuck as actually endpoints are of type Cluster (in case of Azure Redis) and continue to throw "No connection is available to service this operation".

My idea to fix it is to let give an option to user to decide ServerSelectionStrategy.ServerType so that we can be more deterministic instead of doing best effort specially during time of failures. I floated a PR 1381 for the same. Let me know your thoughts.

@NickCraver
Copy link
Collaborator

This PR is hard to read for me since the rename affects a lot more places than the fix - I'm going to go rename this in master and merge in so the diff here is much easier to review :)

NickCraver pushed a commit that referenced this pull request Mar 14, 2020
Tidying this up so the diff/fix in #1374 is easier to analyze.
@NickCraver
Copy link
Collaborator

@hamish-omny Awesome analysis here, thank you for everything!

@mgravell I went through this, my 2 questions are:

  1. Should we be clearing only if token.Success? It seems we should clear regardless of success.
  2. Should we move the .SetIdle() calls to the same position here? I can't see a downside to keeping these coupled and logically it makes more sense. Sure it errored, but the connection is locally idle afterwards - the activity status is independent of the health.

@NickCraver NickCraver linked an issue Mar 14, 2020 that may be closed by this pull request
@NickCraver NickCraver linked an issue Mar 14, 2020 that may be closed by this pull request
@NickCraver NickCraver linked an issue Mar 14, 2020 that may be closed by this pull request
@NickCraver NickCraver linked an issue Mar 14, 2020 that may be closed by this pull request
@NickCraver NickCraver linked an issue Mar 14, 2020 that may be closed by this pull request
@NickCraver NickCraver linked an issue Mar 15, 2020 that may be closed by this pull request
@mgravell mgravell merged commit 58e8d63 into StackExchange:master Mar 18, 2020
edgarrs pushed a commit to microsoft/BuildXL that referenced this pull request Apr 24, 2020
This PR updates `Redis.StackExchange` library to version 2.1.30 to include a fix that supposedly solves the problem with the inability of a process to connect to Redis instance.

This is the PR that this version includes: StackExchange/StackExchange.Redis#1374

We do have many cases when the process fails to connect to on Redis instance and after the service restart it immediately (within seconds) successfully connects to it.

Related work items: #1697377
@pdonovan
Copy link

I've been searching the issues here for this error (No connection is available to service this operation), as it's been causing us lots of problems in production. This pull request does seem to tally with what we see - a write exception (preceded by lots of timeouts), then 'no connection available'. We're running ASP.NET Core 3.1, connecting to an Azure Redis C5 instance. These are our logs for two different days when the problem occurred (under high system usage):

26 Jul 2020 00:19:50.623[2020-07-26 00:18:30Z ERR] Mx=RD0003FF85E598 (Td=5) CacheException: Timeout performing PSETEX (15000ms), next: PSETEX m:ScheduledPublished, inst: 0, qu: 0, qs: 3, aw: True, rs: ReadAsync, ws: Flushing, in: 0, serverEndpoint: Unspecified/bm8-au-cache.redis.cache.windows.net:6380, mgr: 10 of 10 available, clientName: RD0003FF85E598, IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=39,Free=8152,Min=2,Max=8191), v: 2.0.601.3402 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts) \

26 Jul 2020 00:19:50.631[2020-07-26 00:18:30Z ERR] Mx=RD0003FF85E598 (Td=14) CacheException: An unknown error occurred when writing the message

26 Jul 2020 00:19:50.631[2020-07-26 00:18:30Z ERR] Mx=RD0003FF85E598 (Td=73) CacheException: The timeout was reached before the message could be written to the output buffer, and it was not sent, command=PSETEX, timeout: 15000, outbound: 0KiB, inbound: 0KiB, next: PSETEX m:ScheduledPublished, inst: 0, qu: 0, qs: 3, aw: True, rs: ReadAsync, ws: Idle, in: 0, serverEndpoint: Unspecified/bm8-au-cache.redis.cache.windows.net:6380, mgr: 10 of 10 available, clientName: RD0003FF85E598, IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=39,Free=8152,Min=2,Max=8191), v: 2.0.601.3402 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

26 Jul 2020 00:19:50.631[2020-07-26 00:18:30Z ERR] Mx=RD0003FF85E598 (Td=48) CacheException: No connection is available to service this operation: GET m:identityidkey:Yor-C3SWMDk8Y4ya4LECw2djAkKO7ious4Umog2; IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=39,Free=8152,Min=2,Max=8191), Local-CPU: n/a 
27 Jul 2020 00:10:59.591[2020-07-27 00:10:56Z ERR] Mx=RD0003FF85C879 (Td=5) CacheException:Timeout performing TYPE (5000ms), next: PSETEX m:ScheduledPublished, inst: 36, qu: 0, qs: 26, aw: True, rs: ReadAsync, ws: Flushing, in: 0, serverEndpoint: Unspecified/bm8-au-cache.redis.cache.windows.net:6380, mgr: 10 of 10 available, clientName: RD0003FF85C879, IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=40,Free=8151,Min=2,Max=8191), v: 2.0.601.3402 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

27 Jul 2020 00:10:59.591[2020-07-27 00:10:56Z ERR] Mx=RD0003FF85C879 (Td=77) CacheException: An unknown error occurred when writing the message

27 Jul 2020 00:10:59.592[2020-07-27 00:10:56Z ERR] Mx=RD0003FF85C879 (Td=89) CacheException: No connection is available to service this operation: TYPE m:template:105_2; IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=40,Free=8151,Min=2,Max=8191), Local-CPU: n/a

Td=XX is the thread id.

I know the issue may (?) have been fixed in 2.1, but hopefully this log might lend further confidence to that?

@rayansinha
Copy link

I am sorry but i cannot see any resolution to this issue. Can someone please give a solution?

@jwagon
Copy link

jwagon commented Sep 25, 2020

I'm curious as well, we're seeing similar. Do we know for sure what version this affects, and which version the fix is in if any? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment