-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PhysicalBridge always returns NoConnection after exception during message write / connection flush #1374
Conversation
…on is thrown while trying to send the message or flush the connection. Leaving _activeMessage set causes WriteMessageInsideLock to always return a "NoConnectionAvailable" error indefinitely
This is the source of my test project. Note
|
This looks very interesting. I'm going to look at this first thing today. |
Glad to see WIP. Any update can share with us and while waiting for the final fix, any temp mitigation solution? |
We are also facing the same issue with Azure Redis Cache . Any update or pointer how to resolve it. |
@hamish-omny thanks for the local repo instruction. @mgravell I was also able to repo the "No connection is available to service this operation" problem with v1.2.6 and Azure Redis cache. In my case, I noticed that while Reconfiguring the connection in ConnectionMultiplexer, sometimes PING command to server endpoint fails and we kind of fallback to ServerSelectionStrategy.ServerType as Standalone due to which sometime ConnectionMultiplexer gets stuck as actually endpoints are of type Cluster (in case of Azure Redis) and continue to throw "No connection is available to service this operation". My idea to fix it is to let give an option to user to decide ServerSelectionStrategy.ServerType so that we can be more deterministic instead of doing best effort specially during time of failures. I floated a PR 1381 for the same. Let me know your thoughts. |
This PR is hard to read for me since the rename affects a lot more places than the fix - I'm going to go rename this in master and merge in so the diff here is much easier to review :) |
Tidying this up so the diff/fix in #1374 is easier to analyze.
@hamish-omny Awesome analysis here, thank you for everything! @mgravell I went through this, my 2 questions are:
|
This PR updates `Redis.StackExchange` library to version 2.1.30 to include a fix that supposedly solves the problem with the inability of a process to connect to Redis instance. This is the PR that this version includes: StackExchange/StackExchange.Redis#1374 We do have many cases when the process fails to connect to on Redis instance and after the service restart it immediately (within seconds) successfully connects to it. Related work items: #1697377
I've been searching the issues here for this error (No connection is available to service this operation), as it's been causing us lots of problems in production. This pull request does seem to tally with what we see - a write exception (preceded by lots of timeouts), then 'no connection available'. We're running ASP.NET Core 3.1, connecting to an Azure Redis C5 instance. These are our logs for two different days when the problem occurred (under high system usage):
Td=XX is the thread id. I know the issue may (?) have been fixed in 2.1, but hopefully this log might lend further confidence to that? |
I am sorry but i cannot see any resolution to this issue. Can someone please give a solution? |
I'm curious as well, we're seeing similar. Do we know for sure what version this affects, and which version the fix is in if any? Thanks. |
We recently encountered an issue where we repeatedly (for several hours) got the exception "No connection is available to service this operation". Using version 2.0.601 (with
AbortOnConnectFail
set to false) running on asp.net core, in Azure and connecting to an Azure Redis instance. This was only resolved by a restart of the instance, other application instance could connect to the redis instance during this timeThe full exception was as follows
I was able to reproduce this situation with a small console app that repeatably writes and reads keys and while this is running i kill the connection from the redis cli
client kill type normal
. This sometimes causes an exception from within one of theWriteMessage*
methods to be thrown. This in turns causes the_activeMessage
field inPhysicalBridge
to not be set back to null and from that point forward the compare exchange logic inWriteMessageInsideLock
bridge will always return "NoConnection". At this point any request for theConnectionMultiplexer
fails and it won't recover (at least in my testing)I note there have been a number of issues opened that have have a similar error message and symptoms. Looking at the commit history this commit changed from clearing
_activeMessage
inReleaseSingleWriterLock
which was called from the finally block to the current approach, although obviously along with a significant number of other changes. Version 2.0.519 doesn't include this change and that might explain why people have had some success rolling back to it.This PR moves back to always clearing
_activeMessage
within the finally block of the three relevant write methods, ensuring it's cleared even in the case of an exception occuring. Happy to try write some tests for this, just wanted to validate the change is acceptable firstFinally for reference a specific example the following exception was thrown during my testing, from
PhysicalBridge.WriteMessageTakingWriteLockAsync
when flushing the PhysicalConnection.