Investigating 'No connection is available to service this operation' #1277

StfnStrn · 2019-11-18T15:53:53Z

Hey StackExchange.Redis team,

there are several open issues concerning the error message 'No connection is available to service this operation'. We face the same issue in our application. It's an Azure WebApp on containers, using StackExchange.Redis 2.0.601. For us, we face a SocketDisposedException as root cause. It usually occurs after deployment, when the app is started and the deployment slots are exchanged. Our Redis Server is an Azure Redis. The problem occurred in April, so I planned to investigate. But a few days later, we didn't face it again, and I dropped the issue. Since three weeks, the problem is back.

This time, I started to investigate the issue and cloned the StackExchange.Redis repository. In wrote a test case, which simulates the error. The test connects to an Azure Redis instance and sets a key. Then the test calls Dispose() on the underlying Socket object, an operation only possible after I made the fields interactive in PhysicalBridge and the VirtualSocket in PhysicalConnection available via an internal getter. After disposing the socket, the test tries to read the previously set string value. That fails due to a SocketDisposedException. So far, so good.

Using that first StringGetAsync action as a trigger for the connection to reconnect, the first exception was ignored and a second read operation was added. This one failed occasionally, but sometimes returned the proper value. So I added a pause, using await Task.Delay(TimeSpan.FromSecond(1)) to give the connection a change to repair itself. This lead to the second call being valid for every test execution.

This made me curious, so I removed the delay and replaced the two StringGetAsync with a for-loop. It launches count StringGetAsync operations, putting the task in a list, and waits delay milliseconds before starting the next read operation. The result strongly depends on different factors. The value for delay, whether I launch the test in debug or in run mode and most likely on the executing machine. There are values, where some of the tasks return with a valid result, so the connection repairs itself. Using different values, not a single tasks returns. For all combinations, the product of delay and count leads is between one and two seconds.

If I insert a one second delay after disposing the Socket, before the for-loop starts, all operations always return as success.

That's how far I've come. If anyone has hints on where to look for an error, I'm open for suggestions. From pov, it looks like the read-requests disturbed the internal reconnection operation. Also, there's the heartbeat operation. And I haven't checked, how the subscription property of the PhysicalBridge deals with all this. Since I haven't found a fix yet, I didn't create a Pull-Request. If you like to run the test case on your end, I can catch up on that and share the current test code. As long as I haven't run out of ideas, I'll keep digging into this and check if I can find the root cause, why the self-reconnect-logic does not work reliably and why a disposed socket can lead to a dead ConnectionMultiplexer object.

StackTrace.txt

The text was updated successfully, but these errors were encountered:

mgravell · 2019-11-18T16:19:45Z

That's really useful additional context thanks and it will undoubtedly help focus things. I'm aware this needs attention. The problem right now is simply developer bandwidth. I'm wrapping up the work I've been doing on some other projects, so hopefully I should be able to find some suitably large block of time to progress this!

…

On Mon, 18 Nov 2019, 09:53 StfnStrn, ***@***.***> wrote: Hey StackExchange.Redis team, there are several open issues concerning the error message 'No connection is available to service this operation'. We face the same issue in our application. It's an Azure WebApp on containers, using StackExchange.Redis 2.0.601. For us, we face a SocketDisposedException as root cause. It usually occurs after deployment, when the app is started and the deployment slots are exchanged. Our Redis Server is an Azure Redis. The problem occurred in April, so I planned to investigate. But a few days later, we didn't face it again, and I dropped the issue. Since three weeks, the problem is back. This time, I started to investigate the issue and cloned the StackExchange.Redis repository. In wrote a test case, which simulates the error. The test connects to an Azure Redis instance and sets a key. Then the test calls Dispose() on the underlying Socket object, an operation only possible after I made the fields interactive in PhysicalBridge and the VirtualSocket in PhysicalConnection available via an internal getter. After disposing the socket, the test tries to read the previously set string value. That fails due to a SocketDisposedException. So far, so good. Using that first StringGetAsync action as a trigger for the connection to reconnect, the first exception was ignored and a second read operation was added. This one failed occasionally, but sometimes returned the proper value. So I added a pause, using await Task.Delay(TimeSpan.FromSecond(1)) to give the connection a change to repair itself. This lead to the second call being valid for every test execution. This made me curious, so I removed the delay and replaced the two StringGetAsync with a for-loop. It launches *count* StringGetAsync operations, putting the task in a list, and waits *delay* milliseconds before starting the next read operation. The result strongly depends on different factors. The value for *delay*, whether I launch the test in debug or in run mode and most likely on the executing machine. There are values, where some of the tasks return with a valid result, so the connection repairs itself. Using different values, not a single tasks returns. For all combinations, the product of delay and count leads is between one and two seconds. If I insert a one second delay after disposing the Socket, before the for-loop starts, all operations always return as success. That's how far I've come. If anyone has hints on where to look for an error, I'm open for suggestions. From pov, it looks like the read-requests disturbed the internal reconnection operation. Also, there's the heartbeat operation. And I haven't checked, how the subscription property of the PhysicalBridge deals with all this. Since I haven't found a fix yet, I didn't create a Pull-Request. If you like to run the test case on your end, I can catch up on that and share the current test code. As long as I haven't run out of ideas, I'll keep digging into this and check if I can find the root cause, why the self-reconnect-logic does not work reliably and why a disposed socket can lead to a dead ConnectionMultiplexer object. StackTrace.txt <https://github.com/StackExchange/StackExchange.Redis/files/3859455/StackTrace.txt> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1277?email_source=notifications&email_token=AAAEHMDDC6AMXLJOYGSCHGTQUK3BFA5CNFSM4JOVSPZKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H2CR6OA>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAEHMEUA3VPEWCRQRMB7QLQUK3BFANCNFSM4JOVSPZA> .

NickCraver · 2020-03-14T13:18:11Z

Going to close this out to fold all discussions together - see #1374 which we hope to vet and get in a 2.1.x release. Watch this week for some progress here. And thanks to everyone for adding more info and context around this - the hang situation is subtle and it's appreciated.

NickCraver closed this as completed Mar 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigating 'No connection is available to service this operation' #1277

Investigating 'No connection is available to service this operation' #1277

StfnStrn commented Nov 18, 2019

mgravell commented Nov 18, 2019 via email

NickCraver commented Mar 14, 2020

Investigating 'No connection is available to service this operation' #1277

Investigating 'No connection is available to service this operation' #1277

Comments

StfnStrn commented Nov 18, 2019

mgravell commented Nov 18, 2019 via email

NickCraver commented Mar 14, 2020