Remove indirection to mitigate a presumed deadlock. Fixes StackExchange/StackExchange.Redis#42. #353

rjmooney · 2016-02-04T02:42:19Z

So I ran into this issue as well. After reading through some of the comments and taking a look at the code, I noticed an unnecessary indirection in ConnectionMultiPlexer.ConnectImpl which appears to be causing a deadlock:

// note that task has timeouts internally, so it might take *just over* the regular timeout
// wrap into task to force async execution
var task = Factory.StartNew(() => muxer.ReconfigureAsync(true, false, log, null, "connect").Result);

What's happening here is the code is spawning an async task which waits synchronously on an async task result. This is bad juju. What it appears is happening next is the code blocks the context thread waiting for the wrapper task created with Factory.StartNew to return:

if (!task.Wait(muxer.SyncConnectTimeout(true)))

When the continuation from ReconfigureAsync returns, it waits for the context thread to become available. Since the context thread is blocked waiting for the continuation, it results in deadlock. Stephen Cleary has an excellent description of the pattern on his blog.

The fix, I believe, is simple: spawn the ReconfigureAsync method directly:

// note that task has timeouts internally, so it might take *just over* the regular timeout
var task = muxer.ReconfigureAsync(true, false, log, null, "connect");

if (!task.Wait(muxer.SyncConnectTimeout(true)))

If I had to guess, I'd say ReconfigureAsync was once a synchronous method that was converted to async without updating the ConnectImpl implementation to remove the async wrapping.

The above is conjecture at this point, however in my limited testing it appears to solve the problem.

#42.

NickCraver · 2016-02-07T13:58:38Z

This problem originates from aeedada which was via PR #129 to fix a timeout race.

I agree that fix was bad though, because of the deadlock creation - just adding context since we need to fix both issues.

Remove indirection to mitigate a presumed deadlock. Fixes #42.

mgravell · 2016-02-08T14:04:57Z

Yeah, there is definitely something odd here. The TaskFactory approach seems an arbitrary and likely unreliable way of fixing it, especially as there were a couple of missing ForAwait() calls (now added, but that isn't the point). Thanks; will see how it plays.

NickCraver · 2016-02-08T15:13:35Z

@rjmooney heads up, new package is on NuGet with this fix: https://www.nuget.org/packages/StackExchange.Redis/1.1.572-alpha

rjmooney · 2016-02-09T16:54:26Z

I agree with @NickCraver that both the initial connection issue #42 and the timeout issue in #129 need to be resolved. This patch seems to resolve the former. However in synchronous mode the connection will use the default socket connect timeout instead of the ConnectionOptions timeout, causing the unexpected behavior described in #129. This causes the ConnectToUnexistingHostFailsWithinTimeout synchronous unit test to fail.

The culprit is the call to AsyncWaitHandle.WaitOne() in RunWithCompletionType which doesn't specify a timeout. My limited insight into the design of the library notwithstanding, making the current connection context available to downstream methods such thatConnectionOptions can be referenced would allow RunWithCompletionType to specify a timeout to WaitOne. I think this is the ideal solution, although it will require refactoring.

In the meantime, rewrapping the call to ConnectAsync's ReconfigureAsync in a Task.Run async Task for CompletionType.Sync invocations should resolve the inconsistent timeout behavior in synchronous mode despite the smell.

PashaPash · 2016-02-10T09:23:40Z

@rjmooney thanks for the fix! Actually, both test and Factory.StartNew wrapper were added only to cover CompletionType.Sync removal in SocketManager.

RunWithCompletionType with CompletionType.Sync is a test-only mode for #113 unit tests, so it is not necessary to support timeout in it.

It was used in #129 only as a way to put something long-running that will totally ignore all timeout settings inside of ReconfigureAsync. It was basically a way to check what will happen if ReconfigureAsync will ignore ConnectTimeout (as it actually does) and will not timeout internally in a reasonable time. So RunWithCompletionType should not support timeout directly.

There is a comment on that in the code:

// note that task has timeouts internally, so it might take *just over* the regular timeout

If that *just over*case is not a critical problem for anyone, then it is better to leave it as is, and just remove [TestCase(CompletionType.Sync)] from ConnectToUnexistingHostFailsWithinTimeout.

Remove unnecessary indirection to mitigate deadlock. Fixes

0a82797

#42.

rjmooney mentioned this pull request Feb 4, 2016

RedisConnectionException at each first Connect(), while redis-py and hiredis work fine #42

Closed

mgravell added a commit that referenced this pull request Feb 8, 2016

Merge pull request #353 from rjmooney/rjmooney/issue_42

050053a

Remove indirection to mitigate a presumed deadlock. Fixes #42.

mgravell merged commit 050053a into StackExchange:master Feb 8, 2016

rjmooney deleted the rjmooney/issue_42 branch February 9, 2016 17:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove indirection to mitigate a presumed deadlock. Fixes StackExchange/StackExchange.Redis#42. #353

Remove indirection to mitigate a presumed deadlock. Fixes StackExchange/StackExchange.Redis#42. #353

rjmooney commented Feb 4, 2016

NickCraver commented Feb 7, 2016

mgravell commented Feb 8, 2016

NickCraver commented Feb 8, 2016

rjmooney commented Feb 9, 2016

PashaPash commented Feb 10, 2016

Remove indirection to mitigate a presumed deadlock. Fixes StackExchange/StackExchange.Redis#42. #353

Remove indirection to mitigate a presumed deadlock. Fixes StackExchange/StackExchange.Redis#42. #353

Conversation

rjmooney commented Feb 4, 2016

NickCraver commented Feb 7, 2016

mgravell commented Feb 8, 2016

NickCraver commented Feb 8, 2016

rjmooney commented Feb 9, 2016

PashaPash commented Feb 10, 2016