Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove indirection to mitigate a presumed deadlock. Fixes StackExchange/StackExchange.Redis#42. #353

Merged
merged 1 commit into from
Feb 8, 2016

Conversation

rjmooney
Copy link
Contributor

@rjmooney rjmooney commented Feb 4, 2016

So I ran into this issue as well. After reading through some of the comments and taking a look at the code, I noticed an unnecessary indirection in ConnectionMultiPlexer.ConnectImpl which appears to be causing a deadlock:

// note that task has timeouts internally, so it might take *just over* the regular timeout
// wrap into task to force async execution
var task = Factory.StartNew(() => muxer.ReconfigureAsync(true, false, log, null, "connect").Result);

What's happening here is the code is spawning an async task which waits synchronously on an async task result. This is bad juju. What it appears is happening next is the code blocks the context thread waiting for the wrapper task created with Factory.StartNew to return:

if (!task.Wait(muxer.SyncConnectTimeout(true)))

When the continuation from ReconfigureAsync returns, it waits for the context thread to become available. Since the context thread is blocked waiting for the continuation, it results in deadlock. Stephen Cleary has an excellent description of the pattern on his blog.

The fix, I believe, is simple: spawn the ReconfigureAsync method directly:

// note that task has timeouts internally, so it might take *just over* the regular timeout
var task = muxer.ReconfigureAsync(true, false, log, null, "connect");

if (!task.Wait(muxer.SyncConnectTimeout(true)))

If I had to guess, I'd say ReconfigureAsync was once a synchronous method that was converted to async without updating the ConnectImpl implementation to remove the async wrapping.

The above is conjecture at this point, however in my limited testing it appears to solve the problem.

@NickCraver
Copy link
Collaborator

This problem originates from aeedada which was via PR #129 to fix a timeout race.

I agree that fix was bad though, because of the deadlock creation - just adding context since we need to fix both issues.

mgravell added a commit that referenced this pull request Feb 8, 2016
Remove indirection to mitigate a presumed deadlock. Fixes #42.
@mgravell mgravell merged commit 050053a into StackExchange:master Feb 8, 2016
@mgravell
Copy link
Collaborator

mgravell commented Feb 8, 2016

Yeah, there is definitely something odd here. The TaskFactory approach seems an arbitrary and likely unreliable way of fixing it, especially as there were a couple of missing ForAwait() calls (now added, but that isn't the point). Thanks; will see how it plays.

@NickCraver
Copy link
Collaborator

@rjmooney heads up, new package is on NuGet with this fix: https://www.nuget.org/packages/StackExchange.Redis/1.1.572-alpha

@rjmooney
Copy link
Contributor Author

rjmooney commented Feb 9, 2016

I agree with @NickCraver that both the initial connection issue #42 and the timeout issue in #129 need to be resolved. This patch seems to resolve the former. However in synchronous mode the connection will use the default socket connect timeout instead of the ConnectionOptions timeout, causing the unexpected behavior described in #129. This causes the ConnectToUnexistingHostFailsWithinTimeout synchronous unit test to fail.

The culprit is the call to AsyncWaitHandle.WaitOne() in RunWithCompletionType which doesn't specify a timeout. My limited insight into the design of the library notwithstanding, making the current connection context available to downstream methods such thatConnectionOptions can be referenced would allow RunWithCompletionType to specify a timeout to WaitOne. I think this is the ideal solution, although it will require refactoring.

In the meantime, rewrapping the call to ConnectAsync's ReconfigureAsync in a Task.Run async Task for CompletionType.Sync invocations should resolve the inconsistent timeout behavior in synchronous mode despite the smell.

@rjmooney rjmooney deleted the rjmooney/issue_42 branch February 9, 2016 17:07
@PashaPash
Copy link
Contributor

@rjmooney thanks for the fix! Actually, both test and Factory.StartNew wrapper were added only to cover CompletionType.Sync removal in SocketManager.

RunWithCompletionType with CompletionType.Sync is a test-only mode for #113 unit tests, so it is not necessary to support timeout in it.

It was used in #129 only as a way to put something long-running that will totally ignore all timeout settings inside of ReconfigureAsync. It was basically a way to check what will happen if ReconfigureAsync will ignore ConnectTimeout (as it actually does) and will not timeout internally in a reasonable time. So RunWithCompletionType should not support timeout directly.

There is a comment on that in the code:

// note that task has timeouts internally, so it might take *just over* the regular timeout

If that *just over*case is not a critical problem for anyone, then it is better to leave it as is, and just remove [TestCase(CompletionType.Sync)] from ConnectToUnexistingHostFailsWithinTimeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants