New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce API to avoid cancel/fail races with CancellableContinuation #830

Open
elizarov opened this Issue Nov 14, 2018 · 10 comments

Comments

Projects
None yet
6 participants
@elizarov
Collaborator

elizarov commented Nov 14, 2018

Consider this attempt to integrate coroutines with cancellable callback-based API using suspendCancellableCoroutine. This example is for Retrofit, but the same issue could appear with any cancellable API:

suspend fun <T> Call<T>.await(): T = 
    suspendCancellableCoroutine { cont ->
        enqueue(object : Callback<T> { // install callback
            override fun onResponse(call: Call<T>, response: Response<T>) {
                 // resume normally -- omitted for clarity
            }
    
            override fun onFailure(call: Call<T>, t: Throwable) {
                if (cont.isCancelled) return     // LINE (1) 
                cont.resumeWithException(t)      // LINE (2)
            }
        })
        // cancel Call when continuation is cancelled
        cont.invokeOnCancellation {
            cancel()
        }
    }

Resuming a CancellableContinuation with exception is documented to behave in the following way (see https://kotlin.github.io/kotlinx.coroutines/kotlinx-coroutines-core/kotlinx.coroutines/-cancellable-continuation/index.html):

Invocation of resume or resumeWithException in resumed state produces IllegalStateException. Invocation of resume in cancelled state is ignored (it is a trivial race between resume from the continuation owner and outer job cancellation and cancellation wins). Invocation of resumeWithException in cancelled state triggers exception handling of passed exception.

So, this code attempts to avoid calling resumingWithException on an already cancelled continuation by doing a check in LINE (1). This creates a "check & act" race in between LINE (1) and LINE (2). In particular, if continuation gets cancelled from a different thread when this code executes between LINE (1) and LINE (2), then resumeWithException is invoked on cancelled continuation and triggers handling of "uncaught" exception.

There is no public (non-experimental, non-internal) API at the moment to work around this race.

Workarounds

W1. Use internal API on CancellableContinuation:

Replace LINE (1) and LINE(2) with the following code:

            @UseExperimental(InternalCoroutinesApi::class)
            cont.tryResumeWithException(t)?.let { cont.completeResume(it) }

This workaround ignores exceptions when continuation is cancelled and this is the disadvantage of this workaround at the same time. If a true exception (failure) happens concurrently with cancellation, then it is going to be ignored instead of being handled. Arguably this problem is much less severe, but still.

W2. Use internal API on Job:

  1. Replace suspendCancellableCoroutine with suspendCoroutine.
  2. Replace cont.invokeOnCancellation { ... } with the following code:
        @UseExperimental(InternalCoroutinesApi::class)
        cont.context[Job]?.invokeOnCompletion(onCancelling = true) {
            cancel()
        }
  1. Replace LINE (1) with detection of "Cancelled" state on a Call (to see if the call had failed or was cancelled) and resume continuation with CancellationException in case of cancellation:
            if (call.isCanceled) {
                cont.resumeWithException(CancellationException("Cancelled"))
                return
            }

This workaround correctly handles exceptions that occur concurrently with cancellation (the call is either going to be cancelled or fails and we learn what had happened).

Proposed solutions

There are several possible solutions:

S1. Ignore exceptions when resuming a cancelled continuation using resumeWithException. Technically, this is a breaking change, but that is not a major problem here. The problem is that there could be a genuine exception that needs to be handled and ignoring it in resumeWithException is as bad as any other instance of ignoring an exception.

S2. Introduce new resumeWithExceptionOrIgnoreOnCancel method (name is purely provisional). It is not a breaking change, but it suffers from the same problem of ignoring a genuine exception when it happens concurrently with cancellation. This is not a severe problem, though, and this could still be one of the solutions.

S3. New API for cancellable callbacks with an intermediate cancelling state. The idea is that invocation of cancel() should not immediately result in resumption of the suspended coroutine, but shall wait until resumeWith (value or exception) is invoked. It gives a chance to see whether the operation was indeed cancelled successfully or failed while we trying to cancel it. It requires writing somewhat more code, though, similar to the code described in the second workaround.

  • This can be a new mode operation for suspendCancellableCoroutine.
  • This can be a new top-level function.

@elizarov elizarov changed the title from Introduce CancellableContinuation API to avoid cancel/fail races to Introduce API to avoid cancel/fail races with CancellableContinuation Nov 14, 2018

@LouisCAD

This comment has been minimized.

Contributor

LouisCAD commented Nov 14, 2018

I'm not sure I understand the solution *S3resumeWith is not called if the continuation is cancelled and the underlying callback is unregistered and its methods are never called? Or are you talking about resumeWith in another place?

@elizarov

This comment has been minimized.

Collaborator

elizarov commented Nov 14, 2018

Let me clarify S3. Right now, when CancellableContinuation is cancelled it immediately resumes the coroutine with CancellationException, so now if the operation we were waiting for had already crashed and resumeWithException is about to be called, it has no choice but to either ignore this exception or invoke uncaught exception handler.

With S3 the idea the we don't immediately resume a cancelled coroutine. Instead, we mark it as "cancelling" and wait until resumeXxx is invoked, so that if resumeWithException is invoked, we can resume the coroutine with this exception, so that we don't have this ugly choice between losing exception or handling it as uncaught one.

@LouisCAD

This comment has been minimized.

Contributor

LouisCAD commented Nov 14, 2018

@elizarov

This comment has been minimized.

Collaborator

elizarov commented Nov 14, 2018

@LouisCAD Yes, S3 works only for callback-based API that would still invoke a callback when you cancel. Retrofit and many other APIs are like that, but not all of them, so it cannot become the only solution out there.

@LouisCAD

This comment has been minimized.

Contributor

LouisCAD commented Nov 14, 2018

@elizarov Intriguing, I didn't know some callback based APIs did that.

I'm wondering how the callback is invoked after call to cancel() on Call<T> in Retrofit2, since I see no mention of the behavior you're talking about in the javadoc. EDIT: there's a test for that.

@Tolriq

This comment has been minimized.

Tolriq commented Nov 21, 2018

@elizarov Since all my tests and beta works perfectly with W1 I think I'll go to prod with that solution.

Since it's internal, is there any risk that this workaround is no more working at some point before this new API is created, or is the risk low enough to be OK for prod?

@IRus

This comment has been minimized.

Contributor

IRus commented Nov 23, 2018

@LouisCAD for example in apache http client FutureCallback has cancelled listener.

@Tolriq

This comment has been minimized.

Tolriq commented Nov 26, 2018

@elizarov So went to prod with https://gist.github.com/Tolriq/dcbdc62c0e29d3b321034e990c3c85ce containing Workaround 1.

While it greatly reduce the problem (Catched exception re thrown) (no more seen in 7K user beta group) it still happens with a way larger active user base.

I then tested to add a CoroutineExceptionHandler in case the issue was somewhere else it did not fix anything.

I then tested Workaround 2 and it did not change anything.

I then stopped relying on Exceptions and encapsulated them in a sealed class

    private suspend fun Call.await(): WorkerResult {
        return suspendCancellableCoroutine { continuation ->
            continuation.invokeOnCancellation {
                cancel()
            }
            enqueue(object : Callback {
                override fun onResponse(call: Call, response: Response) {
                    continuation.resume(WorkerResult.Success(response))
                }

                override fun onFailure(call: Call, e: IOException) {
                    if (call.isCanceled || !isActive) {
                        continuation.resume(WorkerResult.Error(CancellationException("Cancelled")))
                    } else {
                        continuation.resume(WorkerResult.Error(e))
                    }
                }
            })
        }
    }

And problem solved.

So I wonder where else in the code there's a race, maybe somewhere I the worker pool I use, or maybe inside couroutine or channels, but no idea where to search.

@pmhsfelix

This comment has been minimized.

pmhsfelix commented Nov 28, 2018

My expectation is that most APIs will still call the completion callback on a cancellation scenario, so S3 seems adequate for most cases (@LouisCAD regarding Retrofit2, the documentation is not clear, however there seems to be a test just for that case - https://github.com/square/retrofit/blob/master/retrofit/src/test/java/retrofit2/CallTest.java#L647)

IMO, there should also be a non-internal/experimental way of achieving the behaviour of W2 also for the success case, i.e., ignoring a cancellation if the callback is called with success.

@cuichanghao

This comment has been minimized.

cuichanghao commented Dec 13, 2018

I occur the same error.
local scope async cannot catch error.
https://stackoverflow.com/questions/50230466/kotlin-withcontext-vs-async-await
so withContext can catch error.

bad code:

launch {
            try {
                val result = async(Dispatchers.IO) { //this line
             }
}

to

correct code:

launch {
            try {
                val result = withContext(Dispatchers.IO) { //this line
             }
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment