Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread local is not cleaned up sometimes #2930

Closed
frost13it opened this issue Sep 13, 2021 · 16 comments
Closed

Thread local is not cleaned up sometimes #2930

frost13it opened this issue Sep 13, 2021 · 16 comments
Labels

Comments

@frost13it
Copy link

@frost13it frost13it commented Sep 13, 2021

I'm using a ThreadContextElement that sets value of a ThreadLocal. After resolving of #985 it worked perfectly.
But after upgrade to 1.5.0 I've got a similar problem: sometimes the last value of the thread local stucks in a worker thread.
Equivalent code:

while(true) {
    someCode {
        // here the thread local may already have a value from previous iteration
        withContext(threadLocal.asContextElement("foo")) {
            someOtherCode()
        }
    }
}

Actual code of the ThreadContextElement implementation is here.

It is hard to reproduce the issue, but I'm facing it periodically in production (it may take hours or days to arise).
Tested 1.5.0 and 1.5.2, both behaves the same. Running it with -ea.

@qwwdfsad qwwdfsad added the bug label Sep 27, 2021
@qwwdfsad
Copy link
Member

@qwwdfsad qwwdfsad commented Sep 27, 2021

It's hard to tell what exactly is wrong without seeing the whole coroutine's hierarchy.

What I suspect can be a root cause is a 3rd-party implementation of coroutine builder that does not implement CoroutineStackFrame or completely rewrite the coroutine context instead of overwriting only required elements by +.

#985 uses stackwalking capabilities, leveraging CoroutineStackFrame and the fact that all suspending coroutine builders implement it and also relies on the fact that context is properly propagated.

When the exception is thrown, can you please check if the coroutine context in the most nested coroutine contains UndispatchedMarker?

@frost13it
Copy link
Author

@frost13it frost13it commented Sep 27, 2021

Thanks for the response. I'll check that.

@frost13it
Copy link
Author

@frost13it frost13it commented Oct 4, 2021

I've checked presence of the UndispatchedMarker. It is here inside withContext(threadLocal) { } (but not outside it) and in the most nested suspension points (which are CompletableFuture.await()).
Besides CompletableFuture.await(), the only coroutine-related things in the project are basic coroutine builders (withContext { }, coroutineScope { }, launch { }, withTimeout { }). There is no 3rd-party builders or anything like that.

@michail-nikolaev
Copy link

@michail-nikolaev michail-nikolaev commented Oct 12, 2021

Hello.

We are getting something like this after few days in production...

We have a loop like this:


class RequestContextsStorage()
val threadLocalForRequestContext = ThreadLocal<RequestContextsStorage>()

class RequestContextThreadContextElement(private val storage: RequestContextsStorage) :
    ThreadContextElement<RequestContextsStorage> {

    // Key for CoroutineContext key-value storage
    private object Key : CoroutineContext.Key<RequestContextThreadContextElement>

    override val key: CoroutineContext.Key<*> get() = Key

    override fun updateThreadContext(context: CoroutineContext): RequestContextsStorage {
        val oldState = threadLocalForRequestContext.get()
        threadLocalForRequestContext.set(storage)
        return oldState
    }

    override fun restoreThreadContext(context: CoroutineContext, oldState: RequestContextsStorage) {
        threadLocalForRequestContext.set(oldState)
    }
}

private var otherThreadLocal = ThreadLocal<String?>()
private val scope = CoroutineScope(Dispatchers.IO)

scope.launch(otherThreadLocal.asContextElement("x")) {
    while (isActive) {
        delay(100)
        // sometimes here we could see some value in **threadLocalForRequestContext**
        someStuff()
    }
}

Also, all builders are pretty standard. Maybe some tricks with cancellation\exceptions\etc...

1.5.1 version.

And of course, we have a lot of code like this:

runBlocking(Dispatchers.IO) {
    withContext(RequestContextThreadContextElement(someValue) + otherThreadLocal.asContextElement("x")) {
         // everything seems be fine here
    }
}

@michail-nikolaev
Copy link

@michail-nikolaev michail-nikolaev commented Oct 25, 2021

Workaround like this:

scope.launch(otherThreadLocal.asContextElement("x")) {
    while (isActive) {
      **withContext(RequestContextThreadContextElement(empty)) {**
            delay(100)
            // sometimes here we could see some value in **threadLocalForRequestContext**
            someStuff()
        }
    }
}

fixed the issue in our case.

@frost13it
Copy link
Author

@frost13it frost13it commented Jan 17, 2022

Still reproducible on 1.6.0.

@frost13it
Copy link
Author

@frost13it frost13it commented Jan 21, 2022

Finally I've managed to write a small reproducer:

val threadLocal = ThreadLocal<String>()

suspend fun main() {
    while (true) {
        coroutineScope {
            repeat(100) {
                launch {
                    doSomeJob()
                }
            }
        }
    }
}

private suspend fun doSomeJob() {
    check(threadLocal.get() == null)
    withContext(threadLocal.asContextElement("foo")) {
        val semaphore = Semaphore(1, 1)
        suspendCancellableCoroutine<Unit> { cont ->
            Dispatchers.Default.asExecutor().execute {
                cont.resume(Unit)
            }
        }
        cancel()
        semaphore.acquire()
    }
}

It completes almost instantly on my machine and takes some time on play.kotlinlang.org.

@qwwdfsad
Copy link
Member

@qwwdfsad qwwdfsad commented Jan 21, 2022

Great job with a reproducer! Verified it reproduces, we'll fix it in 1.6.1

qwwdfsad added a commit that referenced this issue Jan 25, 2022
… in order to avoid state interference when the coroutine is updated concurrently.

Concurrency is inevitable in this scenario: when the coroutine that has UndispatchedCoroutine as its completion suspends, we have to clear the thread context, but while we are doing so, concurrent resume of the coroutine could've happened that also ends up in save/clear/update context

Fixes #2930
@frost13it
Copy link
Author

@frost13it frost13it commented Mar 14, 2022

Is there a planned release date for 1.6.1?

@frost13it
Copy link
Author

@frost13it frost13it commented Apr 4, 2022

Unfortunately, the issue seems to be still there.

The following code throws an exception on versions from 1.5.0 till current develop branch (262876b):

val threadLocal = ThreadLocal<String>()

suspend fun main() {
    doSomeJob()
    doSomeJob()
}

private suspend fun doSomeJob() {
    check(threadLocal.get() == null)
    withContext(threadLocal.asContextElement("foo")) {
        try {
            coroutineScope {
                val semaphore = Semaphore(1, 1)
                dummyAwait()
                cancel()
                semaphore.acquire()
            }
        } catch (e: CancellationException) {
            println("cancelled")
        }
    }
}

private suspend fun dummyAwait() {
    CompletableFuture.runAsync({ }, Dispatchers.Default.asExecutor()).await()
}

@qwwdfsad
Copy link
Member

@qwwdfsad qwwdfsad commented Apr 4, 2022

Could you please recheck on 1.6.1?

I cannot reproduce it as is, I will give a few tries a bit later to see if it still reproduces. Anyway, 1.6.1 fixes at least one serious bug in thread locals, so it's worth upgrading

@frost13it
Copy link
Author

@frost13it frost13it commented Apr 4, 2022

The same on 1.6.1, every time. Checked on Liberica JDK 11.0.14 and some build of OpenJDK 17.
Tried kotlinx.coroutines.scheduler.core.pool.size of 1-8, no changes. Is there something else that could depend on environment?

@qwwdfsad qwwdfsad reopened this Apr 4, 2022
@qwwdfsad
Copy link
Member

@qwwdfsad qwwdfsad commented Apr 4, 2022

Aha, I see, it only reproduces with non kotlinx.coroutines-related entry point, namely suspend fun main!

I'll fix it separately. Meanwhile, it would be nice to see if you are still affected in the production environment as it's unlikely to be the case that someone has suspend provider without integration with kotlinx.coroutines.

Depending on that we'll decide on an urgency of the fix

@frost13it
Copy link
Author

@frost13it frost13it commented Apr 4, 2022

Indeed, suspend fun main() is not a production case. But it is pretty sad (not fatal, though) when a "quick and dirty" piece of code fails with such exception. I'm faced this situation using one of our support tools.

The potential production case is an application based on Ktor 1.6.8. When using Netty engine and kotlinx.coroutines 1.6.1, it fails exactly the same way. I don't know if it is a Ktor issue, but I achieved the same effect without it.
Since this bug scares me as hell now, I can't leave it without attention. When using CIO engine, everything is ok, but who knows where it will strike again without any chance for a quick fix.

We have some hooks in the infrastructure to ensure that code runs with an initial value of ThreadLocal. This approach could work even on pre-1.4.3 kotlinx.coroutines and let us work around this bug in most cases. But still every GlobalScope.launch { } is a potential bomb.

@frost13it
Copy link
Author

@frost13it frost13it commented Apr 4, 2022

The reproducer for Ktor does not differ much:

val threadLocal = ThreadLocal<String>()

fun main() {
    val engine = embeddedServer(Netty, port = 8080) {
        routing {
            get {
                doSomeJob()
                doSomeJob()
            }
        }
    }
    engine.start()
}

private suspend fun doSomeJob() {
    check(threadLocal.get() == null)
    withContext(threadLocal.asContextElement("foo")) {
        try {
            coroutineScope {
                val semaphore = Semaphore(1, 1)
                dummyAwait()
                cancel()
                semaphore.acquire()
            }
        } catch (e: CancellationException) {
            println("cancelled")
        }
    }
}

private suspend fun dummyAwait() {
    CompletableFuture.runAsync({ }, Dispatchers.Default.asExecutor()).await()
}

@qwwdfsad
Copy link
Member

@qwwdfsad qwwdfsad commented Apr 13, 2022

Thanks for both Ktor and regular reproducer!

The source of the issue is indeed non-kotlinx.coroutines related entry point that Ktor leverages in order to optimize its internal machinery (SuspendFunGun). #3155 fixed completely different bug that happened to be reproducible with the very same snippet :)

I have a potential solution in mind (#3252) and also future-proof plan to avoid similar problems (#3253), I believe this issue itself is enough to release 1.6.2 with a fix, though I cannot give you a strict timeline here

qwwdfsad added a commit that referenced this issue Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants