Editor : Fix deadlock caused by garbage collection of Settings #5893

johnhaddon · 2024-06-07T11:49:56Z

I believe this fixes #5877. Either of the two commits is sufficient on its own to avoid the hang. I think we definitely want to keep the first one as it is dealing most directly with the cause. I'm in two minds about keeping the second one - have a read of the commit message and see what you think.

Would be good to get this into a release before anybody else gets hit by this problem...

danieldresser-ie · 2024-06-07T16:17:40Z

I don't feel I 100% understand the specifics of this, but the code all looks good to me.

As for your question on the second commit: I see what you mean about the message not being that useful compared to a stack trace. Since you've already written the code to detect it though, one option would be: print the message, but don't return - that would mean that you probably will get a hang, but the message would make it a lot clearer why you're getting the hang? The message could even contain instructions, or point to a webpage about how to collect a stack trace?

Oh, I just noticed the crash on linux-debug though, that's kinda concerning.

johnhaddon · 2024-06-10T10:07:51Z

I just noticed the crash on linux-debug though, that's kinda concerning.

I'm pretty sure that's an unrelated problem - one we've seen sporadically for a while. I have a feeling its more related to the testing than the reality, and haven't found time to look into it yet.

johnhaddon · 2024-06-12T11:41:52Z

Pushed one more commit, which - as discussed offline - keeps the detection of deadlocks but removes the mitigation, so we have a greater chance of catching one should it ever occur again. Did it as a separate commit as I'm still in two minds...

The problematic sequence of operations was this : 1. Destroy Editor. But Settings node lives on, because it is a wrapped RefCounted object and hence requires garbage collection. 2. Start unrelated BackgroundTask, which inadvertently triggers `IECore.RefCounted.collectGarbage()` on a background thread. 3. Settings node is destroyed on background thread by the garbage collection. All plugs are disconnected before destruction, including the `__scriptNode` plug. 4. Disconnections cause cancellation of background tasks associated with the ScriptNode, via `BackgroundTask::cancelAffectedTasks()`. Although the Settings node has no parent, the ScriptNode is still found due to the (about to be removed) connection to the `__scriptNode` plug. 5. `BackgroundTask::cancelAndWait()` never returns, because it is being called from the task's own thread. 6. The UI thread then waits for the task to finish, and we have complete deadlock. This is worked around by removing the `__scriptNode` plug connection on the main thread at the time the Editor is destroyed. Why is this only happening now? Because we only introduced the Settings node and the `__scriptNode` plug mechanism recently in 830de76. But we have always had lots of other Python-derived nodes that require garbage collection, so why weren't _they_ causing problems? Because when they are collected, they will have no parent, and the standard way of finding the ScriptNode for cancellation is to look for a ScriptNode ancestor. The special case using the `__scriptNode` plug only applies to the Settings node. Longer term it would be good to come up with a better mechanism than the `__scriptNode` plug, but I think this is a sufficient workaround in the meantime. Fixes #5877

This would also have avoided the deadlock fixed in the previous commit. I'm not sure if its worth keeping or not. On the one hand, it would identify similar problems in future without exposing the user to hangs. But on the other hand, the warning makes it harder to trace the exact source of the problem, whereas a hung process makes that relatively easy with `eu-stack`.

But keep detection, so in the unlikely event this happens we can prompt the user to provide a useful bug report.

murraystevenson

Thanks! We'll go with the detection of deadlocks but no mitigation. I've tweaked Changes.md to reflect that as an Improvement as we still had the mitigation listed under API, then rebased to fix the inevitable Changes.md merge conflict...

johnhaddon requested review from danieldresser-ie and murraystevenson June 7, 2024 11:49

johnhaddon self-assigned this Jun 7, 2024

johnhaddon mentioned this pull request Jun 7, 2024

UI Freezing Randomly - RHEL 9.4 #5877

Closed

johnhaddon force-pushed the garbageCollectionDeadlockWorkaround branch from 86b65ed to 3d8dd6f Compare June 12, 2024 11:42

johnhaddon added 3 commits June 14, 2024 11:15

BackgroundTask : Remove deadlock mitigation

51a4f99

But keep detection, so in the unlikely event this happens we can prompt the user to provide a useful bug report.

murraystevenson force-pushed the garbageCollectionDeadlockWorkaround branch from 3d8dd6f to 51a4f99 Compare June 14, 2024 18:17

murraystevenson approved these changes Jun 14, 2024

View reviewed changes

murraystevenson merged commit ff5358d into 1.4_maintenance Jun 14, 2024
5 checks passed

johnhaddon deleted the garbageCollectionDeadlockWorkaround branch June 20, 2024 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Editor : Fix deadlock caused by garbage collection of Settings #5893

Editor : Fix deadlock caused by garbage collection of Settings #5893

johnhaddon commented Jun 7, 2024

danieldresser-ie commented Jun 7, 2024

johnhaddon commented Jun 10, 2024

johnhaddon commented Jun 12, 2024

murraystevenson left a comment

Editor : Fix deadlock caused by garbage collection of Settings #5893

Editor : Fix deadlock caused by garbage collection of Settings #5893

Conversation

johnhaddon commented Jun 7, 2024

danieldresser-ie commented Jun 7, 2024

johnhaddon commented Jun 10, 2024

johnhaddon commented Jun 12, 2024

murraystevenson left a comment

Choose a reason for hiding this comment