New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixes #1827 Windows a deadlock on nng_close() #1828
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1828 +/- ##
==========================================
+ Coverage 79.32% 79.38% +0.05%
==========================================
Files 95 95
Lines 21491 21484 -7
==========================================
+ Hits 17048 17054 +6
+ Misses 4443 4430 -13 ☔ View full report in Codecov by Sentry. |
Please have another go with this branch -- another commit made which I hope will help. |
Well, that didn't work as well as hoped. Seems that the read/write cbs are also here. |
Ah reaping is needed because we are in the callback when we fail. And its interesting that this happens consistently for IPC, so that suggests that I'm on the right path. |
When closing pipes, we defer them to be reaped, but also leave them in the match list where they might be picked up by ep_match, or leak. It's best to reap these proactively and ensure that they are not allowed to life longer once they have errored during the negotiation phase.
993a237
to
1ab81ce
Compare
(Another go, restoring the reaping..) |
I think it does not do what is expected as I see in the debugger that |
there are two types of crashes here: one in from: https://en.wikipedia.org/wiki/Magic_number_(programming)#Debug_values
so it seems the |
from my observations, the problem occurs when |
@alzix thanks for the analysis. I will try to get to the bottom of this soon ... I've just been completely swamped with $dayjob. |
Definitely a use-after-free. |
This is very definitely windows specific. It may impact TCP as well, but the callback structure here is used with overlapped IO (a Windows thing.) |
So I guess the send_cb is somehow still running. I'm still trying to get to the bottom of this, because I would not expect that there are any posted I/Os at that point. |
added some info in PR #1831 (comment) |
According to https://learn.microsoft.com/en-us/windows/win32/fileio/canceling-pending-i-o-operations
perhaps this is the case? |
Then the driver should continue to completion which would be fine. But Windows named pipes and TCP both support cancellation. The problem is a defect in my logic, not missing Windows functionality. I'm still working to get to the bottom of it -- I thought I had understood it but clearly I was missing something. |
My current theory is that for some reason that I don't yet fully understand, we have code waiting in the condition that didn't set the closing. (Possibly the failure is a synchronization since s_closing is changed while not protected by the global lock.)
At any rate, the attempt to avoid the cost of a wake up here is silly, as pthread_cond_broadcast (and one assumes other variants like the Windows implementation to which I don't have source) are nearly free when there are no waiters. (Pthreads uses a relaxed order memory read to look for waiters, so no barrier is involved.)
So we can just do the wake unconditionally.
I'd appreciate it if folks who are encountering the problem can tell me if this change resolves for them.