Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes #1827 Windows a deadlock on nng_close() #1828

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

gdamore
Copy link
Contributor

@gdamore gdamore commented Apr 25, 2024

My current theory is that for some reason that I don't yet fully understand, we have code waiting in the condition that didn't set the closing. (Possibly the failure is a synchronization since s_closing is changed while not protected by the global lock.)

At any rate, the attempt to avoid the cost of a wake up here is silly, as pthread_cond_broadcast (and one assumes other variants like the Windows implementation to which I don't have source) are nearly free when there are no waiters. (Pthreads uses a relaxed order memory read to look for waiters, so no barrier is involved.)

So we can just do the wake unconditionally.

I'd appreciate it if folks who are encountering the problem can tell me if this change resolves for them.

Copy link

codecov bot commented Apr 25, 2024

Codecov Report

Attention: Patch coverage is 81.81818% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 79.38%. Comparing base (aac4dc3) to head (1ab81ce).

Files Patch % Lines
src/sp/transport/tcp/tcp.c 0.00% 1 Missing ⚠️
src/sp/transport/tls/tls.c 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1828      +/-   ##
==========================================
+ Coverage   79.32%   79.38%   +0.05%     
==========================================
  Files          95       95              
  Lines       21491    21484       -7     
==========================================
+ Hits        17048    17054       +6     
+ Misses       4443     4430      -13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alzix
Copy link
Contributor

alzix commented Apr 25, 2024

the issue is still reproducible on this branch.
image

@gdamore
Copy link
Contributor Author

gdamore commented Apr 27, 2024

Please have another go with this branch -- another commit made which I hope will help.

@gdamore
Copy link
Contributor Author

gdamore commented Apr 27, 2024

Well, that didn't work as well as hoped. Seems that the read/write cbs are also here.

@gdamore
Copy link
Contributor Author

gdamore commented Apr 27, 2024

Ah reaping is needed because we are in the callback when we fail. And its interesting that this happens consistently for IPC, so that suggests that I'm on the right path.

When closing pipes, we defer them to be reaped, but also leave
them in the match list where they might be picked up by ep_match,
or leak.  It's best to reap these proactively and ensure that they
are not allowed to life longer once they have errored during the
negotiation phase.
@gdamore
Copy link
Contributor Author

gdamore commented Apr 27, 2024

(Another go, restoring the reaping..)

@alzix
Copy link
Contributor

alzix commented Apr 28, 2024

i was not able to reproduce the original issue anymore, but I cannot get a decent number of iterations as the server is crashing on nni_list_node_remove as was previously reported
image

image

@alzix
Copy link
Contributor

alzix commented Apr 28, 2024

win_ipcconn.c:229 in ipc_send_cb there is a check:

if ((aio = nni_list_first(&c->send_aios)) == NULL) {
	// Should indicate that it was closed.
	nni_mtx_unlock(&c->mtx);
	return;
}

I think it does not do what is expected as I see in the debugger that c->closed == true

@alzix
Copy link
Contributor

alzix commented Apr 28, 2024

there are two types of crashes here: one in ipc_send_cb and the other in ipc_recv_cb. Both occur on close.
Based on my observation in these cases the aio object is malformed, and later leads to a crash.
image
either memory was not properly initialized or some other thread overwrote it.

from: https://en.wikipedia.org/wiki/Magic_number_(programming)#Debug_values

0xDDDDDDDD pattern is used by Microsoft's C/C++ debug free() function to mark freed heap memory

so it seems the aio contains dangling pointers...

@alzix
Copy link
Contributor

alzix commented Apr 29, 2024

from my observations, the problem occurs when ipc_recv_cb and/or ipc_send_cb are executed after nni_sock_shutdown

@gdamore
Copy link
Contributor Author

gdamore commented May 3, 2024

@alzix thanks for the analysis. I will try to get to the bottom of this soon ... I've just been completely swamped with $dayjob.

@gdamore
Copy link
Contributor Author

gdamore commented May 3, 2024

Definitely a use-after-free.

@gdamore
Copy link
Contributor Author

gdamore commented May 5, 2024

This is very definitely windows specific. It may impact TCP as well, but the callback structure here is used with overlapped IO (a Windows thing.)

@gdamore
Copy link
Contributor Author

gdamore commented May 22, 2024

So I guess the send_cb is somehow still running. I'm still trying to get to the bottom of this, because I would not expect that there are any posted I/Os at that point.

@itayzafrir
Copy link

added some info in PR #1831 (comment)

@alzix
Copy link
Contributor

alzix commented May 22, 2024

So I guess the send_cb is somehow still running. I'm still trying to get to the bottom of this, because I would not expect that there are any posted I/Os at that point.

According to https://learn.microsoft.com/en-us/windows/win32/fileio/canceling-pending-i-o-operations

There is no guarantee that underlying drivers correctly support cancellation.

perhaps this is the case?

@gdamore
Copy link
Contributor Author

gdamore commented May 25, 2024

So I guess the send_cb is somehow still running. I'm still trying to get to the bottom of this, because I would not expect that there are any posted I/Os at that point.

According to https://learn.microsoft.com/en-us/windows/win32/fileio/canceling-pending-i-o-operations

There is no guarantee that underlying drivers correctly support cancellation.

perhaps this is the case?

Then the driver should continue to completion which would be fine. But Windows named pipes and TCP both support cancellation. The problem is a defect in my logic, not missing Windows functionality. I'm still working to get to the bottom of it -- I thought I had understood it but clearly I was missing something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants