-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvme/rdma: rdma_destroy_id deadlock when remove nvme controller #3347
Comments
In this code, if the cm event processing times out, the qpair will be recycled directly. In this case, will rdma_destroy_id keep waiting?
|
@jimharris @AlekseyMarchuk When a deadlock occurs, events_completed is 4, and the function ucma_destroy_kern_id returns 5, and then it enters the condition wait. If the thread receives the event later, a deadlock will occur. I don't know which cm event is not completed. Normally, the last event is RDMA_CM_EVENT_TIMEWAIT_EXIT. In addition, I found that in this nvme_rdma_poll_group_process_completions function, disconnected_qpairs is processed first, and then other CM events are processed. Is it possible that the event is processed after rdma_destroy_id? Finally, the current combination of vhost v24.01 and target v24.01 does not have this problem. Only the combination of vhost v23.01 and target v23.01 has this problem. |
I can't identify a commit which could fix this problem, there were several commits related to the qpair disconnect process. What was the back trace when the issue happened? Just to double check, v24.01 works fine? |
The back trace I can see is:
I ran the test for 3 days in the v24.01 environment and this problem did not occur. |
Thank you for your reply. I'll check behaviour of nvme_rdma_qpair_wait_until_quiet in v23 and v24 later this week |
I encountered the same problem with spdk v23.01. I think the reason is the cm channel is shared between all qpairs, the cm events for one io qpairs will be polled in any other io thread. |
@dwgseu Thanks for your comment! You are right, that is a race condition. But normally cm_events are polled under controller lock and |
@AlekseyMarchuk Thanks a lot! Your comment reminded me, I reorganized the code logic, as you said, |
Fix can be found here: https://review.spdk.io/gerrit/c/spdk/spdk/+/23324 |
@AlekseyMarchuk Thanks, let me test it in our environment. |
@AlekseyMarchuk I test with the same reproduce steps in my environment, this patch works well. |
This patch works well, I‘m closing this issue now. |
Sighting report
When I delete the nvme controller on the vhost, I often encounter the problem of vhost getting stuck.
vhost has the following logs:
Through gdb, I saw that the reactor_2 thread was stuck. It seemed that the rdma_destroy_id function was deadlocked.
Expected Behavior
nvme controller can be deleted successfully and vhost runs normally.
Current Behavior
A thread of vhost is stuck
Possible Solution
Before rdma_destroy_id, there are undisconnected qpairs, or there are unprocessed events, resulting in rdma_destroy_id deadlock.
Steps to Reproduce
Context (Environment including OS version, SPDK version, etc.)
OS Version:AlmaLinux release 9.1 (Lime Lynx)
Kernel Version: 5.14.0-162.23.1.el9
NVMf target Version: V23.01
Vhost Version: V24.01
The text was updated successfully, but these errors were encountered: