Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid device_watcher deadlock after reset #11933

Closed

Conversation

Arun-Prasad-V
Copy link
Contributor

Fixing: #10482

Issue:

  • Deadlock happens after reset

Root cause:

  • After hardware_reset, when the device class object is created, it detects that the devices are changed (due to reset) and triggers the on_device_changed() callback
    • Before this callback getting executed, the class object goes out of scope and the destructor ~device() is getting called in different thread
  • When both on_device_changed() callback and destructor ~device() runs at same time in different threads, deadlock happens as both of their call stacks require same set of mutexes but in different order
    • First thread: (Dispatcher thread)
      • Acquired the _dispatch_mutex at line and following the callstack it calls the on_device_changed() callback and waits for mutex _devices_changed_callbacks_mtx at line
    • Second thread:
      • When the device object goes out of scope, ~device() destructor is called and then, it calls unregister_internal_device_callback() function at line
        • There it acquires _devices_changed_callbacks_mtx at line
      • Then it calls device_watcher->stop(), which calls dispatcher::stop()
        • Here it waits for _dispatch_mutex at line

Proposed Fix:

  • To have a timed mutex lock in on_device_changed() callback. If the timeout happens, this callback can be stopped - which will in turn resumes the other thread as well

@Nir-Az Nir-Az requested a review from maloel June 22, 2023 12:46
@Milnaye
Copy link

Milnaye commented Aug 10, 2023

Hi Arun,

I tried this fix on our robot, based on librealsense v2.53.1.
For the reset part, it effectively work, I didn't see deadlock in multiples tests. However sometimes it prevent the recovery of disconnected cameras by realsense-ros.

As a reminder, here is the testing conditions (scenario 1) : #10517 (comment)
I tried both with and without this patch, nothing else was different. The issue happens only while using this fix but it's not systematic.

I join examples of log, when this issue happen and when not :
with_issue.log
without_issue.log

For what I understand, the library seems to "miss" the disconnection of the camera (which might be linked to the timeout you added on some mutex), and so it never searches for it again as it only thinks it failed to access it.

Also, we don't intend to unplug/replug our camera while using our robot, but it will be active for long periods of time without human intervention and we had bad experiences with cables, so we don't want a poor connection to cause issues.

@Arun-Prasad-V
Copy link
Contributor Author

@Milnaye , Thanks for trying this out. We will check and get back. Thanks.

cc: @Nir-Az

@maloel
Copy link
Collaborator

maloel commented Oct 3, 2023

Hi @Arun-Prasad-V

This code is currently undergoing some major changes: the device will no longer use the context's callbacks and therefore the deadlock should be gone. I hope this will make this PR unnecessary.

I can update here once the change is made, if you'd like (and assuming you are willing to use our development branch -- it will not be a single PR that you can simply cherry-pick). Otherwise the next release is likely a few months away.

I should have looked at your PR earlier, so sorry about neglecting this. If you're OK with the above, can I close the PR?

@Arun-Prasad-V
Copy link
Contributor Author

Thanks, @maloel. That makes sense.

Closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants