Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid taking open handler mutex from request manager failure. #8130

Merged
merged 1 commit into from Mar 7, 2015

Conversation

bbockelm
Copy link
Contributor

@bbockelm bbockelm commented Mar 7, 2015

Observed deadlock:

Thread 1:

  • FileTimer::Run holds FileTimer::pMutex
  • FileStateHandler::Tick wants to take the FileStateHandler::pMutex

Thread 2:

  • FileStateHandler::OnStateError holds the FileStateHandler::pMutex lock
  • RequestManager::requestFailure calls
  • OpenHandler::current_source, which wants the OpenHandler::m_mutex

Thread 3:

  • OpenHandler::HandleResponseWithHosts holds OpenHandler::m_mutex,
  • ~FileStateHandler calls FileTimer::UnRegisterFileObject which tries
    to get the FileTimer::pMutex.

We remove the call to OpenHandler::current_source to break the deadlock.

If a file-open is in progress, we cannot take the open handler
mutex from within (RequestManager::requestFailure).

It is safe to call XrdAdaptor::RequestManager::OpenHandler::open
from within the requestFailure callback; if the file-open was in progress, it will return the shared
future and not touch Xrootd code. If the file-open was not in progress, it
is safe to take the open handler mutex in the first place.

See also #8129 - that also fixes the deadlock quoted above. However, this patch has the advantage of not trying to acquire OpenHandler::m_mutex when OpenHandler::HandleResponseWithHosts is alive and calling into the Xrootd library.

Observed deadlock:

Thread 1:
  - FileTimer::Run holds FileTimer::pMutex
  - FileStateHandler::Tick wants to take the FileStateHandler::pMutex

Thread 2:
  - FileStateHandler::OnStateError holds the FileStateHandler::pMutex lock
  - RequestManager::requestFailure calls
  - OpenHandler::current_source, which wants the OpenHandler::m_mutex

Thread 3:
  - OpenHandler::HandleResponseWithHosts holds OpenHandler::m_mutex,
  - ~FileStateHandler calls FileTimer::UnRegisterFileObject which tries
    to get the FileTimer::pMutex.

We remove the call to OpenHandler::current_source to break the deadlock.

If a file-open is in progress, we cannot take the open handler
mutex from within (RequestManager::requestFailure).

It is safe to call XrdAdaptor::RequestManager::OpenHandler::open
from within the requestFailure callback; if the file-open was in progress, it will return the shared
future and not touch Xrootd code.  If the file-open was not in progress, it
is safe to take the open handler mutex in the first place.
@bbockelm
Copy link
Contributor Author

bbockelm commented Mar 7, 2015

@Dr15Jones here's the counterpart to #8129.

Still trying to figure out how to completely avoid taking this lock from inside and outside a xrootd callback. I'm at FNAL on Monday/Tuesday; I'll see if I can sneak out of meetings and brainstorm.

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 7, 2015

A new Pull Request was created by @bbockelm (Brian Bockelman) for CMSSW_7_5_X.

Avoid taking open handler mutex from request manager failure.

It involves the following packages:

Utilities/XrdAdaptor

@cmsbuild, @Dr15Jones, @ktf, @nclopezo can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @wddgit this is something you requested to watch as well.
You can sign-off by replying to this message having '+1' in the first line of your reply.
You can reject by replying to this message having '-1' in the first line of your reply.
If you are a L2 or a release manager you can ask for tests by saying 'please test' in the first line of a comment.
@nclopezo, @ktf you are the release manager for this.
You can merge this pull request by typing 'merge' in the first line of your comment.

@Dr15Jones
Copy link
Contributor

Please test

@Dr15Jones
Copy link
Contributor

We also need this for 7_4

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 7, 2015

The tests are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 7, 2015

@Dr15Jones
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 7, 2015

This pull request is fully signed and it will be integrated in one of the next CMSSW_7_5_X IBs unless changes (tests are also fine). This pull request requires discussion in the ORP meeting before it's merged. @davidlange6, @nclopezo, @ktf, @smuzaffar

@davidlange6
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants