Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow FileSystem queries to occur from callback threads. #13080

Merged
merged 4 commits into from Jan 29, 2016

Conversation

bbockelm
Copy link
Contributor

To query a filesystem property, a free callback thread is needed
with the default synchronous versions of XrdCl::FileSystem methods.

Thus, if the synchronous method is called from a callback thread,
then there must be an idle thread to handle the response; if there
are N threads in the pool and N simultaneous queries, then the
filesystem query would cause a deadlock.

Worse yet, the timeout mechanism for XrdCl relies on an idle callback
thread to function.

As making the synchronous query asynchronous in its currrent use case
is quite hard, I only partially solved the problem: wrote a new
response handler that can timeout without an idle thread.

Technically, any version of CMSSW 7_6 or later should be affected. Pragmatically, do we actually open multiple files simultaneously except in the threaded mixing module?

To query a filesystem property, a free callback thread is needed
with the default synchronous versions of XrdCl::FileSystem methods.

Thus, if the synchronous method is called from a *callback thread*,
then there must be an idle thread to handle the response; if there
are N threads in the pool and N simultaneous queries, then the
filesystem query would cause a deadlock.

Worse yet, the timeout mechanism for XrdCl relies on an idle callback
thread to function.

As making the synchronous query asynchronous in its currrent use case
is quite hard, I only partially solved the problem: wrote a new
response handler that can timeout without an idle thread.
@cmsbuild
Copy link
Contributor

A new Pull Request was created by @bbockelm (Brian Bockelman) for CMSSW_8_0_X.

It involves the following packages:

Utilities/XrdAdaptor

@cmsbuild, @smuzaffar, @Dr15Jones, @davidlange6 can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @wddgit this is something you requested to watch as well.
@slava77, @Degano, @smuzaffar you are the release manager for this.

cms-bot commands are list here #13028

@Dr15Jones
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/10754/console

// m_mutex protects m_status
std::unique_lock<std::mutex> guard(m_mutex);
// On exit from the block, make sure m_status is set; it needs to be set before we notify threads.
std::unique_ptr<char, std::function<void(char*)>> exit_guard(nullptr, [&](char *) {m_status.reset(new XrdCl::XRootDStatus(XrdCl::stError, XrdCl::errInternal));});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think unique_ptr allowed a delete handler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it does!

Since I want to release() the data later, I felt it was appropriate here.

@Dr15Jones
Copy link
Contributor

Why not make m_status an std::atomic<...*>? If you did that you can use a non nullptr value to tell you it was sent and m_response is OK.

@cmsbuild
Copy link
Contributor

-1

Tested at: aa0a0b7
I found errors in the following addon tests:

cmsDriver.py RelVal -s L1REPACK:GT2 --data --scenario=HeavyIons -n 10 --conditions auto:run2_hlt_HIon --relval 9000,50 --datatier "RAW" --eventcontent RAW --customise=HLTrigger/Configuration/CustomConfigs.L1T --era Run2_HI --magField 38T_PostLS1 --fileout file:RelVal_Raw_HIon_DATA.root --filein /store/hidata/HIRun2015/HIHardProbes/RAW-RECO/HighPtJet-PromptReco-v1/000/263/689/00000/1802CD9A-DDB8-E511-9CF9-02163E0138CA.root : FAILED - time: date Tue Jan 26 23:11:16 2016-date Tue Jan 26 23:10:36 2016 s - exit: 23552
cmsRun /afs/cern.ch/cms/sw/ReleaseCandidates/vol1/slc6_amd64_gcc493/cms/cmssw-patch/CMSSW_8_0_X_2016-01-26-1100/src/HLTrigger/Configuration/test/OnLine_HLT_HIon.py realData=True globalTag=@ inputFiles=@ : FAILED - time: date Tue Jan 26 23:11:16 2016-date Tue Jan 26 23:10:36 2016 s - exit: 21504
cmsDriver.py RelVal -s HLT:HIon,RAW2DIGI,L1Reco,RECO --data --scenario=HeavyIons -n 10 --conditions auto:run2_data_HIon --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --era Run2_HI --magField 38T_PostLS1 --processName=HLTRECO --filein file:RelVal_Raw_HIon_DATA.root --fileout file:RelVal_Raw_HIon_DATA_HLT_RECO.root : FAILED - time: date Tue Jan 26 23:11:16 2016-date Tue Jan 26 23:10:36 2016 s - exit: 21504

you can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-13080/10754/summary.html

@cmsbuild
Copy link
Contributor

@bbockelm
Copy link
Contributor Author

@Dr15Jones - I considered the atomic. However, we already need to have the mutex there for the condition variable. Further:

  • There's no contention (it's synchronizing between threads).
  • There has to be a network round trip anyway, meaning that there will little contention on the mutex.
  • Simplifies the implementation of HandleResponse. (Honestly, this was the driving one: with raw pointers, having exception safety got complicated.

@bbockelm
Copy link
Contributor Author

For the static check warnings - I think these are all "correct" and may just need to be annotated. What are the annotations I should use?

Alternately, there might be some way to tweak the singletons to make the "more correct". Chris, ideas?

Also, I notice that catch(...) is banned by the static analyzer. I use this to try to make more coherent error messages. What's the more-correct suggestion?

@Martin-Grunewald
Copy link
Contributor

please test
(HIon problem should be solved)

@cmsbuild
Copy link
Contributor

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/10765/console

All "real" state is kept in a sub-object; the QueryAttrHandler keeps
a weak_ptr to the "real" state.  While the main thread is waiting,
it keeps a reference to the shared_ptr for the state, keeping the
weak_ptr alive.

If the timeout occurs, the shared_ptr is dropped and the weak_ptr
cannot be locked.

The QueryAttrHandler is always deleted by the callback if it is
successfully registered.
Previously, the presence of a pointer in m_file indicated an open-
file request was outstanding.  However, after the file was opened,
the unique_ptr was moved to the Source object (which does a synchronous
network query in its constructor).  Hence, it appeared there was
no outstanding file-open and a second one could start.

Now, we keep a separate boolean flag that is only set to false when
the HandleResponse method exits.
@cmsbuild
Copy link
Contributor

Pull request #13080 was updated. @cmsbuild, @smuzaffar, @Dr15Jones, @davidlange6 can you please check and sign again.

@bbockelm
Copy link
Contributor Author

The update fixes a the obvious build issue (sorry! refactored a line after testing... guess which one) and attacks the refactor suggested by Chris.

@Dr15Jones
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/10848/console

@@ -246,9 +355,7 @@ Source::getXrootdSiteFromURL(std::string url, std::string &site)
delete response;
return false;
}
std::string rsite = response->ToString();
delete response;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like response is an unused variable?

@Dr15Jones
Copy link
Contributor

+1
Any further changes can be done on a different pull request since this will fix a problem which is presently 'deadlocking' MC jobs in the IB RelVals for CMSSW_8_0_THREADED_X.

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_8_0_X IBs after it passes the integration tests. This pull request requires discussion in the ORP meeting before it's merged. @slava77, @davidlange6, @Degano, @smuzaffar

@Dr15Jones
Copy link
Contributor

Upon further reflection, I don't think this will solve the underlying problem and will only push the problem to a slightly later stage.

What is happening is xrootd only has 3 threads it uses to answer client 'queries'. In our case each of those threads gets stuck with the following stacktrace

#0  0x00000037912e5199 in syscall () from /lib64/libc.so.6
#1  0x00007fab05781500 in XrdSys::LinuxSemaphore::Wait() () from /afs/cern.ch/cms/sw/ReleaseCandidates/vol1/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_0_THREADED_X_2016-01-28-1100/external/slc6_amd64_gcc493/lib/libXrdCl.so.2
#2  0x00007fab0577609a in XrdCl::FileSystem::Query(XrdCl::QueryCode::Code, XrdCl::Buffer const&, XrdCl::Buffer*&, unsigned short) () from /afs/cern.ch/cms/sw/ReleaseCandidates/vol1/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_0_THREADED_X_2016-01-28-1100/external/slc6_amd64_gcc493/lib/libXrdCl.so.2
#3  0x00007fab05702d70 in XrdAdaptor::Source::getXrootdSiteFromURL(std::string, std::string&) () from /afs/cern.ch/cms/sw/ReleaseCandidates/vol1/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_0_THREADED_X_2016-01-28-1100/lib/slc6_amd64_gcc493/libUtilitiesXrdAdaptor.so
#4  0x00007fab0570448e in XrdAdaptor::Source::getXrootdSite(XrdCl::File&, std::string&) () from /afs/cern.ch/cms/sw/ReleaseCandidates/vol1/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_0_THREADED_X_2016-01-28-1100/lib/slc6_amd64_gcc493/libUtilitiesXrdAdaptor.so
#5  0x00007fab05704a40 in XrdAdaptor::Source::setXrootdSite() () from /afs/cern.ch/cms/sw/ReleaseCandidates/vol1/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_0_THREADED_X_2016-01-28-1100/lib/slc6_amd64_gcc493/libUtilitiesXrdAdaptor.so
#6  0x00007fab05705da5 in XrdAdaptor::Source::Source(timespec, std::unique_ptr<XrdCl::File, std::default_delete<XrdCl::File> >, std::string const&) () from /afs/cern.ch/cms/sw/ReleaseCandidates/vol1/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_0_THREADED_X_2016-01-28-1100/lib/slc6_amd64_gcc493/libUtilitiesXrdAdaptor.so
#7  0x00007fab056f3822 in XrdAdaptor::RequestManager::OpenHandler::HandleResponseWithHosts(XrdCl::XRootDStatus*, XrdCl::AnyObject*, std::vector<XrdCl::HostInfo, std::allocator<XrdCl::HostInfo> >*) () from /afs/cern.ch/cms/sw/ReleaseCandidates/vol1/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_0_THREADED_X_2016-01-28-1100/lib/slc6_amd64_gcc493/libUtilitiesXrdAdaptor.so
#8  0x00007fab057a2dfc in ?? () from /afs/cern.ch/cms/sw/ReleaseCandidates/vol1/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_0_THREADED_X_2016-01-28-1100/external/slc6_amd64_gcc493/lib/libXrdCl.so.2
#9  0x00007fab057860cf in XrdCl::XRootDMsgHandler::HandleResponse() () from /afs/cern.ch/cms/sw/ReleaseCandidates/vol1/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_0_THREADED_X_2016-01-28-1100/external/slc6_amd64_gcc493/lib/libXrdCl.so.2
#10 0x00007fab05787ffe in XrdCl::XRootDMsgHandler::Process(XrdCl::Message*) () from /afs/cern.ch/cms/sw/ReleaseCandidates/vol1/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_0_THREADED_X_2016-01-28-1100/external/slc6_amd64_gcc493/lib/libXrdCl.so.2
#11 0x00007fab0576b20a in XrdCl::Stream::HandleIncMsgJob::Run(void*) () from /afs/cern.ch/cms/sw/ReleaseCandidates/vol1/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_0_THREADED_X_2016-01-28-1100/external/slc6_amd64_gcc493/lib/libXrdCl.so.2
#12 0x00007fab057cbc16 in XrdCl::JobManager::RunJobs() () from /afs/cern.ch/cms/sw/ReleaseCandidates/vol1/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_0_THREADED_X_2016-01-28-1100/external/slc6_amd64_gcc493/lib/libXrdCl.so.2
#13 0x00007fab057cbdf9 in ?? () from /afs/cern.ch/cms/sw/ReleaseCandidates/vol1/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_0_THREADED_X_2016-01-28-1100/external/slc6_amd64_gcc493/lib/libXrdCl.so.2
#14 0x0000003791607a51 in start_thread () from /lib64/libpthread.so.0
#15 0x00000037912e893d in clone () from /lib64/libc.so.6

The reason they are stuck is XrdCl::FileSystem::Query puts another client query on xrootd's list of queries to run but that new query can't run because all the threads for queries are stuck waiting for a query that is on the list but can't run because the threads are all full.

This pull request allows the threads mangaged by cmsRun (i.e. the TBB threads) to eventually timeout waiting for xrootd to respond and then proceed to run. However, all subsequent calls will also time out since from this point on in the program xrootd will never be able to give a client response! Eventually we'll try to open a new file, it will time out which will lead to an exception which will abort the job.

The real fix we need is to not get xrootd's query threads jammed up in the first place. I'll start looking into how to accomplish that task.

@bbockelm
Copy link
Contributor Author

This pull request allows the threads mangaged by cmsRun (i.e. the TBB threads) to eventually timeout waiting for xrootd to respond and then proceed to run

Nope. This fix affects this frame:

#3  0x00007fab05702d70 in XrdAdaptor::Source::getXrootdSiteFromURL(std::string, std::string&) () from /afs/cern.ch/cms/sw/ReleaseCandidates/vol1/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_0_THREADED_X_2016-01-28-1100/lib/slc6_amd64_gcc493/libUtilitiesXrdAdaptor.so

I.e., the change is that getXrootdSiteFromURL will timeout without needing an XRootD callback. Hence, the OpenHandler callback will not block, finish, and free up the XRootD callback.

This is all for the XRootD callback threads; there shouldn't be changes for the TBB-managed threads.

@Dr15Jones
Copy link
Contributor

Ah, it is because you now call the 'asynchronous' version of XrdCl::FileSystem::Query rather than the synchronous version with the same name. It was the same name that lead to my confusion when I looked at the stacktrace again. Man, I wish they would have called it XrdCl::FileSystem::AsyncQuery.

Sorry for the false alarm.

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

davidlange6 added a commit that referenced this pull request Jan 29, 2016
Allow FileSystem queries to occur from callback threads.
@davidlange6 davidlange6 merged commit 93fb804 into cms-sw:CMSSW_8_0_X Jan 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants