New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow FileSystem queries to occur from callback threads. #13080
Allow FileSystem queries to occur from callback threads. #13080
Conversation
To query a filesystem property, a free callback thread is needed with the default synchronous versions of XrdCl::FileSystem methods. Thus, if the synchronous method is called from a *callback thread*, then there must be an idle thread to handle the response; if there are N threads in the pool and N simultaneous queries, then the filesystem query would cause a deadlock. Worse yet, the timeout mechanism for XrdCl relies on an idle callback thread to function. As making the synchronous query asynchronous in its currrent use case is quite hard, I only partially solved the problem: wrote a new response handler that can timeout without an idle thread.
A new Pull Request was created by @bbockelm (Brian Bockelman) for CMSSW_8_0_X. It involves the following packages: Utilities/XrdAdaptor @cmsbuild, @smuzaffar, @Dr15Jones, @davidlange6 can you please review it and eventually sign? Thanks. cms-bot commands are list here #13028 |
please test |
The tests are being triggered in jenkins. |
// m_mutex protects m_status | ||
std::unique_lock<std::mutex> guard(m_mutex); | ||
// On exit from the block, make sure m_status is set; it needs to be set before we notify threads. | ||
std::unique_ptr<char, std::function<void(char*)>> exit_guard(nullptr, [&](char *) {m_status.reset(new XrdCl::XRootDStatus(XrdCl::stError, XrdCl::errInternal));}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think unique_ptr allowed a delete handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, it does!
Since I want to release()
the data later, I felt it was appropriate here.
Why not make |
-1 Tested at: aa0a0b7 cmsDriver.py RelVal -s L1REPACK:GT2 --data --scenario=HeavyIons -n 10 --conditions auto:run2_hlt_HIon --relval 9000,50 --datatier "RAW" --eventcontent RAW --customise=HLTrigger/Configuration/CustomConfigs.L1T --era Run2_HI --magField 38T_PostLS1 --fileout file:RelVal_Raw_HIon_DATA.root --filein /store/hidata/HIRun2015/HIHardProbes/RAW-RECO/HighPtJet-PromptReco-v1/000/263/689/00000/1802CD9A-DDB8-E511-9CF9-02163E0138CA.root : FAILED - time: date Tue Jan 26 23:11:16 2016-date Tue Jan 26 23:10:36 2016 s - exit: 23552 you can see the results of the tests here: |
@Dr15Jones - I considered the atomic. However, we already need to have the mutex there for the condition variable. Further:
|
For the static check warnings - I think these are all "correct" and may just need to be annotated. What are the annotations I should use? Alternately, there might be some way to tweak the singletons to make the "more correct". Chris, ideas? Also, I notice that |
please test |
The tests are being triggered in jenkins. |
All "real" state is kept in a sub-object; the QueryAttrHandler keeps a weak_ptr to the "real" state. While the main thread is waiting, it keeps a reference to the shared_ptr for the state, keeping the weak_ptr alive. If the timeout occurs, the shared_ptr is dropped and the weak_ptr cannot be locked. The QueryAttrHandler is always deleted by the callback if it is successfully registered.
Previously, the presence of a pointer in m_file indicated an open- file request was outstanding. However, after the file was opened, the unique_ptr was moved to the Source object (which does a synchronous network query in its constructor). Hence, it appeared there was no outstanding file-open and a second one could start. Now, we keep a separate boolean flag that is only set to false when the HandleResponse method exits.
Pull request #13080 was updated. @cmsbuild, @smuzaffar, @Dr15Jones, @davidlange6 can you please check and sign again. |
The update fixes a the obvious build issue (sorry! refactored a line after testing... guess which one) and attacks the refactor suggested by Chris. |
please test |
The tests are being triggered in jenkins. |
@@ -246,9 +355,7 @@ Source::getXrootdSiteFromURL(std::string url, std::string &site) | |||
delete response; | |||
return false; | |||
} | |||
std::string rsite = response->ToString(); | |||
delete response; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like response
is an unused variable?
+1 |
This pull request is fully signed and it will be integrated in one of the next CMSSW_8_0_X IBs after it passes the integration tests. This pull request requires discussion in the ORP meeting before it's merged. @slava77, @davidlange6, @Degano, @smuzaffar |
Upon further reflection, I don't think this will solve the underlying problem and will only push the problem to a slightly later stage. What is happening is xrootd only has 3 threads it uses to answer client 'queries'. In our case each of those threads gets stuck with the following stacktrace
The reason they are stuck is This pull request allows the threads mangaged by cmsRun (i.e. the TBB threads) to eventually timeout waiting for xrootd to respond and then proceed to run. However, all subsequent calls will also time out since from this point on in the program xrootd will never be able to give a client response! Eventually we'll try to open a new file, it will time out which will lead to an exception which will abort the job. The real fix we need is to not get xrootd's query threads jammed up in the first place. I'll start looking into how to accomplish that task. |
Nope. This fix affects this frame:
I.e., the change is that This is all for the XRootD callback threads; there shouldn't be changes for the TBB-managed threads. |
Ah, it is because you now call the 'asynchronous' version of Sorry for the false alarm. |
Allow FileSystem queries to occur from callback threads.
To query a filesystem property, a free callback thread is needed
with the default synchronous versions of XrdCl::FileSystem methods.
Thus, if the synchronous method is called from a callback thread,
then there must be an idle thread to handle the response; if there
are N threads in the pool and N simultaneous queries, then the
filesystem query would cause a deadlock.
Worse yet, the timeout mechanism for XrdCl relies on an idle callback
thread to function.
As making the synchronous query asynchronous in its currrent use case
is quite hard, I only partially solved the problem: wrote a new
response handler that can timeout without an idle thread.
Technically, any version of CMSSW 7_6 or later should be affected. Pragmatically, do we actually open multiple files simultaneously except in the threaded mixing module?