Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multisource xrootd #558

Merged
merged 7 commits into from
Sep 4, 2013
Merged

Conversation

bbockelm
Copy link
Contributor

This patch series switches XrdAdaptor to the new XrdCl's asynchronous interface.

This internally tracks file server performance and actively load-balance across two active sources in order to avoid poorly-performing endpoints.

Understandably, this is not a simple algorithm. See Utilities/XrdAdaptor/doc/multisource_algorithm_design.txt for a full description.

if (m_parent1 && m_parent2)
{
timespec stop;
clock_gettime(CLOCK_MONOTONIC, &stop);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test this on mac? IIRC it does not work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

  1. What should we do on Mac OS X instead of the monotonic clock?
  2. How do I get started with Mac OS X development in CMSSW?

My plan would be to fix compilation hours now, but wait for testing until a pre-release (as this requires a new version of Xrootd too).

@bbockelm
Copy link
Contributor Author

Forgot to mention -- this requires Xrootd 3.3.3. If you want to test it (from CERN), you can do:

scram setup /afs/cern.ch/cms/slc5_amd64_gcc472/external/xrootd-toolfile/1.0-cms6/etc/scram.d/xrootd.xml

@ghost ghost assigned bbockelm Aug 16, 2013
@ktf
Copy link
Contributor

ktf commented Aug 16, 2013

Do you really need those metrics on mac? I would simply ifdef them out, it not.

If yes, you can probably have a look at:

http://stackoverflow.com/questions/11680461/monotonic-clock-on-osx

@bbockelm
Copy link
Contributor Author

I would prefer to have the timing metrics (and, in general, have this work as well as on Linux) -- I don't want to cripple Mac OS X support.

@ktf
Copy link
Contributor

ktf commented Aug 16, 2013

Ok.

To get mac support the easiest thing is to use the usual recipe to install it
on your laptop:

https://twiki.cern.ch/twiki/bin/view/CMSPublic/SDTCMSSW_aptinstaller

Just use a mac architecture.

You can also use --repository cms.week0 (or week1) and install IBs.

@ktf
Copy link
Contributor

ktf commented Aug 21, 2013

@nclopezo can you test and compile (also on osx108, please).

@ghost ghost assigned nclopezo Aug 21, 2013
@nclopezo
Copy link
Contributor

Hi,

When I ran the RelVals I got the following error in workflow 4.22, step2:

globaltag = PRE_62_V8::All
271 DQMStore::DQMStore 
22-Aug-2013 11:18:44 CEST  Initiating request to open file root://eoscms//eos/cms/store/data/Run2011A/Cosmics/RAW/v1/000/160/960/049F6443-8E53-E011-A943-003048F117EA.root?svcClass=default
[2013-08-22 11:18:44 +0200][Error  ][XRootD            ] [lxfsrf49c01.cern.ch:1095] Handling error while processing : [ERROR] Error response.
[2013-08-22 11:18:44 +0200][Error  ][File              ] [0x638f7140@root://eoscms//eos/cms/store/data/Run2011A/Cosmics/RAW/v1/000/160/960/049F6443-8E53-E011-A943-003048F117EA.root?svcClass=default] Fatal file state error. Message  returned with [ERROR] Server responded with an error: [3011] Unable to stat file /eos/cms/store/data/Run2011A/Cosmics/RAW/v1/000/160/960/049F6443-8E53-E011-A943-003048F117EA.root; No such file or directory
22-Aug-2013 11:18:45 CEST  Fallback request to file root://xrootd.ba.infn.it//store/data/Run2011A/Cosmics/RAW/v1/000/160/960/049F6443-8E53-E011-A943-003048F117EA.root
XrdSec: No authentication protocols are available.
[2013-08-22 11:18:45 +0200][Error  ][XRootDTransport   ] [xrootd.ba.infn.it:1094 #0.0] No protocols left to try
[2013-08-22 11:18:45 +0200][Error  ][AsyncSock         ] [xrootd.ba.infn.it:1094 #0.0] Socket error while handshaking: [FATAL] Auth failed
[2013-08-22 11:18:45 +0200][Error  ][PostMaster        ] [xrootd.ba.infn.it:1094 #0] Unable to recover: [FATAL] Auth failed.
[2013-08-22 11:18:45 +0200][Error  ][XRootD            ] [xrootd.ba.infn.it:1094] Impossible to send message . Trying to recover.
[2013-08-22 11:18:45 +0200][Error  ][XRootD            ] [xrootd.ba.infn.it:1094] Handling error while processing : [FATAL] Auth failed.
----- Begin Fatal Exception 22-Aug-2013 11:18:45 CEST-----------------------
An exception of category 'FallbackFileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootInputFileSequence::initFile()
   [3] Calling StorageFactory::open()
   [4] Calling XrdFile::open()
Exception Message:
Failed to open the file 'root://xrootd.ba.infn.it//store/data/Run2011A/Cosmics/RAW/v1/000/160/960/049F6443-8E53-E011-A943-003048F117EA.root'
   Additional Info:
      [a] Input file root://eoscms//eos/cms/store/data/Run2011A/Cosmics/RAW/v1/000/160/960/049F6443-8E53-E011-A943-003048F117EA.root?svcClass=default could not be opened.
Fallback Input file root://xrootd.ba.infn.it//store/data/Run2011A/Cosmics/RAW/v1/000/160/960/049F6443-8E53-E011-A943-003048F117EA.root also could not be opened.
      [b] XrdCl::File::Open(name='root://xrootd.ba.infn.it//store/data/Run2011A/Cosmics/RAW/v1/000/160/960/049F6443-8E53-E011-A943-003048F117EA.root', flags=0x10, permissions=0660) => error '[FATAL] Auth failed' (errno=0, code=204)
----- End Fatal Exception -------------------------------------------------

@nclopezo
Copy link
Contributor

Hi,

And also, I tried to build a CMSSW_7_0_XROOTD_X_2013-08-22-0200 for osx108, which is CMSSW_7_0_X_2013-08-22-0200 plus this pull request.

When I was building i got the following error:

/Volumes/build1/dmendezl/tmp/BUILDROOT/e81c0e286d98ac393fff5d6ad19197f6/opt/cmssw/osx108_amd64_gcc472/cms/cmssw/CMSSW_7_0_XROOTD_X_2013-08-22-0200/src/Utilities/XrdAdaptor/src/XrdRequestManager.cc: In constructor 'XrdAdaptor::RequestMana
ger::RequestManager(const string&, XrdCl::OpenFlags::Flags, XrdCl::Access::Mode)':
/Volumes/build1/dmendezl/tmp/BUILDROOT/e81c0e286d98ac393fff5d6ad19197f6/opt/cmssw/osx108_amd64_gcc472/cms/cmssw/CMSSW_7_0_XROOTD_X_2013-08-22-0200/src/Utilities/XrdAdaptor/src/XrdRequestManager.cc:62:17: error: 'CLOCK_MONOTONIC' was not declared in this scope
/Volumes/build1/dmendezl/tmp/BUILDROOT/e81c0e286d98ac393fff5d6ad19197f6/opt/cmssw/osx108_amd64_gcc472/cms/cmssw/CMSSW_7_0_XROOTD_X_2013-08-22-0200/src/Utilities/XrdAdaptor/src/XrdRequestManager.cc:62:37: error: 'clock_gettime' was not declared in this scope
/Volumes/build1/dmendezl/tmp/BUILDROOT/e81c0e286d98ac393fff5d6ad19197f6/opt/cmssw/osx108_amd64_gcc472/cms/cmssw/CMSSW_7_0_XROOTD_X_2013-08-22-0200/src/Utilities/XrdAdaptor/src/XrdRequestManager.cc: In member function 'std::future XrdAdaptor::RequestManager::handle(std::shared_ptr)':
/Volumes/build1/dmendezl/tmp/BUILDROOT/e81c0e286d98ac393fff5d6ad19197f6/opt/cmssw/osx108_amd64_gcc472/cms/cmssw/CMSSW_7_0_XROOTD_X_2013-08-22-0200/src/Utilities/XrdAdaptor/src/XrdRequestManager.cc:230:17: error: 'CLOCK_MONOTONIC' was not declared in this scope
/Volumes/build1/dmendezl/tmp/BUILDROOT/e81c0e286d98ac393fff5d6ad19197f6/opt/cmssw/osx108_amd64_gcc472/cms/cmssw/CMSSW_7_0_XROOTD_X_2013-08-22-0200/src/Utilities/XrdAdaptor/src/XrdRequestManager.cc:230:38: error: 'clock_gettime' was not declared in this scope
/Volumes/build1/dmendezl/tmp/BUILDROOT/e81c0e286d98ac393fff5d6ad19197f6/opt/cmssw/osx108_amd64_gcc472/cms/cmssw/CMSSW_7_0_XROOTD_X_2013-08-22-0200/src/Utilities/XrdAdaptor/src/XrdRequestManager.cc: In member function 'std::future XrdAdaptor::RequestManager::handle(std::shared_ptr >)':
/Volumes/build1/dmendezl/tmp/BUILDROOT/e81c0e286d98ac393fff5d6ad19197f6/opt/cmssw/osx108_amd64_gcc472/cms/cmssw/CMSSW_7_0_XROOTD_X_2013-08-22-0200/src/Utilities/XrdAdaptor/src/XrdRequestManager.cc:318:19: error: 'CLOCK_MONOTONIC' was not declared in this scope
/Volumes/build1/dmendezl/tmp/BUILDROOT/e81c0e286d98ac393fff5d6ad19197f6/opt/cmssw/osx108_amd64_gcc472/cms/cmssw/CMSSW_7_0_XROOTD_X_2013-08-22-0200/src/Utilities/XrdAdaptor/src/XrdRequestManager.cc:318:40: error: 'clock_gettime' was not declared in this scope
/Volumes/build1/dmendezl/tmp/BUILDROOT/e81c0e286d98ac393fff5d6ad19197f6/opt/cmssw/osx108_amd64_gcc472/cms/cmssw/CMSSW_7_0_XROOTD_X_2013-08-22-0200/src/Utilities/XrdAdaptor/src/XrdRequestManager.cc: In member function 'void XrdAdaptor::RequestManager::requestFailure(std::shared_ptr)':
/Volumes/build1/dmendezl/tmp/BUILDROOT/e81c0e286d98ac393fff5d6ad19197f6/opt/cmssw/osx108_amd64_gcc472/cms/cmssw/CMSSW_7_0_XROOTD_X_2013-08-22-0200/src/Utilities/XrdAdaptor/src/XrdRequestManager.cc:404:23: error: 'CLOCK_MONOTONIC' was not declared in this scope
/Volumes/build1/dmendezl/tmp/BUILDROOT/e81c0e286d98ac393fff5d6ad19197f6/opt/cmssw/osx108_amd64_gcc472/cms/cmssw/CMSSW_7_0_XROOTD_X_2013-08-22-0200/src/Utilities/XrdAdaptor/src/XrdRequestManager.cc:404:44: error: 'clock_gettime' was not declared in this scope
/Volumes/build1/dmendezl/tmp/BUILDROOT/e81c0e286d98ac393fff5d6ad19197f6/opt/cmssw/osx108_amd64_gcc472/cms/cmssw/CMSSW_7_0_XROOTD_X_2013-08-22-0200/src/Utilities/XrdAdaptor/src/XrdRequestManager.cc: In member function 'virtual void XrdAdaptor::RequestManager::OpenHandler::HandleResponseWithHosts(XrdCl::XRootDStatus*, XrdCl::AnyObject*, XrdCl::HostList*)':
/Volumes/build1/dmendezl/tmp/BUILDROOT/e81c0e286d98ac393fff5d6ad19197f6/opt/cmssw/osx108_amd64_gcc472/cms/cmssw/CMSSW_7_0_XROOTD_X_2013-08-22-0200/src/Utilities/XrdAdaptor/src/XrdRequestManager.cc:538:23: error: 'CLOCK_MONOTONIC' was not declared in this scope
/Volumes/build1/dmendezl/tmp/BUILDROOT/e81c0e286d98ac393fff5d6ad19197f6/opt/cmssw/osx108_amd64_gcc472/cms/cmssw/CMSSW_7_0_XROOTD_X_2013-08-22-0200/src/Utilities/XrdAdaptor/src/XrdRequestManager.cc:538:44: error: 'clock_gettime' was not declared in this scope
gmake: *** [tmp/osx108_amd64_gcc472/src/Utilities/XrdAdaptor/src/UtilitiesXrdAdaptor/XrdRequestManager.o] Error 1

@bbockelm
Copy link
Contributor Author

@nclopezo - how do I reproduce workflow 4.22, step2? From the message, it looks like an EOS error -- tough to tell though! It would be nice if I had your build available too...

I see the issue with CLOCK_MONOTONIC - I had missed fixes for a source file. I'll get to that later today.

@bbockelm
Copy link
Contributor Author

Actually, my calendar just reminded me I have a faculty retreat all morning long. I did a quick fix for the CLOCK_MONOTONIC issue - still don't have a way to test on Mac OS X, can you try again?

@nclopezo
Copy link
Contributor

Hi @bbockelm

To reproduce the error you can execute the following commands:

scram p CMSSW_7_0_X_2013-08-23-0200
cd CMSSW_7_0_X_2013-08-23-0200/
cmsenv
git cms-merge-topic 558
scram setup /afs/cern.ch/cms/slc5_amd64_gcc472/external/xrootd-toolfile/1.0-cms6/etc/scram.d/xrootd.xml
scram b -j 12
runTheMatrix.py -l 4.22

I ran it again with jenkins and I noticed that the same error appears on workflows 4.53 and 1000.0, you can see the logs here:

https://cmssdt.cern.ch/jenkins/job/Pull-Request-Integration/ARCHITECTURE=slc5_amd64_gcc472/230/console

@nclopezo
Copy link
Contributor

Hi @bbockelm

I took your last commit and I built again for osx. This time it compiled without errors.

@nclopezo
Copy link
Contributor

Hi,

I ran the RelVals on my installation on osx. And I am getting the same errors on workflows 4.22, 4.53 and 1000.0 that I showed you in previous messages when I tested on scl5.

For example, this is the message for 4.22:

globaltag = PRE_62_V8::All
271 DQMStore::DQMStore 
26-Aug-2013 15:35:24 CEST  Initiating request to open file root://eoscms//eos/cms/store/data/Run2011A/Cosmics/RAW/v1/000/160/960/049F6443-8E53-E011-A943-003048F117EA.root?svcClass=default
[2013-08-26 15:35:24 +0200][Error  ][XRootD            ] [lxfsrf45c01.cern.ch:1095] Handling error while processing : [ERROR] Error response.
[2013-08-26 15:35:24 +0200][Error  ][File              ] [0x2e67b0f0@root://eoscms//eos/cms/store/data/Run2011A/Cosmics/RAW/v1/000/160/960/049F6443-8E53-E011-A943-003048F117EA.root?svcClass=default] Fatal file state error. Message  returned with [ERROR] Server responded with an error: [3011] Unable to stat file /eos/cms/store/data/Run2011A/Cosmics/RAW/v1/000/160/960/049F6443-8E53-E011-A943-003048F117EA.root; No such file or directory
----- Begin Fatal Exception 26-Aug-2013 15:35:25 CEST-----------------------
An exception of category 'FileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootInputFileSequence::initFile()
   [3] Calling StorageFactory::open()
   [4] Calling XrdFile::open()
Exception Message:
Failed to open the file 'root://eoscms//eos/cms/store/data/Run2011A/Cosmics/RAW/v1/000/160/960/049F6443-8E53-E011-A943-003048F117EA.root?svcClass=default'
   Additional Info:
      [a] Input file root://eoscms//eos/cms/store/data/Run2011A/Cosmics/RAW/v1/000/160/960/049F6443-8E53-E011-A943-003048F117EA.root?svcClass=default could not be opened.
      [b] XrdCl::File::Stat(name='root://eoscms//eos/cms/store/data/Run2011A/Cosmics/RAW/v1/000/160/960/049F6443-8E53-E011-A943-003048F117EA.root?svcClass=default) => error '[ERROR] Error response: Unknown error: 3011' (errno=3011, code=400)
      [c] Active source: lxfsrf45c01.cern.ch:1095
----- End Fatal Exception -------------------------------------------------

@ktf
Copy link
Contributor

ktf commented Aug 27, 2013

-1
To avoid keep looking at this one.

this fails on EOS.

Fixup error messages to use the correct ToStr.
@bbockelm
Copy link
Contributor Author

bbockelm commented Sep 3, 2013

Figured out the issue - it's a bug in EOS that the old client avoided. The recent push does the same workaround as before; I'll followup with EOS separately.

Following @nclopezo's reproduction recipe above, I can confirm the issue has gone away.

@Dr15Jones
Copy link
Contributor

@nclopezo David, please rerun the standard tests on this updated request

@bbockelm
Copy link
Contributor Author

bbockelm commented Sep 3, 2013

@nclopezo - is it reasonable for me to do runTheMatrix from lxplus? If no, do I have access to any machine at CERN where I can do this myself?

Would save a few round trips...

@nclopezo
Copy link
Contributor

nclopezo commented Sep 3, 2013

Hi @bbockelm

I am currently running the tests, and you can see the logs here:

https://cmssdt.cern.ch/jenkins/job/Pull-Request-Integration/ARCHITECTURE=slc5_amd64_gcc472/376/console

@nclopezo
Copy link
Contributor

nclopezo commented Sep 3, 2013

Hi @bbockelm

Sorry, I forgot to include command for setting up xrootd 3.3.3. You can see the logs for the new run here:

https://cmssdt.cern.ch/jenkins/job/Pull-Request-Integration/ARCHITECTURE=slc5_amd64_gcc472/377/console

@nclopezo
Copy link
Contributor

nclopezo commented Sep 3, 2013

The tests finished, all passed

@Dr15Jones
Copy link
Contributor

+1

@ktf @davidlt @smuzaffar
This needs to be coordinated with switching to xrootd 3.3.3 as an external.

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 3, 2013

This pull request is fully signed and it will be integrated in one of the next IBs unless changes or unless it breaks tests.

@davidlt
Copy link
Contributor

davidlt commented Sep 4, 2013

@Dr15Jones I just merged my xrootd changes to CMSDIST. All should be available in 0200 IB.

davidlt added a commit that referenced this pull request Sep 4, 2013
@davidlt davidlt merged commit 680e3cb into cms-sw:CMSSW_7_0_X Sep 4, 2013
davidlt pushed a commit to davidlt/cmssw that referenced this pull request Sep 5, 2013
bbockelm added a commit to bbockelm/cmssw that referenced this pull request Sep 16, 2013
Revert "Revert "Merge pull request cms-sw#558 from bbockelm/multisource-xrootd""

This reverts commit d0a3832.
cmsbuild pushed a commit that referenced this pull request Oct 20, 2020
Adjust include guards.
Adjust comments on preprocessor macros.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants