New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate CMSSW with HTCondor's update service. #10056
Conversation
This commit provides a new default service, CondorStatusService, which automatically reports basic progress statistics (# events, The service will automatically detect if it's running as part of a HTCondor job and probes to see if HTCondor has user-level updates enabled (and is a sufficient version). If either check fails, then the service does not register any callbacks with the framework and effectively becomes a no-op. To see if it is within a HTCondor job, the service looks for the _CONDOR_CHIRP_CONFIG environment variable - a very cheap check. To see if HTCondor supports this feature, it spawns a new process and looks at the exit code (more expensive). This uses the 'set_job_attr_delayed' mechanism of HTCondor, which causes these updates to 'tag along' with the existing updates for memory / disk / CPU usage. Hence, the extra cost is the few extra bytes in the update packet that goes out once every 5 minutes. The update only goes as far as the local daemon; condor_chirp does not wait for it to propagate to the remote host. Hence, it exits rapidly. CMSSW will only update once every updateIntervalSeconds (defaults to 15 minutes). The HTCondor worker node and central components all have additional rate limiting mechanisms to prevent overload.
A new Pull Request was created by @bbockelm (Brian Bockelman) for CMSSW_7_6_X. Integrate CMSSW with HTCondor's update service. It involves the following packages: FWCore/Framework @cmsbuild, @smuzaffar, @Dr15Jones can you please review it and eventually sign? Thanks. |
This commit adds a few more reported attributes: - ChirpCMSSWFinished: Unix timestamp of when the job has finished. - ChirpCMSSWLastUpdate: Unix timestamp of when the last update occurred. - ChirpCMSSWMaxEvents: Maximum number of input events CMSSW is configured to process. From the process's maxEvents pset. - ChirpCMSSWMaxLumis: Maximum number of lumis CMSSW is configured to process. From process's maxLuminosityBlocks pset and some simple processing of the source. If no max is configured, the attribute is not reported. The motivation behind these attributes is they: - Simplify deadlock detection. Using the defaults, the LastUpdate attribute should never be more than 30 minutes if Finished isn't set. - Provide simple estimates of percent completion. When we can determine the number of events or lumis to process (something we can for the majority of the grid use cases), we'll be able to determine the number of events/lumis left to process and an aggregate event/lumi processing rate.
I added a few new attributes:
These allow us to:
|
Pull request #10056 was updated. @cmsbuild, @smuzaffar, @Dr15Jones can you please check and sign again. |
please test |
The tests are being triggered in jenkins. |
@@ -332,6 +332,7 @@ int main(int argc, char* argv[]) { | |||
defaultServices.push_back("AdaptorConfig"); | |||
defaultServices.push_back("SiteLocalConfigService"); | |||
defaultServices.push_back("StatisticsSenderService"); | |||
defaultServices.push_back("CondorStatusService"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make this a default service? Why not require WMAgent to add it to the configurations since this is only needed on grid jobs?
Why not have CondorStatusUpdater.cc also hold the content of CondorStatusUpdater.h and move it to plugins and then register the service directly in CondorStatusUpdater.cc? As far as I can tell there is no need for anyone to attempt to talk directly to CondorStatusUpdater. |
+1 The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic: |
This pull request is fully signed and it will be integrated in one of the next CMSSW_7_6_X IBs or unless it breaks tests. This pull request requires discussion in the ORP meeting before it's merged. @davidlange6, @Degano, @smuzaffar |
-1 runTheMatrix-results/50202.0_TTbar_13+TTbar_13+DIGIUP15_PU50+RECOUP15_PU50+HARVESTUP15_PU50+MINIAODMCUP1550/step2_TTbar_13+TTbar_13+DIGIUP15_PU50+RECOUP15_PU50+HARVESTUP15_PU50+MINIAODMCUP1550.log ----- Begin Fatal Exception 21-Jul-2015 16:57:31 CEST----------------------- An exception of category 'Configuration' occurred while [0] Constructing the EventProcessor [1] Constructing module: class=MixingModule label='mix' Exception Message: RootEmbeddedFileSequence no input files specified for secondary input source. ----- End Fatal Exception ------------------------------------------------- 25202.0 step3 runTheMatrix-results/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25+MINIAODMCUP15/step3_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25+MINIAODMCUP15.log ----- Begin Fatal Exception 21-Jul-2015 17:15:11 CEST----------------------- An exception of category 'Configuration' occurred while [0] Constructing the EventProcessor [1] Constructing module: class=MixingModule label='mix' Exception Message: RootEmbeddedFileSequence no input files specified for secondary input source. ----- End Fatal Exception ------------------------------------------------- you can see the results of the tests here: |
This pull request is fully signed and it will be integrated in one of the next CMSSW_7_6_X IBs (but tests are reportedly failing). This pull request requires discussion in the ORP meeting before it's merged. @davidlange6, @Degano, @smuzaffar |
Integrate CMSSW with HTCondor's update service.
@Dr15Jones - how do you suggest I do the backport without squashing? Does CMS have a how-to page for that (I always squash...) |
I've always just used |
Ah, ok. I thought there was a deeper secret than that. |
This commit provides a new default service, CondorStatusService,
which automatically reports basic progress statistics (# events,
The service will automatically detect if it's running as part of
a HTCondor job and probes to see if HTCondor has user-level updates
enabled (and is a sufficient version). If either check fails, then
the service does not register any callbacks with the framework and
effectively becomes a no-op.
To see if it is within a HTCondor job, the service looks for the
_CONDOR_CHIRP_CONFIG environment variable - a very cheap check.
To see if HTCondor supports this feature, it spawns a new process
and looks at the exit code (more expensive).
This uses the 'set_job_attr_delayed' mechanism of HTCondor,
which causes these updates to 'tag along' with the existing updates
for memory / disk / CPU usage. Hence, the extra cost is the few
extra bytes in the update packet that goes out once every 5 minutes.
The update only goes as far as the local daemon; condor_chirp does
not wait for it to propagate to the remote host. Hence, it exits
rapidly.
CMSSW will only update once every updateIntervalSeconds (defaults
to 15 minutes). The HTCondor worker node and central components
all have additional rate limiting mechanisms to prevent overload.