Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate CMSSW with HTCondor's update service. #10056

Merged
merged 8 commits into from Jul 21, 2015

Conversation

bbockelm
Copy link
Contributor

@bbockelm bbockelm commented Jul 6, 2015

This commit provides a new default service, CondorStatusService,
which automatically reports basic progress statistics (# events,

The service will automatically detect if it's running as part of
a HTCondor job and probes to see if HTCondor has user-level updates
enabled (and is a sufficient version). If either check fails, then
the service does not register any callbacks with the framework and
effectively becomes a no-op.

To see if it is within a HTCondor job, the service looks for the
_CONDOR_CHIRP_CONFIG environment variable - a very cheap check.
To see if HTCondor supports this feature, it spawns a new process
and looks at the exit code (more expensive).

This uses the 'set_job_attr_delayed' mechanism of HTCondor,
which causes these updates to 'tag along' with the existing updates
for memory / disk / CPU usage. Hence, the extra cost is the few
extra bytes in the update packet that goes out once every 5 minutes.
The update only goes as far as the local daemon; condor_chirp does
not wait for it to propagate to the remote host. Hence, it exits
rapidly.

CMSSW will only update once every updateIntervalSeconds (defaults
to 15 minutes). The HTCondor worker node and central components
all have additional rate limiting mechanisms to prevent overload.

This commit provides a new default service, CondorStatusService,
which automatically reports basic progress statistics (# events,

The service will automatically detect if it's running as part of
a HTCondor job and probes to see if HTCondor has user-level updates
enabled (and is a sufficient version).  If either check fails, then
the service does not register any callbacks with the framework and
effectively becomes a no-op.

To see if it is within a HTCondor job, the service looks for the
_CONDOR_CHIRP_CONFIG environment variable - a very cheap check.
To see if HTCondor supports this feature, it spawns a new process
and looks at the exit code (more expensive).

This uses the 'set_job_attr_delayed' mechanism of HTCondor,
which causes these updates to 'tag along' with the existing updates
for memory / disk / CPU usage.  Hence, the extra cost is the few
extra bytes in the update packet that goes out once every 5 minutes.
The update only goes as far as the local daemon; condor_chirp does
not wait for it to propagate to the remote host.  Hence, it exits
rapidly.

CMSSW will only update once every updateIntervalSeconds (defaults
to 15 minutes).  The HTCondor worker node and central components
all have additional rate limiting mechanisms to prevent overload.
@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 6, 2015

A new Pull Request was created by @bbockelm (Brian Bockelman) for CMSSW_7_6_X.

Integrate CMSSW with HTCondor's update service.

It involves the following packages:

FWCore/Framework
FWCore/Services

@cmsbuild, @smuzaffar, @Dr15Jones can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @wddgit, @wmtan this is something you requested to watch as well.
You can sign-off by replying to this message having '+1' in the first line of your reply.
You can reject by replying to this message having '-1' in the first line of your reply.
If you are a L2 or a release manager you can ask for tests by saying 'please test' in the first line of a comment.
@Degano you are the release manager for this.
You can merge this pull request by typing 'merge' in the first line of your comment.

This commit adds a few more reported attributes:

- ChirpCMSSWFinished: Unix timestamp of when the job has finished.
- ChirpCMSSWLastUpdate: Unix timestamp of when the last update occurred.
- ChirpCMSSWMaxEvents: Maximum number of input events CMSSW is
  configured to process.  From the process's maxEvents pset.
- ChirpCMSSWMaxLumis: Maximum number of lumis CMSSW is configured
  to process.  From process's maxLuminosityBlocks pset and some
  simple processing of the source.

If no max is configured, the attribute is not reported.

The motivation behind these attributes is they:
- Simplify deadlock detection.  Using the defaults, the LastUpdate
  attribute should never be more than 30 minutes if Finished isn't set.
- Provide simple estimates of percent completion.  When we can determine
  the number of events or lumis to process (something we can for the
  majority of the grid use cases), we'll be able to determine the
  number of events/lumis left to process and an aggregate event/lumi
  processing rate.
@bbockelm
Copy link
Contributor Author

bbockelm commented Jul 7, 2015

I added a few new attributes:

ChirpCMSSWFinished = 1436240094
ChirpCMSSWLastUpdate = 1436240094
ChirpCMSSWMaxEvents = 5
ChirpCMSSWMaxLumis = 5

These allow us to:

  • Better detect deadlocks. LastUpdate and Finished will allow us to detect whether CMSSW is running; if so, how long it has been since an event was completed.
  • Provide rough estimates of percent completion. The logic to detect MaxEvents / MaxLumis from the pset is far from perfect, but ought to work for 99% of the grid jobs out there (the core use case).

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 7, 2015

Pull request #10056 was updated. @cmsbuild, @smuzaffar, @Dr15Jones can you please check and sign again.

@Dr15Jones
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 7, 2015

The tests are being triggered in jenkins.

@@ -332,6 +332,7 @@ int main(int argc, char* argv[]) {
defaultServices.push_back("AdaptorConfig");
defaultServices.push_back("SiteLocalConfigService");
defaultServices.push_back("StatisticsSenderService");
defaultServices.push_back("CondorStatusService");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this a default service? Why not require WMAgent to add it to the configurations since this is only needed on grid jobs?

@Dr15Jones
Copy link
Contributor

Why not have CondorStatusUpdater.cc also hold the content of CondorStatusUpdater.h and move it to plugins and then register the service directly in CondorStatusUpdater.cc? As far as I can tell there is no need for anyone to attempt to talk directly to CondorStatusUpdater.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 7, 2015

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_7_6_X IBs or unless it breaks tests. This pull request requires discussion in the ORP meeting before it's merged. @davidlange6, @Degano, @smuzaffar

@cmsbuild
Copy link
Contributor

-1
Tested at: acd0eea
When I ran the RelVals I found an error in the following worklfows:
50202.0 step2

runTheMatrix-results/50202.0_TTbar_13+TTbar_13+DIGIUP15_PU50+RECOUP15_PU50+HARVESTUP15_PU50+MINIAODMCUP1550/step2_TTbar_13+TTbar_13+DIGIUP15_PU50+RECOUP15_PU50+HARVESTUP15_PU50+MINIAODMCUP1550.log
----- Begin Fatal Exception 21-Jul-2015 16:57:31 CEST-----------------------
An exception of category 'Configuration' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing module: class=MixingModule label='mix'
Exception Message:
RootEmbeddedFileSequence no input files specified for secondary input source.
----- End Fatal Exception -------------------------------------------------

25202.0 step3

runTheMatrix-results/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25+MINIAODMCUP15/step3_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25+MINIAODMCUP15.log
----- Begin Fatal Exception 21-Jul-2015 17:15:11 CEST-----------------------
An exception of category 'Configuration' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing module: class=MixingModule label='mix'
Exception Message:
RootEmbeddedFileSequence no input files specified for secondary input source.
----- End Fatal Exception -------------------------------------------------

you can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-10056/6402/summary.html

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_7_6_X IBs (but tests are reportedly failing). This pull request requires discussion in the ORP meeting before it's merged. @davidlange6, @Degano, @smuzaffar

davidlange6 added a commit that referenced this pull request Jul 21, 2015
Integrate CMSSW with HTCondor's update service.
@davidlange6 davidlange6 merged commit 4e98e7c into cms-sw:CMSSW_7_6_X Jul 21, 2015
@bbockelm
Copy link
Contributor Author

@Dr15Jones - how do you suggest I do the backport without squashing? Does CMS have a how-to page for that (I always squash...)

@Dr15Jones
Copy link
Contributor

I've always just used git cherry-pick. I think the new version of git even allows you to specify multiple commits in one command.

@bbockelm
Copy link
Contributor Author

Ah, ok. I thought there was a deeper secret than that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants