Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run modules concurrently during global begin transitions #18451

Merged

Conversation

Dr15Jones
Copy link
Contributor

Modules are run concurrently during global begin Run and LuminosityBlock transitions.

We do not yet run concurrently during global end transitions because that would break the use of the DQMStore.

Test getting products that are only made at end transitions during begin transitions and during event.
Changed the interface to modules to allow querying about which data products it consumes for all transition types.
In the future, the Principal will need to know if it is at an end Run or end LuminosityBlock transition so we prefetch items that are only made at the end transition on the end transition.
Prefetching will be used for non-Event transitions so we need to be certain to only send signals from the ActivityRegistry meant for the Event during Event processing.
Requesting a data product without using a process name can result in different results based on if the request happens before the end transition. The reason is if the module in the job only puts its data into the Principal at an end transition but the source contains a related data product from a previous process. Requesting before the end transition would return the previous process data product, while waiting to request at end will return the newly created data product.
To accommodate prefetching of data products from Run and LuminosityBlocks, we need to reset the cached lookup information at end transition to allow the newly requested item to be obtained.
Use new async version for global begin Run and Luminosity transitions.
Extended the concurrent running of modules on global begin transitions to SubProcesses.
Child SubProcesses are also run concurrently.
@cmsbuild
Copy link
Contributor

A new Pull Request was created by @Dr15Jones (Chris Jones) for master.

It involves the following packages:

FWCore/Framework
FWCore/Integration

@cmsbuild, @smuzaffar, @Dr15Jones, @davidlange6 can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @wddgit this is something you requested to watch as well.
@Muzaffar, @davidlange6, @smuzaffar you are the release manager for this.

cms-bot commands are listed here #13028

@Dr15Jones
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 24, 2017

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/19347/console Started: 2017/04/24 19:57

@cmsbuild
Copy link
Contributor

-1

Tested at: d677356

You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-18451/19347/summary.html

I found follow errors while testing this PR

Failed tests: Build

  • Build:

I found an error when building:

gmake[1]: Target 'PostBuild' not remade because of errors.
gmake[1]: Leaving directory '/build/cmsbld/jenkins-workarea/workspace/ib-any-integration/CMSSW_9_1_X_2017-04-24-1100'
config/SCRAM/GMake/Makefile.rules:2035: recipe for target 'src' failed
gmake: *** [src] Error 2
gmake: Target 'all' not remade because of errors.
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2


@cmsbuild
Copy link
Contributor

Comparison not run due to Build errors (RelVals and Igprof tests were also skipped)

@Dr15Jones
Copy link
Contributor Author

The build failed because of a problem in a python init.py file generated by scram.

@Dr15Jones
Copy link
Contributor Author

please test

@Dr15Jones
Copy link
Contributor Author

The RelVal failures are all from xrootd socket timeouts and are not caused by the changes in this pull request.

@Dr15Jones
Copy link
Contributor Author

The AddOnTest failures are all from xrootd errors and are not caused by the changes in this pull request.

@Dr15Jones
Copy link
Contributor Author

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (but tests are reportedly failing). This pull request requires discussion in the ORP meeting before it's merged. @Muzaffar, @davidlange6, @smuzaffar

@Dr15Jones
Copy link
Contributor Author

I rebuild CMSSW using this change and ran the full runTheMatrix.py. The only errors were ones that occur already in the IBs.

@Dr15Jones
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 26, 2017

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/19415/console Started: 2017/04/26 14:27

@Dr15Jones
Copy link
Contributor Author

From my run of the full runTheMatrix there were no problems, but I'll rerun the tests to see if we can get comparisons out.

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-18451/19415/summary.html

Comparison Summary:

  • You potentially added 28 lines to the logs
  • Reco comparison results: 1767 differences found in the comparisons
  • DQMHistoTests: Total files compared: 23
  • DQMHistoTests: Total histograms compared: 1780008
  • DQMHistoTests: Total failures: 6218
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 1773617
  • DQMHistoTests: Total skipped: 173
  • DQMHistoTests: Total Missing objects: 0
  • Checked 94 log files, 14 edm output root files, 23 DQM output files

@davidlange6 davidlange6 merged commit 588aa71 into cms-sw:master Apr 26, 2017
@Dr15Jones
Copy link
Contributor Author

@davidlange6 given the unforseen thread-safety problem with LumiProducer (the static analyzer doesn't catch it) do you want to roll back this change?
I hope to get a work-around done tomorrow, but there are other places calling frontier indirectly and I don't know if any of them are used in production.

@davidlange6
Copy link
Contributor

To be safe I will

@Dr15Jones Dr15Jones deleted the unscheduledBeginTransitionHandling branch May 18, 2017 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants