Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix for behavior after exceptions with concurrent lumis #26349

Merged
merged 3 commits into from Apr 7, 2019

Conversation

wddgit
Copy link
Contributor

@wddgit wddgit commented Apr 3, 2019

PR description:

This fixes a bug that would be encountered when more than
one lumi is being processed concurrently and there is an
exception. If the exception is associated with a lumi other
than the last one read, an infinite wait will be entered.
The wait is in the processLumis function of the EventProcessor.
It waits on every task in the stream SerialTaskQueues.
In the above situation the serial queue is not resumed
before the wait and there are still tasks in the queues.
There is code that would clean these up in the function
endUnfinishedLumis but that only runs after the wait
is over and processLumis has returned. Too late.

This commit also fixes the maxEvents output parameter which was
broken when more than one lumi is processed concurrently. As far
as I know nothing uses this, but it is an advertised parameter
of the Framework that is supposed to work. One note about
this. If events are running concurrently when the limit is
reached, then all the events already running will complete, so
the job could write more than the requested number of events
to output.

The modified code is not executed if there are no exceptions and
the maxEvents output parameter is not used. Under normal
circumstances this should not have any effect on output or
performance.

PR validation:

This adds two new unit tests. One test would fail with an infinite
wait if executed in a release before this commit. It creates the
condition described above. The other tests the maxEvent output
parameter. Plus I added an assert that checks that the reference
count is one in normalEnd when we try to delete RunResources
and get endRun to execute (something to be concerned about in
concurrent lumi mode).

This fixes a bug that would be encountered when more than
one lumi is being processed concurrently and there is an
exception. If the exception is associated with a lumi other
than the last one read, an infinite wait will be entered.
The wait is in the processLumis function of the EventProcessor.
It waits on every task in the stream SerialTaskQueues.
In the above situation the serial queue is not resumed
before the wait and there are still tasks in the queues.
There is code that would clean these up in the function
endUnfinishedLumis but that only runs after the wait
is over and processLumis has returned. Too late.

This commit also fixes the maxEvent output parameter which was
broken when more than one lumi is processed concurrently. As far
as I know nothing uses this, but it is an advertised parameter
of the Framework that is supposed to work. One note about
this. If events are running concurrently when the limit is
reached, then all the events already running will complete, so
the job could write more than the requested number of events
to output.

This adds two new unit tests. One test would fail with an infinite
wait if executed in a release before this commit. The other tests
the maxEvent output parameter. Plus I added an assert that
checks that the reference count is one in normalEnd when we
try to delete RunResources and get endRun to execute.
@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 3, 2019

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 3, 2019

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26349/9064

  • This PR adds an extra 48KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 3, 2019

A new Pull Request was created by @wddgit (W. David Dagenhart) for master.

It involves the following packages:

FWCore/Framework

@cmsbuild, @smuzaffar, @Dr15Jones can you please review it and eventually sign? Thanks.
@makortel, @Martin-Grunewald this is something you requested to watch as well.
@davidlange6, @slava77, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@wddgit
Copy link
Contributor Author

wddgit commented Apr 3, 2019

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 3, 2019

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/33947/console Started: 2019/04/04 00:41

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 4, 2019

-1

Tested at: 1259b5b

You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-26349/33947/summary.html

I found follow errors while testing this PR

Failed tests: UnitTests

  • Unit Tests:

I found errors in the following unit tests:

---> test TestFWCoreFrameworkCmsRun had ERRORS

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 4, 2019

Comparison job queued.

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 4, 2019

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-26349/33947/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 32
  • DQMHistoTests: Total histograms compared: 3139747
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3139549
  • DQMHistoTests: Total skipped: 197
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 31 files compared)
  • Checked 133 log files, 14 edm output root files, 32 DQM output files

@Dr15Jones
Copy link
Contributor

It looks like the new test failed?

Failure running cmsRun /build/cmsbld/jenkins/workspace/ib-any-integration/CMSSW_10_6_X_2019-04-03-1100/src/FWCore/Framework/test/testMaxEventsOutput_cfg.py: status 65

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 5, 2019

The code-checks are being triggered in jenkins.

@wddgit
Copy link
Contributor Author

wddgit commented Apr 5, 2019

I just pushed a fix to Chris's push. And as I looked at it some more I saw some other ways to improve it. It is simplified in a few places and also will behave better when there are multiple exceptions getting raised, usually a primary one and others that are side effects of the primary exception.

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 5, 2019

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26349/9117

  • This PR adds an extra 56KB to repository

@wddgit
Copy link
Contributor Author

wddgit commented Apr 5, 2019

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 5, 2019

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/34021/console Started: 2019/04/05 23:01

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 5, 2019

Pull request #26349 was updated. @cmsbuild, @smuzaffar, @Dr15Jones can you please check and sign again.

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 5, 2019

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 5, 2019

Comparison job queued.

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 6, 2019

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-26349/34021/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 32
  • DQMHistoTests: Total histograms compared: 3140495
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3140297
  • DQMHistoTests: Total skipped: 197
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 31 files compared)
  • Checked 133 log files, 14 edm output root files, 32 DQM output files

@Dr15Jones
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 6, 2019

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @davidlange6, @slava77, @smuzaffar, @fabiocos (and backports should be raised in the release meeting by the corresponding L2)

@fabiocos
Copy link
Contributor

fabiocos commented Apr 7, 2019

+1

@cmsbuild cmsbuild merged commit d11322c into cms-sw:master Apr 7, 2019
@wddgit wddgit deleted the bugFixConcurrentLumisWithException branch August 9, 2019 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants