New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug fix for behavior after exceptions with concurrent lumis #26349
Bug fix for behavior after exceptions with concurrent lumis #26349
Conversation
This fixes a bug that would be encountered when more than one lumi is being processed concurrently and there is an exception. If the exception is associated with a lumi other than the last one read, an infinite wait will be entered. The wait is in the processLumis function of the EventProcessor. It waits on every task in the stream SerialTaskQueues. In the above situation the serial queue is not resumed before the wait and there are still tasks in the queues. There is code that would clean these up in the function endUnfinishedLumis but that only runs after the wait is over and processLumis has returned. Too late. This commit also fixes the maxEvent output parameter which was broken when more than one lumi is processed concurrently. As far as I know nothing uses this, but it is an advertised parameter of the Framework that is supposed to work. One note about this. If events are running concurrently when the limit is reached, then all the events already running will complete, so the job could write more than the requested number of events to output. This adds two new unit tests. One test would fail with an infinite wait if executed in a release before this commit. The other tests the maxEvent output parameter. Plus I added an assert that checks that the reference count is one in normalEnd when we try to delete RunResources and get endRun to execute.
The code-checks are being triggered in jenkins. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26349/9064
|
A new Pull Request was created by @wddgit (W. David Dagenhart) for master. It involves the following packages: FWCore/Framework @cmsbuild, @smuzaffar, @Dr15Jones can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test |
The tests are being triggered in jenkins. |
-1 Tested at: 1259b5b You can see the results of the tests here: I found follow errors while testing this PR Failed tests: UnitTests
I found errors in the following unit tests: ---> test TestFWCoreFrameworkCmsRun had ERRORS |
Comparison job queued. |
Comparison is ready Comparison Summary:
|
It looks like the new test failed?
|
The code-checks are being triggered in jenkins. |
I just pushed a fix to Chris's push. And as I looked at it some more I saw some other ways to improve it. It is simplified in a few places and also will behave better when there are multiple exceptions getting raised, usually a primary one and others that are side effects of the primary exception. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26349/9117
|
please test |
The tests are being triggered in jenkins. |
Pull request #26349 was updated. @cmsbuild, @smuzaffar, @Dr15Jones can you please check and sign again. |
Comparison job queued. |
Comparison is ready Comparison Summary:
|
+1 |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @davidlange6, @slava77, @smuzaffar, @fabiocos (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
PR description:
This fixes a bug that would be encountered when more than
one lumi is being processed concurrently and there is an
exception. If the exception is associated with a lumi other
than the last one read, an infinite wait will be entered.
The wait is in the processLumis function of the EventProcessor.
It waits on every task in the stream SerialTaskQueues.
In the above situation the serial queue is not resumed
before the wait and there are still tasks in the queues.
There is code that would clean these up in the function
endUnfinishedLumis but that only runs after the wait
is over and processLumis has returned. Too late.
This commit also fixes the maxEvents output parameter which was
broken when more than one lumi is processed concurrently. As far
as I know nothing uses this, but it is an advertised parameter
of the Framework that is supposed to work. One note about
this. If events are running concurrently when the limit is
reached, then all the events already running will complete, so
the job could write more than the requested number of events
to output.
The modified code is not executed if there are no exceptions and
the maxEvents output parameter is not used. Under normal
circumstances this should not have any effect on output or
performance.
PR validation:
This adds two new unit tests. One test would fail with an infinite
wait if executed in a release before this commit. It creates the
condition described above. The other tests the maxEvent output
parameter. Plus I added an assert that checks that the reference
count is one in normalEnd when we try to delete RunResources
and get endRun to execute (something to be concerned about in
concurrent lumi mode).