Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in DAQ modules when exception is thrown in event processing (only affecting multithreading) - 75X #12200

Merged
merged 1 commit into from
Nov 4, 2015

Conversation

smorovic
Copy link
Contributor

A rare race condition occurs when exception is thrown during processing of last few events in a file and LS. In this case, another thread can already request next event from the source. If next event belongs to the next LS, input source reports to the FastMonitoringService a total number of events in previous LS.

Normally in case of exception, we skip writing JSON stream output (catching exception action callback in the FastMonitoringService), and subsequently hltd assigns missing events as error events to close micro-merge of that LS. However, suppression was not happening after input source already reported the total number of events to the FastMonitoringService. This lead to incomplete micromerge for some streams. The problem is present only in multithreading, as in the single-threaded mode source can get a request for next event before exception on currently processed event is thrown (i.e. event requests are aborted and run/LS get closed).

In this update, JSON output is suppressed if exception has been thrown, regardless of input source report.

…g, with other thread already requests next event from source. Source can then open next LS (internally) and report event number in past LS to the FastMonitoringService. In this case it is possible to run preEndLumi triggered by exception later than source report, in which case exception check was (incorrectly) being skipped.
@cmsbuild
Copy link
Contributor

A new Pull Request was created by @smorovic (Srecko Morovic) for CMSSW_7_5_X.

Fix race condition in DAQ modules when exception is thrown in event processing (only affecting multithreading) - 75X

It involves the following packages:

EventFilter/Utilities

@mommsen, @cvuosalo, @cmsbuild, @emeschi, @slava77 can you please review it and eventually sign? Thanks.
@Martin-Grunewald this is something you requested to watch as well.
You can sign-off by replying to this message having '+1' in the first line of your reply.
You can reject by replying to this message having '-1' in the first line of your reply.
If you are a L2 or a release manager you can ask for tests by saying 'please test' or '@cmsbuild, please test' in the first line of a comment.
@Degano you are the release manager for this.
You can merge this pull request by typing 'merge' in the first line of your comment.

@slava77
Copy link
Contributor

slava77 commented Oct 30, 2015

@cmsbuild please test

@cmsbuild
Copy link
Contributor

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/9380/console

@cmsbuild
Copy link
Contributor

-1
Tested at: d7117b0
When I ran the RelVals I found an error in the following worklfows:
25.0 step3

runTheMatrix-results/25.0_TTbar+TTbar+DIGI+RECOAlCaCalo+HARVEST+ALCATT/step3_TTbar+TTbar+DIGI+RECOAlCaCalo+HARVEST+ALCATT.log
----- Begin Fatal Exception 30-Oct-2015 14:12:39 CET-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling EventProcessor::runToCompletion (which does almost everything after beginJob and before endJob)
   Additional Info:
      [a] Fatal Root Error: @SUB=TFile::Flush
error flushing file step3_inDQM.root (Disk quota exceeded)
----- End Fatal Exception -------------------------------------------------

1330.0 step1

runTheMatrix-results/1330.0_ZMM_13+ZMM_13+DIGIUP15+RECOUP15+HARVESTUP15/step1_ZMM_13+ZMM_13+DIGIUP15+RECOUP15+HARVESTUP15.log

you can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-12200/9380/summary.html

@cvuosalo
Copy link
Contributor

@cmsbuild please test

@cmsbuild
Copy link
Contributor

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/9401/console

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

@cvuosalo
Copy link
Contributor

+1

For #12200 d7117b0

Fixing rare multi-threading race condition in event processing by DAQ modules. There should be no change in monitored quantities.

The code changes are satisfactory, and Jenkins tests against baseline CMSSW_7_5_X_2015-10-30-1100 show no significant differences, as expected.

@mommsen
Copy link
Contributor

mommsen commented Nov 4, 2015

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 4, 2015

This pull request is fully signed and it will be integrated in one of the next CMSSW_7_5_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_7_6_X is complete. This pull request requires discussion in the ORP meeting before it's merged. @davidlange6, @Degano, @smuzaffar

@davidlange6
Copy link
Contributor

+1

cmsbuild added a commit that referenced this pull request Nov 4, 2015
Fix race condition in DAQ modules when exception is thrown in event processing (only affecting multithreading) - 75X
@cmsbuild cmsbuild merged commit 33b45e9 into cms-sw:CMSSW_7_5_X Nov 4, 2015
@smorovic smorovic deleted the exception-eols-fix-75X branch November 13, 2015 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants