Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

81X - Added CSC Unpacker check for FED/DDU<->chamber mapping inconsistencies #15330

Conversation

barvic
Copy link
Contributor

@barvic barvic commented Jul 30, 2016

  • Added check to CSC Unpacker for FED/DDU<->chamber mapping inconsistencies to prevent CSC reco crashes in case of rare data corruptions (a case when faulty chamber sends corrupted data posing as chamber with different ID).
  • Handled few reported CMSSW static analyzer warnings.

80X PR #15329

…nconsistencies to prevent CSC reco crashes in case of rare data corruptions. Fixed few reported CMS static analyzer warnings.
@cmsbuild
Copy link
Contributor

A new Pull Request was created by @barvic for CMSSW_8_1_X.

It involves the following packages:

EventFilter/CSCRawToDigi

@cmsbuild, @cvuosalo, @slava77, @davidlange6 can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @ptcox this is something you requested to watch as well.
@slava77, @smuzaffar you are the release manager for this.

cms-bot commands are list here #13028

@slava77
Copy link
Contributor

slava77 commented Jul 30, 2016

@cmsbuild please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 30, 2016

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/14296/console

@slava77
Copy link
Contributor

slava77 commented Jul 30, 2016

@barvic please provide a link to some slides in a CSC meeting with details of the issues addressed by this PR.
I don't recall recent crash reports in reco due to CSC unpacking. In which environment do we get the problems that are fixed by this PR?
A file and a config to test that the problem is resolved would be helpful.

@cmsbuild
Copy link
Contributor

@barvic
Copy link
Contributor Author

barvic commented Jul 30, 2016

@slava77 You could check @ptcox Tim's CSC DPG report presentation from July 20th CSC Weekly Meeting for some details (page 10)
https://indico.cern.ch/event/557894/contributions/2249187/attachments/1313017/1965659/160720_csc_dpg.pdf

  • On July 17, HLT observed 1553 crashes in 9 minutes during run 276870 due to CSC
    possibly sending corrupted data.
  • HLT expert (Charles Mueller) was able to provide few data dumps with those bad events for debugging (I think one file is still available at P5 /nfshome0/muell149/public/Run276870_ls1956_1972_ErrorStream_RAW.root and I have a copy if needed).

Here is also some info from our mail exchange between experts, where I was trying to explain what's going on.
"And it seems that one of the chambers sent partially corrupted data, exposing itself as a chamber with different ID and coincidentally the chamber with that ID was also present in the readout for that event. So currently I think that what could cause double entry error in the CSCRecHitDProducer module."

I have test config file for reproducing this problem, and I think any common RECO config would trigger those crashes without patch with that data file.
Please tell me if you need more details.

run276870_crash_debug_cfg.py.txt

@cmsbuild
Copy link
Contributor

@slava77
Copy link
Contributor

slava77 commented Jul 31, 2016

On 7/30/16 4:30 PM, barvic wrote:

@slava77 https://github.com/slava77 You could check @ptcox
https://github.com/ptcox Tim's CSC DPG report presentation from July
20th CSC Weekly Meeting for some details (page 10)
https://indico.cern.ch/event/557894/contributions/2249187/attachments/1313017/1965659/160720_csc_dpg.pdf

  • On July 17, HLT observed 1553 crashes in 9 minutes during run 276870
    due to CSC possibly sending corrupted data.
  • HLT expert (Charles Mueller) was able to provide few data dumps with
    those bad events for debugging (I think one file is still available
    at P5
    /nfshome0/muell149/public/Run276870_ls1956_1972_ErrorStream_RAW.root
    and I have a copy if needed).

Here is also some info from our mail exchange between experts, where I
was trying to explain what's going on.
"And it seems that one of the chambers sent partially corrupted data,
exposing itself as a chamber with different ID and coincidentally the
chamber with that ID was also present in the readout for that event. So
currently I think that what could cause double entry error in the
CSCRecHitDProducer module."

I have test config file for reproducing this problem, and I think any
common RECO config would trigger those crashes without patch with that
data file.
Please tell me if you need more details.

run276870_crash_debug_cfg.py.txt
https://github.com/cms-sw/cmssw/files/392142/run276870_crash_debug_cfg.py.txt

These details look good enough for now.

How urgently is this needed from point of view of RC or RFM and CSC ops?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15330 (comment), or
mute the thread
https://github.com/notifications/unsubscribe-auth/AEdcbpy0aZgq4T_bi4JdiZVgrIjrLnL2ks5qa96vgaJpZM4JY5xT.

@barvic
Copy link
Contributor Author

barvic commented Jul 31, 2016

@slava77 So far it was just a single incident on July 17th. I could not really predict right now how soon it could happen again. But if HLT is going to contact us with the same issue again very soon, then we definitely will ask to give this update higher priority for release and deployment.
I don't think that we really want to wait too long (few months) for deployment.
@ptcox Tim, what do you think? Should we discuss it with CSC ops?

@slava77
Copy link
Contributor

slava77 commented Jul 31, 2016

On 7/30/16 7:17 PM, barvic wrote:

@slava77 https://github.com/slava77 So far it was just a single
incident on July 17th. I could not really predict right now how soon it
could happen again. But if HLT is going to contact us with the same
issue again very soon, then we definitely will ask to give this update
higher priority for release and deployment.
I don't think that we really want to wait too long (few months) for
deployment.
@ptcox https://github.com/ptcox Tim, what do you think? Should we
discuss it with CSC ops?

We usually have a release in 80X every week or two.
It sounds like it will be OK for 80X of this PR to make it on this time
scale.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15330 (comment), or
mute the thread
https://github.com/notifications/unsubscribe-auth/AEdcbrqfL9kAgtIXiektZj5YMfB4PnjEks5qbAW0gaJpZM4JY5xT.

@ptcox
Copy link
Contributor

ptcox commented Jul 31, 2016

Yes I agree. It should be released in 80x as soon as possible, just in case.
This is a fix to keep CMS running of course, so it should be 'CMS' which
is requiring it with urgency, not necessarily us :) However, we've never
seen the problem before. or since.

Tim

Slava Krutelyov wrote:

On 7/30/16 7:17 PM, barvic wrote:

@slava77 https://github.com/slava77 So far it was just a single
incident on July 17th. I could not really predict right now how soon it
could happen again. But if HLT is going to contact us with the same
issue again very soon, then we definitely will ask to give this update
higher priority for release and deployment.
I don't think that we really want to wait too long (few months) for
deployment.
@ptcox https://github.com/ptcox Tim, what do you think? Should we
discuss it with CSC ops?

We usually have a release in 80X every week or two.
It sounds like it will be OK for 80X of this PR to make it on this time
scale.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15330 (comment), or
mute the thread

https://github.com/notifications/unsubscribe-auth/AEdcbrqfL9kAgtIXiektZj5YMfB4PnjEks5qbAW0gaJpZM4JY5xT.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15330 (comment), or
mute the thread
https://github.com/notifications/unsubscribe-auth/AE2FnhzUYP9k-dUZxpajFBx2FWIKcsSTks5qbDTcgaJpZM4JY5xT.

@cvuosalo
Copy link
Contributor

cvuosalo commented Aug 1, 2016

@barvic: Could you please make the bad events file available on AFS? I don't have access to P5 computers. Thanks.

@barvic
Copy link
Contributor Author

barvic commented Aug 2, 2016

@cvuosalo Please try /afs/cern.ch/user/b/barvic/public/Run276870_ls1956_1972_ErrorStream_RAW.root

@cvuosalo
Copy link
Contributor

cvuosalo commented Aug 2, 2016

@barvic: Thank you. I was very easily able to replicate the crash and confirm the fix.

@cvuosalo
Copy link
Contributor

cvuosalo commented Aug 2, 2016

+1

For #15330 88d4f1e

Fix for CSC unpacker chamber mapping inconsistencies. This PR eliminates a very rare crash that can be triggered by data corruption. There should be no change in monitored quantities.

#15329 is the 80X version of this PR, and it has already been approved.

The code changes are satisfactory, and Jenkins tests against baseline CMSSW_8_1_X_2016-07-30-1100 show no significant differences, as expected. For #15329, the crash was replicated in CMSSW_8_0_16, and the fix was confirmed.

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 2, 2016

This pull request is fully signed and it will be integrated in one of the next CMSSW_8_1_X IBs (tests are also fine). This pull request requires discussion in the ORP meeting before it's merged. @slava77, @davidlange6, @smuzaffar

@davidlange6
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 42d8f51 into cms-sw:CMSSW_8_1_X Aug 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants