81X - Added CSC Unpacker check for FED/DDU<->chamber mapping inconsistencies #15330

barvic · 2016-07-30T21:58:17Z

Added check to CSC Unpacker for FED/DDU<->chamber mapping inconsistencies to prevent CSC reco crashes in case of rare data corruptions (a case when faulty chamber sends corrupted data posing as chamber with different ID).
Handled few reported CMSSW static analyzer warnings.

…nconsistencies to prevent CSC reco crashes in case of rare data corruptions. Fixed few reported CMS static analyzer warnings.

cmsbuild · 2016-07-30T21:58:43Z

A new Pull Request was created by @barvic for CMSSW_8_1_X.

It involves the following packages:

EventFilter/CSCRawToDigi

@cmsbuild, @cvuosalo, @slava77, @davidlange6 can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @ptcox this is something you requested to watch as well.
@slava77, @smuzaffar you are the release manager for this.

cms-bot commands are list here #13028

slava77 · 2016-07-30T22:10:05Z

@cmsbuild please test

cmsbuild · 2016-07-30T22:10:25Z

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/14296/console

slava77 · 2016-07-30T22:10:32Z

@barvic please provide a link to some slides in a CSC meeting with details of the issues addressed by this PR.
I don't recall recent crash reports in reco due to CSC unpacking. In which environment do we get the problems that are fixed by this PR?
A file and a config to test that the problem is resolved would be helpful.

cmsbuild · 2016-07-30T23:24:38Z

+1
Tested at: 88d4f1e
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-15330/14296/summary.html

barvic · 2016-07-30T23:30:54Z

@slava77 You could check @ptcox Tim's CSC DPG report presentation from July 20th CSC Weekly Meeting for some details (page 10)
https://indico.cern.ch/event/557894/contributions/2249187/attachments/1313017/1965659/160720_csc_dpg.pdf

On July 17, HLT observed 1553 crashes in 9 minutes during run 276870 due to CSC
possibly sending corrupted data.
HLT expert (Charles Mueller) was able to provide few data dumps with those bad events for debugging (I think one file is still available at P5 /nfshome0/muell149/public/Run276870_ls1956_1972_ErrorStream_RAW.root and I have a copy if needed).

Here is also some info from our mail exchange between experts, where I was trying to explain what's going on.
"And it seems that one of the chambers sent partially corrupted data, exposing itself as a chamber with different ID and coincidentally the chamber with that ID was also present in the readout for that event. So currently I think that what could cause double entry error in the CSCRecHitDProducer module."

I have test config file for reproducing this problem, and I think any common RECO config would trigger those crashes without patch with that data file.
Please tell me if you need more details.

run276870_crash_debug_cfg.py.txt

cmsbuild · 2016-07-31T00:17:08Z

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-15330/14296/summary.html

slava77 · 2016-07-31T01:24:32Z

On 7/30/16 4:30 PM, barvic wrote:

@slava77 https://github.com/slava77 You could check @ptcox
https://github.com/ptcox Tim's CSC DPG report presentation from July
20th CSC Weekly Meeting for some details (page 10)
https://indico.cern.ch/event/557894/contributions/2249187/attachments/1313017/1965659/160720_csc_dpg.pdf

On July 17, HLT observed 1553 crashes in 9 minutes during run 276870
due to CSC possibly sending corrupted data.

HLT expert (Charles Mueller) was able to provide few data dumps with
those bad events for debugging (I think one file is still available
at P5
/nfshome0/muell149/public/Run276870_ls1956_1972_ErrorStream_RAW.root
and I have a copy if needed).

Here is also some info from our mail exchange between experts, where I
was trying to explain what's going on.
"And it seems that one of the chambers sent partially corrupted data,
exposing itself as a chamber with different ID and coincidentally the
chamber with that ID was also present in the readout for that event. So
currently I think that what could cause double entry error in the
CSCRecHitDProducer module."

I have test config file for reproducing this problem, and I think any
common RECO config would trigger those crashes without patch with that
data file.
Please tell me if you need more details.

run276870_crash_debug_cfg.py.txt
https://github.com/cms-sw/cmssw/files/392142/run276870_crash_debug_cfg.py.txt

These details look good enough for now.

How urgently is this needed from point of view of RC or RFM and CSC ops?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15330 (comment), or
mute the thread
https://github.com/notifications/unsubscribe-auth/AEdcbpy0aZgq4T_bi4JdiZVgrIjrLnL2ks5qa96vgaJpZM4JY5xT.

barvic · 2016-07-31T02:17:23Z

@slava77 So far it was just a single incident on July 17th. I could not really predict right now how soon it could happen again. But if HLT is going to contact us with the same issue again very soon, then we definitely will ask to give this update higher priority for release and deployment.
I don't think that we really want to wait too long (few months) for deployment.
@ptcox Tim, what do you think? Should we discuss it with CSC ops?

slava77 · 2016-07-31T05:38:32Z

On 7/30/16 7:17 PM, barvic wrote:

@slava77 https://github.com/slava77 So far it was just a single
incident on July 17th. I could not really predict right now how soon it
could happen again. But if HLT is going to contact us with the same
issue again very soon, then we definitely will ask to give this update
higher priority for release and deployment.
I don't think that we really want to wait too long (few months) for
deployment.
@ptcox https://github.com/ptcox Tim, what do you think? Should we
discuss it with CSC ops?

We usually have a release in 80X every week or two.
It sounds like it will be OK for 80X of this PR to make it on this time
scale.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15330 (comment), or
mute the thread
https://github.com/notifications/unsubscribe-auth/AEdcbrqfL9kAgtIXiektZj5YMfB4PnjEks5qbAW0gaJpZM4JY5xT.

ptcox · 2016-07-31T09:06:03Z

Yes I agree. It should be released in 80x as soon as possible, just in case.
This is a fix to keep CMS running of course, so it should be 'CMS' which
is requiring it with urgency, not necessarily us :) However, we've never
seen the problem before. or since.

Tim

Slava Krutelyov wrote:

On 7/30/16 7:17 PM, barvic wrote:

@slava77 https://github.com/slava77 So far it was just a single
incident on July 17th. I could not really predict right now how soon it
could happen again. But if HLT is going to contact us with the same
issue again very soon, then we definitely will ask to give this update
higher priority for release and deployment.
I don't think that we really want to wait too long (few months) for
deployment.
@ptcox https://github.com/ptcox Tim, what do you think? Should we
discuss it with CSC ops?

We usually have a release in 80X every week or two.
It sounds like it will be OK for 80X of this PR to make it on this time
scale.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15330 (comment), or
mute the thread

https://github.com/notifications/unsubscribe-auth/AEdcbrqfL9kAgtIXiektZj5YMfB4PnjEks5qbAW0gaJpZM4JY5xT.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15330 (comment), or
mute the thread
https://github.com/notifications/unsubscribe-auth/AE2FnhzUYP9k-dUZxpajFBx2FWIKcsSTks5qbDTcgaJpZM4JY5xT.

cvuosalo · 2016-08-01T22:27:59Z

@barvic: Could you please make the bad events file available on AFS? I don't have access to P5 computers. Thanks.

barvic · 2016-08-02T13:54:19Z

@cvuosalo Please try /afs/cern.ch/user/b/barvic/public/Run276870_ls1956_1972_ErrorStream_RAW.root

cvuosalo · 2016-08-02T14:36:54Z

@barvic: Thank you. I was very easily able to replicate the crash and confirm the fix.

cvuosalo · 2016-08-02T16:01:25Z

+1

For #15330 88d4f1e

Fix for CSC unpacker chamber mapping inconsistencies. This PR eliminates a very rare crash that can be triggered by data corruption. There should be no change in monitored quantities.

#15329 is the 80X version of this PR, and it has already been approved.

The code changes are satisfactory, and Jenkins tests against baseline CMSSW_8_1_X_2016-07-30-1100 show no significant differences, as expected. For #15329, the crash was replicated in CMSSW_8_0_16, and the fix was confirmed.

cmsbuild · 2016-08-02T16:01:50Z

This pull request is fully signed and it will be integrated in one of the next CMSSW_8_1_X IBs (tests are also fine). This pull request requires discussion in the ORP meeting before it's merged. @slava77, @davidlange6, @smuzaffar

davidlange6 · 2016-08-02T18:52:17Z

+1

81X - Added check to the CSC Unpacker for FED/DDU<->chamber mapping i…

88d4f1e

…nconsistencies to prevent CSC reco crashes in case of rare data corruptions. Fixed few reported CMS static analyzer warnings.

cmsbuild added this to the Next CMSSW_8_1_X milestone Jul 30, 2016

cmsbuild added reconstruction-pending pending-signatures tests-pending orp-pending comparison-pending labels Jul 30, 2016

cmsbuild added tests-started and removed tests-pending labels Jul 30, 2016

cmsbuild added tests-approved and removed tests-started labels Jul 30, 2016

cmsbuild added comparison-available and removed comparison-pending labels Jul 31, 2016

cvuosalo mentioned this pull request Aug 2, 2016

80X - Added CSC Unpacker check for FED/DDU<->chamber mapping inconsistencies #15329

Merged

cmsbuild added reconstruction-approved fully-signed and removed pending-signatures reconstruction-pending labels Aug 2, 2016

cmsbuild added orp-approved and removed orp-pending labels Aug 2, 2016

cmsbuild merged commit 42d8f51 into cms-sw:CMSSW_8_1_X Aug 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

81X - Added CSC Unpacker check for FED/DDU<->chamber mapping inconsistencies #15330

81X - Added CSC Unpacker check for FED/DDU<->chamber mapping inconsistencies #15330

barvic commented Jul 30, 2016

cmsbuild commented Jul 30, 2016

slava77 commented Jul 30, 2016

cmsbuild commented Jul 30, 2016 •

edited

slava77 commented Jul 30, 2016

cmsbuild commented Jul 30, 2016

barvic commented Jul 30, 2016

cmsbuild commented Jul 31, 2016

slava77 commented Jul 31, 2016

barvic commented Jul 31, 2016

slava77 commented Jul 31, 2016

ptcox commented Jul 31, 2016

cvuosalo commented Aug 1, 2016

barvic commented Aug 2, 2016

cvuosalo commented Aug 2, 2016

cvuosalo commented Aug 2, 2016

cmsbuild commented Aug 2, 2016

davidlange6 commented Aug 2, 2016

81X - Added CSC Unpacker check for FED/DDU<->chamber mapping inconsistencies #15330

81X - Added CSC Unpacker check for FED/DDU<->chamber mapping inconsistencies #15330

Conversation

barvic commented Jul 30, 2016

cmsbuild commented Jul 30, 2016

slava77 commented Jul 30, 2016

cmsbuild commented Jul 30, 2016 • edited

slava77 commented Jul 30, 2016

cmsbuild commented Jul 30, 2016

barvic commented Jul 30, 2016

cmsbuild commented Jul 31, 2016

slava77 commented Jul 31, 2016

barvic commented Jul 31, 2016

slava77 commented Jul 31, 2016

ptcox commented Jul 31, 2016

cvuosalo commented Aug 1, 2016

barvic commented Aug 2, 2016

cvuosalo commented Aug 2, 2016

cvuosalo commented Aug 2, 2016

cmsbuild commented Aug 2, 2016

davidlange6 commented Aug 2, 2016

cmsbuild commented Jul 30, 2016 •

edited