Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

treat a predicted state too far as invalid in CosmicMuonSmoother #33168

Conversation

slava77
Copy link
Contributor

@slava77 slava77 commented Mar 13, 2021

this is a follow-up to the origins of the issue reported in https://hypernews.cern.ch/HyperNews/CMS/get/recoDevelopment/1706.html
in processing of date in Run: 340323, which has a seg fault.

This is a case of runaway fit leading to running over numerical precision limits with a final result as NaN. Some later code using this can crash (fixed separately in #33157).

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-33168/21562

  • This PR adds an extra 16KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @slava77 (Slava Krutelyov) for master.

It involves the following packages:

RecoMuon/CosmicMuonProducer

@perrotta, @jpata, @cmsbuild, @slava77 can you please review it and eventually sign? Thanks.
@calderona, @abbiendi, @jhgoh, @bellan, @HuguesBrun, @Fedespring, @trocino, @cericeci, @rociovilar this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@slava77
Copy link
Contributor Author

slava77 commented Mar 13, 2021

@cmsbuild please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8d8e71/13480/summary.html
COMMIT: 114c370
CMSSW: CMSSW_11_3_X_2021-03-12-2300/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/33168/13480/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 37
  • DQMHistoTests: Total histograms compared: 2635087
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 2635058
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 36 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 155 log files, 37 edm output root files, 37 DQM output files

@slava77
Copy link
Contributor Author

slava77 commented Mar 13, 2021

@ArnabPurohit @trocino
please check if this fix is acceptable for MUO POG.
Thank you.

@slava77
Copy link
Contributor Author

slava77 commented Mar 16, 2021

@ArnabPurohit @trocino
please check if this fix is acceptable for MUO POG.
Thank you.

please clarify if someone in MUO POG is looking at this.

Unfortunately the reference root file /store/data/Commissioning2021/Cosmics/RAW/v1/000/340/323/00000/7ce1719d-fa9c-4e91-b053-d373dd3b8a0b.root fell off the disk to just tape. So, rerunning for a debug would need a transfer.

@trocino
Copy link
Contributor

trocino commented Mar 17, 2021

Thanks Slava for taking care of this, and apologies for the late response. Apparently my spam filter doesn't like GitHub threads lately...

The fix (square modulus of the predicted state global position not to exceed 1e+12) is a minimal safety check and is certainly harmless for non-pathological cases, so it is perfectly acceptable on the MUO POG side.

Just for my understanding, can I ask a couple more questions?

  • If I understand correctly from the log files, the crash happens in event 19951024, where the propagation of a predicted state goes over 1e+8 in position and the corresponding covariance matrix is filled with NaNs. Is that it?
  • In PR #33157 you fixed some code in TECLayer: is it the piece of code that crashes from the NaNs? I'm having a hard time understanding that from the log files and the stack trace. It seems like several modules are affected by the NaN state

@slava77
Copy link
Contributor Author

slava77 commented Mar 17, 2021

Just for my understanding, can I ask a couple more questions?

  • If I understand correctly from the log files, the crash happens in event 19951024, where the propagation of a predicted state goes over 1e+8 in position and the corresponding covariance matrix is filled with NaNs. Is that it?

Right, the crash is in 340323:1225:19951024 (run:ls:evt). The crash itself is in OutsideInMuonSeeder:cosmicDCSeeds where the reco::Muon::outerTrack()::innerMomentum() and position are used to find compatible dets; at this point the state is a NaN, not just the covariance.
The problem starts upstream in CosmicMuonProducer:cosmicMuons , or more specifically in the CosmicMuonSmoother::fit where the fix is applied

You can perhaps follow the following debug from printouts in CosmicMuonSmoother::fit defined in CMSSW_11_2_3...slava77:CMSSW_11_2_3/t0-340323/CosmicMuon-debug

This is compressed with my additional comments after #

TO PREDICT from global parameters
x =   -6.6554e+07  3.82524e+07 -1.13034e+07
p =    7.8655e-06 -4.52515e-06  4.23877e-05
# the rest of the parameters are OK
INITIAL PRED isNotValid
ALMOST FINAL PRED isNotValid
# after two attempts to propagate, tries with StraightLinePropagator
FINAL PRED global parameters
x =  -1.20086e+08    6.905e+07 -2.99789e+08
p =    7.8655e-06 -4.52515e-06  4.23877e-05
local parameters (q/p,v',w',v,w)
     -23069.1 -5.65162e+07 -2.63655e+08 -1.38602e+08 -2.99753e+08
local error
         -nan         -nan         -nan         -nan         -nan ...
  ADDED from validHit pos  
 predTsos global parameters
x =  -1.20086e+08    6.905e+07 -2.99789e+08
p =    7.8655e-06 -4.52515e-06  4.23877e-05
# the Kalman update now returns both invalid state as well as covariance
 currTsos global parameters
x =          -nan         -nan         -nan
p =          -nan         -nan         -nan
  • In PR #33157 you fixed some code in TECLayer: is it the piece of code that crashes from the NaNs? I'm having a hard time understanding that from the log files and the stack trace. It seems like several modules are affected by the NaN state

#33157 is a downstream fix; strictly speaking it is not needed for this problematic event after this (#33168) PR. The logistics of the crash is the TECLayer call gets NaN state, then uses it to compute an intersection of this state with a plane and (pre-fix) returns a valid state with NaN values, then the code converts the state phi to an integer bin index which is then used to get an element of an array happening out of bounds.

We usually make an effort to get rid of NaNs at the origin; the reason for this PR.
However, rather generic code that may get a NaN as an input anyways is still better to have some protection against invalid inputs; the reason for #33157.

I hope this clarifies.

@trocino
Copy link
Contributor

trocino commented Mar 18, 2021

It does clarify, thanks.

As I said, your fix is perfectly fine on the MUO side, no objections to merging it.

@slava77
Copy link
Contributor Author

slava77 commented Mar 18, 2021

+reconstruction

for #33168 114c370

  • code changes are confirmed by MUO POG contact(s)
  • jenkins tests pass and comparisons with the baseline show no (relevant) differences; only rather rare cases with bad cosmic muon states are expected to lead to a change.

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@qliphy
Copy link
Contributor

qliphy commented Mar 19, 2021

+1

@cmsbuild cmsbuild merged commit 8a3e75a into cms-sw:master Mar 19, 2021
cmsbuild added a commit that referenced this pull request Mar 19, 2021
…uonSmoother

treat a predicted state too far as invalid in CosmicMuonSmoother (backport of #33168)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants