Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protect access to updatedState #14060

Merged
merged 1 commit into from Apr 18, 2016

Conversation

rovere
Copy link
Contributor

@rovere rovere commented Apr 14, 2016

While testing if a newly added TrajectoryMeasurement to a Trajectory is
compatible or not with the cut on the cluster charge, a protection has
been added to test the validity of the updatedState before using it.
This fix a segmentation fault that was due to an invalid updatedState
caused by a failed KfUpdator. Full log avaialable here:
https://hypernews.cern.ch/HyperNews/CMS/get/recoDevelopment/1436.html

While testing if a newly added TrajectoryMeasurement to a Trajectory is
compatible or not with the cut on the cluster charge, a protection has
been added to test the validity of the updatedState before using it.
This fix a segmentation fault that was due to an invalid updatedState
caused by a failed KfUpdator. Full log avaialable here:
https://hypernews.cern.ch/HyperNews/CMS/get/recoDevelopment/1436.html
@cmsbuild
Copy link
Contributor

A new Pull Request was created by @rovere (Marco Rovere) for CMSSW_8_0_X.

It involves the following packages:

TrackingTools/PatternTools

@cmsbuild, @cvuosalo, @slava77, @davidlange6 can you please review it and eventually sign? Thanks.
@battibass, @makortel, @abbiendi, @GiacomoSguazzoni, @jhgoh, @VinInn, @bellan, @istaslis, @gpetruc, @cerati, @trocino, @dgulhan this is something you requested to watch as well.
@slava77, @Degano, @smuzaffar you are the release manager for this.

cms-bot commands are list here #13028

@@ -118,6 +118,8 @@ bool TempTrajectory::badForCCC(const TrajectoryMeasurement &tm) {
return false;
if (thit->isPixel())
return false;
if (!tm.updatedState().isValid())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhmmm, in principle we should NOT store invalid states in the trajectory. we should abort Pattern-reco for that traj...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ciao Vincenzo,
I agree, but at this level I cannot abort PatternReco. I can return true instead of false, but this will not guarantee to stop the PR.

@davidlange6
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/12365/console

@cmsbuild
Copy link
Contributor

@slava77
Copy link
Contributor

slava77 commented Apr 14, 2016

@rovere
what is the status of 81X PR?

@rovere
Copy link
Contributor Author

rovere commented Apr 14, 2016

Ciao @slava77,
I still have to prepare it. I'd like to test it on real data, but I need to cook the proper cmsDriver command. If you happen to have one at hand, that would help.

@cmsbuild
Copy link
Contributor

@slava77
Copy link
Contributor

slava77 commented Apr 14, 2016

Ciao Marco,

something like this:

cmsDriver.py step3 --conditions  -s RAW2DIGI,L1Reco,RECO,EI,PAT,[putDQM/ALCA here if needed] \
--runUnscheduled --data --scenario pp --datatier RECO,MINIAOD \
--era ppEra_Run2_2016_trackingLowPU \
   --customise  Configuration/DataProcessing/RecoTLR.customisePostEra_Run2_2016

I omitted input/output file names
For tracking tests you don't really need the last customization (it contains settings for data for HCAL)

@rovere
Copy link
Contributor Author

rovere commented Apr 14, 2016

@slava77 it seems that era ppEra_Run2_2016_trackingLowPU is not part of 81X, see, e.g., https://github.com/cms-sw/cmssw/blob/CMSSW_8_1_X/Configuration/StandardSequences/python/Eras.py

@slava77
Copy link
Contributor

slava77 commented Apr 14, 2016

On 4/14/16 8:22 AM, Marco Rovere wrote:

@slava77 https://github.com/slava77 it seems that era
|ppEra_Run2_2016_trackingLowPU| is not part of 81X, see, e.g.,
https://github.com/cms-sw/cmssw/blob/CMSSW_8_1_X/Configuration/StandardSequences/python/Eras.py

ah, sorry,
this is not merged yet
add #13988
to test


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#14060 (comment)

@makortel
Copy link
Contributor

@slava77 @rovere With #13988 one should use --era Run2_2016_trackingLowPU (without the leading ppEra_), right?

@slava77
Copy link
Contributor

slava77 commented Apr 14, 2016

Matti, thanks for correcting my typo in the cmsDriver.
Yes, it should be Run2_...

@slava77
Copy link
Contributor

slava77 commented Apr 15, 2016

+1

for #14060 8f0ec1f

  • changes are in line with the description; the logic affects only events that would otherwise have a crash
    • I was not able to reproduce the crash today with ced2dece-0172-11e6-a395-001e67ac06a0-6-3-logArchive.tar.gz in CMSSW_8_0_4 with the supplied PSet.py (neither on AMD not on Intel). The fix is logically in a correct place though and according to @rovere it does the job
  • jenkins tests pass and comparisons with baseline show no difference

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_8_0_X IBs (tests are also fine). This pull request requires discussion in the ORP meeting before it's merged. @slava77, @davidlange6, @Degano, @smuzaffar

@rovere
Copy link
Contributor Author

rovere commented Apr 15, 2016

Ciao @slava77,
that's very weird. I picked the very same tgz and I am currently able to constantly reproduce the crash on an Intel i7 Haswell. If you are still interested, you could quickly go to the offending event by:

process.source.skipEvents = cms.untracked.uint32(535)

and processing only 1 event.

The crash happens both running with 4 threads or running with 1 thread.
Finally, after having recompiled almost all 81X with #13988, I can see the very same crash also on 81X, so I am in a better position to provide a separate fix also for that release. I'd not keep this one on hold, though.

rovere added a commit to rovere/cmssw that referenced this pull request Apr 15, 2016
While testing if a newly added TrajectoryMeasurement to a Trajectory is
compatible or not with the cut on the cluster charge, a protection has
been added to test the validity of the updatedState before using it.
This fix a segmentation fault that was due to an invalid updatedState
caused by a failed KfUpdator. Full log avaialable here:
https://hypernews.cern.ch/HyperNews/CMS/get/recoDevelopment/1436.html

See also Pull Request cms-sw#14060 for the corresponding PR for the 80X
branch.
@rovere
Copy link
Contributor Author

rovere commented Apr 15, 2016

I have a branch ready for 81X, but would rather prefer to have #13988 integrate first, otherwise my PR will bring in the very same changes as that PR.
Let me know.

Marco.

@slava77
Copy link
Contributor

slava77 commented Apr 15, 2016

On 4/15/16 1:17 AM, Marco Rovere wrote:

Ciao @slava77 https://github.com/slava77,
that's very weird. I picked the very same tgz and I am currently able to
constantly reproduce the crash on an Intel i7 Haswell. If you are still
interested, you could quickly go to the offending event by:

|process.source.skipEvents = cms.untracked.uint32(535)|

and processing only 1 event.

Marco,

Can you try on cmsdev02 (that's where I tested)?

I have just tried in CMSSW_8_0_4 with the PSet.py with extra lines

process.source.skipEvents = cms.untracked.uint32(535)
process.options.numberOfThreads = 1
process.maxEvents.input = 10

to match what you describe.

It ran to completion.

Begin processing the 1st record. Run 269223, Event 9231372, LumiSection
10 at 15-Apr-2016 14:46:17.303 CEST
Begin processing the 2nd record. Run 269223, Event 9383584, LumiSection
10 at 15-Apr-2016 14:46:52.066 CEST
Begin processing the 3rd record. Run 269223, Event 9172000, LumiSection
10 at 15-Apr-2016 14:46:56.021 CEST
Begin processing the 4th record. Run 269223, Event 8728400, LumiSection
10 at 15-Apr-2016 14:46:57.442 CEST
Begin processing the 5th record. Run 269223, Event 9326019, LumiSection
10 at 15-Apr-2016 14:47:13.941 CEST
Begin processing the 6th record. Run 269223, Event 9335299, LumiSection
10 at 15-Apr-2016 14:47:16.681 CEST
Begin processing the 7th record. Run 269223, Event 9335303, LumiSection
10 at 15-Apr-2016 14:47:22.143 CEST
Begin processing the 8th record. Run 269223, Event 8453721, LumiSection
10 at 15-Apr-2016 14:47:23.557 CEST
Begin processing the 9th record. Run 269223, Event 9153410, LumiSection
10 at 15-Apr-2016 14:47:25.072 CEST
Begin processing the 10th record. Run 269223, Event 9047646, LumiSection
10 at 15-Apr-2016 14:47:27.148 CEST

The crash happens both running with 4 threads or running with 1 thread.
Finally, after having recompiled almost all 81X with #13988
#13988, I can see the very same
crash also on 81X, so I am in a better position to provide a separate
fix also for that release. I'd not keep this one on hold, though.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#14060 (comment)

@rovere
Copy link
Contributor Author

rovere commented Apr 15, 2016

Ciao @slava77,
I confirm your findings, yet, as I checked also this morning, I consistently have the crash on both 804 and 810 on the machine I mentioned.
Could Tier-0 guys report back the kind of machine on which we had the crash?

@slava77
Copy link
Contributor

slava77 commented Apr 15, 2016

I tried on the following without any crashes

AMD Opteron(tm) Processor 6128
Westmere E56xx/L56xx/X56xx (Nehalem-C)
Intel Xeon E312xx (Sandy Bridge)
Intel Core i7 9xx (Nehalem Class Core i7)
Intel(R) Xeon(R) CPU           L5420  @ 2.50GHz

is this Ofast doing or is it something else, I wonder.

@slava77
Copy link
Contributor

slava77 commented Apr 15, 2016

finally got a crash on

 Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz

on FNAL LPC

@slava77
Copy link
Contributor

slava77 commented Apr 15, 2016

@davidlange6
when do we get this into 80X?
are you waiting for 81X PR integration?

@davidlange6 davidlange6 merged commit 2694fd6 into cms-sw:CMSSW_8_0_X Apr 18, 2016
@rovere rovere deleted the checkUpdatedStateValidity branch October 3, 2023 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants