New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make RU CSC segment algorithm reproducible by enforcing constness #19421
make RU CSC segment algorithm reproducible by enforcing constness #19421
Conversation
…an AlgoState percolated through all methods. This is a somewhat mindless method to make each call independent and resolve the problem with reproducibility running the algorithm in multithreaded mode or otherwise reordered events.
A new Pull Request was created by @slava77 (Slava Krutelyov) for master. It involves the following packages: RecoLocalMuon/CSCSegment @perrotta, @cmsbuild, @slava77, @davidlange6 can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
@cmsbuild please test |
The tests are being triggered in jenkins. |
-1 Tested at: 404d1d9 The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic: You can see the results of the tests here: I found follow errors while testing this PR Failed tests: RelVals
When I ran the RelVals I found an error in the following worklfows: runTheMatrix-results/136.731_RunSinglePh2016B+RunSinglePh2016B+HLTDR2_2016+RECODR2_2016reHLT_skimSinglePh_HIPM+HARVESTDR2/step3_RunSinglePh2016B+RunSinglePh2016B+HLTDR2_2016+RECODR2_2016reHLT_skimSinglePh_HIPM+HARVESTDR2.log The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic: |
Comparison not run due to runTheMatrix errors (RelVals and Igprof tests were also skipped) |
chi2Norm_2D_ = 5*chi2Norm_2D_; | ||
chi2_str_ = 100; | ||
chi2Max = 2*chi2Max; | ||
if(aState.doCollisions && search_disp && int(rechits.size()-used_rh)>2){//check if there are enough recHits left to build a segment from displaced vertices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please also apply here the (unrelated) fix pointed out in #19081 (review)?
The crash to wf 136.731 is already in the latest CMSSW_9_2_X_2017-06-23-2300 IB, and therefore unrelated from this PR. It is quite likely originated from the merging of #19194 |
Hey, Slava! Thanks for doing our work for us. This seems like a sledgehammer fix! I still want Nikolay to i) simplify the logic flow, ii) leave the config parameters const and not perform algebra on them, and iii) remove historical comments that have no relation to the current code. But it is great that you've solved the non-reproducibility like this. Thanks! |
On 6/24/17 2:50 AM, ptcox wrote:
Hey, Slava! Thanks for doing our work for us. This seems like a
sledgehammer fix! I still want Nikolay to i) simplify the logic flow,
ii) leave the config parameters const and not perform algebra on them,
and iii) remove historical comments that have no relation to the current
code. But it is great that you've solved the non-reproducibility like
this. Thanks!
Hi Tim,
Regarding the "sledgehammer fix", given the attempts by others to read
into the code and find an issue,
it felt much easier to make it reproducible from the first principles
and not spend much effort trying to understand the full logic of the
algorithm.
Just to be clear:
are you OK with this fix or would you rather we revert the algorithm to
the old one and wait for Nikolay to provide fixes on the 3 items in your
list?
Please clarify.
I'm fine either way, but we should know soon before the release is to be
built.
Thank you.
…--slava
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19421 (comment)>, or
mute the thread
<https://github.com/notifications/unsubscribe-auth/AEdcbudfnDtXSec3GqHhki3IKRV6jNHgks5sHNvmgaJpZM4OEQTC>.
|
Hi Slava,
I'm happy with your fix temporarily. Longer term I think we need the code cleaned up as we all see it needs. But I (and all CSC) are grateful to you that we won't have to just revert to the old algorithm.
Thanks,
Tim
________________________________
From: Slava Krutelyov [notifications@github.com]
Sent: 24 June 2017 17:16
To: cms-sw/cmssw
Cc: Tim Cox; Mention
Subject: Re: [cms-sw/cmssw] make RU CSC segment algorithm reproducible by enforcing constness (#19421)
On 6/24/17 2:50 AM, ptcox wrote:
Hey, Slava! Thanks for doing our work for us. This seems like a
sledgehammer fix! I still want Nikolay to i) simplify the logic flow,
ii) leave the config parameters const and not perform algebra on them,
and iii) remove historical comments that have no relation to the current
code. But it is great that you've solved the non-reproducibility like
this. Thanks!
Hi Tim,
Regarding the "sledgehammer fix", given the attempts by others to read
into the code and find an issue,
it felt much easier to make it reproducible from the first principles
and not spend much effort trying to understand the full logic of the
algorithm.
Just to be clear:
are you OK with this fix or would you rather we revert the algorithm to
the old one and wait for Nikolay to provide fixes on the 3 items in your
list?
Please clarify.
I'm fine either way, but we should know soon before the release is to be
built.
Thank you.
--slava
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19421 (comment)>, or
mute the thread
<https://github.com/notifications/unsubscribe-auth/AEdcbudfnDtXSec3GqHhki3IKRV6jNHgks5sHNvmgaJpZM4OEQTC>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#19421 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE2FnntsGdKSAagli2ce1pRklTAcPgyqks5sHShrgaJpZM4OEQTC>.
|
@cmsbuild please test it looks like failures in 136.731 are somewhat random (the baseline used in the last test CMSSW_9_2_X_2017-06-23-2300 did not have the error). |
The tests are being triggered in jenkins. |
+1 The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic: |
Comparison job queued. |
Comparison is ready There are some workflows for which there are errors in the baseline: Comparison Summary:
|
+1
@perrotta I couldn't convince myself that the change mentioned in #19421 (comment) is required (it definitely would be if there was no int() on the left hand side already). |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request requires discussion in the ORP meeting before it's merged. @davidlange6, @smuzaffar |
Slava Krutelyov <notifications@github.com> ha scritto:
@perrotta I couldn't convince myself that the change mentioned in
#19421 (comment) is
required (it definitely would be if there was no int() on the left
hand side already).
Fine with me, because here the result will not change.
Simply, I find in this case only useless, but normally also error
prone, casting to an int the whole difference between an unsigned type
and an int.
|
Hi Andrea,
I agree with you and I'll make sure Nikolay includes cleaning this up in the more thorough revision of the code in the next few weeks.
Tim
________________________________
From: perrotta [notifications@github.com]
Sent: 25 June 2017 11:50
To: cms-sw/cmssw
Cc: Tim Cox; Mention
Subject: Re: [cms-sw/cmssw] make RU CSC segment algorithm reproducible by enforcing constness (#19421)
Slava Krutelyov <notifications@github.com> ha scritto:
@perrotta I couldn't convince myself that the change mentioned in
#19421 (comment) is
required (it definitely would be if there was no int() on the left
hand side already).
Fine with me, because here the result will not change.
Simply, I find in this case only useless, but normally also error
prone, casting to an int the whole difference between an unsigned type
and an int.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#19421 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE2FnrbZS3FTI_63suBxCaIjVgI4bHb6ks5sHi1ggaJpZM4OEQTC>.
|
+1 |
make method buildSegments const and move all varying data members to an AlgoState percolated through all methods.
This is a somewhat mindless method to make each call independent and resolve the problem with reproducibility running the algorithm in multithreaded mode or otherwise reordered events.
With this solution the reproducibility is effectively enforced by the compiler.
The code is called chamber-by chamber. Clearly, there was a changing memory between calls to build a segment in different chambers. After the constness is enforced, the order of calls between chambers or between events shouldn't matter.
Changes, compared to the baseline (black CMSSW_9_2_3_patch1) in wf 27411 (10 muons per event)
in one thread:
Baseline CMSSW_9_2_3_patch1 comparison between single-thread run (black) and multi-thread (red, using 8 threads)
in this test the events are still somewhat in order and on the same events there should be no differences. This explains much smaller size of changes in the multithread-single thread.
After the fix there are no differences in the cscSegments distributions in comparison between MT1 and MT8 runs.