Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make RU CSC segment algorithm reproducible by enforcing constness #19421

Merged

Conversation

slava77
Copy link
Contributor

@slava77 slava77 commented Jun 24, 2017

make method buildSegments const and move all varying data members to an AlgoState percolated through all methods.

This is a somewhat mindless method to make each call independent and resolve the problem with reproducibility running the algorithm in multithreaded mode or otherwise reordered events.
With this solution the reproducibility is effectively enforced by the compiler.

The code is called chamber-by chamber. Clearly, there was a changing memory between calls to build a segment in different chambers. After the constness is enforced, the order of calls between chambers or between events shouldn't matter.

Changes, compared to the baseline (black CMSSW_9_2_3_patch1) in wf 27411 (10 muons per event)
in one thread:
all_sign935-mt1vsorig_tenmuextendede2023d17wf27411p0c_cscdetidcscsegmentsownedrangemap_cscsegments__reco_obj_collection__data__degreesoffreedom

Baseline CMSSW_9_2_3_patch1 comparison between single-thread run (black) and multi-thread (red, using 8 threads)
all_orig-mt8vsorig_tenmuextendede2023d17wf27411p0c_cscdetidcscsegmentsownedrangemap_cscsegments__reco_obj_collection__data__localposition_x
in this test the events are still somewhat in order and on the same events there should be no differences. This explains much smaller size of changes in the multithread-single thread.

After the fix there are no differences in the cscSegments distributions in comparison between MT1 and MT8 runs.

…an AlgoState percolated through all methods. This is a somewhat mindless method to make each call independent and resolve the problem with reproducibility running the algorithm in multithreaded mode or otherwise reordered events.
@cmsbuild
Copy link
Contributor

A new Pull Request was created by @slava77 (Slava Krutelyov) for master.

It involves the following packages:

RecoLocalMuon/CSCSegment

@perrotta, @cmsbuild, @slava77, @davidlange6 can you please review it and eventually sign? Thanks.
@ptcox, @bellan, @abbiendi, @jhgoh this is something you requested to watch as well.
@davidlange6 you are the release manager for this.

cms-bot commands are listed here

@slava77
Copy link
Contributor Author

slava77 commented Jun 24, 2017

@cmsbuild please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 24, 2017

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/20898/console Started: 2017/06/24 09:31

@cmsbuild
Copy link
Contributor

-1

Tested at: 404d1d9

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:
821918a
You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-19421/20898/git-log-recent-commits
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-19421/20898/git-merge-result

You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-19421/20898/summary.html

I found follow errors while testing this PR

Failed tests: RelVals

  • RelVals:

When I ran the RelVals I found an error in the following worklfows:
136.731 step3

runTheMatrix-results/136.731_RunSinglePh2016B+RunSinglePh2016B+HLTDR2_2016+RECODR2_2016reHLT_skimSinglePh_HIPM+HARVESTDR2/step3_RunSinglePh2016B+RunSinglePh2016B+HLTDR2_2016+RECODR2_2016reHLT_skimSinglePh_HIPM+HARVESTDR2.log

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:
821918a
You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-19421/20898/git-log-recent-commits
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-19421/20898/git-merge-result

@cmsbuild
Copy link
Contributor

Comparison not run due to runTheMatrix errors (RelVals and Igprof tests were also skipped)

chi2Norm_2D_ = 5*chi2Norm_2D_;
chi2_str_ = 100;
chi2Max = 2*chi2Max;
if(aState.doCollisions && search_disp && int(rechits.size()-used_rh)>2){//check if there are enough recHits left to build a segment from displaced vertices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also apply here the (unrelated) fix pointed out in #19081 (review)?

@perrotta
Copy link
Contributor

The crash to wf 136.731 is already in the latest CMSSW_9_2_X_2017-06-23-2300 IB, and therefore unrelated from this PR. It is quite likely originated from the merging of #19194

@ptcox
Copy link
Contributor

ptcox commented Jun 24, 2017

Hey, Slava! Thanks for doing our work for us. This seems like a sledgehammer fix! I still want Nikolay to i) simplify the logic flow, ii) leave the config parameters const and not perform algebra on them, and iii) remove historical comments that have no relation to the current code. But it is great that you've solved the non-reproducibility like this. Thanks!

@slava77
Copy link
Contributor Author

slava77 commented Jun 24, 2017 via email

@ptcox
Copy link
Contributor

ptcox commented Jun 24, 2017 via email

@slava77
Copy link
Contributor Author

slava77 commented Jun 24, 2017

@cmsbuild please test

it looks like failures in 136.731 are somewhat random (the baseline used in the last test CMSSW_9_2_X_2017-06-23-2300 did not have the error).
Maybe it goes away.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 24, 2017

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/20900/console Started: 2017/06/24 17:55

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Comparison job queued.

@slava77
Copy link
Contributor Author

slava77 commented Jun 24, 2017

Here are some plots from running on 1K events with pt=1000 muons:

The trend seems to repeat (as in the PR description) that the CSC segments are becoming somewhat shorter, while more abundant
all_sign935vsorig_singlemupt1000in2017wf10009p0c_cscdetidcscsegmentsownedrangemap_cscsegments__reco_obj_collection__data__degreesoffreedom

This change corresponds to somewhat better DyDz residuals (the effect is less pronounced on Dy or Dx)
wf10009_gm_csc1_ddydz

wf10009_gm_csc2_ddydz

There are more hits on tracks
wf10009_gm_cschits_eta

and there is probably a higher efficiency (one bin here, not stat significant)
wf10009_gm_eff_eta

The more restrictive definition of efficiency (IIRC, the numerator requires a fraction of hits to be from muon sim hits) is clearly better by ~4-5% in the endcaps
wf10009_staupd_effq075_eta

q/pt pull (and other pulls) is not changing significantly
wf10009_staupd_pullqop

The plots above suggest to me that the behavior of the algorithm starting from a fixed initial state is appropriate (and the original version didn't get to it by starting from an incorrect initial point and then settling down on a better point after a few segment fits, by virtue of changing the settings in a fit remembered in the next fit calls).

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-19421/20900/summary.html

There are some workflows for which there are errors in the baseline:
10824.0 step 3
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Comparison Summary:

  • You potentially added 3 lines to the logs
  • Reco comparison results: 931 differences found in the comparisons
  • DQMHistoTests: Total files compared: 21
  • DQMHistoTests: Total histograms compared: 1669851
  • DQMHistoTests: Total failures: 722
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 1668971
  • DQMHistoTests: Total skipped: 158
  • DQMHistoTests: Total Missing objects: 0
  • Checked 85 log files, 14 edm output root files, 21 DQM output files

@slava77
Copy link
Contributor Author

slava77 commented Jun 24, 2017

+1

for #19421 404d1d9

  • jenkins tests pass and comparisons with baseline show small changes that start in cscSegments and propagate downstream
  • local tests with multimuon, high pt muon without PU and also ttbar and ZMM with PU35 show essentially the same if not slightly better performance related to the updates in CSC segment reco

@perrotta I couldn't convince myself that the change mentioned in #19421 (comment) is required (it definitely would be if there was no int() on the left hand side already).

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request requires discussion in the ORP meeting before it's merged. @davidlange6, @smuzaffar

@perrotta
Copy link
Contributor

perrotta commented Jun 25, 2017 via email

@ptcox
Copy link
Contributor

ptcox commented Jun 25, 2017 via email

@davidlange6
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 59a599d into cms-sw:master Jun 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants