Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turning on ClusterRepair for edge hits only #26263

Merged
merged 10 commits into from Apr 11, 2019

Conversation

cmantill
Copy link
Contributor

@cmantill cmantill commented Mar 26, 2019

PR description:

This PR turns ON ClusterRepair CPE for edge hits only.
It adds an option to the configuration file so that is possible to choose which pixel sub-detectors/layers to run on. By default this option is: "PXB 2","PXB 3","PXB 4" which refers to layers 2,3 and 4 of PixelBarrel.
The PR further adds a flag to turn on/off CR for truncated clusters - this flag is false by default.

The PR follows from the discussion on the DPG meeting https://indico.cern.ch/event/787658/?showDate=all&showSession=6

The following are updated tags for autoCond:

Update pixel local reco conditions in the offline GTs

Data

New offline GT is
https://cms-conddb.cern.ch/cmsDbBrowser/list/Prod/gts/106X_dataRun2_v4
Diff
https://cms-conddb.cern.ch/cmsDbBrowser/diff/Prod/gts/106X_dataRun2_v4/106X_dataRun2_v1
New offline relval GT is
https://cms-conddb.cern.ch/cmsDbBrowser/list/Prod/gts/106X_dataRun2_relval_v2
Diff
https://cms-conddb.cern.ch/cmsDbBrowser/diff/Prod/gts/106X_dataRun2_relval_v2/106X_dataRun2_relval_1

Only 2017 conditions are updated in GenErr, Lorentz angle and 1D template.
2D template changes in 2016 and 2017, but it was not consumed in the past till this PR.

PR validation:

It has been tested locally on data - using cmsDriver step3 - (on JetHT and on a run for which a 2D template exists) and verified by printouts that is called only for the layers that are defined in the configuration file.
It has also been tested using standard runTheMatrix tests using 105X_dataRun2_Candidate_2019_04_03_13_37_01 GT and checking step3 independently for runs corresponding to each 2017 IOV.

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26263/8938

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @cmantill (Cristina Ana Mantilla Suarez) for master.

It involves the following packages:

RecoLocalTracker/SiPixelRecHits
RecoTracker/TransientTrackingRecHit

@cmsbuild, @perrotta, @slava77 can you please review it and eventually sign? Thanks.
@makortel, @felicepantaleo, @GiacomoSguazzoni, @rovere, @VinInn, @mschrode, @dkotlins, @gpetruc, @ebrondol, @threus, @dgulhan this is something you requested to watch as well.
@davidlange6, @slava77, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@cmantill
Copy link
Contributor Author

Standard runTheMatrix tests have been performed -> but they are failing due to a global tag error [1]
The current release I'm using is CMSSW_10_6_0_pre2 and sl7.
We expect jobs to fail because there are no updated conditions for 2D templates but this error does not seem related. If anyone has any suggestions please let me know.

[1] ----- Begin Fatal Exception 26-Mar-2019 18:37:15 CDT----------------------- An exception of category 'StdException' occurred while [0] Constructing the EventProcessor [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag' Exception Message: A std::exception was thrown. Connection on "frontier://(preferipfamily=0)(proxyconfigurl=http://grid-wpad/wpad.dat)(backupproxyurl=http://cmst0frontier.cern.ch:3128)(backupproxyurl=http://cmst0frontier1.cern.ch:3128)(backupproxyurl=\ http://cmst0frontier2.cern.ch:3128)(backupproxyurl=http://cmsbpfrontier.cern.ch:3128)(backupproxyurl=http://cmsbpfrontier1.cern.ch:3128)(backupproxyurl=http://cmsbpfrontier2.cern.ch:3128)(backupproxyurl=\ http://cmsbproxy.fnal.gov:3128)(serverurl=http://cmsfrontier.cern.ch:8000/FrontierProd)(serverurl=http://cmsfrontier1.cern.ch:8000/FrontierProd)(serverurl=http://cmsfrontier2.cern.ch:8000/FrontierProd)(s\ erverurl=http://cmsfrontier3.cern.ch:8000/FrontierProd)(serverurl=http://cmsfrontier4.cern.ch:8000/FrontierProd)/CMS_CONDITIONS" cannot be established ( CORAL : "ConnectionPool::getSessionFromNewConnecti\ on" from "CORAL/Services/ConnectionService" ) ----- End Fatal Exception -------------------------------------------------

@cmantill
Copy link
Contributor Author

Quick update: I repeated the runTheMatrix standard tests on lxplus and most of them were completed (it seems that the error from before was something regarding the connection in the fermilab cluster).

The following workflows fail:
136.85_RunEGamma2018A+RunEGamma2018A+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQM+HARVEST2018_L1TEgDQM
140.56_RunHI2018+RunHI2018+RECOHID18+HARVESTDHI18
136.788_RunSinglePh2017B+RunSinglePh2017B+HLTDR2_2017+RECODR2_2017reHLT_skimSinglePh_Prompt+HARVEST2017

[2]
Exception Message:
SiPixelTemplate2D::interpolate can't find needed template ID = 6534, Are you using the correct global tag?

@tvami
Copy link
Contributor

tvami commented Mar 28, 2019

Hi @tocheng, the PR fails because it wants to use the Prompt GTs. PR26281 seems to fix this. Can we do some kind of workaround to pass these tests?

@cmantill
Copy link
Contributor Author

@tvami I tried your GT: 105X_dataRun2_v8 and this works for all 3 workflows that were failing before.

But I tried the PR26281 and it still does not work, e.g for workflow 136.788:

No HLT prescale table found
Using default empty table with all prescales 1
%MSG
----- Begin Fatal Exception 29-Mar-2019 03:20:27 CET-----------------------
An exception of category 'DataCorrupt' occurred while
   [0] Processing  Event run: 297557 lumi: 119 event: 189755411 stream: 0
   [1] Running path 'Flag_METFilters'
   [2] Prefetching for module BooleanFlagFilter/'HBHENoiseFilter'
   [3] Prefetching for module HBHENoiseFilterResultProducer/'HBHENoiseFilterResultProducer'
   [4] Prefetching for module HcalNoiseInfoProducer/'hcalnoise'
   [5] Prefetching for module FastjetJetProducer/'ak4PFJets'
   [6] Prefetching for module PFLinker/'particleFlow'
   [7] Prefetching for module PFProducer/'particleFlowTmp'
   [8] Prefetching for module PFBlockProducer/'particleFlowBlock'
   [9] Prefetching for module PFElecTkProducer/'pfTrackElec'
   [10] Prefetching for module GsfTrackProducer/'electronGsfTracks'
   [11] Prefetching for module MeasurementTrackerEventProducer/'MeasurementTrackerEvent'
   [12] Prefetching for module JetCoreClusterSplitter/'siPixelClusters'
   [13] Prefetching for module PrimaryVertexProducer/'firstStepPrimaryVerticesPreSplitting'
   [14] Calling method for module TrackProducer/'initialStepTracksPreSplitting'
Exception Message:
SiPixelTemplate2D::interpolate can't find needed template ID = 715, Are you using the correct global tag?
----- End Fatal Exception -------------------------------------------------
%MSG-w BeamFitter:  AlcaBeamMonitor:AlcaBeamMonitor@endLumi  29-Mar-2019 03:20:27 CET Run: 297557 Lumi: 119
No event read! No Fitting!
%MSG

@tocheng
Copy link
Contributor

tocheng commented Mar 29, 2019

@tvami @cmantill
I suspect the 2D template works for 2017.
If you look at the tag,
https://cms-conddb.cern.ch/cmsDbBrowser/list/Prod/tags/SiPixel2DTemplateDBObject_38T_v1_offline
the first IOV covers from run 1 to run 312209 where 2016 and 2017 runs all fall into this IOV.

@mmusich
Copy link
Contributor

mmusich commented Mar 29, 2019

@tvami @cmantill
I suspect the 2D template works for 2017.
If you look at the tag,
https://cms-conddb.cern.ch/cmsDbBrowser/list/Prod/tags/SiPixel2DTemplateDBObject_38T_v1_offline
the first IOV covers from run 1 to run 312209 where 2016 and 2017 runs all fall into this IOV.

@tocheng I don't believe that's the problem for issue reported in #26263 (comment), because all the other workflows run fine excepted there 3:

136.85_RunEGamma2018A+RunEGamma2018A+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQM+HARVEST2018_L1TEgDQM
140.56_RunHI2018+RunHI2018+RECOHID18+HARVESTDHI18
136.788_RunSinglePh2017B+RunSinglePh2017B+HLTDR2_2017+RECODR2_2017reHLT_skimSinglePh_Prompt+HARVEST2017

By the way, phase-0 detector reconstruction is not touched intentionally by this PR, see :
https://github.com/cms-sw/cmssw/pull/26263/files#diff-f952152458b1577d1a7da17a1db86daaR16

As far as I can see all of the three failing workflows use the same 2D template tag: SiPixel2DTemplateDBObject_38T_v1_prompt:

136.788 => GT: auto:run2_data_promptlike (105X_dataRun2_PromptLike_v8 in master branch)
136.85 => GT: auto:run2_data_promptlike (105X_dataRun2_PromptLike_v8 in master branch)
140.56 => GT: auto:run2_data_promptlike_hi (105X_dataRun2_PromptLike_HI_v3 in master branch)

and all of these GTs use the same tag [0]
As the various failing workflows fail for runs: 297557, 315489, 326479 it looks like all the 3 payloads in that tag are problematic.

Since: Run   Insertion Time       Payload                                   Object Type               
-----------  -------------------  ----------------------------------------  ------------------------- 
1            2018-06-18 22:16:47  a114682c4d0a1df6cafb53f9ce37af591cfd7bfb  SiPixel2DTemplateDBObject 
312209       2018-06-18 22:20:22  50758a12856b842ffe768ac29348d73bda9480b3  SiPixel2DTemplateDBObject 
324276       2018-10-10 19:02:50  916a68aa85d2bbbb4c61f0db8a1bdb6a30008829  SiPixel2DTemplateDBObject

an easy workaround would be if you could supply new Global Tag for the prompt-like workflows that instead of the tag SiPixel2DTemplateDBObject_38T_v1_prompt contain the tag: SiPixel2DTemplateDBObject_38T_v1_offline.
Please let us know.

[0]

$ conddb list 105X_dataRun2_PromptLike_v8 | grep SiPixel2D
SiPixel2DTemplateDBObjectRcd                            numerator                                         SiPixel2DTemplateDBObject_38T_v1_prompt                                        
$ conddb list 105X_dataRun2_PromptLike_HI_v3 | grep SiPixel2D
SiPixel2DTemplateDBObjectRcd                            numerator                                         SiPixel2DTemplateDBObject_38T_v1_prompt        

@tvami
Copy link
Contributor

tvami commented Mar 29, 2019

Hi,
to make things a bit faster, I've queued SiPixel2DTemplateDBObject_38T_v1_offline to
105X_dataRun2_PromptLike_Queue
Tamas

@tvami
Copy link
Contributor

tvami commented Mar 29, 2019

Hi @tocheng, I've created SiPixel2DTemplateDBObject_38T_v2_prompt as a copy of SiPixel2DTemplateDBObject_38T_v1_offline but with prompt synchronization and queued to the same queue as above.

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 4, 2019

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 4, 2019

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26263/9101

  • This PR adds an extra 24KB to repository

Code check has found code style and quality issues which could be resolved by applying a patch in https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26263/9101/git-diff.patch
e.g. curl https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26263/9101/git-diff.patch | patch -p1

You can run scram build code-checks to apply code checks directly

@cmantill
Copy link
Contributor Author

cmantill commented Apr 4, 2019

@mmusich @tocheng I am confused by this patch:
https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26263/9101/git-diff.patch
The last lines seem to update the GTs according to this PR: f7c4a8a#diff-8cbfe6480e6018966eac475523946acd
which is a reasonable thing to fix.

However, the first lines are reverting the commit we just did to update pixel local reco conditions for 2017.

Please let me know how to proceed.

@tocheng
Copy link
Contributor

tocheng commented Apr 4, 2019

@mmusich @tocheng I am confused by this patch:
https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-26263/9101/git-diff.patch
The last lines seem to update the GTs according to this PR: f7c4a8a#diff-8cbfe6480e6018966eac475523946acd
which is a reasonable thing to fix.

However, the first lines are reverting the commit we just did to update pixel local reco conditions for 2017.

Please let me know how to proceed.

@cmantill Can you rebase the PR to the most recent IB?

@cmsbuild
Copy link
Contributor

Pull request #26263 was updated. @perrotta, @christopheralanwest, @tocheng, @cmsbuild, @franzoni, @slava77, @pohsun can you please check and sign again.

@cmantill
Copy link
Contributor Author

@perrotta We found a bug in the current implementation to call the 2D fit:

For cases where we are in the correct layers (BPIX 2,3,4) but not edgehits, the theClusterParam.recommended2D_ is set to a garbage value (!=0) that later passes the 2D call. Thus, instead of calling 2D reco for 0.5% of the hits, we were calling for 10% of them. We have fixed that in the latest commit by setting theClusterParam.recommended2D_ false by default.

I should clarify that this bug should have not affected the hit residuals by a significant amount. For the hits affected by this bug, the 2D Reco would have been running with option edgeY=0, i.e. when CR is not trying to repair anything and is pretty much the same as the 1D Reco. It was only making things much slower as shown by the results below.

With the fix introduced in the latest commit we have updated the timing and CPU time tests as follows:

   +0.062444      +0.00%        10.43 ms/ev ->        11.10 ms/ev jetCoreRegionalStepTracks
   +0.057465      +0.04%       113.48 ms/ev ->       120.19 ms/ev initialStepTracksPreSplitting
   +0.065809      +0.05%       121.00 ms/ev ->       129.24 ms/ev detachedTripletStepTracks
   +0.070813      +0.06%       134.83 ms/ev ->       144.73 ms/ev highPtTripletStepTracks
   +0.071519      +0.06%       130.54 ms/ev ->       140.22 ms/ev initialStepTracks
   +0.069876      +0.06%       138.94 ms/ev ->       149.00 ms/ev lowPtTripletStepTracks
   +0.087252      +0.07%       128.88 ms/ev ->       140.64 ms/ev lowPtQuadStepTracks
Job total:  15.7222 s/ev ==> 17.0266 s/ev

Although the time increases for tracking, the increased time is less than that shown before the bugfix. We think that the overall increase of time is due to random increases of other modules (also this comparison is done to a test with CROFF two days ago).

[1] https://cmantill.web.cern.ch/cmantill/tmp/newtiming_afterbugfix.log

@perrotta
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 10, 2019

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/34137/console Started: 2019/04/10 23:17

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-26263/34137/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 8476 differences found in the comparisons
  • DQMHistoTests: Total files compared: 32
  • DQMHistoTests: Total histograms compared: 3140495
  • DQMHistoTests: Total failures: 12869
  • DQMHistoTests: Total nulls: 6
  • DQMHistoTests: Total successes: 3127423
  • DQMHistoTests: Total skipped: 197
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.028 KiB( 31 files compared)
  • DQMHistoSizes: changed ( 136.788 ): -0.016 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 136.85 ): -0.012 KiB JetMET/SUSYDQM
  • Checked 133 log files, 14 edm output root files, 32 DQM output files

@perrotta
Copy link
Contributor

After the last fix, overall timing is quite comparable while using ClusterRepair on a limited number of hits, as in this PR, or having it turned off, as in the baseline IB. I think that the differences you report in #26263 (comment) are mostly due to an overall normalization factor, as I see no relevant increase while running on the usual 100 evt of the TTbar PU sample 11024.

The number of hits affected being now rather limited, as you pointed out, also the effects on general tracks justshow up as small deviations from the baseline, see
image
The same for the derived higher order objects, see for example
image
or
image

A detailed validation cannot be done with just those small test samples, but some is certainly expected based on pre4, provided this PR can get merged for it

@perrotta
Copy link
Contributor

+1

  • ClusterRepair has beed turned on for the restricted class of hits as in the PR descritption.
  • While restricting to that set of hits the impact on code performance is negligible, and also the propagation to the reco objects has overall a limited effect.
  • Jenkins tests pass and show small deviations in tracking and derived quantities, as expected.

@pohsun
Copy link

pohsun commented Apr 11, 2019

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @davidlange6, @slava77, @smuzaffar, @fabiocos (and backports should be raised in the release meeting by the corresponding L2)

@fabiocos
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit c2e23af into cms-sw:master Apr 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet