Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrote checkForModuleDependencyCorrectness #34735

Merged
merged 10 commits into from Aug 10, 2021

Conversation

Dr15Jones
Copy link
Contributor

@Dr15Jones Dr15Jones commented Aug 2, 2021

PR description:

  • The IBs were showing the old algorithm, using boost graph library, could hit some pathological cases and take >10 minutes to run.
  • The new algorithm simulates how the framework would run the modules and checks to see if a deadlock would occur.
  • New unit tests were added to test the higher level algorithm interface and to check exception messages.

PR validation:

Code compiles. Framework unit tests (including new ones) pass.

fixes #34633
fixes #31199
fixes cms-sw/framework-team#210

The IBs were showing the old algorithm, using boost graph library, could hit some pathological cases and take >10 minutes to run.
The new algorithm simulates how the framework would run the modules and checks to see if a deadlock would occur.
The function is no longer needed as the dependency checks are
now done using a different algoritm.
@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 2, 2021

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34735/24382

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 2, 2021

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34735/24383

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 2, 2021

A new Pull Request was created by @Dr15Jones (Chris Jones) for master.

It involves the following packages:

  • FWCore/Framework (core)

@makortel, @smuzaffar, @cmsbuild, @Dr15Jones can you please review it and eventually sign? Thanks.
@makortel, @wddgit this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy, @perrotta you are the release manager for this.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor Author

I tested this change on step2 of workflow 11725.0 under CMSSW_12_0_X_2021-07-28-2300. On the machine I tested, the original job took > 10 minutes to do the job initialization. Using this code, it took less than 2 minutes. However, the new code gave an error stating a dependent module was later on a path. The problem was the same modules appear multiple times on the same path which confused the part of the algorithm that is meant to enforce policy, not the part that tests for runnability.

I'll modify the algorithm to ignore duplicate modules on the same path.

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 2, 2021

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34735/24384

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 2, 2021

Pull request #34735 was updated. @makortel, @smuzaffar, @cmsbuild, @Dr15Jones can you please check and sign again.

@Dr15Jones
Copy link
Contributor Author

please test

@Dr15Jones
Copy link
Contributor Author

Dr15Jones commented Aug 2, 2021

@Martin-Grunewald @fwyzard The new module dependency checking algorithm explicitly adds some enforcements of policy which the old algorithm appeared to be doing.

One such policy is if a module 'a' consumes data from 'module 'b' then if module 'b' appears on at least 1 path with module 'a' it must appear on ALL paths with module 'b'. We believe that policy was requested by the HLT.

NOTE: that is strictly a policy enforcement as in the cases mentioned above, the framework could actually properly schedule the modules even if the paths were not completely consistent.

The reason I mention this is I was just testing step2 of workflow 11725.0 and it is failing with

----- Begin Fatal Exception 02-Aug-2021 15:16:31 CDT-----------------------
An exception of category 'ScheduleExecutionFailure' occurred while
   [0] Calling beginJob
Exception Message:
Unrunnable schedule
Paths are non consistent
  module 'hltPixelTrackerHVOn' depends on 'hltScalersRawToDigi' which appears on paths
  HLT_Ele24_eta2p1_WPTight_Gsf_LooseChargedIsoPFTauHPS30_eta2p1_CrossL1_v1 HLT_Ele24_eta2p1_WPTight_Gsf_MediumChargedIsoPFTauHPS30_eta2p1_CrossL1_v1 HLT_Ele24_eta2p1_WPTight_Gsf_TightChargedIsoPFTauHPS30_eta2p1_CrossL1_v1 HLT_Ele24_eta2p1_WPTight_Gsf_LooseChargedIsoPFTauHPS30_eta2p1_TightID_CrossL1_v1 HLT_Ele24_eta2p1_WPTight_Gsf_MediumChargedIsoPFTauHPS30_eta2p1_TightID_CrossL1_v1 HLT_Ele24_eta2p1_WPTight_Gsf_TightChargedIsoPFTauHPS30_eta2p1_TightID_CrossL1_v1 HLT_IsoMu20_eta2p1_LooseChargedIsoPFTauHPS27_eta2p1_CrossL1_v4 HLT_IsoMu20_eta2p1_MediumChargedIsoPFTauHPS27_eta2p1_CrossL1_v1 HLT_IsoMu20_eta2p1_TightChargedIsoPFTauHPS27_eta2p1_CrossL1_v1 HLT_IsoMu20_eta2p1_LooseChargedIsoPFTauHPS27_eta2p1_TightID_CrossL1_v1 HLT_IsoMu20_eta2p1_MediumChargedIsoPFTauHPS27_eta2p1_TightID_CrossL1_v1 HLT_IsoMu20_eta2p1_TightChargedIsoPFTauHPS27_eta2p1_TightID_CrossL1_v1 HLT_IsoMu24_eta2p1_TightChargedIsoPFTauHPS35_Trk1_eta2p1_Reg_CrossL1_v1 HLT_IsoMu24_eta2p1_MediumChargedIsoPFTauHPS35_Trk1_TightID_eta2p1_Reg_CrossL1_v1 HLT_IsoMu24_eta2p1_TightChargedIsoPFTauHPS35_Trk1_TightID_eta2p1_Reg_CrossL1_v1 HLT_IsoMu24_eta2p1_MediumChargedIsoPFTauHPS35_Trk1_eta2p1_Reg_CrossL1_v4 HLT_IsoMu24_eta2p1_MediumChargedIsoPFTauHPS30_Trk1_eta2p1_Reg_CrossL1_v1 HLT_IsoMu27_LooseChargedIsoPFTauHPS20_Trk1_eta2p1_SingleL1_v1 HLT_IsoMu27_MediumChargedIsoPFTauHPS20_Trk1_eta2p1_SingleL1_v1 HLT_IsoMu27_TightChargedIsoPFTauHPS20_Trk1_eta2p1_SingleL1_v1 HLT_HT425_v9 HLT_HT430_DisplacedDijet40_DisplacedTrack_v13 HLT_HT500_DisplacedDijet40_DisplacedTrack_v13 HLT_HT430_DisplacedDijet60_DisplacedTrack_v13 HLT_HT400_DisplacedDijet40_DisplacedTrack_v13 HLT_HT650_DisplacedDijet60_Inclusive_v13 HLT_HT550_DisplacedDijet60_Inclusive_v13 AlCa_LumiPixelsCounts_ZeroBias_v1 HLT_DoubleMediumChargedIsoPFTauHPS30_L1MaxMass_Trk1_eta2p1_Reg_v1 HLT_DoubleTightChargedIsoPFTauHPS35_Trk1_eta2p1_Reg_v1 HLT_DoubleMediumChargedIsoPFTauHPS35_Trk1_TightID_eta2p1_Reg_v1 HLT_DoubleMediumChargedIsoPFTauHPS35_Trk1_eta2p1_Reg_v4 HLT_DoubleTightChargedIsoPFTauHPS35_Trk1_TightID_eta2p1_Reg_v1 HLT_DoubleMediumChargedIsoPFTauHPS40_Trk1_eta2p1_Reg_v1 HLT_DoubleTightChargedIsoPFTauHPS40_Trk1_eta2p1_Reg_v1 HLT_DoubleMediumChargedIsoPFTauHPS40_Trk1_TightID_eta2p1_Reg_v1 HLT_DoubleTightChargedIsoPFTauHPS40_Trk1_TightID_eta2p1_Reg_v1 HLT_VBF_DoubleLooseChargedIsoPFTauHPS20_Trk1_eta2p1_v1 HLT_VBF_DoubleMediumChargedIsoPFTauHPS20_Trk1_eta2p1_v1 HLT_VBF_DoubleTightChargedIsoPFTauHPS20_Trk1_eta2p1_v1 
but is missing from
  AlCa_LumiPixelsCounts_Random_v1 
----- End Fatal Exception -------------------------------------------------

Therefore strict enforcement of this policy will likely break the IBs as the old algorithm was not catching all the cases.

So we need to know if this policy must actually be enforced and if so, who will clean up the existing problems.

@fwyzard
Copy link
Contributor

fwyzard commented Aug 5, 2021

Hi Chris,
that's a problem with the mkFit customisation, not with the HLT menu.

@mmasciov @slava77 @makortel may be able to help fix it.

In the meantime, the agreement when it was introduced was that HLT-related failures in that test should not block other PRs from being merged.

@Martin-Grunewald
Copy link
Contributor

Martin-Grunewald commented Aug 5, 2021

@fwyzard
Sorry, I do not get your point re mkfit problem in this context, could you please clarify?
I also do not see how module 'hltIter0PFlowCkfTrackCandidates' depends on 'hltIter0PFlowCkfTrackCandidatesMkFitEventOfHits' - the latter is not an InputTag parameter of the former...

@fwyzard
Copy link
Contributor

fwyzard commented Aug 5, 2021

Sure !

The error reported by Chris

Paths are non consistent
  module 'hltIter0PFlowCkfTrackCandidates' depends on 'hltIter0PFlowCkfTrackCandidatesMkFitEventOfHits' which appears on paths
  HLT_AK8PFJet360_TrimMass30_v18 HLT_AK8PFJet380_TrimMass30_v11 HLT_AK8PFJet400_TrimMass30_v12 ...[cut many, many, many paths]
but is missing from
  HLT_IsoTrackHB_v4 HLT_IsoTrackHE_v4

mentions the module hltIter0PFlowCkfTrackCandidatesMkFitEventOfHits, which is not part of the HLT menu itself, but is added by the mkFit customisation at RecoTracker/MkFit/python/customizeHLTIter0ToMkFit.py.

.7 is the workflow modifier used by runTheMatrix.py to switch on the use of mkFit.

@fwyzard
Copy link
Contributor

fwyzard commented Aug 5, 2021

See #33802 and the discussion around #33802 (comment) .

By the way, I should correct myself: the agreement was that the mkFit customisation should not block any HLT-related work from being merged - for the framework changes, it's up to the Core Software group.

@Martin-Grunewald
Copy link
Contributor

Martin-Grunewald commented Aug 5, 2021

Thanks Andrea!

Sorry to say, but my first reaction is that this sucks: That customisation is alien to HLT, why is an HLT modification done by Reco? In view of integration into HLT 'some time in the future'? At this stage it may help Reco but is no good for HLT. Who is the proponent to push this into HLT? On what time scale?

Hmm, I would prefer if that modification of HLT would be removed alltogether from IB and other 'official' tests, and kept private for now until it gets proposed and approved for integration into HLT.

@fwyzard
Copy link
Contributor

fwyzard commented Aug 5, 2021

Sorry to say, but my first reaction is that this sucks: That customisation is alien to HLT, why is an HLT modification done by Reco?

Eh... I don't disagree.

In view of integration into HLT 'some time in the future'? At this stage it may help Reco but is no good for HLT.
Who is the proponent to push this into HLT? On what time scale?

I would say @mmasciov, @slava77, and @makortel, based on the previous presentations, discussion, and work on PRs.

However it's not clear (to me) if it will be useful for the HLT, at least on the timescale of the beginning of Run 3.

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 5, 2021

-1

Failed Tests: RelVals RelVals-INPUT
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-76ce15/17549/summary.html
COMMIT: 6bf59e3
CMSSW: CMSSW_12_1_X_2021-08-04-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/34735/17549/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals

----- Begin Fatal Exception 04-Aug-2021 23:15:24 CEST-----------------------
An exception of category 'ScheduleExecutionFailure' occurred while
   [0] Calling beginJob
Exception Message:
Unrunnable schedule
Paths are non consistent
  module 'ALCARECOHcalCalPhisymDQM' depends on 'hbherecoNoise' which appears on paths
  pathALCARECOHcalCalMinBias 
but is missing from
  pathALCARECOHcalCalIterativePhiSym 
----- End Fatal Exception -------------------------------------------------

RelVals-INPUT

  • 1000.01000.0_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT/step2_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT.log
  • 11634.711634.7_TTbar_14TeV+2021_trackingMkFit+TTbar_14TeV_TuneCP5_GenSimINPUT+Digi+Reco+HARVEST/step2_TTbar_14TeV+2021_trackingMkFit+TTbar_14TeV_TuneCP5_GenSimINPUT+Digi+Reco+HARVEST.log

@Dr15Jones
Copy link
Contributor Author

please test with #34793, #34784

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 5, 2021

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-76ce15/17580/summary.html
COMMIT: 6bf59e3
CMSSW: CMSSW_12_1_X_2021-08-05-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/34735/17580/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 39
  • DQMHistoTests: Total histograms compared: 2999410
  • DQMHistoTests: Total failures: 10
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 2999377
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 173.577 KiB( 38 files compared)
  • DQMHistoSizes: changed ( 1000.0 ): 24.607 KiB ALCAStreamHcalIterativePhiSym/MBdepth1
  • DQMHistoSizes: changed ( 1000.0 ): 24.607 KiB ALCAStreamHcalIterativePhiSym/MBdepth2
  • DQMHistoSizes: changed ( 1000.0 ): 24.607 KiB ALCAStreamHcalIterativePhiSym/MBdepth3
  • DQMHistoSizes: changed ( 1000.0 ): 24.607 KiB ALCAStreamHcalIterativePhiSym/MBdepth4
  • DQMHistoSizes: changed ( 1000.0 ): 24.607 KiB ALCAStreamHcalIterativePhiSym/MBdepth5
  • DQMHistoSizes: changed ( 1000.0 ): 24.607 KiB ALCAStreamHcalIterativePhiSym/MBdepth6
  • DQMHistoSizes: changed ( 1000.0 ): 24.607 KiB ALCAStreamHcalIterativePhiSym/MBdepth7
  • DQMHistoSizes: changed ( 1000.0 ): 0.440 KiB ALCAStreamHcalIterativePhiSym/DistrHBHEsize
  • DQMHistoSizes: changed ( 1000.0 ): 0.438 KiB ALCAStreamHcalIterativePhiSym/DistrHFsize
  • DQMHistoSizes: changed ( 1000.0 ): 0.438 KiB ALCAStreamHcalIterativePhiSym/DistrHOsize
  • DQMHistoSizes: changed ( 1000.0 ): ...
  • Checked 165 log files, 37 edm output root files, 39 DQM output files
  • TriggerResults: no differences found

@Dr15Jones
Copy link
Contributor Author

+1
requires #34793 and #34784 in order to avoid failures in the RelVals.

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 6, 2021

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy, @perrotta (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

@cmsbuild cmsbuild merged commit b6555c2 into cms-sw:master Aug 10, 2021
@Dr15Jones Dr15Jones deleted the improveUnrunnableScheduledFinder branch August 17, 2021 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment