Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[12_2_X] Add Run3 relval GTs and use fixed snapshot in online GTs #37670

Merged

Conversation

francescobrivio
Copy link
Contributor

@francescobrivio francescobrivio commented Apr 25, 2022

PR description:

Backport of #37171
As reported in #37653 (comment) (and later in the AlCaDB CMS Talk forum by @Martin-Grunewald), since datataking switched to CMSSW_12_3_X (and the boost version in use is now 1_78) the HLT tests in the 12_2_X IBs are failing because they are still using the auto:run3_hlt and auto:run3_data_prompt GTs (which have an infinite snapshot).

This PR introduces, for the 12_2_X cycle, the same "relval" autoCond entries for run3 data there were already introduced in 12_3_X and 12_4_X, namely:

Following @missirol suggestion I also updated autoCondHLT.py to use the newly created GTs.

Edit:
After some discussion in @cms-sw/alca-l2 we decided it is better NOT to have "open-ended" GTs in the release as they can cause all sort of troubles in the IBs and also possibly reproducibility issues.
So I updated all the online GTs (HLT/Express/Prompt) to a "frozen" version, i.e. with a fixed snapshot at the time of the last run processed with 12_2_X in Tier0 (as announced in this CMSTalk post): Run 350279 - Snapshot: 2022-04-12 15:15:00

Differences wrt the hlt and offline Run3 data GTs are:
Run 3 HLT RelVals
https://cms-conddb.cern.ch/cmsDbBrowser/diff/Prod/gts/122X_dataRun3_HLT_v4/122X_dataRun3_HLT_relval_v2

Run 3 data RelVals
https://cms-conddb.cern.ch/cmsDbBrowser/diff/Prod/gts/122X_dataRun3_v3/122X_dataRun3_relval_v2

Run 3 HLT
https://cms-conddb.cern.ch/cmsDbBrowser/diff/Prod/gts/122X_dataRun3_HLT_v4/122X_dataRun3_HLT_frozen_v1

Run 3 Express
https://cms-conddb.cern.ch/cmsDbBrowser/diff/Prod/gts/122X_dataRun3_Express_v3/122X_dataRun3_Express_frozen_v1

Run 3 Prompt
https://cms-conddb.cern.ch/cmsDbBrowser/diff/Prod/gts/122X_dataRun3_Prompt_v3/122X_dataRun3_Prompt_frozen_v1

PR validation:

Code compiles and I run the same HLT test that is running in the IBs (thanks @missirol for the recipe):

git cms-addpkg HLTrigger/Configuration
cd HLTrigger/Configuration/test
./runAll.csh dev

As expected this produces:

16:23:01 cmsRun RelVal_RECO_GRun_DATA.py >& RelVal_RECO_GRun_DATA.log
191.193u 35.487s 3:09.59 119.5%    0+0k 4524536+79240io 13954pf+0w
16:26:11 exit status: 0

Proving that the updates are correct.

Backport

backport of #37171

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 25, 2022

A new Pull Request was created by @francescobrivio for CMSSW_12_2_X.

It involves the following packages:

  • Configuration/AlCa (alca)
  • Configuration/HLT (hlt)

@malbouis, @yuanchao, @cmsbuild, @missirol, @Martin-Grunewald, @francescobrivio, @tvami can you please review it and eventually sign? Thanks.
@silviodonato, @tocheng, @Martin-Grunewald, @missirol, @mmusich, @fabiocos this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@tvami
Copy link
Contributor

tvami commented Apr 25, 2022

@cmsbuild , please test

@francescobrivio
Copy link
Contributor Author

backport of #37171

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-INPUT
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-79a9f7/24194/summary.html
COMMIT: 6ff8cb4
CMSSW: CMSSW_12_2_X_2022-04-25-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/37670/24194/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-INPUT

  • 138.4138.4_PromptCollisions+RunMinimumBias2021+ALCARECOPROMPTR3+HARVESTDPROMPTR3/step2_PromptCollisions+RunMinimumBias2021+ALCARECOPROMPTR3+HARVESTDPROMPTR3.log

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 42
  • DQMHistoTests: Total histograms compared: 3251324
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3251296
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 41 files compared)
  • Checked 177 log files, 37 edm output root files, 42 DQM output files
  • TriggerResults: no differences found

@Martin-Grunewald
Copy link
Contributor

Can we progress on this PR, so that the IB HLT-validation and TSG tests work again? The test failure can not be related to this PR.

In general the use case from the TSG side for these two special GTs is to run the (more recent and developing) HLT menus on data taken with an older HLT menu. The goal is NOT to reproduce the old HLT from the time the data was taken, but so see the performance of the new HLT menu, in turn using new calibrations etc, on that old data (to be replaced by real Run3 collisions once we have them).
A secondary goal would be if we run the same HLT menu now and in a week (in the same IB/release), we'd like to get same results, but this is a secondary goal, not the primary, and thus not need to be guaranteed for these two special GTs.

Of course, if 12_2 is no longer needed, then we could let it rot but then it should be removed from IB/relval testing, ie,
removed from https://cmssdt.cern.ch/SDT/html/cmssdt-ib/ to save resources. @perrotta @qliphy

@cmsbuild
Copy link
Contributor

cmsbuild commented May 1, 2022

Pull request #37670 was updated. @malbouis, @yuanchao, @cmsbuild, @missirol, @Martin-Grunewald, @francescobrivio, @tvami can you please check and sign again.

@francescobrivio
Copy link
Contributor Author

test parameters:

  • workflow = 138.4

@francescobrivio
Copy link
Contributor Author

@cmsbuild please test

@francescobrivio
Copy link
Contributor Author

Can we progress on this PR, so that the IB HLT-validation and TSG tests work again? The test failure can not be related to this PR

Hi Martin, sorry for the delay I was offline for a couple of days, it should be fixed now.
After discussing with @cms-sw/alca-l2 and @mmusich we agreed it's better to have a fixed snapshot for the online GTs (including the two "relval" ones). I'll update the PR description and title to give more extensive details.

We would like to test in 12_2_X IBs if everything is ok like this, if so we will update also 12_3_X and 12_4_X with the same policy. @perrotta @qliphy I know the forward porting is not the standard procedure, but in this case using the 12_2_X IBs is perfect to reproduce the boost version change between releases, I hope that is fine with you 😄

@francescobrivio francescobrivio changed the title [12_2_X] Add Run3 relval GTs [12_2_X] Add Run3 relval GTs and use fixed snapshot in online GTs May 1, 2022
@cmsbuild
Copy link
Contributor

cmsbuild commented May 1, 2022

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-79a9f7/24365/summary.html
COMMIT: 5306e7f
CMSSW: CMSSW_12_2_X_2022-05-01-0000/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/37670/24365/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

There are some workflows for which there are errors in the baseline:
138.4 step 2
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

  • You potentially added 940 lines to the logs
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 42
  • DQMHistoTests: Total histograms compared: 3251324
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3251302
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 41 files compared)
  • Checked 179 log files, 37 edm output root files, 42 DQM output files
  • TriggerResults: no differences found

@Martin-Grunewald
Copy link
Contributor

Can we progress on this PR, so that the IB HLT-validation and TSG tests work again? The test failure can not be related to this PR

Hi Martin, sorry for the delay I was offline for a couple of days, it should be fixed now. After discussing with @cms-sw/alca-l2 and @mmusich we agreed it's better to have a fixed snapshot for the online GTs (including the two "relval" ones). I'll update the PR description and title to give more extensive details.

We would like to test in 12_2_X IBs if everything is ok like this, if so we will update also 12_3_X and 12_4_X with the same policy. @perrotta @qliphy I know the forward porting is not the standard procedure, but in this case using the 12_2_X IBs is perfect to reproduce the boost version change between releases, I hope that is fine with you smile

Sorry, but does this fixed snapshot mean the GTs do not (automatically) update for new (later) calibration records?
Would we then need to manually request from you an update of those GTs?

@francescobrivio
Copy link
Contributor Author

Sorry, but does this fixed snapshot mean the GTs do not (automatically) update for new (later) calibration records?
Would we then need to manually request from you an update of those GTs?

Hi Martin, yes that is correct.
We will update manually all the "frozen" GTs every time that the "online" ones are updated. This is somehow restoring the same policy that we had back at the beginning of Run2, as pointed out by Marco few messages above. In order to keep the history of what is used in data-taking I added in the commented lines all the info needed to go back from the "frozen" GTs to their exact corresponding "online" GTs.

Please note that for the relval GTs the snapshot has always been frozen, see e.g. the current 123X relval GTs:

@Martin-Grunewald
Copy link
Contributor

OK, thanks for the clarification and the workflow to update.

@malbouis
Copy link
Contributor

malbouis commented May 2, 2022

+alca

  • introducing new policy for "online" GTs in autoCond.py

@cmsbuild
Copy link
Contributor

cmsbuild commented May 2, 2022

This pull request is fully signed and it will be integrated in one of the next CMSSW_12_2_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_12_4_X is complete. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@mmusich
Copy link
Contributor

mmusich commented May 2, 2022

introducing new policy for "online" GTs in autoCond.py

Except is not new at all :D

@francescobrivio
Copy link
Contributor Author

@perrotta @qliphy it would be nice to merge this PR and check the effect on the 12_2_X IBs, which should now be fixed :D

@perrotta
Copy link
Contributor

perrotta commented May 3, 2022

Summary:

  • You potentially added 940 lines to the logs
  • Reco comparison results: 0 differences found in the comparisons

Is anybody able to trace where do those extra lines in the logs come from?
I checked a few Run3 wf logs (at random) and I didn't find any. They could potentially point out some issue, though.

@tvami
Copy link
Contributor

tvami commented May 3, 2022

Isnt that reffering to this
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-79a9f7/24365/build.log
?

Most things I see here are not related to this PR

@mmusich
Copy link
Contributor

mmusich commented May 3, 2022

That wf does not run HLT, and I don't know if those log messages are to be expected.

If that wf does not run HLT, why are we running the HLT tracking monitoring in it? That is producing a gazillion of product not found warnings...

@missirol
Copy link
Contributor

missirol commented May 3, 2022

Good question for DQM? @cms-sw/dqm-l2

@missirol
Copy link
Contributor

missirol commented May 3, 2022

Good question for DQM?

Actually, the question might be better directed to who designed, or maintains, that wf.

In any case, one way to fix this is [*]. I can't judge if that's the correct fix, that's left to the experts.

It's clear that this issue is not introduced by this PR, and it likely needs to be fixed in master first.

Arguably, this PR should not be delayed because of it.

[*]

diff --git a/Configuration/PyReleaseValidation/python/relval_steps.py b/Configuration/PyReleaseValidation/python/relval_steps.py
index d7a461e1bda..2dde975c0c4 100644
--- a/Configuration/PyReleaseValidation/python/relval_steps.py
+++ b/Configuration/PyReleaseValidation/python/relval_steps.py
@@ -2738,7 +2738,7 @@ steps['ALCARECOEXPR3']=merge([{'-s':'ALCAOUTPUT:SiPixelCalZeroBias+SiStripCalZer
                                '--triggerResultsProcess': 'RECO',
                                '--customise':'Configuration/DataProcessing/RecoTLR.customiseExpress'},steps['RECODR3']])
 
-steps['ALCARECOPROMPTR3']=merge([{'-s':'RAW2DIGI,L1Reco,RECO,ALCA:SiStripCalZeroBias+SiStripCalMinBias+TkAlMinBias+HcalCalHO+HcalCalIterativePhiSym+HcalCalHBHEMuonProducerFilter+HcalCalIsoTrkProducerFilter,DQM',
+steps['ALCARECOPROMPTR3']=merge([{'-s':'RAW2DIGI,L1Reco,RECO,ALCA:SiStripCalZeroBias+SiStripCalMinBias+TkAlMinBias+HcalCalHO+HcalCalIterativePhiSym+HcalCalHBHEMuonProducerFilter+HcalCalIsoTrkProducerFilter,DQM:@standardDQMFakeHLT',
                                   '--conditions':'auto:run3_data_prompt',
                                   '--scenario':'pp',
                                   '--era':'Run3',

@missirol
Copy link
Contributor

missirol commented May 5, 2022

@perrotta, the issue of spurious warnings in wf 138.4 is being addressed elsewhere.

Do you think this PR can now be integrated?

@mmusich
Copy link
Contributor

mmusich commented May 5, 2022

@smuzaffar

You potentially added 940 lines to the logs

is there any easy way to scan from where these come from?

@smuzaffar
Copy link
Contributor

@mmusich
Copy link
Contributor

mmusich commented May 5, 2022

Just a guess: maybe it has to do with the wf that was failing in IB, and passes now thanks to this PR, i.e. 138.4.

I think this is indeed probably the cause of the additions (from the log above, thanks @smuzaffar !):

You added 2712 to /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-79a9f7/138.4_PromptCollisions+RunMinimumBias2021+ALCARECOPROMPTR3+HARVESTDPROMPTR3/step2_PromptCollisions+RunMinimumBias2021+ALCARECOPROMPTR3+HARVESTDPROMPTR3.log

@perrotta
Copy link
Contributor

perrotta commented May 5, 2022

Thank you @smuzaffar : it is not an easy read, but I also think that it can confirm that the extra lines in the log mostly come from 138.4.

According to that link the extra lines in that workflow log are 2712, while #37670 (comment) says that overall the extra lines are 940. It means that all other log comprehensively remove O(1800) lines. I think they could be the oned detailed in the bottom of https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_12_2_X_2022-05-01-0000+79a9f7/49980/validateJR/logRootQA.log where "You removed..." is written. But I don't think that any more detailed debug of the issue is needed ;-)

Conclusion: what @missirol suggested in #37670 (comment) makes sense, and we can merge this PR, so that next IB tests in 12_2_X will be clean.

@perrotta
Copy link
Contributor

perrotta commented May 5, 2022

+1

  • Meant to remove the errors to the HLT tests in 12_2_X IBs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants