[DQM] Prompt Matrix reconfiguration #4619

jfernan2 · 2021-10-11T09:10:16Z

Changes for Prompt Matrix DQM reduction:

Reduced:

L1TMon to SingleMuon, SingleElectron/EG and ZB
TAU to SingleMuon, SingleElectron/EG and TAU
TRK to SingleMuon, SingleElectron/EG, DoubleMuon, JetHT, JetMET and ZB, by creating commonReduced which excludes TRK
CTPPS to DoubleMuon, SingleElectron/EG y ZB

Replay Request

Requestor
PPD-DQM

Describe the configuration

Release: CMSSW_12_1_X and PR [DQM] Prompt matrix redefinition cms-sw/cmssw#35605
Run: 317696
GTs:
expressGlobalTag: 120X_dataRun3_Express_Candidate_2021_09_30_18_52_55
promptrecoGlobalTag: 120X_dataRun3_Prompt_Candidate_2021_09_30_19_06_33
alcap0GlobalTag: 120X_dataRun3_Prompt_Candidate_2021_09_30_19_06_33
Additional changes:
This PR goes in sync with: [DQM] Prompt matrix redefinition cms-sw/cmssw#35605

Purpose of the test

Not sure if a replay is needed. PPD started a campaign to reduce the DQM load on Prompt matrix. This PR reduces the number of sequences as agreed on:

https://docs.google.com/presentation/d/1AF65xzq7T70Yt-0o_OV47YTFBv-kbvHZDlRmUSCZsAw/edit#slide=id.geedc047305_0_0

T0 Operations HyperNews thread
https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2314.html

Thanks

Changes for Prompt Matrix DQM reduction

cmsdmwmbot · 2021-10-11T09:11:06Z

Can one of the admins verify this patch?

francescobrivio · 2021-10-11T09:22:20Z

There are two options here:

Replay recent cosmics runs as in Replay testing CMSSW_12_1_0_pre4 #4616:
- Runs: 343082,344063
- GTs:
  - expressGlobalTag: 120X_dataRun3_Express_v2
  - promptrecoGlobalTag: 120X_dataRun3_Prompt_v2
  - alcap0GlobalTag: 120X_dataRun3_Prompt_v2
On the other I believe most of these sequences are not really exercised with cosmics so we could test it with 2018 pp collisions (same as in AlCaDB test of PCL workflows #4602)
- Runs: 317696
- GTs:
  - expressGlobalTag: 120X_dataRun3_Express_Candidate_2021_09_30_18_52_55
  - promptrecoGlobalTag: 120X_dataRun3_Prompt_Candidate_2021_09_30_19_06_33
  - alcap0GlobalTag: 120X_dataRun3_Prompt_Candidate_2021_09_30_19_06_33

I let @germanfgv and other Tier0 experts comment further.

germanfgv · 2021-10-11T11:22:01Z

I agree that a 2018 pp collisions test would be better. We cannot start such test yet, but we can do it later during the week. In any case, it seems like RetVal shows some issues with cms-sw/cmssw#35605, so there is no point in running a replay.

Also, please first propose this kind of changes in T0 Hypernews so everyone is aware.

jfernan2 · 2021-10-11T21:08:36Z

cms-sw/cmssw#35605 has been fixed, it was an unrelated proble for Run1 datasets.

Hypernews announcement: https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2278.html

germanfgv · 2021-10-12T10:54:15Z

@jfernan2 I'm a bit confused. If we need cms-sw/cmssw#35605 to properly test this configuration, then that PR should be already merge into a release. We can only test CMSSW code that's available in /cvmfs/.

If that's not the case, then what CMSSW release would you like us to use?

jfernan2 · 2021-10-12T11:02:48Z

OK @germanfgv sorry, I thought some PR could be added on top. So, @qliphy please, let's integrate cms-sw/cmssw#35605 to allow Tier0 for the test.
Thanks!

tvami · 2021-10-23T07:35:52Z

Hi @jfernan2 this is not relevant for the pilot beams right?
Since 12_1_0 is about to come out, I'd suggest to wait for that then, and run the replay even with the 121X GTs, maybe even on the pilot beam data?

jfernan2 · 2021-10-23T21:01:57Z

Right, this is not relevant. It is just for computing resources saving purpouses. It can wait.
Thanks

jfernan2 · 2021-11-03T15:16:53Z

@germanfgv PR #35605 was merged in CMSSW_12_1_0_pre5
https://github.com/cms-sw/cmssw/releases/tag/CMSSW_12_1_0_pre5
So, if you could perform the replay at some point it is apreciated
Thanks

tvami · 2021-11-03T15:45:54Z

Hi @jfernan2 I think we wanted to wait for CMSSW_12_1_0 to come out, i.e. tomorrow

qliphy · 2021-11-05T09:05:35Z

@tvami @jfernan2 @germanfgv CMSSW_12_1_0 is now ready.

germanfgv · 2021-11-05T10:42:41Z

@jfernan2 @tvami I updated the CMSSW version. Can you check GTs and runs before trigerring the replay?

francescobrivio · 2021-11-05T10:59:21Z

@germanfgv I think the best way to test the cpu usage reduction with the new DQM Matrix is to use a 2018 pp run, as specified in: #4619 (comment)
So you should update in the configuration:

run number
GTs
different scenarios

germanfgv · 2021-11-05T13:41:59Z

@francescobrivio I made the necessary changes. I'll start the replay.

francescobrivio · 2021-11-05T13:45:00Z

looks good! thanks @germanfgv !

germanfgv · 2021-11-05T13:51:24Z

run replay please

cmsdmwmbot · 2021-11-05T21:45:05Z

There are 16 repack workflows.
There are 4 express workflows.
There are 1016 filesets not closed.
There are 769 paused jobs in the replay.

cmsdmwmbot · 2021-11-05T22:48:57Z

There are 16 repack workflows.
There are 5 express workflows.
There are 1297 filesets not closed.
There are 3599 paused jobs in the replay.

cmsdmwmbot · 2021-11-05T23:52:08Z

There are 12 repack workflows.
There are 5 express workflows.
There are 1314 filesets not closed.
There are 4380 paused jobs in the replay.

germanfgv · 2021-11-06T00:42:24Z

@francescobrivio it looks like there is an issue with the GTs. We are getting the following error, for both Express and Prompt:

An exception of category 'NoRecord' occurred while
   [0] Processing global begin Run run: 317696
   [1] Prefetching for module TotemTimingDQMSource/'totemTimingDQMSource'
   [2] Prefetching for EventSetup module CTPPSGeometryESModule/'ctppsGeometryESModule'
   [3] Calling method for EventSetup module CTPPSGeometryESModule/'ctppsGeometryESModule'
   [4] While getting dependent Record from Record VeryForwardRealGeometryRecord
Exception Message:
No "VeryForwardIdealGeometryRecord" record found in the EventSetup.

 The Record is delivered by an ESSource or ESProducer but there is no valid IOV for the synchronization value.
 Please check 
   a) if the synchronization value is reasonable and report to the hypernews if it is not.
   b) else check that all ESSources have been properly configured.

Should we be using different GTs?

cmsdmwmbot · 2021-11-06T14:53:43Z

There are 8 repack workflows.
There are 5 express workflows.
There are 728 filesets not closed.
There are 1 paused jobs in the replay.

tvami · 2021-11-06T15:03:51Z

Hi @germanfgv
yes, this VeryForwardIdealGeometryRecord was added recently.
Let's use these GTs

Express: 121X_dataRun3_Express_v11
Prompt: 121X_dataRun3_Prompt_v10

For 2018

Express: 121X_dataRun3_Express_Candidate_2021_11_06_15_02_30
Prompt: 121X_dataRun3_Prompt_Candidate_2021_11_06_15_02_45

GTs to fix issue with VeryForwardIdealGeometryRecord

germanfgv · 2021-11-26T10:08:25Z

@jfernan2 I'm not sure what you are referring to when you say that the memory issues have been clarified. Did I miss something? could you point me to the discussion?

In any case, if there is a fix to the issue we can test to see if it works. Is it available in one of the pre-releases?

jfernan2 · 2021-11-26T10:13:45Z

I am sorry @germanfgv , I thought you were following the Tier0 thread that you initiated:
https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2314/3/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1.html

germanfgv · 2021-11-26T10:24:12Z

@jfernan2 I think I missed those last few messages. For what I see, there is no changes that need to be tested, just the increase in T0 memory limit (We did discuss this in our T0 internal meeting). In any case, I'd like to confirm that 2GB/core is enough to run the jobs. For that, I'll redeploy the replay.

germanfgv · 2021-11-26T10:25:08Z

run replay please

jfernan2 · 2021-11-26T10:26:33Z

Thanks @germanfgv

cmsdmwmbot · 2021-11-26T12:09:57Z

Replay testing PR '[DQM] Prompt Matrix reconfiguration'
An automatic replay has been requested by jfernan2.
Here is a brief description of the replay.
Github PR : #4619
PR author : jfernan2
Requestor : PPD-DQM
Injected runs : 317696
CMSSW release : CMSSW_12_1_0
Tier0 release : 3.0.1
ppScenario : ppEra_Run2_2018
Tier0 Config : https://cmst0.web.cern.ch/CMST0/tier0/offline_config/ReplayOfflineConfiguration_047.php
Contatiner ID : 1
Jenkins Build : https://cmssdt.cern.ch/dmwm-jenkins/job/DMWM-T0-PR-test-job/366/
Jira Issue : https://its.cern.ch/jira/browse/CMSTZDEV-702

cmsdmwmbot · 2021-11-26T15:40:31Z

There are 17 repack workflows.
There are 5 express workflows.
There are 1128 filesets not closed.
There are 1 paused jobs in the replay.

germanfgv · 2021-11-29T10:19:39Z

After this later retry, we still have one MET promt Reco jobs exceeding the 2GB/core memory limit. @jfernan2 for what I understand, this may improve in the next 12_1_X release, is that right?

jfernan2 · 2021-11-29T10:22:56Z

@germanfgv From the RECO side I do not know, from the DQM side there are not improvements expected in memory: this PR is already reducing the number of modules run, hence the memory, w.r.t. 12_0_X

slava77 · 2021-11-29T13:40:04Z

After this later retry, we still have one MET promt Reco jobs exceeding the 2GB/core memory limit. @jfernan2 for what I understand, this may improve in the next 12_1_X release, is that right?

is the log for the job available?
(I've forgotten how to find the replay details on the web with wmstats)

jfernan2 · 2021-11-29T13:41:33Z

@slava77 would a backport of this PR reduce the memory?
cms-sw/cmssw#36246
Thanks

slava77 · 2021-11-29T13:58:45Z

@slava77 would a backport of this PR reduce the memory?
cms-sw/cmssw#36246

I think so

germanfgv · 2021-11-29T14:08:02Z

After this later retry, we still have one MET promt Reco jobs exceeding the 2GB/core memory limit. @jfernan2 for what I understand, this may improve in the next 12_1_X release, is that right?

is the log for the job available? (I've forgotten how to find the replay details on the web with wmstats)

@slava77 You can find the jobs tarball here:
C1_vocms047.cern.ch-23926-0-log.tar.gz

I'll put it in AFS in a few minutes.

germanfgv · 2021-11-29T17:13:13Z

@slava77 In case you prefer:
/afs/cern.ch/user/c/cmst0/public/PausedJobs/DQMseq/job_23926/tarball

slava77 · 2021-11-29T18:27:07Z

@slava77 In case you prefer: /afs/cern.ch/user/c/cmst0/public/PausedJobs/DQMseq/job_23926/tarball

Thanks.
I was mainly looking for

Job has exceeded maxPSS: 16000 MB
Job has PSS: 16823 MB

jfernan2 · 2021-12-06T08:20:35Z

Hi @germanfgv
I have switched to CMSSW_12_1_1 now that the memory issues have been reduced there

germanfgv · 2021-12-06T09:01:54Z

@jfernan2 great! I'll manually run the test to see if it's enough.

germanfgv · 2021-12-08T08:27:46Z

@jfernan2 The replay finished without problems. The 16GB of memory were enough to reconstruct the MET dataset.

Jetmet · 2021-12-08T08:28:06Z

这是来自QQ邮箱的假期自动回复邮件。您好，我最近正在休假中，无法亲自回复您的邮件。我将在假期结束后，尽快给您回复。

jfernan2 · 2021-12-13T07:28:31Z

Hi @germanfgv
What are the next steps?
Thanks

germanfgv · 2021-12-14T14:32:42Z

Hi @jfernan2, i'll make a clean PR adding these changes to the Replay config and the Production config.

Following what was discussed here with a new configuration of PromptMatrix dmwm#4619 (comment)

tvami · 2022-04-21T16:49:51Z

Hi @jfernan2 my understanding is that this was added to the prod config, so I believe this PR can be closed, do you agree?

Jetmet · 2022-04-21T16:50:19Z

这是来自QQ邮箱的假期自动回复邮件。您好，我最近正在休假中，无法亲自回复您的邮件。我将在假期结束后，尽快给您回复。

Update ReplayOfflineConfiguration.py

bcf0b37

Changes for Prompt Matrix DQM reduction

jfernan2 mentioned this pull request Oct 11, 2021

[DQM] Prompt matrix redefinition #4617

Closed

jfernan2 changed the title ~~Update ReplayOfflineConfiguration.py~~ [DQM] Prompt Matrix reconfiguration Oct 11, 2021

qliphy mentioned this pull request Oct 12, 2021

[DQM] Prompt matrix redefinition cms-sw/cmssw#35605

Merged

Update CMSSW version to CMSSW_12_1_0

80e5081

germanfgv added 2 commits November 5, 2021 13:30

Update runs and GTs

9cb188b

Set proper Scenarios and scram Arch

59dc6db

Update GTs to fix 'NoRecord' execption

5599940

GTs to fix issue with VeryForwardIdealGeometryRecord

germanfgv mentioned this pull request Nov 29, 2021

Replay for CMSSW_12_1_X without DQM reduction #4637

Closed

jfernan2 mentioned this pull request Nov 29, 2021

Profiling a T0 prompt reco workflow cms-sw/cmssw#36282

Closed

Switched to CMSSW_12_1_1

61ad12d

jhonatanamado added a commit to jhonatanamado/T0 that referenced this pull request Jan 19, 2022

Fix PromptMatrix reconfiguration

90eaeea

Following what was discussed here with a new configuration of PromptMatrix dmwm#4619 (comment)

jhonatanamado added a commit to jhonatanamado/T0 that referenced this pull request Jan 20, 2022

Fix PromptMatrix reconfiguration

4e6df8b

Following what was discussed here with a new configuration of PromptMatrix dmwm#4619 (comment)

jfernan2 closed this Apr 21, 2022

francescobrivio mentioned this pull request Oct 31, 2022

Fix rerecoCommon DQM sequence cms-sw/cmssw#39934

Merged

[DQM] Prompt Matrix reconfiguration #4619

[DQM] Prompt Matrix reconfiguration #4619

Conversation

jfernan2 commented Oct 11, 2021 • edited by germanfgv

Replay Request

cmsdmwmbot commented Oct 11, 2021

francescobrivio commented Oct 11, 2021

germanfgv commented Oct 11, 2021

jfernan2 commented Oct 11, 2021

germanfgv commented Oct 12, 2021

jfernan2 commented Oct 12, 2021

tvami commented Oct 23, 2021

jfernan2 commented Oct 23, 2021

jfernan2 commented Nov 3, 2021

tvami commented Nov 3, 2021

qliphy commented Nov 5, 2021

germanfgv commented Nov 5, 2021

francescobrivio commented Nov 5, 2021 • edited

germanfgv commented Nov 5, 2021

francescobrivio commented Nov 5, 2021

germanfgv commented Nov 5, 2021

cmsdmwmbot commented Nov 5, 2021

cmsdmwmbot commented Nov 5, 2021

cmsdmwmbot commented Nov 5, 2021

germanfgv commented Nov 6, 2021

cmsdmwmbot commented Nov 6, 2021

tvami commented Nov 6, 2021

germanfgv commented Nov 26, 2021

jfernan2 commented Nov 26, 2021

germanfgv commented Nov 26, 2021

germanfgv commented Nov 26, 2021

jfernan2 commented Nov 26, 2021

cmsdmwmbot commented Nov 26, 2021

cmsdmwmbot commented Nov 26, 2021

germanfgv commented Nov 29, 2021

jfernan2 commented Nov 29, 2021

slava77 commented Nov 29, 2021

jfernan2 commented Nov 29, 2021

slava77 commented Nov 29, 2021

germanfgv commented Nov 29, 2021

germanfgv commented Nov 29, 2021 • edited

slava77 commented Nov 29, 2021

jfernan2 commented Dec 6, 2021

germanfgv commented Dec 6, 2021

germanfgv commented Dec 8, 2021

Jetmet commented Dec 8, 2021 via email

jfernan2 commented Dec 13, 2021

germanfgv commented Dec 14, 2021

tvami commented Apr 21, 2022

Jetmet commented Apr 21, 2022 via email

jfernan2 commented Oct 11, 2021 •

edited by germanfgv

francescobrivio commented Nov 5, 2021 •

edited

germanfgv commented Nov 29, 2021 •

edited