Split DQM configuration tests in smaller set to make then run fast #34186

smuzaffar · 2021-06-19T08:37:50Z

ast night we updated the configuration to kill any test which runs over 1 hour and caught this test failing. Looks like these tests are taking over an hour to run that is why there are few of these failed.

I propose to split these tests in smaller set ( e.g. 20 each) so that they can run in parallel and finish with in one hour.

smuzaffar · 2021-06-19T08:38:11Z

please test

cmsbuild · 2021-06-19T08:48:25Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34186/23400

This PR adds an extra 12KB to repository

cmsbuild · 2021-06-19T08:48:51Z

A new Pull Request was created by @smuzaffar (Malik Shahzad Muzaffar) for master.

It involves the following packages:

DQMOffline/Configuration

@andrius-k, @kmaeshima, @ErnestaP, @ahmad3213, @jfernan2, @rvenditti can you please review it and eventually sign? Thanks.
@threus, @rociovilar this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

cmsbuild · 2021-06-19T11:47:03Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d97bc8/16111/summary.html
COMMIT: b4457bd
CMSSW: CMSSW_12_0_X_2021-06-18-2300/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/34186/16111/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 38
DQMHistoTests: Total histograms compared: 2785631
DQMHistoTests: Total failures: 1
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 2785608
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 37 files compared)
Checked 160 log files, 37 edm output root files, 38 DQM output files
TriggerResults: no differences found

qliphy · 2021-06-20T01:53:38Z

@smuzaffar Thanks! We also need to backport this to 11_3_X

jfernan2 · 2021-06-20T08:44:53Z

+1

cmsbuild · 2021-06-20T08:45:13Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

qliphy · 2021-06-20T08:57:25Z

+1

smuzaffar · 2021-06-23T09:57:14Z

@jfernan2 , these tests still take a lot time e.g for ppc64le tests these are still timed out after 1 hour. Is there any reason to not break these in to set of 10 or 5 ( if overload of starting a test is not high then may be 1 per test ) ? SCRAM is soon going to support ( cms-sw/cmsdist#7056 )

<test name="FOO${loop}" command="command args ${loop}" loop="start,stop,step"/>

so we can use that to split these tests e.g.

<test name="TestDQMOfflineConfiguration${loop}" command="runtests.sh 5 ${loop}" loop="0,295,5"/>

which should run TestDQMOfflineConfiguration0, TestDQMOfflineConfiguration5, .... TestDQMOfflineConfiguration295

jfernan2 · 2021-06-23T10:11:51Z

Hi @smuzaffar
I am not against the atomization you suggest, the point is if the timeouts are due to delays from AAA serving the input file or to the tests theirselves.
If the timeout is produced by slow file serving, making more of them will increase the problem, right?

smuzaffar · 2021-06-23T10:24:15Z

AAA gets involved only when we try to access new data otherwise all data (once accessed by cmsRun in IBs/PR tests) should be cached and accessed via CERN ib eos. I see that all tests have messages like

entry /store/data/Run2018A/EGamma/RAW/v1/000/315/489/00000/004D960A-EA4C-E811-A908-FA163ED1F481.root

which is available on cern eos cached area.

Off course if there is issue with CERN EOS then yes we will have problem.

jfernan2 · 2021-06-23T10:42:43Z

OK, by AAA I meant either xrootd access or fallback to eos, sorry for not being precise.
The point is that the tests theirselves are not long, but if input is not properly served the modules keep waiting.
Can you point me to any of the logs for he failing ppc64le tests ?
Thanks

smuzaffar · 2021-06-23T12:32:14Z

unfortunately the logs are gone now as I had restarted the tests with timeout 7200 that allowed the tests to finish but you can see https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_ppc64le_gcc9/CMSSW_12_0_X_2021-06-22-2300/unitTestLogs/DQMOffline/Configuration#/816 ( search for TestTime:) and you will notice that some tests took over 7000s to finish. For amd64, these tests are taking 1300s - 2200s.

Anyway , I ran tests locally and noticed that most of the time is spend is not in cmsRun . See the avg time below for 3 tests run in parallel each with --limit 20

AMD64:
- cmsDriver: 10s
- cmsRun : 30s
- Overall job time: 1260s
- Avg time without cmsRun: 1260-(30x20): 660s
PPC64LE:
- cmsDriver: 17s
- cmsRun : 55s
- Overall job time: 2100s
- Avg time without cmsRun: 2100-(55x20): 1000s

I think the non-cmsRun time should remain nearly same and that is alreay too big on ppc64le. So running with --limit 5 should allow to run faster.

jfernan2 · 2021-06-23T14:43:46Z

Thanks @smuzaffar
I am not familiar with PPC64LE architecture so I cannot imagine any issue which may cause the difference w.r.t. AMD64
In the logs you point out, I see at least a test which failed at cmsDriver step: TestDQMOfflineConfiguration140 which seems strange since the same test had several sucessful cmsDriver calls before

For my understanding, which process does non-cmsRun or non-cmsDriver time correspond to?

smuzaffar · 2021-06-23T15:02:53Z

cmsDriver times is the time spend while running https://github.com/cms-sw/cmssw/blob/master/DQMOffline/Configuration/scripts/cmsswSequenceInfo.py#L65
cmsRun time is the time spend while running https://github.com/cms-sw/cmssw/blob/master/DQMOffline/Configuration/scripts/cmsswSequenceInfo.py#L68
non-cmsRun time is the total time taken by job minus total time spend by cmsRun. For --limit 20 we run 20 cmsRun jobs.

jfernan2 · 2021-06-23T15:11:45Z

Yes, but non-cmsRun time is in practice what? Waiting time? Idle time? Doing what?
Thanks

smuzaffar · 2021-06-23T15:28:25Z

No idea @jfernan2, I thought that you should know ( you are one of developer of https://github.com/cms-sw/cmssw/blob/master/DQMOffline/Configuration/scripts/cmsswSequenceInfo.py :-) )

jfernan2 · 2021-06-23T15:50:27Z

Yes, but everything in there is either cmsDriver, cmsRun or edmPluginHelp/edmConfigDump, i.e a splitting of sequences into pieces to be run with cmsRun.... no further magic inside the script.
Is python optimized in PPC64LE arch?

smuzaffar · 2021-06-23T16:06:44Z

python on ppc64le is build the same way as it is for amd64 ( https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_12_0_X/master/python3.spec ). Power8 systems are big servers (128 cores) and are also shared by other users so performance of these are not same as our dedicated cmsbuild machines.

jfernan2 · 2021-06-23T16:19:34Z

Ok thanks.
And is there any way to measure where the time is being spent, some ig-proof like? With the characteristics you quote I can only think of larger latency in serving the input files or in the throughput output in one case w.r.t. the other, or the file delivery time is already counted in your cmsRun time?

cmsRun/cmsDriver time is almost a factor two in your numbers above (55/30), so it seems no surprise if non-cmsRun is factor 1.5 (1000/660)

smuzaffar · 2021-06-23T16:36:31Z

by the way, if large latency in serving the input files is an issue then we can try something like

<test name="GetTestDQMOfflineConfigurationFile" command="edmCopyUtil /store/data/foo/bar/input.root $(LOCALTOP)/tmp/"/>
<test name="TestDQMOfflineConfiguration${loop}" command="runtests.sh   ${step} ${loop}" loop="0,295,5">
  <flags PRE_TEST="GetTestDQMOfflineConfigurationFile"/>
</test>

and update the cmsswSequenceInfo.py to read local file from $(LOCALTOP)/tmp/input.root (or pass it via command-line) , this way input file will be downloaded once and every cmsRun command ran by cmsswSequenceInfo.py will read a local file.

jfernan2 · 2021-06-23T17:00:00Z

I don't know if latency is an issue for PPC64LE, but if CPU is not the real difference I don't know what else to blame.

In the past this kind of tests gave problems ONLY with eos was having timeouts: if you have just fragmentized in this PR even more the tests, CPU should not be the responsible... but this is just my guess, without a proper benchmark I cannot assure.

On the other hand, from your log above, the same file is being used 212 times in the test, so a first unique copy can indeed be a solution. Just tell me if you want me to implement it. Thanks

smuzaffar · 2021-06-23T17:04:40Z

once scram changes are merged then I will update the test to first do a local copy of input file and update the cmsswSequenceInfo.py to use the local copy

split DQM configuration tests in smaller set to make then run fast

b4457bd

cmsbuild added this to the CMSSW_12_0_X milestone Jun 19, 2021

cmsbuild added code-checks-pending dqm-pending orp-pending pending-signatures tests-started labels Jun 19, 2021

cmsbuild added code-checks-approved and removed code-checks-pending labels Jun 19, 2021

cmsbuild added tests-approved and removed tests-started labels Jun 19, 2021

cmsbuild added dqm-approved fully-signed and removed dqm-pending pending-signatures labels Jun 20, 2021

cmsbuild added orp-approved and removed orp-pending labels Jun 20, 2021

cmsbuild merged commit 37e0ca3 into cms-sw:master Jun 20, 2021

This was referenced Jun 20, 2021

remove a unit test which was added by mistake #34188

Merged

[11.3] Split DQM configuration tests in smaller set to make then run … #34189

Merged

smuzaffar deleted the dwm-config-tests branch June 22, 2021 11:56

smuzaffar mentioned this pull request Jul 1, 2021

Split dqm configuration tests in smaller parts to avoid timeouts #34306

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split DQM configuration tests in smaller set to make then run fast #34186

Split DQM configuration tests in smaller set to make then run fast #34186

smuzaffar commented Jun 19, 2021

smuzaffar commented Jun 19, 2021

cmsbuild commented Jun 19, 2021

cmsbuild commented Jun 19, 2021

cmsbuild commented Jun 19, 2021

qliphy commented Jun 20, 2021

jfernan2 commented Jun 20, 2021

cmsbuild commented Jun 20, 2021

qliphy commented Jun 20, 2021

smuzaffar commented Jun 23, 2021 •

edited

jfernan2 commented Jun 23, 2021

smuzaffar commented Jun 23, 2021

jfernan2 commented Jun 23, 2021

smuzaffar commented Jun 23, 2021

jfernan2 commented Jun 23, 2021 •

edited

smuzaffar commented Jun 23, 2021

jfernan2 commented Jun 23, 2021

smuzaffar commented Jun 23, 2021

jfernan2 commented Jun 23, 2021

smuzaffar commented Jun 23, 2021

jfernan2 commented Jun 23, 2021

smuzaffar commented Jun 23, 2021

jfernan2 commented Jun 23, 2021

smuzaffar commented Jun 23, 2021

Split DQM configuration tests in smaller set to make then run fast #34186

Split DQM configuration tests in smaller set to make then run fast #34186

Conversation

smuzaffar commented Jun 19, 2021

smuzaffar commented Jun 19, 2021

cmsbuild commented Jun 19, 2021

cmsbuild commented Jun 19, 2021

cmsbuild commented Jun 19, 2021

Comparison Summary

qliphy commented Jun 20, 2021

jfernan2 commented Jun 20, 2021

cmsbuild commented Jun 20, 2021

qliphy commented Jun 20, 2021

smuzaffar commented Jun 23, 2021 • edited

jfernan2 commented Jun 23, 2021

smuzaffar commented Jun 23, 2021

jfernan2 commented Jun 23, 2021

smuzaffar commented Jun 23, 2021

jfernan2 commented Jun 23, 2021 • edited

smuzaffar commented Jun 23, 2021

jfernan2 commented Jun 23, 2021

smuzaffar commented Jun 23, 2021

jfernan2 commented Jun 23, 2021

smuzaffar commented Jun 23, 2021

jfernan2 commented Jun 23, 2021

smuzaffar commented Jun 23, 2021

jfernan2 commented Jun 23, 2021

smuzaffar commented Jun 23, 2021

smuzaffar commented Jun 23, 2021 •

edited

jfernan2 commented Jun 23, 2021 •

edited