Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split DQM configuration tests in smaller set to make then run fast #34186

Merged
merged 1 commit into from Jun 20, 2021
Merged

Split DQM configuration tests in smaller set to make then run fast #34186

merged 1 commit into from Jun 20, 2021

Conversation

smuzaffar
Copy link
Contributor

ast night we updated the configuration to kill any test which runs over 1 hour and caught this test failing. Looks like these tests are taking over an hour to run that is why there are few of these failed.

I propose to split these tests in smaller set ( e.g. 20 each) so that they can run in parallel and finish with in one hour.

@smuzaffar
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34186/23400

  • This PR adds an extra 12KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @smuzaffar (Malik Shahzad Muzaffar) for master.

It involves the following packages:

DQMOffline/Configuration

@andrius-k, @kmaeshima, @ErnestaP, @ahmad3213, @jfernan2, @rvenditti can you please review it and eventually sign? Thanks.
@threus, @rociovilar this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d97bc8/16111/summary.html
COMMIT: b4457bd
CMSSW: CMSSW_12_0_X_2021-06-18-2300/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/34186/16111/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 38
  • DQMHistoTests: Total histograms compared: 2785631
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2785608
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 37 files compared)
  • Checked 160 log files, 37 edm output root files, 38 DQM output files
  • TriggerResults: no differences found

@qliphy
Copy link
Contributor

qliphy commented Jun 20, 2021

@smuzaffar Thanks! We also need to backport this to 11_3_X

@jfernan2
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@qliphy
Copy link
Contributor

qliphy commented Jun 20, 2021

+1

@smuzaffar
Copy link
Contributor Author

smuzaffar commented Jun 23, 2021

@jfernan2 , these tests still take a lot time e.g for ppc64le tests these are still timed out after 1 hour. Is there any reason to not break these in to set of 10 or 5 ( if overload of starting a test is not high then may be 1 per test ) ? SCRAM is soon going to support ( cms-sw/cmsdist#7056 )

<test name="FOO${loop}" command="command args ${loop}" loop="start,stop,step"/>

so we can use that to split these tests e.g.

<test name="TestDQMOfflineConfiguration${loop}" command="runtests.sh 5 ${loop}" loop="0,295,5"/>

which should run TestDQMOfflineConfiguration0, TestDQMOfflineConfiguration5, .... TestDQMOfflineConfiguration295

@jfernan2
Copy link
Contributor

Hi @smuzaffar
I am not against the atomization you suggest, the point is if the timeouts are due to delays from AAA serving the input file or to the tests theirselves.
If the timeout is produced by slow file serving, making more of them will increase the problem, right?

@smuzaffar
Copy link
Contributor Author

AAA gets involved only when we try to access new data otherwise all data (once accessed by cmsRun in IBs/PR tests) should be cached and accessed via CERN ib eos. I see that all tests have messages like

entry /store/data/Run2018A/EGamma/RAW/v1/000/315/489/00000/004D960A-EA4C-E811-A908-FA163ED1F481.root

which is available on cern eos cached area.

Off course if there is issue with CERN EOS then yes we will have problem.

@jfernan2
Copy link
Contributor

OK, by AAA I meant either xrootd access or fallback to eos, sorry for not being precise.
The point is that the tests theirselves are not long, but if input is not properly served the modules keep waiting.
Can you point me to any of the logs for he failing ppc64le tests ?
Thanks

@smuzaffar
Copy link
Contributor Author

unfortunately the logs are gone now as I had restarted the tests with timeout 7200 that allowed the tests to finish but you can see https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_ppc64le_gcc9/CMSSW_12_0_X_2021-06-22-2300/unitTestLogs/DQMOffline/Configuration#/816 ( search for TestTime:) and you will notice that some tests took over 7000s to finish. For amd64, these tests are taking 1300s - 2200s.

Anyway , I ran tests locally and noticed that most of the time is spend is not in cmsRun . See the avg time below for 3 tests run in parallel each with --limit 20

  • AMD64:

    • cmsDriver: 10s
    • cmsRun : 30s
    • Overall job time: 1260s
    • Avg time without cmsRun: 1260-(30x20): 660s
  • PPC64LE:

    • cmsDriver: 17s
    • cmsRun : 55s
    • Overall job time: 2100s
    • Avg time without cmsRun: 2100-(55x20): 1000s

I think the non-cmsRun time should remain nearly same and that is alreay too big on ppc64le. So running with --limit 5 should allow to run faster.

@jfernan2
Copy link
Contributor

jfernan2 commented Jun 23, 2021

Thanks @smuzaffar
I am not familiar with PPC64LE architecture so I cannot imagine any issue which may cause the difference w.r.t. AMD64
In the logs you point out, I see at least a test which failed at cmsDriver step: TestDQMOfflineConfiguration140 which seems strange since the same test had several sucessful cmsDriver calls before

For my understanding, which process does non-cmsRun or non-cmsDriver time correspond to?

@smuzaffar
Copy link
Contributor Author

@jfernan2
Copy link
Contributor

Yes, but non-cmsRun time is in practice what? Waiting time? Idle time? Doing what?
Thanks

@smuzaffar
Copy link
Contributor Author

No idea @jfernan2, I thought that you should know ( you are one of developer of https://github.com/cms-sw/cmssw/blob/master/DQMOffline/Configuration/scripts/cmsswSequenceInfo.py :-) )

@jfernan2
Copy link
Contributor

Yes, but everything in there is either cmsDriver, cmsRun or edmPluginHelp/edmConfigDump, i.e a splitting of sequences into pieces to be run with cmsRun.... no further magic inside the script.
Is python optimized in PPC64LE arch?

@smuzaffar
Copy link
Contributor Author

python on ppc64le is build the same way as it is for amd64 ( https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_12_0_X/master/python3.spec ). Power8 systems are big servers (128 cores) and are also shared by other users so performance of these are not same as our dedicated cmsbuild machines.

@jfernan2
Copy link
Contributor

Ok thanks.
And is there any way to measure where the time is being spent, some ig-proof like? With the characteristics you quote I can only think of larger latency in serving the input files or in the throughput output in one case w.r.t. the other, or the file delivery time is already counted in your cmsRun time?

cmsRun/cmsDriver time is almost a factor two in your numbers above (55/30), so it seems no surprise if non-cmsRun is factor 1.5 (1000/660)

@smuzaffar
Copy link
Contributor Author

by the way, if large latency in serving the input files is an issue then we can try something like

<test name="GetTestDQMOfflineConfigurationFile" command="edmCopyUtil /store/data/foo/bar/input.root $(LOCALTOP)/tmp/"/>
<test name="TestDQMOfflineConfiguration${loop}" command="runtests.sh   ${step} ${loop}" loop="0,295,5">
  <flags PRE_TEST="GetTestDQMOfflineConfigurationFile"/>
</test>

and update the cmsswSequenceInfo.py to read local file from $(LOCALTOP)/tmp/input.root (or pass it via command-line) , this way input file will be downloaded once and every cmsRun command ran by cmsswSequenceInfo.py will read a local file.

@jfernan2
Copy link
Contributor

I don't know if latency is an issue for PPC64LE, but if CPU is not the real difference I don't know what else to blame.

In the past this kind of tests gave problems ONLY with eos was having timeouts: if you have just fragmentized in this PR even more the tests, CPU should not be the responsible... but this is just my guess, without a proper benchmark I cannot assure.

On the other hand, from your log above, the same file is being used 212 times in the test, so a first unique copy can indeed be a solution. Just tell me if you want me to implement it. Thanks

@smuzaffar
Copy link
Contributor Author

once scram changes are merged then I will update the test to first do a local copy of input file and update the cmsswSequenceInfo.py to use the local copy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants