New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split DQM configuration tests in smaller set to make then run fast #34186
Conversation
please test |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34186/23400
|
A new Pull Request was created by @smuzaffar (Malik Shahzad Muzaffar) for master. It involves the following packages: DQMOffline/Configuration @andrius-k, @kmaeshima, @ErnestaP, @ahmad3213, @jfernan2, @rvenditti can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d97bc8/16111/summary.html Comparison SummarySummary:
|
@smuzaffar Thanks! We also need to backport this to 11_3_X |
+1 |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
@jfernan2 , these tests still take a lot time e.g for ppc64le tests these are still timed out after 1 hour. Is there any reason to not break these in to set of 10 or 5 ( if overload of starting a test is not high then may be 1 per test ) ? SCRAM is soon going to support ( cms-sw/cmsdist#7056 )
so we can use that to split these tests e.g.
which should run |
Hi @smuzaffar |
which is available on cern eos cached area. Off course if there is issue with CERN EOS then yes we will have problem. |
OK, by AAA I meant either xrootd access or fallback to eos, sorry for not being precise. |
unfortunately the logs are gone now as I had restarted the tests with Anyway , I ran tests locally and noticed that most of the time is spend is not in
I think the |
Thanks @smuzaffar For my understanding, which process does non-cmsRun or non-cmsDriver time correspond to? |
|
Yes, but non-cmsRun time is in practice what? Waiting time? Idle time? Doing what? |
No idea @jfernan2, I thought that you should know ( you are one of developer of https://github.com/cms-sw/cmssw/blob/master/DQMOffline/Configuration/scripts/cmsswSequenceInfo.py :-) ) |
Yes, but everything in there is either cmsDriver, cmsRun or edmPluginHelp/edmConfigDump, i.e a splitting of sequences into pieces to be run with cmsRun.... no further magic inside the script. |
python on ppc64le is build the same way as it is for amd64 ( https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_12_0_X/master/python3.spec ). Power8 systems are big servers (128 cores) and are also shared by other users so performance of these are not same as our dedicated |
Ok thanks. cmsRun/cmsDriver time is almost a factor two in your numbers above (55/30), so it seems no surprise if non-cmsRun is factor 1.5 (1000/660) |
by the way, if large latency in serving the input files is an issue then we can try something like
and update the |
I don't know if latency is an issue for PPC64LE, but if CPU is not the real difference I don't know what else to blame. In the past this kind of tests gave problems ONLY with eos was having timeouts: if you have just fragmentized in this PR even more the tests, CPU should not be the responsible... but this is just my guess, without a proper benchmark I cannot assure. On the other hand, from your log above, the same file is being used 212 times in the test, so a first unique copy can indeed be a solution. Just tell me if you want me to implement it. Thanks |
once scram changes are merged then I will update the test to first do a local copy of input file and update the |
ast night we updated the configuration to kill any test which runs over 1 hour and caught this test failing. Looks like these tests are taking over an hour to run that is why there are few of these failed.
I propose to split these tests in smaller set ( e.g. 20 each) so that they can run in parallel and finish with in one hour.