Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make time report sane for threading #7599

Merged

Conversation

Dr15Jones
Copy link
Contributor

The tools for measuring CPU time actually measure CPU time for all cores being used by the process. Therefore recording such information per Event, Path, and Module makes no sense when multiple threads are running. We now only record CPU time for the event loop as a whole and report only wall clock time for all other measurements. We also now report both the wall clock time for the entire event loop but also the sum of the time for each all Streams. The former allows throughput to be measured while the later is useful for determining the average time it takes one Event to be processed.

CPU time measurement is for the application as a whole and therefore cannot be determined on an event by event basis. Therefore the event level CPU measurement has been dropped.
The average event time is now properly calculated so it is the average time it takes an event to processes on a stream.
We now calculate event throughput which measures how fast the entire job is processing events on average.
The edm::WallclockTimer is meant to replace edm::CPUTimer for the cases where measuring CPU time in a threaded application is not appropriate.
The tools for measuring CPU time actually measure CPU time for all cores being used by the process. Therefore recording such information per Event, Path, and Module makes no sense when multiple threads are running. We now only record CPU time for the event loop as a whole and report only wall clock time for all other measurements. We also now report both the wall clock time for the entire event loop but also the sum of the time for each all Streams. The former allows throughput to be measured while the later is useful for determining the average time it takes one Event to be processed.
@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 6, 2015

A new Pull Request was created by @Dr15Jones (Chris Jones) for CMSSW_7_4_X.

Make time report sane for threading

It involves the following packages:

FWCore/Framework
FWCore/ServiceRegistry
FWCore/Services
FWCore/Utilities

@cmsbuild, @Dr15Jones, @ktf, @nclopezo can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @wddgit, @wmtan this is something you requested to watch as well.
You can sign-off by replying to this message having '+1' in the first line of your reply.
You can reject by replying to this message having '-1' in the first line of your reply.
If you are a L2 or a release manager you can ask for tests by saying 'please test' in the first line of a comment.
@Degano you are the release manager for this.
You can merge this pull request by typing 'merge' in the first line of your comment.

@Dr15Jones
Copy link
Contributor Author

please test

@Dr15Jones
Copy link
Contributor Author

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 6, 2015

The tests are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 6, 2015

This pull request is fully signed and it will be integrated in one of the next CMSSW_7_4_X IBs unless changes or unless it breaks tests. This pull request requires discussion in the ORP meeting before it's merged. @davidlange6, @Degano, @ktf, @smuzaffar

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 6, 2015

This pull request is fully signed and it will be integrated in one of the next CMSSW_7_4_X IBs unless changes or unless it breaks tests. This pull request requires discussion in the ORP meeting before it's merged. @davidlange6, @Degano, @ktf, @smuzaffar

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 6, 2015

-1

Tested at: e07466f
I ran the usual tests and I found errors in the following unit tests:

---> test runtestPhysicsToolsPatAlgos had ERRORS
---> test testJetMETCorrectionsType1MET had ERRORS
---> test runtestTqafTopHitFit had ERRORS
---> test runtestTqafExamples had ERRORS
---> test runtestTqafTopKinFitter had ERRORS
---> test runtestTqafTopEventSelection had ERRORS
---> test runtestTqafTopEventProducers had ERRORS
---> test runtestTqafTopJetCombination had ERRORS

you can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7599/2421/summary.html

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 6, 2015

This pull request is fully signed and it will be integrated in one of the next CMSSW_7_4_X IBs unless changes (but tests are reportedly failing). This pull request requires discussion in the ORP meeting before it's merged. @davidlange6, @Degano, @ktf, @smuzaffar

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 7, 2015

-1

Tested at: 5963526
I ran the usual tests and I found errors in the following unit tests:

---> test runtestTqafTopJetCombination had ERRORS
---> test testJetMETCorrectionsType1MET had ERRORS
---> test runtestPhysicsToolsPatAlgos had ERRORS
---> test runtestTqafTopEventSelection had ERRORS
---> test runtestTqafTopHitFit had ERRORS
---> test runtestTqafTopKinFitter had ERRORS
---> test runtestTqafTopEventProducers had ERRORS
---> test runtestTqafExamples had ERRORS

you can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7599/2437/summary.html

davidlange6 added a commit that referenced this pull request Feb 7, 2015
@davidlange6 davidlange6 merged commit f67f7a2 into cms-sw:CMSSW_7_4_X Feb 7, 2015
@slava77
Copy link
Contributor

slava77 commented Feb 13, 2015

Chris, please put back the CPU timing for single-thread applications.

@cvuosalo
Copy link
Contributor

@Dr15Jones:
I agree with Slava. Please restore this functionality. We reco conveners depend on it.

@Dr15Jones
Copy link
Contributor Author

It would help if you could state what problem you are trying to solve with the use of CPU time.

@slava77
Copy link
Contributor

slava77 commented Feb 13, 2015

Hi Chris,

basic resource use monitoring by module is much more reliable with CPU time,
it is not significantly affected by database fetch latencies or
filesystem file open times.

To the first order, a wall-clock summary based on events excluding the
first one would do,
but it's not always covering and is not as clean as the CPU time.

On 2/13/15 9:08 AM, Chris Jones wrote:

It would help if you could state what problem you are trying to solve
with the use of CPU time.


Reply to this email directly or view it on GitHub
#7599 (comment).


Vyacheslav (Slava) Krutelyov
TAMU: Physics Dept Texas A&M MS4242, College Station, TX 77843-4242
CERN: 42-R-027
AIM/Skype: siava16 googleTalk: slava77@gmail.com
(630) 291-5128 Cell (US) +41 76 275 7116 Cell (CERN)


@Dr15Jones
Copy link
Contributor Author

The time reading from the edm::Source is actually already removed from the per module timing. This was a side effect of removing the time spent in 'getBy*' in order to not overcount when doing unscheduled execution.
As for DB latencies, the HLT folk get around this in their timing measurements by running EventSetupRecordDataGetter as the first module in their jobs. I've started to think about how we might exclude the cost of 'get' from the EventSetup but its harder than the event case since in the event we pass to the module the transient 'edm::Event' which is a wrapper around the actual class which manages the event data products.

@Dr15Jones Dr15Jones deleted the makeTimeReportSaneForThreading branch February 25, 2015 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants