Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrease begin Run startup time for HLT #29492

Merged
merged 3 commits into from Apr 17, 2020

Conversation

Dr15Jones
Copy link
Contributor

PR description:

Running igprof on an example HLT configuration uncovered that the vast amount of time spent in the begin Run transition was all the ShmStreamConsumer OutputModules calculating the ParameterSetBlobs to be stored in the files. This pull request adds the option to run ParameterSetBlobProducer at begin Run to create the ParameterSetBlobs once and then have all the OutputModules use that information. This decreased the time spent making the blobs from 147s in the orginal job to 4s.

PR validation:

The test configuration I was using ran fine both with and without ParameterSetBlobProducer added.

Added the needed dictionaries.
When many output modules were used in the HLT job, the begin Run
was completely dominated by the calculation of the ParameterSetBlobs.
This allows an option to have the ParameterSetBlobs created once and
then shared by all the output modules.
@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-29492/14723

  • This PR adds an extra 40KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @Dr15Jones (Chris Jones) for master.

It involves the following packages:

DataFormats/Common
DataFormats/Provenance
EventFilter/Utilities
IOPool/Streamer

@perrotta, @smuzaffar, @Dr15Jones, @makortel, @emeschi, @cmsbuild, @slava77, @mommsen can you please review it and eventually sign? Thanks.
@makortel, @Martin-Grunewald, @rovere, @wddgit this is something you requested to watch as well.
@silviodonato, @dpiparo you are the release manager for this.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 16, 2020

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/5720/console Started: 2020/04/16 16:13

@Dr15Jones
Copy link
Contributor Author

@fwyzard FYI I found this from the configuration you sent me.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 16, 2020

thanks, looks like an interesting improvement !

@Dr15Jones
Copy link
Contributor Author

In order to turn on the feature, I added the following to the test configuration I was using

process.psetMap = cms.EDProducer("ParameterSetBlobProducer")
process.PhysicsMuonsOutput.associate(cms.Task(process.psetMap))

I.e. I added the EDProducer as a Task to one of the EndPaths, that made it available to all of the OutputModules.

@mommsen
Copy link
Contributor

mommsen commented Apr 16, 2020

This sounds like a great improvement. Thanks for looking into this.
@smorovic, @emeschi, this could be of interest for you, too.

@emeschi
Copy link
Contributor

emeschi commented Apr 16, 2020 via email

@cmsbuild
Copy link
Contributor

+1
Tested at: 2ea92c1
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-30d713/5720/summary.html
CMSSW: CMSSW_11_1_X_2020-04-16-1100
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

Comparison job queued.

@smorovic
Copy link
Contributor

@fwyzard thanks for the files.
We should install 11_1_0_pre6 in Hilton and BU appliance that I use for tests as soon as it's available. It was being prepared this morning so we can expect it soon.
I can run tests with scripts in BU appliance. It is equivalent to Hilton tests so we can compare (but I expect the same conclusion). In appliance test we can see init period from the microstate plot. Event sample won't matter since this all comes before first event.

@makortel
Copy link
Contributor

+core

@mommsen
Copy link
Contributor

mommsen commented Apr 17, 2020

+1

@Martin-Grunewald
Copy link
Contributor

@Dr15Jones
Since ConfDB is not yet task-aware, how should this new producer be included in a menu?
Adding it to each EndPath with an OutputModule?

@fwyzard
Copy link
Contributor

fwyzard commented Apr 17, 2020

I would add it to the HLTriggerFirstPath, together with hltGetConditions and hltGetRaw .

@Dr15Jones
Copy link
Contributor Author

I would add it to the HLTriggerFirstPath, together with hltGetConditions and hltGetRaw .

I think Andrea's suggestion is a good one. It actually doesn't much matter since during Run and LuminosityBlock transitions modules are run in data dependency order, not in strict Path order.

@perrotta
Copy link
Contributor

+1

  • It adds the option to run ParameterSetBlobProducer at begin Run to create the ParameterSetBlobs once, and then have all the OutputModules use that information: this may speed up beginRun startup time for HLT: that option is not enabled by default
  • Reco only affected by the needed modifications (new parameter added) in EventFilter/Utilities/EvFOutputModule
  • Jenkins tests pass

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo (and backports should be raised in the release meeting by the corresponding L2)

@fwyzard
Copy link
Contributor

fwyzard commented Apr 17, 2020

On a job with 4 threads I see a reduction in the startup time (measured from the %MSG-i ThreadStreamSetup [diff.txt](https://github.com/cms-sw/cmssw/files/4493310/diff.txt) pre-events message to the Begin processing the 1st record message) from 106 ± 2 s to 62 ± 2 s.

The diff to the hlt configuration is attached.

@Dr15Jones do you think this is coherent with your results ?

@Dr15Jones
Copy link
Contributor Author

do you think this is coherent with your results ?

There are two things happening

  1. when using the old code under multi-threading several output modules could run concurrently at begin Run and just doing the exact same work so the biggest benefit happens at 1 thread in a job (since everything is serialized and the doing the same work over and over hurts the most). When using multiple threads, if the framework can find more work to do (based on data dependencies) then you can get a bigger 'win'.
  2. conditions access can become the dominant timing for begin Run (I saw large variations in timing because of different response time from the frontier servers).

The work I'm trying to do for allowing EventSetup modules to run concurrently should also allow the framework to do better scheduling at begin Run.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 17, 2020

OK, I'll try again

  • with a single threaded job
  • with a multi-threaded job
  • with a single threaded job, without any actual events and conditions data
  • with a multi-threaded job, without any actual events and conditions data

@smorovic
Copy link
Contributor

I get similar result with BU-FU appliance.
With12 FUs, 4-streams/thread processes (1 thread per hyperthread). I did two tries each case.

116 s, 113 s (no blob producer module)
73 s, 78 s (module added)

Average improvement: 39 seconds.
Note: one HLT machine took longer than others, but only in "no blob producer" case. I ignored it for the result.

As far as I know, squid cache expiration is 30 seconds in HLT, and repeated attempts don't happen within that time frame.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 17, 2020

As far as I know, squid cache expiration is 30 seconds in HLT, and repeated attempts don't happen within that time frame.

Is that only for the data with a short lifetime (i.e. IOVs) or also for the actual immutable payloads ?

@Dr15Jones
Copy link
Contributor Author

Another thing to keep in mind is when accessing data from the EventSetup the module takes a lock. The framework doesn't know about the lock so in multi-threading the framework can schedule multiple modules who all want the EventSetup lock so no progress gets made on some of the threads while waiting for the lock. (This is again why I'm working on running EventSetup modules concurrently and why we are adding 'consumes' to modules for EventSetup products.) The TimeReport does say how much time was spent waiting on the EventSetup lock.

@smorovic
Copy link
Contributor

@fwyzard good question. It was discussed in context of lumi-based conditions and Dave Dykstra mentioned it, but did not specify if there is different setting for IOVs and payloads.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 17, 2020

For a single threaded job the time drops from 150 ± 20s (best: 138s) to 63 ± 1s (best: 62s), with a gain of -87 ± 20s (best: -76s).

For a multi threaded job the time drops from 107 ± 3s (best: 103s) to 62 ± 2s (best: 60s), with a gain of -45 ± 3s (best: -43s).

So, the startup time with the new approach is independent of using multiple threads, while the original approach had some benefit from it (as expected following the explanation by Chris).

In any case, the improvement is impressive :-)

@silviodonato
Copy link
Contributor

+1
wow! Impressive improvement!

@cmsbuild cmsbuild merged commit 3548ccd into cms-sw:master Apr 17, 2020
@Dr15Jones Dr15Jones deleted the fastPSetStorage branch April 21, 2020 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants