Concurrent ExternalLHEProducer #28899

Dr15Jones · 2020-02-09T16:22:28Z

PR description:

Add the ability to use multiple TBB tasks to execute the script concurrently during the begin run phase.

PR validation:

The code compiles and the new unit tests succeed. In addition, I ran the code using an example job I had been given. If the new parameter is not used, the old code is used.

If requested, use multiple tbb::tasks where each task runs the script. Each task is assigned its own directory to run the script. We then read the generated LHE files sequentially during event processing.

cmsbuild · 2020-02-09T16:24:20Z

The code-checks are being triggered in jenkins.

cmsbuild · 2020-02-09T16:30:44Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28899/13685

This PR adds an extra 20KB to repository

cmsbuild · 2020-02-09T16:31:19Z

A new Pull Request was created by @Dr15Jones (Chris Jones) for master.

It involves the following packages:

GeneratorInterface/LHEInterface

@SiewYan, @efeyazgan, @mkirsano, @cmsbuild, @agrohsje, @alberto-sanchez, @qliphy can you please review it and eventually sign? Thanks.
@alberto-sanchez, @agrohsje, @mkirsano this is something you requested to watch as well.
@davidlange6, @silviodonato, @fabiocos you are the release manager for this.

cms-bot commands are listed here

Dr15Jones · 2020-02-09T16:34:21Z

please test

cmsbuild · 2020-02-09T16:34:47Z

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/4562/console Started: 2020/02/09 18:48

cmsbuild · 2020-02-09T18:58:30Z

+1
Tested at: f59e40a
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2fdaa4/4562/summary.html
CMSSW: CMSSW_11_1_X_2020-02-09-0000
SCRAM_ARCH: slc7_amd64_gcc820

cmsbuild · 2020-02-09T18:58:34Z

Comparison job queued.

cmsbuild · 2020-02-09T22:03:34Z

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2fdaa4/4562/summary.html

Comparison Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 34
DQMHistoTests: Total histograms compared: 2698043
DQMHistoTests: Total failures: 1
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 2697696
DQMHistoTests: Total skipped: 346
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
Checked 147 log files, 16 edm output root files, 34 DQM output files

qliphy · 2020-02-10T06:26:02Z

@Dr15Jones Thanks. Looks interesting! Can you give an example how to test this? Will the LHE produced in each thread be merged at last? Current gridpacks (MadGraph, Powheg...) doesn't support multi-thread yet, so does this mean it needs to copy and tar/untar gridpacks multi times?

mseidel42 · 2020-02-10T10:02:32Z

Hi, I tried to produce a couple of Powheg ttbar events, it seems to work :)

I am just wondering about the use of consecutive random seeds. These would overlap when consecutive job numbers are used as seed. How is it done in production? Consecutive or "random" random seeds?

@qliphy Yes, the gridpack is unpacked and used multiple times in directories thread0, thread1, etc.

I am using this fragment:

process.externalLHEProducer = cms.EDProducer("ExternalLHEProducer",
    args = cms.vstring('/cvmfs/cms.cern.ch/phys_generator/gridpacks/slc6_amd64_gcc481/13TeV/powheg/V2/TT_hdamp_Tune4_NNPDF30/TT_hdamp_Tune4_NNPDF30_13TeV_powheg_hvq.tgz'),
    nEvents = cms.untracked.uint32(process.maxEvents.input.value()),
    numberOfParameters = cms.uint32(1),
    outputFile = cms.string('cmsgrid_final.lhe'),
    scriptName = cms.FileInPath('GeneratorInterface/LHEInterface/data/run_generic_tarball_cvmfs.sh'),
    generateConcurrently = cms.untracked.bool(True)
)

Dr15Jones · 2020-02-10T13:01:31Z

@intrepid42 already answered these but just for completeness.

Can you give an example how to test this?

You can take any existing configuration and add generateConcurrently = cms.untracked.bool(True) to the ExternalLHEProducer configuration.

Will the LHE produced in each thread be merged at last?

Yes

Current gridpacks (MadGraph, Powheg...) doesn't support multi-thread yet, so does this mean it needs to copy and tar/untar gridpacks multi times?

Yes

Now as for the random numbers

I am just wondering about the use of consecutive random seeds. These would overlap when consecutive job numbers are used as seed. How is it done in production? Consecutive or "random" random seeds?

The framework already makes use of consecutive random numbers to handle random number assignments for threading. The product system assigns a single random number seed to each module in a configuration. Then for each stream in a job, the random number seed is modifed by adding the stream index (a number from 0 to #streams -1).

qliphy · 2020-02-13T15:53:55Z

Thanks @Dr15Jones @intrepid42
I have done a test based on SMP-RunIISummer19UL17wmLHEGEN-00001
which is a complicated case, i.e. WJetsToLNu_TuneCP5_13TeV-madgraph261MLM-pythia8 (W+up to 4 jets).

(1).
cmsrel CMSSW_11_1_X_2020-02-12-2300
cd CMSSW_11_1_X_2020-02-12-2300/src
git cms-addpkg GeneratorInterface/LHEInterface
git cms-merge-topic Dr15Jones:concurrentExternalLHEProducer
scram b

curl -s --insecure https://cms-pdmv.cern.ch/mcm/public/restapi/requests/get_fragment/SMP-
RunIISummer19UL17wmLHEGEN-00001 --retry 2 --create-dirs -o Configuration/GenProduction/python/SMP-RunIISummer19UL17wmLHEGEN-00001-fragment.py
scram b

cmsDriver.py Configuration/GenProduction/python/SMP-RunIISummer19UL17wmLHEGEN-00001-fragment.py --fileout file:SMP-RunIISummer19UL17wmLHEGEN-00001.root --mc --eventcontent RAWSIM,LHE --datatier GEN,LHE --conditions 106X_mc2017_realistic_v6 --beamspot Realistic25ns13TeVEarly2017Collision --step LHE,GEN --geometry DB:Extended --era Run2_2017 --python_filename SMP-RunIISummer19UL17wmLHEGEN-00001_1_cfg.py --no_exec --customise Configuration/DataProcessing/Utils.addMonitoring --customise_commands
process.source.numberEventsInLuminosityBlock="cms.untracked.uint32(2222)" -n 200

(2) same as (1),
but add "generateConcurrently = cms.untracked.bool(True)"
and in addition
process.options.numberOfThreads=cms.untracked.uint32(4)
process.options.numberOfStreams=cms.untracked.uint32(0)
process.options.numberOfConcurrentLuminosityBlocks=cms.untracked.uint32(1)

I indeed see 4 thread running, and each with a gridpack untarring and running.
However, from the output below, it seems the concurrent mode even took longer time to get finished, especially initialization of gridpack is slower.

Probably this is due to this example is with very complicated physics process. I will have a look with a simple case.

The output of (1):
TimeReport> Time report complete in 1141.85 seconds
Time Summary:

Min event: 0.039495
Max event: 6.11634
Avg event: 1.53845
Total loop: 1128.07
Total init: 12.4372
Total job: 1141.85
EventSetup Lock: 5.00679e-06
EventSetup Get: 0.00770998
Event Throughput: 0.177295 ev/s
CPU Summary:
Total loop: 518.395
Total init: 4.91161
Total extra: 0
Total job: 523.881
Processing Summary:
Number of Events: 200
Number of Global Begin Lumi Calls: 1
Number of Global Begin Run Calls: 1

The output of (2)
TimeReport> Time report complete in 1788.17 seconds
Time Summary:

Min event: 0.179609
Max event: 17.4009
Avg event: 6.54978
Total loop: 1778.47
Total init: 8.827
Total job: 1788.17
EventSetup Lock: 6.19888e-06
EventSetup Get: 0.022414
Event Throughput: 0.112456 ev/s
CPU Summary:
Total loop: 1383.01
Total init: 4.88403
Total extra: 0
Total job: 1388.48
Processing Summary:
Number of Events: 200
Number of Global Begin Lumi Calls: 1
Number of Global Begin Run Calls: 1

Dr15Jones · 2020-02-14T14:08:47Z

Thanks for doing the study! If we look just at the single threaded case we see that the CPU efficiency is poor

Wallclock Total loop: 1128.07
CPU Total loop: 518.395

 518.395/1128.07 = 0.46

I was able to replicate what you ran and by watching the job with 'top' I was able to see that the generation step is incredibly CPU inefficient. It does lots of work with 'tar' and even when 'madevent' is running, that application is also very CPU inefficient. So given that the single threaded job is so IO bound, it stands to reason that running 4 IO bound jobs (which is what the changes to externalLHEProducer replicates) would be slower since the 4 applications are fighting each other for the disk resources. There is one thing to keep in mind, the workflow management system will not just run 1 single threaded GEN job on a node as a batch slot typically has 8 cores assigned to it. Instead, the workflow system is likely to assign 8 single threaded GEN jobs to a node. Therefore a better test would be to compare the 4 threaded case to the time it takes to run 4 single threaded jobs on the node.

As a test, I ran the configuration you created under 3 different conditions

just the way it was creates using just 1 thread (only 1 job, not 4 of them)
setting generateConcurrently to true and running with 4 threads
including the Pythia8ConcurrentHadronizerFilter from my other pull request and running 4 threads

The results were as follows.

Single threaded
Wallclock Total loop: 3134.77
CPU Total loop: 2826.53
efficiency 2826.53/3134.77 = 0.90

4 threads with only ExternalLHEProducer
Wallclock Total loop: 3295.3
CPU Total loop: 5828.19
efficiency 5828.19/3295.3/4 = 0.44

4 threads with ExternalLHEProducer and Pythia8ConcurrentHadronizerFiter
Wallclock Total loop: 1699.53
CPU Total loop: 4559.94
efficiency 4559.94/1699.53/4 = 0.67

So using all the changes did process the code 3134.77/1699.53 = 1.8 times faster than the single threaded case.

Dr15Jones · 2020-02-15T01:31:23Z

As a test, I ran 4 single-threaded jobs concurrently. The timing was
Wallclock Total loop: 3610.61
CPU Total loop: 3006.08

Wallclock Total loop: 3737.87
CPU Total loop: 3186.13

Wallclock Total loop: 3663.9
CPU Total loop: 3130.87

Wallclock Total loop: 3780.75
CPU Total loop: 3182.42

So much slower than the case of just 1 job. Now this did process 4x as many events as the previous test but does show that multiple IO bound jobs to interfere with one another.

qliphy · 2020-02-15T02:31:05Z

Thanks @Dr15Jones It makes sense.

One question,

You got two single thread numbers:

(1)
Wallclock Total loop: 1128.07
CPU Total loop: 518.395
518.395/1128.07 = 0.46

(2)
Single threaded
Wallclock Total loop: 3134.77
CPU Total loop: 2826.53
efficiency 2826.53/3134.77 = 0.90

The efficiency differ a lot. Why?

Dr15Jones · 2020-02-15T16:35:14Z

The efficiency differ a lot. Why?

So we each ran them on different machines. It looks like one machine has a much faster CPU than the other. If the IO systems on the two machines were about the same then the slower CPU machine would have spent a proportionally less time doing IO than the fast machine. That would mean the faster machine would have a proportionally worse CPU utilization.

Dr15Jones · 2020-02-17T14:18:44Z

What further checks/code do you want in order to finish this pull request?

qliphy · 2020-02-17T15:42:33Z

This PR looks good to me. Probably we can check with not only GENonly but also GEN+SIM, but that can be done later.

qliphy · 2020-02-17T15:42:40Z

+1

cmsbuild · 2020-02-17T15:42:59Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @davidlange6, @silviodonato, @fabiocos (and backports should be raised in the release meeting by the corresponding L2)

silviodonato · 2020-02-18T14:07:12Z

+1

Backported from cms-sw#28899 Note: the macro CMSSW_SA_ALLOW is disabled.

Backported from cms-sw#28899 Note: a unit test using TestProcessor::testEndRun() is disabled.

silviodonato · 2020-07-03T18:57:54Z

+1

bendavid · 2020-07-03T19:12:48Z

@silviodonato I guess you meant to +1 to #30481 instead?

Concurrent ExternalLHEProducer (backport #28899)

Dr15Jones added 2 commits February 9, 2020 17:11

Use tbb::tasks in ExternalLHEProducer

37a8017

If requested, use multiple tbb::tasks where each task runs the script. Each task is assigned its own directory to run the script. We then read the generated LHE files sequentially during event processing.

Added unit test for ExternalLHEProducer

f59e40a

cmsbuild added this to the CMSSW_11_1_X milestone Feb 9, 2020

cmsbuild added code-checks-pending comparison-pending generators-pending orp-pending pending-signatures tests-pending labels Feb 9, 2020

cmsbuild added code-checks-approved and removed code-checks-pending labels Feb 9, 2020

cmsbuild added tests-started and removed tests-pending labels Feb 9, 2020

cmsbuild added tests-approved and removed tests-started labels Feb 9, 2020

cmsbuild added comparison-available and removed comparison-pending labels Feb 9, 2020

cmsbuild added fully-signed generators-approved and removed generators-pending pending-signatures labels Feb 17, 2020

cmsbuild added orp-approved and removed orp-pending labels Feb 18, 2020

cmsbuild merged commit bd87a91 into cms-sw:master Feb 18, 2020

Dr15Jones deleted the concurrentExternalLHEProducer branch February 24, 2020 20:12

qliphy mentioned this pull request Apr 23, 2020

Run GeneratorFilter modules in an external process controlled by cmsRun #29445

Merged

colizz added a commit to colizz/cmssw that referenced this pull request Jul 1, 2020

Use tbb::tasks in ExternalLHEProducer (106X)

eada95c

Backported from cms-sw#28899 Note: the macro CMSSW_SA_ALLOW is disabled.

colizz added a commit to colizz/cmssw that referenced this pull request Jul 1, 2020

Added unit test for ExternalLHEProducer (106X)

454f974

Backported from cms-sw#28899 Note: a unit test using TestProcessor::testEndRun() is disabled.

colizz mentioned this pull request Jul 1, 2020

Concurrent ExternalLHEProducer (backport #28899) #30481

Merged

cmsbuild added a commit that referenced this pull request Jul 3, 2020

Merge pull request #30481 from colizz/dev-106X-externalLHEProducer

73cc181

Concurrent ExternalLHEProducer (backport #28899)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent ExternalLHEProducer #28899

Concurrent ExternalLHEProducer #28899

Dr15Jones commented Feb 9, 2020

cmsbuild commented Feb 9, 2020

cmsbuild commented Feb 9, 2020

cmsbuild commented Feb 9, 2020

Dr15Jones commented Feb 9, 2020

cmsbuild commented Feb 9, 2020 •

edited

cmsbuild commented Feb 9, 2020

cmsbuild commented Feb 9, 2020

cmsbuild commented Feb 9, 2020

qliphy commented Feb 10, 2020 •

edited

mseidel42 commented Feb 10, 2020

Dr15Jones commented Feb 10, 2020

qliphy commented Feb 13, 2020 •

edited

Dr15Jones commented Feb 14, 2020

Dr15Jones commented Feb 15, 2020

qliphy commented Feb 15, 2020

Dr15Jones commented Feb 15, 2020

Dr15Jones commented Feb 17, 2020

qliphy commented Feb 17, 2020

qliphy commented Feb 17, 2020

cmsbuild commented Feb 17, 2020

silviodonato commented Feb 18, 2020

silviodonato commented Jul 3, 2020

bendavid commented Jul 3, 2020

Concurrent ExternalLHEProducer #28899

Concurrent ExternalLHEProducer #28899

Conversation

Dr15Jones commented Feb 9, 2020

PR description:

PR validation:

cmsbuild commented Feb 9, 2020

cmsbuild commented Feb 9, 2020

cmsbuild commented Feb 9, 2020

Dr15Jones commented Feb 9, 2020

cmsbuild commented Feb 9, 2020 • edited

cmsbuild commented Feb 9, 2020

cmsbuild commented Feb 9, 2020

cmsbuild commented Feb 9, 2020

qliphy commented Feb 10, 2020 • edited

mseidel42 commented Feb 10, 2020

Dr15Jones commented Feb 10, 2020

qliphy commented Feb 13, 2020 • edited

Dr15Jones commented Feb 14, 2020

Dr15Jones commented Feb 15, 2020

qliphy commented Feb 15, 2020

Dr15Jones commented Feb 15, 2020

Dr15Jones commented Feb 17, 2020

qliphy commented Feb 17, 2020

qliphy commented Feb 17, 2020

cmsbuild commented Feb 17, 2020

silviodonato commented Feb 18, 2020

silviodonato commented Jul 3, 2020

bendavid commented Jul 3, 2020

cmsbuild commented Feb 9, 2020 •

edited

qliphy commented Feb 10, 2020 •

edited

qliphy commented Feb 13, 2020 •

edited