Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent ExternalLHEProducer #28899

Merged
merged 2 commits into from Feb 18, 2020

Conversation

Dr15Jones
Copy link
Contributor

PR description:

Add the ability to use multiple TBB tasks to execute the script concurrently during the begin run phase.

PR validation:

The code compiles and the new unit tests succeed. In addition, I ran the code using an example job I had been given. If the new parameter is not used, the old code is used.

If requested, use multiple tbb::tasks where each task runs the
script. Each task is assigned its own directory to run the script.
We then read the generated LHE files sequentially during event
processing.
@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 9, 2020

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 9, 2020

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28899/13685

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 9, 2020

A new Pull Request was created by @Dr15Jones (Chris Jones) for master.

It involves the following packages:

GeneratorInterface/LHEInterface

@SiewYan, @efeyazgan, @mkirsano, @cmsbuild, @agrohsje, @alberto-sanchez, @qliphy can you please review it and eventually sign? Thanks.
@alberto-sanchez, @agrohsje, @mkirsano this is something you requested to watch as well.
@davidlange6, @silviodonato, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 9, 2020

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/4562/console Started: 2020/02/09 18:48

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 9, 2020

+1
Tested at: f59e40a
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2fdaa4/4562/summary.html
CMSSW: CMSSW_11_1_X_2020-02-09-0000
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 9, 2020

Comparison job queued.

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 9, 2020

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2fdaa4/4562/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 34
  • DQMHistoTests: Total histograms compared: 2698043
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2697696
  • DQMHistoTests: Total skipped: 346
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
  • Checked 147 log files, 16 edm output root files, 34 DQM output files

@qliphy
Copy link
Contributor

qliphy commented Feb 10, 2020

@Dr15Jones Thanks. Looks interesting! Can you give an example how to test this? Will the LHE produced in each thread be merged at last? Current gridpacks (MadGraph, Powheg...) doesn't support multi-thread yet, so does this mean it needs to copy and tar/untar gridpacks multi times?

@mseidel42
Copy link
Contributor

Hi, I tried to produce a couple of Powheg ttbar events, it seems to work :)

I am just wondering about the use of consecutive random seeds. These would overlap when consecutive job numbers are used as seed. How is it done in production? Consecutive or "random" random seeds?

@qliphy Yes, the gridpack is unpacked and used multiple times in directories thread0, thread1, etc.

I am using this fragment:

process.externalLHEProducer = cms.EDProducer("ExternalLHEProducer",
    args = cms.vstring('/cvmfs/cms.cern.ch/phys_generator/gridpacks/slc6_amd64_gcc481/13TeV/powheg/V2/TT_hdamp_Tune4_NNPDF30/TT_hdamp_Tune4_NNPDF30_13TeV_powheg_hvq.tgz'),
    nEvents = cms.untracked.uint32(process.maxEvents.input.value()),
    numberOfParameters = cms.uint32(1),
    outputFile = cms.string('cmsgrid_final.lhe'),
    scriptName = cms.FileInPath('GeneratorInterface/LHEInterface/data/run_generic_tarball_cvmfs.sh'),
    generateConcurrently = cms.untracked.bool(True)
)

@Dr15Jones
Copy link
Contributor Author

@intrepid42 already answered these but just for completeness.

Can you give an example how to test this?

You can take any existing configuration and add generateConcurrently = cms.untracked.bool(True) to the ExternalLHEProducer configuration.

Will the LHE produced in each thread be merged at last?

Yes

Current gridpacks (MadGraph, Powheg...) doesn't support multi-thread yet, so does this mean it needs to copy and tar/untar gridpacks multi times?

Yes

Now as for the random numbers

I am just wondering about the use of consecutive random seeds. These would overlap when consecutive job numbers are used as seed. How is it done in production? Consecutive or "random" random seeds?

The framework already makes use of consecutive random numbers to handle random number assignments for threading. The product system assigns a single random number seed to each module in a configuration. Then for each stream in a job, the random number seed is modifed by adding the stream index (a number from 0 to #streams -1).

@qliphy
Copy link
Contributor

qliphy commented Feb 13, 2020

Thanks @Dr15Jones @intrepid42
I have done a test based on SMP-RunIISummer19UL17wmLHEGEN-00001
which is a complicated case, i.e. WJetsToLNu_TuneCP5_13TeV-madgraph261MLM-pythia8 (W+up to 4 jets).

(1).
cmsrel CMSSW_11_1_X_2020-02-12-2300
cd CMSSW_11_1_X_2020-02-12-2300/src
git cms-addpkg GeneratorInterface/LHEInterface
git cms-merge-topic Dr15Jones:concurrentExternalLHEProducer
scram b

curl -s --insecure https://cms-pdmv.cern.ch/mcm/public/restapi/requests/get_fragment/SMP-
RunIISummer19UL17wmLHEGEN-00001 --retry 2 --create-dirs -o Configuration/GenProduction/python/SMP-RunIISummer19UL17wmLHEGEN-00001-fragment.py
scram b

cmsDriver.py Configuration/GenProduction/python/SMP-RunIISummer19UL17wmLHEGEN-00001-fragment.py --fileout file:SMP-RunIISummer19UL17wmLHEGEN-00001.root --mc --eventcontent RAWSIM,LHE --datatier GEN,LHE --conditions 106X_mc2017_realistic_v6 --beamspot Realistic25ns13TeVEarly2017Collision --step LHE,GEN --geometry DB:Extended --era Run2_2017 --python_filename SMP-RunIISummer19UL17wmLHEGEN-00001_1_cfg.py --no_exec --customise Configuration/DataProcessing/Utils.addMonitoring --customise_commands
process.source.numberEventsInLuminosityBlock="cms.untracked.uint32(2222)" -n 200

(2) same as (1),
but add "generateConcurrently = cms.untracked.bool(True)"
and in addition
process.options.numberOfThreads=cms.untracked.uint32(4)
process.options.numberOfStreams=cms.untracked.uint32(0)
process.options.numberOfConcurrentLuminosityBlocks=cms.untracked.uint32(1)

I indeed see 4 thread running, and each with a gridpack untarring and running.
However, from the output below, it seems the concurrent mode even took longer time to get finished, especially initialization of gridpack is slower.

Probably this is due to this example is with very complicated physics process. I will have a look with a simple case.

The output of (1):
TimeReport> Time report complete in 1141.85 seconds
Time Summary:

  • Min event: 0.039495
  • Max event: 6.11634
  • Avg event: 1.53845
  • Total loop: 1128.07
  • Total init: 12.4372
  • Total job: 1141.85
  • EventSetup Lock: 5.00679e-06
  • EventSetup Get: 0.00770998
    Event Throughput: 0.177295 ev/s
    CPU Summary:
  • Total loop: 518.395
  • Total init: 4.91161
  • Total extra: 0
  • Total job: 523.881
    Processing Summary:
  • Number of Events: 200
  • Number of Global Begin Lumi Calls: 1
  • Number of Global Begin Run Calls: 1

The output of (2)
TimeReport> Time report complete in 1788.17 seconds
Time Summary:

  • Min event: 0.179609
  • Max event: 17.4009
  • Avg event: 6.54978
  • Total loop: 1778.47
  • Total init: 8.827
  • Total job: 1788.17
  • EventSetup Lock: 6.19888e-06
  • EventSetup Get: 0.022414
    Event Throughput: 0.112456 ev/s
    CPU Summary:
  • Total loop: 1383.01
  • Total init: 4.88403
  • Total extra: 0
  • Total job: 1388.48
    Processing Summary:
  • Number of Events: 200
  • Number of Global Begin Lumi Calls: 1
  • Number of Global Begin Run Calls: 1

@Dr15Jones
Copy link
Contributor Author

Thanks for doing the study! If we look just at the single threaded case we see that the CPU efficiency is poor

Wallclock Total loop: 1128.07
CPU Total loop: 518.395

 518.395/1128.07 = 0.46

I was able to replicate what you ran and by watching the job with 'top' I was able to see that the generation step is incredibly CPU inefficient. It does lots of work with 'tar' and even when 'madevent' is running, that application is also very CPU inefficient. So given that the single threaded job is so IO bound, it stands to reason that running 4 IO bound jobs (which is what the changes to externalLHEProducer replicates) would be slower since the 4 applications are fighting each other for the disk resources. There is one thing to keep in mind, the workflow management system will not just run 1 single threaded GEN job on a node as a batch slot typically has 8 cores assigned to it. Instead, the workflow system is likely to assign 8 single threaded GEN jobs to a node. Therefore a better test would be to compare the 4 threaded case to the time it takes to run 4 single threaded jobs on the node.

As a test, I ran the configuration you created under 3 different conditions

  1. just the way it was creates using just 1 thread (only 1 job, not 4 of them)
  2. setting generateConcurrently to true and running with 4 threads
  3. including the Pythia8ConcurrentHadronizerFilter from my other pull request and running 4 threads

The results were as follows.

Single threaded
Wallclock Total loop: 3134.77
CPU Total loop: 2826.53
efficiency 2826.53/3134.77 = 0.90

4 threads with only ExternalLHEProducer
Wallclock Total loop: 3295.3
CPU Total loop: 5828.19
efficiency 5828.19/3295.3/4 = 0.44

4 threads with ExternalLHEProducer and Pythia8ConcurrentHadronizerFiter
Wallclock Total loop: 1699.53
CPU Total loop: 4559.94
efficiency 4559.94/1699.53/4 = 0.67

So using all the changes did process the code 3134.77/1699.53 = 1.8 times faster than the single threaded case.

@Dr15Jones
Copy link
Contributor Author

As a test, I ran 4 single-threaded jobs concurrently. The timing was
Wallclock Total loop: 3610.61
CPU Total loop: 3006.08

Wallclock Total loop: 3737.87
CPU Total loop: 3186.13

Wallclock Total loop: 3663.9
CPU Total loop: 3130.87

Wallclock Total loop: 3780.75
CPU Total loop: 3182.42

So much slower than the case of just 1 job. Now this did process 4x as many events as the previous test but does show that multiple IO bound jobs to interfere with one another.

@qliphy
Copy link
Contributor

qliphy commented Feb 15, 2020

Thanks @Dr15Jones It makes sense.

One question,

You got two single thread numbers:

(1)
Wallclock Total loop: 1128.07
CPU Total loop: 518.395
518.395/1128.07 = 0.46

(2)
Single threaded
Wallclock Total loop: 3134.77
CPU Total loop: 2826.53
efficiency 2826.53/3134.77 = 0.90

The efficiency differ a lot. Why?

@Dr15Jones
Copy link
Contributor Author

The efficiency differ a lot. Why?

So we each ran them on different machines. It looks like one machine has a much faster CPU than the other. If the IO systems on the two machines were about the same then the slower CPU machine would have spent a proportionally less time doing IO than the fast machine. That would mean the faster machine would have a proportionally worse CPU utilization.

@Dr15Jones
Copy link
Contributor Author

What further checks/code do you want in order to finish this pull request?

@qliphy
Copy link
Contributor

qliphy commented Feb 17, 2020

This PR looks good to me. Probably we can check with not only GENonly but also GEN+SIM, but that can be done later.

@qliphy
Copy link
Contributor

qliphy commented Feb 17, 2020

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @davidlange6, @silviodonato, @fabiocos (and backports should be raised in the release meeting by the corresponding L2)

@silviodonato
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit bd87a91 into cms-sw:master Feb 18, 2020
@Dr15Jones Dr15Jones deleted the concurrentExternalLHEProducer branch February 24, 2020 20:12
colizz added a commit to colizz/cmssw that referenced this pull request Jul 1, 2020
Backported from cms-sw#28899
Note: the macro CMSSW_SA_ALLOW is disabled.
colizz added a commit to colizz/cmssw that referenced this pull request Jul 1, 2020
Backported from cms-sw#28899
Note: a unit test using TestProcessor::testEndRun() is disabled.
@silviodonato
Copy link
Contributor

+1

@bendavid
Copy link
Contributor

bendavid commented Jul 3, 2020

@silviodonato I guess you meant to +1 to #30481 instead?

cmsbuild added a commit that referenced this pull request Jul 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants