New fastsim geometry reb #20666

skurz · 2017-09-27T11:19:02Z

The goal is to prepare fastsim for the upgrade of the pixel detector. A new, configurable interface for the geometry of the fastsim tracker was developed. To this end, it was necessary to also write a new algorithm for the propagation of the particles inside the tracker.
The geometry of the detector used to be hard-coded but can now be specified in a single python file (FastSimulation/Geometry/python/TrackerMaterial_cfi.py)

Motivation for a new algorithm (particle propagation inside the tracker)

almost impossible to adjust the old algorithm to the new, configurable geometry (hard-coded links have to be omitted)
old algorithm is flawed
the new algorithm allows to put material and instruments on any plane perpendicular to z, or on any cylinder with the z-axis as axis
while the old algorithm, with the requirement of nested layers, forbids the introduction of long narrow cylinders, required for a decent simulation of material in front of HF

cmsbuild · 2017-09-27T11:19:16Z

@skurz, CMSSW_9_4_X branch is closed for direct updates. cms-bot is going to move this PR to master branch.
In future, please use cmssw master branch to submit your changes.

cmsbuild · 2017-09-27T11:19:44Z

The code-checks are being triggered in jenkins.

cmsbuild · 2017-09-27T11:22:33Z

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/PR-20666/986

Code check has found code style and quality issues which could be resolved by applying a patch in https://cmssdt.cern.ch/SDT/code-checks/PR-20666/986/git-diff.patch
e.g. curl https://cmssdt.cern.ch/SDT/code-checks/PR-20666/986/git-diff.patch | patch -p1

You can run scram build code-checks to apply code checks directly

kpedro88 · 2017-09-27T21:37:25Z

@skurz please apply the code checks as indicated for this PR to proceed

cmsbuild · 2017-09-28T09:25:42Z

The code-checks are being triggered in jenkins.

cmsbuild · 2017-09-28T09:29:19Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/PR-20666/1024

cmsbuild · 2017-09-28T09:30:01Z

A new Pull Request was created by @skurz for master.

It involves the following packages:

Configuration/Applications
FastSimulation/Calorimetry
FastSimulation/Configuration
FastSimulation/Event
FastSimulation/ForwardDetectors
FastSimulation/HighLevelTrigger
FastSimulation/MuonSimHitProducer
FastSimulation/SimplifiedGeometryPropagator
FastSimulation/Tracking
FastSimulation/TrackingRecHitProducer
FastSimulation/Validation
HLTriggerOffline/Btag
IOMC/RandomEngine
MuonAnalysis/MomentumScaleCalibration
MuonAnalysis/MuonAssociators
PhysicsTools/HepMCCandAlgos
PhysicsTools/PatAlgos
PhysicsTools/PatExamples
RecoJets/JetAssociationProducers
RecoLuminosity/LumiProducer
RecoParticleFlow/Configuration
RecoParticleFlow/PFProducer
RecoParticleFlow/PFSimProducer
RecoParticleFlow/PFTracking
SLHCUpgradeSimulations/Configuration
SimCalorimetry/HcalSimProducers
SimG4Core/CustomPhysics
SimGeneral/DataMixingModule
SimGeneral/MixingModule
SimGeneral/TrackingAnalysis
SimMuon/MCTruth
Validation/EcalDigis
Validation/EcalHits
Validation/EcalRecHits
Validation/HcalDigis
Validation/HcalHits
Validation/HcalRecHits
Validation/RecoEgamma
Validation/RecoParticleFlow
Validation/RecoTrack

The following packages do not have a category, yet:

FastSimulation/SimplifiedGeometryPropagator
Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories.py to assign category

@perrotta, @smuzaffar, @civanch, @Dr15Jones, @vazzolini, @lveldere, @kmaeshima, @ssekmen, @kpedro88, @dmitrijus, @cmsbuild, @franzoni, @mdhildreth, @slava77, @vanbesien, @monttj, @davidlange6 can you please review it and eventually sign? Thanks.
@ghellwig, @TaiSakuma, @gouskos, @felicepantaleo, @schoef, @rappoccio, @sviret, @argiro, @Martin-Grunewald, @imarches, @ahinzmann, @mmusich, @seemasharmafnal, @mmarionncern, @JyothsnaKomaragiri, @makortel, @acaudron, @jhgoh, @lgray, @jdolen, @HuguesBrun, @ferencek, @trocino, @abbiendi, @GiacomoSguazzoni, @rovere, @VinInn, @jdamgov, @bellan, @nhanvtran, @gkasieczka, @wmtford, @ebrondol, @mtosi, @dgulhan, @battibass, @wddgit, @calderona, @mverzett, @cbernet, @gpetruc, @matt-komm, @mariadalfonso, @pvmulder, @bachtis this is something you requested to watch as well.
@davidlange6, @slava77 you are the release manager for this.

cms-bot commands are listed here

Dr15Jones · 2017-09-28T11:46:53Z

please test

Dr15Jones · 2018-02-13T12:51:33Z

@smuzaffar can we get stack traces for any of these jobs?

skurz · 2018-02-13T13:04:54Z

I am also looking into this now. I can reproduce the increase in runtime and I'm doing some timing test to see where it comes from. However, I can't reproduce that the test sometimes gets stuck.

skurz · 2018-02-13T13:30:54Z

Ok, I found the problem. The calorimetry is initialized every single event which obviosly doesn't make sense. These lines should rather go into a beginRun(...) method:
https://github.com/skurz/cmssw/blob/0c0d5f3f770e17ee599cc08f12c14a2b1c886bda/FastSimulation/SimplifiedGeometryPropagator/plugins/FastSimProducer.cc#L197-L212
What do you suggest? How should we proceed?

Dr15Jones · 2018-02-13T13:34:20Z

You could also use an ESWatcher to call the update code only when the conditions change. You can find the ESWatcher in FWCore/Utilities.

skurz · 2018-02-13T13:47:31Z

Ok, so what do I have to do as the PR has already been merged? Do I have to make a new PR?
And what do you think makes more sense? Using an ESWatcher or adding the beginRun method?

Dr15Jones · 2018-02-13T13:58:49Z

You need a new pull request and ESWatcher is the better solution.

civanch · 2018-02-13T14:48:48Z

I would think, that if before ESWatcher was not used in FastSim this means that FastSim is performed with fixed conditions.So, I may be wrong but beginRun corresponding to what was before.

skurz · 2018-02-13T14:49:46Z

Yes, beginRun was used before.

Dr15Jones · 2018-02-13T14:51:08Z

In general, it is never good to assume exactly when IOVs change. The ESWatcher always does the right thing. In this case, if we did Run dependent MC using this module and you were updating in the beginRun, each run would, probably unnecessarily, update the geometry. The ESWatcher guarantees that the update happens only if needed.

skurz · 2018-02-13T15:36:47Z

I am going to submit the PR as soon as I am sure there are no other flaws that have an influence on the runtime.

perrotta · 2018-02-13T16:37:52Z

FastSimulation/SimplifiedGeometryPropagator/src/HelixTrajectory.cc

+    {   
+        // Should not be reached: Full Propagation does always have a solution "if(crosses(layer)) == -1"
+        // Even if particle is outside all layers -> can turn around in magnetic field
+        throw cms::Exception("FastSimulation") << "HelixTrajectory: should not be reached (no solution).";


@skurz, since you are checking the effect of this already merged PR: perhaps the following failure
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-22152/26029/addOnTests/fastsim2/cmsDriver.py_TTbar_13TeV_TuneCUETP8M1_cfi_--conditions_auto:run2_mc_--fast__-n_100_--eventcontent_AODSIM,DQM_--relval_100000,1000_-s_GEN,SIM,RECOBEFMI.log
observed in the addOn tests of the apparently unrelated pull request #22152 is also worth giving a look

Yes, this is definitely related to my PR but still it is really strange. I'll try to find out what is going on.

skurz · 2018-02-13T16:45:32Z

So it seems like the runtime of the new fastSim propagator is fixed by the ESWatcher. However, there is something else going on that I don't understand. I used the fastsim1 config from the addOnTests to produce 100 events. The first events are processed really quickly (<10s per evt) and then it starts to slow down more and more (~1min per evt); at the same time CPU consumption is rising. I have no idea where this comes from and I don't know how this should be related to my PR.

Here are some lines from the TimeReport:
TimeReport ---------- Path Summary ---[Real sec]----
TimeReport per event per exec Name
TimeReport 32.375547 32.375547 generation_step
TimeReport 2.101804 2.101804 simulation_step
TimeReport 2.087158 2.087158 reconstruction_befmix_step
TimeReport 34.106832 34.106832 digitisation_step
TimeReport 3.613188 3.613188 L1simulation_step
TimeReport 3.615137 3.615137 digi2raw_step
TimeReport 2.825613 2.825613 L1Reco_step
TimeReport 35.093805 35.093805 reconstruction_step
TimeReport 25.964805 25.964805 eventinterpretaion_step
TimeReport 25.967529 25.967529 prevalidation_step

But looking e.g. at the details of the generating step, I don't understand why it takes about 32s per evt:
TimeReport ---------- Modules in Path: generation_step ---[Real sec]----
TimeReport per event per visit Name
TimeReport 0.000000 0.000000 generator
TimeReport 0.000065 0.000065 randomEngineStateProducer
TimeReport 0.001556 0.001556 VtxSmeared
TimeReport 0.000025 0.000025 generatorSmeared
TimeReport 0.002150 0.002150 genParticles
TimeReport 0.000033 0.000033 genParticlesForJets
TimeReport 0.000025 0.000025 genParticlesForJetsNoNu
TimeReport 0.000071 0.000071 ak4GenJets
TimeReport 0.000064 0.000064 ak8GenJets
TimeReport 0.000049 0.000049 ak4GenJetsNoNu
TimeReport 0.000059 0.000059 ak8GenJetsNoNu
TimeReport 0.000025 0.000025 genCandidatesForMET
TimeReport 0.000025 0.000025 genParticlesForMETAllVisible
TimeReport 0.000008 0.000008 genMetCalo
TimeReport 0.000006 0.000006 genMetTrue

kpedro88 · 2018-02-13T16:51:58Z

@skurz you can try to use igprof to get more information on the CPU usage (see e.g. https://github.com/kpedro88/utilities/blob/master/runIgprof.sh for an example of how to run it).

The slowdown over time makes me suspect there could be a memory leak or failure to deallocate. You can try to watch the memory usage over time as it slows down. You can also run valgrind:

valgrind --tool=memcheck --leak-check=yes --show-reachable=yes --num-callers=20 --suppressions=$ROOTSYS/etc/valgrind-root.supp --track-origins=yes cmsRun ...

Or igprof in memory profile mode, this requires adding a service to the config:

    process.IgProfService = cms.Service("IgProfService",
        reportEventInterval = cms.untracked.int32(1),
        reportFirstEvent = cms.untracked.int32(1),
        reportToFileAtPostEndJob = cms.untracked.string('| gzip -c > test___memory___%I_EndOfJob.gz'),
        reportToFileAtPostEvent = cms.untracked.string('| gzip -c > test___memory___%I.gz')
)

Dr15Jones · 2018-02-13T16:53:43Z

@skurz could you run the job in the debugger and when it starts to slow down do a and then a thread apply all backtrace to get the stacktraces for all the threads? You might need to try several times to see if we are 'stuck' in any particular thread.

Dr15Jones · 2018-02-13T16:54:40Z

@skurz you could also run top in another shell on the same machine and see if the memory is steadily increasing.

skurz · 2018-02-13T17:06:49Z

Ok, I am going through your list.
valgrind apparantly did not find anything.
top showed me that memory consumption stays at around 8.5% (maybe very slowly inreases) of 4gb reserved. CPU consumption always at 400% (4 threads)

kpedro88 · 2018-02-13T17:10:10Z

@skurz valgrind usually takes several hours to run, so I'm surprised you could have results already...

@Dr15Jones maybe the stall monitor could provide some insight here?

skurz · 2018-02-13T17:12:55Z

ok, you're right. I did something wrong. It's running now.

skurz · 2018-02-13T18:53:56Z

Another update for today.
valgrind is still running: it showed tons of error messages so far but nothing is related to my PR as far as I can tell. Let's wait for the summary.
@Dr15Jones I run the debugger a couple of times and did the backtrace. I attached a representative trace here: most of the time at least 2 or 3 threads are occupied with ´CaloSubdetectorGeometry::present(DetId const&)´. Does this help? Is this what you were looking for?

backtrace.txt

Dr15Jones · 2018-02-13T21:17:58Z

@skurz There doesn't seem to be anything mysterious about that function:
https://github.com/cms-sw/cmssw/blob/225ce8ab3efc5171be8f17491f9448d040ccc698/Geometry/CaloGeometry/src/CaloSubdetectorGeometry.cc

maybe the mystery is why the ParticleFlow code is calling it so often?

Dr15Jones · 2018-02-13T21:33:10Z

@skurz can you post the complete timing report somewhere? Was this done using 1 thread? With multi-threaded framework, the path times are now a lie. All Paths actually start simultaneously (even in single threaded mode) and the timer for the path doesn't stop until the last module runs. Because of contention between 'legacy' and 'one' modules, modules from different Paths can get intermingled in the processing. If you want so see this for yourself, just add
process.add_(cms.Service("Tracer")) to your job.

skurz · 2018-02-13T21:41:31Z

Yeah, I've already seen that there is nothing special about the CaloSubdetectorGeometry. So, the timing was done using 4 threads. How can I make sure that the correct timing is calculated? Is there a way to make sure that paths are not run simultaneosly (like a change in the config file)? Anyway, here is the timing report: TimeReport.txt

skurz · 2018-02-13T22:03:50Z

I finally found the problem. I made a stupid mistake when "fixing" the old issue. I am going to make a new PR soon. Thank you very much for your help!

Dr15Jones · 2018-02-13T22:08:22Z

@skurz It looks to me like your time is being spent in ParticleFlow calculation with many of those modules being 'unscheduled'. E.g.

TimeReport   1.492296     1.492296     1.492296  particleFlowClusterECAL
TimeReport   1.489819     1.489819     1.489819  particleFlowClusterOOTECAL
TimeReport   5.707759     5.707759     5.707759  particleFlowRecHitECAL
TimeReport   5.712172     5.712172     5.712172  particleFlowRecHitOOTECAL
TimeReport   1.811973     1.811973     1.811973  particleFlowSuperClusterECAL
TimeReport   1.804347     1.804347     1.804347  particleFlowSuperClusterOOTECAL

I think I understand why the generation_step Path takes so long. All Paths start their timers at the beginning of the event processing (since the first module in every Path gets scheduled to 'be runnable' at that moment). The timer for a Path stops when the final module in the Path has finished running. Now the time a module run is actually up to the framework. The framework only guarantees that a module will run IF it is needed by another module or IF it is on a Path and there is no other module which can be run. So my theory is since generator is on several paths the other modules on generation_step are not needed by any other module except for maybe the OutputModule. That means the framework is free to run them at the end if that happens to be what works best at the time.

skurz · 2018-02-13T22:37:00Z

Here is the new PR: #22202

cmsbuild changed the base branch from CMSSW_9_4_X to master September 27, 2017 11:19

cmsbuild added this to the CMSSW_9_4_X milestone Sep 27, 2017

cmsbuild added analysis-pending code-checks-pending comparison-pending core-pending dqm-pending fastsim-pending new-package-pending operations-pending orp-pending pending-signatures reconstruction-pending simulation-pending tests-pending upgrade-pending labels Sep 27, 2017

cmsbuild added code-checks-rejected and removed code-checks-pending labels Sep 27, 2017

cmsbuild added code-checks-pending and removed code-checks-rejected labels Sep 28, 2017

cmsbuild added code-checks-approved and removed code-checks-pending labels Sep 28, 2017

smuzaffar mentioned this pull request Feb 13, 2018

Please do not merge: Revert "New fastsim geometry reb" #22187

Closed

perrotta reviewed Feb 13, 2018

View reviewed changes

skurz mentioned this pull request Feb 13, 2018

New fastsim geometry fix #22202

Merged

fabiocos mentioned this pull request Feb 15, 2018

Fix the FastSim input MinBias for PU samples #22232

Closed

kpedro88 mentioned this pull request Mar 8, 2018

Improve event setup usage in HCAL digitization #22538

Merged

ssekmen mentioned this pull request Aug 30, 2018

New fastsim geometry - backport to 94X #24417

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New fastsim geometry reb #20666

New fastsim geometry reb #20666

skurz commented Sep 27, 2017

cmsbuild commented Sep 27, 2017

cmsbuild commented Sep 27, 2017

cmsbuild commented Sep 27, 2017

kpedro88 commented Sep 27, 2017

cmsbuild commented Sep 28, 2017

cmsbuild commented Sep 28, 2017

cmsbuild commented Sep 28, 2017

Dr15Jones commented Sep 28, 2017

Dr15Jones commented Feb 13, 2018

skurz commented Feb 13, 2018

skurz commented Feb 13, 2018

Dr15Jones commented Feb 13, 2018

skurz commented Feb 13, 2018

Dr15Jones commented Feb 13, 2018

civanch commented Feb 13, 2018

skurz commented Feb 13, 2018

Dr15Jones commented Feb 13, 2018

skurz commented Feb 13, 2018

perrotta Feb 13, 2018

skurz Feb 13, 2018

skurz commented Feb 13, 2018

kpedro88 commented Feb 13, 2018

Dr15Jones commented Feb 13, 2018

Dr15Jones commented Feb 13, 2018

skurz commented Feb 13, 2018

kpedro88 commented Feb 13, 2018

skurz commented Feb 13, 2018

skurz commented Feb 13, 2018

Dr15Jones commented Feb 13, 2018

Dr15Jones commented Feb 13, 2018

skurz commented Feb 13, 2018

skurz commented Feb 13, 2018

Dr15Jones commented Feb 13, 2018

skurz commented Feb 13, 2018

New fastsim geometry reb #20666

New fastsim geometry reb #20666

Conversation

skurz commented Sep 27, 2017

cmsbuild commented Sep 27, 2017

cmsbuild commented Sep 27, 2017

cmsbuild commented Sep 27, 2017

kpedro88 commented Sep 27, 2017

cmsbuild commented Sep 28, 2017

cmsbuild commented Sep 28, 2017

cmsbuild commented Sep 28, 2017

Dr15Jones commented Sep 28, 2017

Dr15Jones commented Feb 13, 2018

skurz commented Feb 13, 2018

skurz commented Feb 13, 2018

Dr15Jones commented Feb 13, 2018

skurz commented Feb 13, 2018

Dr15Jones commented Feb 13, 2018

civanch commented Feb 13, 2018

skurz commented Feb 13, 2018

Dr15Jones commented Feb 13, 2018

skurz commented Feb 13, 2018

perrotta Feb 13, 2018

Choose a reason for hiding this comment

skurz Feb 13, 2018

Choose a reason for hiding this comment

skurz commented Feb 13, 2018

kpedro88 commented Feb 13, 2018

Dr15Jones commented Feb 13, 2018

Dr15Jones commented Feb 13, 2018

skurz commented Feb 13, 2018

kpedro88 commented Feb 13, 2018

skurz commented Feb 13, 2018

skurz commented Feb 13, 2018

Dr15Jones commented Feb 13, 2018

Dr15Jones commented Feb 13, 2018

skurz commented Feb 13, 2018

skurz commented Feb 13, 2018

Dr15Jones commented Feb 13, 2018

skurz commented Feb 13, 2018