Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New fastsim geometry reb #20666

Merged
merged 23 commits into from Feb 9, 2018
Merged

Conversation

skurz
Copy link
Contributor

@skurz skurz commented Sep 27, 2017

The goal is to prepare fastsim for the upgrade of the pixel detector. A new, configurable interface for the geometry of the fastsim tracker was developed. To this end, it was necessary to also write a new algorithm for the propagation of the particles inside the tracker.
The geometry of the detector used to be hard-coded but can now be specified in a single python file (FastSimulation/Geometry/python/TrackerMaterial_cfi.py)

Motivation for a new algorithm (particle propagation inside the tracker)

  • almost impossible to adjust the old algorithm to the new, configurable geometry (hard-coded links have to be omitted)
  • old algorithm is flawed
  • the new algorithm allows to put material and instruments on any plane perpendicular to z, or on any cylinder with the z-axis as axis
  • while the old algorithm, with the requirement of nested layers, forbids the introduction of long narrow cylinders, required for a decent simulation of material in front of HF

@cmsbuild cmsbuild changed the base branch from CMSSW_9_4_X to master September 27, 2017 11:19
@cmsbuild
Copy link
Contributor

@skurz, CMSSW_9_4_X branch is closed for direct updates. cms-bot is going to move this PR to master branch.
In future, please use cmssw master branch to submit your changes.

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/PR-20666/986

Code check has found code style and quality issues which could be resolved by applying a patch in https://cmssdt.cern.ch/SDT/code-checks/PR-20666/986/git-diff.patch
e.g. curl https://cmssdt.cern.ch/SDT/code-checks/PR-20666/986/git-diff.patch | patch -p1

You can run scram build code-checks to apply code checks directly

@kpedro88
Copy link
Contributor

@skurz please apply the code checks as indicated for this PR to proceed

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/PR-20666/1024

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @skurz for master.

It involves the following packages:

Configuration/Applications
FastSimulation/Calorimetry
FastSimulation/Configuration
FastSimulation/Event
FastSimulation/ForwardDetectors
FastSimulation/HighLevelTrigger
FastSimulation/MuonSimHitProducer
FastSimulation/SimplifiedGeometryPropagator
FastSimulation/Tracking
FastSimulation/TrackingRecHitProducer
FastSimulation/Validation
HLTriggerOffline/Btag
IOMC/RandomEngine
MuonAnalysis/MomentumScaleCalibration
MuonAnalysis/MuonAssociators
PhysicsTools/HepMCCandAlgos
PhysicsTools/PatAlgos
PhysicsTools/PatExamples
RecoJets/JetAssociationProducers
RecoLuminosity/LumiProducer
RecoParticleFlow/Configuration
RecoParticleFlow/PFProducer
RecoParticleFlow/PFSimProducer
RecoParticleFlow/PFTracking
SLHCUpgradeSimulations/Configuration
SimCalorimetry/HcalSimProducers
SimG4Core/CustomPhysics
SimGeneral/DataMixingModule
SimGeneral/MixingModule
SimGeneral/TrackingAnalysis
SimMuon/MCTruth
Validation/EcalDigis
Validation/EcalHits
Validation/EcalRecHits
Validation/HcalDigis
Validation/HcalHits
Validation/HcalRecHits
Validation/RecoEgamma
Validation/RecoParticleFlow
Validation/RecoTrack

The following packages do not have a category, yet:

FastSimulation/SimplifiedGeometryPropagator
Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories.py to assign category

@perrotta, @smuzaffar, @civanch, @Dr15Jones, @vazzolini, @lveldere, @kmaeshima, @ssekmen, @kpedro88, @dmitrijus, @cmsbuild, @franzoni, @mdhildreth, @slava77, @vanbesien, @monttj, @davidlange6 can you please review it and eventually sign? Thanks.
@ghellwig, @TaiSakuma, @gouskos, @felicepantaleo, @schoef, @rappoccio, @sviret, @argiro, @Martin-Grunewald, @imarches, @ahinzmann, @mmusich, @seemasharmafnal, @mmarionncern, @JyothsnaKomaragiri, @makortel, @acaudron, @jhgoh, @lgray, @jdolen, @HuguesBrun, @ferencek, @trocino, @abbiendi, @GiacomoSguazzoni, @rovere, @VinInn, @jdamgov, @bellan, @nhanvtran, @gkasieczka, @wmtford, @ebrondol, @mtosi, @dgulhan, @battibass, @wddgit, @calderona, @mverzett, @cbernet, @gpetruc, @matt-komm, @mariadalfonso, @pvmulder, @bachtis this is something you requested to watch as well.
@davidlange6, @slava77 you are the release manager for this.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor

please test

@Dr15Jones
Copy link
Contributor

@smuzaffar can we get stack traces for any of these jobs?

@skurz
Copy link
Contributor Author

skurz commented Feb 13, 2018

I am also looking into this now. I can reproduce the increase in runtime and I'm doing some timing test to see where it comes from. However, I can't reproduce that the test sometimes gets stuck.

@skurz
Copy link
Contributor Author

skurz commented Feb 13, 2018

Ok, I found the problem. The calorimetry is initialized every single event which obviosly doesn't make sense. These lines should rather go into a beginRun(...) method:
https://github.com/skurz/cmssw/blob/0c0d5f3f770e17ee599cc08f12c14a2b1c886bda/FastSimulation/SimplifiedGeometryPropagator/plugins/FastSimProducer.cc#L197-L212
What do you suggest? How should we proceed?

@Dr15Jones
Copy link
Contributor

You could also use an ESWatcher to call the update code only when the conditions change. You can find the ESWatcher in FWCore/Utilities.

@skurz
Copy link
Contributor Author

skurz commented Feb 13, 2018

Ok, so what do I have to do as the PR has already been merged? Do I have to make a new PR?
And what do you think makes more sense? Using an ESWatcher or adding the beginRun method?

@Dr15Jones
Copy link
Contributor

You need a new pull request and ESWatcher is the better solution.

@civanch
Copy link
Contributor

civanch commented Feb 13, 2018

I would think, that if before ESWatcher was not used in FastSim this means that FastSim is performed with fixed conditions.So, I may be wrong but beginRun corresponding to what was before.

@skurz
Copy link
Contributor Author

skurz commented Feb 13, 2018

Yes, beginRun was used before.

@Dr15Jones
Copy link
Contributor

In general, it is never good to assume exactly when IOVs change. The ESWatcher always does the right thing. In this case, if we did Run dependent MC using this module and you were updating in the beginRun, each run would, probably unnecessarily, update the geometry. The ESWatcher guarantees that the update happens only if needed.

@skurz
Copy link
Contributor Author

skurz commented Feb 13, 2018

I am going to submit the PR as soon as I am sure there are no other flaws that have an influence on the runtime.

{
// Should not be reached: Full Propagation does always have a solution "if(crosses(layer)) == -1"
// Even if particle is outside all layers -> can turn around in magnetic field
throw cms::Exception("FastSimulation") << "HelixTrajectory: should not be reached (no solution).";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skurz, since you are checking the effect of this already merged PR: perhaps the following failure
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-22152/26029/addOnTests/fastsim2/cmsDriver.py_TTbar_13TeV_TuneCUETP8M1_cfi_--conditions_auto:run2_mc_--fast__-n_100_--eventcontent_AODSIM,DQM_--relval_100000,1000_-s_GEN,SIM,RECOBEFMI.log
observed in the addOn tests of the apparently unrelated pull request #22152 is also worth giving a look

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is definitely related to my PR but still it is really strange. I'll try to find out what is going on.

@skurz
Copy link
Contributor Author

skurz commented Feb 13, 2018

So it seems like the runtime of the new fastSim propagator is fixed by the ESWatcher. However, there is something else going on that I don't understand. I used the fastsim1 config from the addOnTests to produce 100 events. The first events are processed really quickly (<10s per evt) and then it starts to slow down more and more (~1min per evt); at the same time CPU consumption is rising. I have no idea where this comes from and I don't know how this should be related to my PR.

Here are some lines from the TimeReport:
TimeReport ---------- Path Summary ---[Real sec]----
TimeReport per event per exec Name
TimeReport 32.375547 32.375547 generation_step
TimeReport 2.101804 2.101804 simulation_step
TimeReport 2.087158 2.087158 reconstruction_befmix_step
TimeReport 34.106832 34.106832 digitisation_step
TimeReport 3.613188 3.613188 L1simulation_step
TimeReport 3.615137 3.615137 digi2raw_step
TimeReport 2.825613 2.825613 L1Reco_step
TimeReport 35.093805 35.093805 reconstruction_step
TimeReport 25.964805 25.964805 eventinterpretaion_step
TimeReport 25.967529 25.967529 prevalidation_step

But looking e.g. at the details of the generating step, I don't understand why it takes about 32s per evt:
TimeReport ---------- Modules in Path: generation_step ---[Real sec]----
TimeReport per event per visit Name
TimeReport 0.000000 0.000000 generator
TimeReport 0.000065 0.000065 randomEngineStateProducer
TimeReport 0.001556 0.001556 VtxSmeared
TimeReport 0.000025 0.000025 generatorSmeared
TimeReport 0.002150 0.002150 genParticles
TimeReport 0.000033 0.000033 genParticlesForJets
TimeReport 0.000025 0.000025 genParticlesForJetsNoNu
TimeReport 0.000071 0.000071 ak4GenJets
TimeReport 0.000064 0.000064 ak8GenJets
TimeReport 0.000049 0.000049 ak4GenJetsNoNu
TimeReport 0.000059 0.000059 ak8GenJetsNoNu
TimeReport 0.000025 0.000025 genCandidatesForMET
TimeReport 0.000025 0.000025 genParticlesForMETAllVisible
TimeReport 0.000008 0.000008 genMetCalo
TimeReport 0.000006 0.000006 genMetTrue

@kpedro88
Copy link
Contributor

@skurz you can try to use igprof to get more information on the CPU usage (see e.g. https://github.com/kpedro88/utilities/blob/master/runIgprof.sh for an example of how to run it).

The slowdown over time makes me suspect there could be a memory leak or failure to deallocate. You can try to watch the memory usage over time as it slows down. You can also run valgrind:

valgrind --tool=memcheck --leak-check=yes --show-reachable=yes --num-callers=20 --suppressions=$ROOTSYS/etc/valgrind-root.supp --track-origins=yes cmsRun ...

Or igprof in memory profile mode, this requires adding a service to the config:

    process.IgProfService = cms.Service("IgProfService",
        reportEventInterval = cms.untracked.int32(1),
        reportFirstEvent = cms.untracked.int32(1),
        reportToFileAtPostEndJob = cms.untracked.string('| gzip -c > test___memory___%I_EndOfJob.gz'),
        reportToFileAtPostEvent = cms.untracked.string('| gzip -c > test___memory___%I.gz')
)

@Dr15Jones
Copy link
Contributor

@skurz could you run the job in the debugger and when it starts to slow down do a and then a thread apply all backtrace to get the stacktraces for all the threads? You might need to try several times to see if we are 'stuck' in any particular thread.

@Dr15Jones
Copy link
Contributor

@skurz you could also run top in another shell on the same machine and see if the memory is steadily increasing.

@skurz
Copy link
Contributor Author

skurz commented Feb 13, 2018

Ok, I am going through your list.
valgrind apparantly did not find anything.
top showed me that memory consumption stays at around 8.5% (maybe very slowly inreases) of 4gb reserved. CPU consumption always at 400% (4 threads)

@kpedro88
Copy link
Contributor

@skurz valgrind usually takes several hours to run, so I'm surprised you could have results already...

@Dr15Jones maybe the stall monitor could provide some insight here?

@skurz
Copy link
Contributor Author

skurz commented Feb 13, 2018

ok, you're right. I did something wrong. It's running now.

@skurz
Copy link
Contributor Author

skurz commented Feb 13, 2018

Another update for today.
valgrind is still running: it showed tons of error messages so far but nothing is related to my PR as far as I can tell. Let's wait for the summary.
@Dr15Jones I run the debugger a couple of times and did the backtrace. I attached a representative trace here: most of the time at least 2 or 3 threads are occupied with ´CaloSubdetectorGeometry::present(DetId const&)´. Does this help? Is this what you were looking for?

backtrace.txt

@Dr15Jones
Copy link
Contributor

@skurz There doesn't seem to be anything mysterious about that function:
https://github.com/cms-sw/cmssw/blob/225ce8ab3efc5171be8f17491f9448d040ccc698/Geometry/CaloGeometry/src/CaloSubdetectorGeometry.cc

maybe the mystery is why the ParticleFlow code is calling it so often?

@Dr15Jones
Copy link
Contributor

@skurz can you post the complete timing report somewhere? Was this done using 1 thread? With multi-threaded framework, the path times are now a lie. All Paths actually start simultaneously (even in single threaded mode) and the timer for the path doesn't stop until the last module runs. Because of contention between 'legacy' and 'one' modules, modules from different Paths can get intermingled in the processing. If you want so see this for yourself, just add
process.add_(cms.Service("Tracer")) to your job.

@skurz
Copy link
Contributor Author

skurz commented Feb 13, 2018

Yeah, I've already seen that there is nothing special about the CaloSubdetectorGeometry. So, the timing was done using 4 threads. How can I make sure that the correct timing is calculated? Is there a way to make sure that paths are not run simultaneosly (like a change in the config file)? Anyway, here is the timing report: TimeReport.txt

@skurz
Copy link
Contributor Author

skurz commented Feb 13, 2018

I finally found the problem. I made a stupid mistake when "fixing" the old issue. I am going to make a new PR soon. Thank you very much for your help!

@Dr15Jones
Copy link
Contributor

@skurz It looks to me like your time is being spent in ParticleFlow calculation with many of those modules being 'unscheduled'. E.g.

TimeReport   1.492296     1.492296     1.492296  particleFlowClusterECAL
TimeReport   1.489819     1.489819     1.489819  particleFlowClusterOOTECAL
TimeReport   5.707759     5.707759     5.707759  particleFlowRecHitECAL
TimeReport   5.712172     5.712172     5.712172  particleFlowRecHitOOTECAL
TimeReport   1.811973     1.811973     1.811973  particleFlowSuperClusterECAL
TimeReport   1.804347     1.804347     1.804347  particleFlowSuperClusterOOTECAL

I think I understand why the generation_step Path takes so long. All Paths start their timers at the beginning of the event processing (since the first module in every Path gets scheduled to 'be runnable' at that moment). The timer for a Path stops when the final module in the Path has finished running. Now the time a module run is actually up to the framework. The framework only guarantees that a module will run IF it is needed by another module or IF it is on a Path and there is no other module which can be run. So my theory is since generator is on several paths the other modules on generation_step are not needed by any other module except for maybe the OutputModule. That means the framework is free to run them at the end if that happens to be what works best at the time.

@skurz skurz mentioned this pull request Feb 13, 2018
@skurz
Copy link
Contributor Author

skurz commented Feb 13, 2018

Here is the new PR: #22202

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet