New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New fastsim geometry reb #20666
New fastsim geometry reb #20666
Conversation
@skurz, CMSSW_9_4_X branch is closed for direct updates. cms-bot is going to move this PR to master branch. |
The code-checks are being triggered in jenkins. |
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/PR-20666/986 Code check has found code style and quality issues which could be resolved by applying a patch in https://cmssdt.cern.ch/SDT/code-checks/PR-20666/986/git-diff.patch You can run |
@skurz please apply the code checks as indicated for this PR to proceed |
The code-checks are being triggered in jenkins. |
+code-checks |
A new Pull Request was created by @skurz for master. It involves the following packages: Configuration/Applications The following packages do not have a category, yet: FastSimulation/SimplifiedGeometryPropagator @perrotta, @smuzaffar, @civanch, @Dr15Jones, @vazzolini, @lveldere, @kmaeshima, @ssekmen, @kpedro88, @dmitrijus, @cmsbuild, @franzoni, @mdhildreth, @slava77, @vanbesien, @monttj, @davidlange6 can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test |
@smuzaffar can we get stack traces for any of these jobs? |
I am also looking into this now. I can reproduce the increase in runtime and I'm doing some timing test to see where it comes from. However, I can't reproduce that the test sometimes gets stuck. |
Ok, I found the problem. The calorimetry is initialized every single event which obviosly doesn't make sense. These lines should rather go into a beginRun(...) method: |
You could also use an ESWatcher to call the update code only when the conditions change. You can find the ESWatcher in FWCore/Utilities. |
Ok, so what do I have to do as the PR has already been merged? Do I have to make a new PR? |
You need a new pull request and ESWatcher is the better solution. |
I would think, that if before ESWatcher was not used in FastSim this means that FastSim is performed with fixed conditions.So, I may be wrong but beginRun corresponding to what was before. |
Yes, beginRun was used before. |
In general, it is never good to assume exactly when IOVs change. The ESWatcher always does the right thing. In this case, if we did Run dependent MC using this module and you were updating in the beginRun, each run would, probably unnecessarily, update the geometry. The ESWatcher guarantees that the update happens only if needed. |
I am going to submit the PR as soon as I am sure there are no other flaws that have an influence on the runtime. |
{ | ||
// Should not be reached: Full Propagation does always have a solution "if(crosses(layer)) == -1" | ||
// Even if particle is outside all layers -> can turn around in magnetic field | ||
throw cms::Exception("FastSimulation") << "HelixTrajectory: should not be reached (no solution)."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skurz, since you are checking the effect of this already merged PR: perhaps the following failure
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-22152/26029/addOnTests/fastsim2/cmsDriver.py_TTbar_13TeV_TuneCUETP8M1_cfi_--conditions_auto:run2_mc_--fast__-n_100_--eventcontent_AODSIM,DQM_--relval_100000,1000_-s_GEN,SIM,RECOBEFMI.log
observed in the addOn tests of the apparently unrelated pull request #22152 is also worth giving a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is definitely related to my PR but still it is really strange. I'll try to find out what is going on.
So it seems like the runtime of the new fastSim propagator is fixed by the ESWatcher. However, there is something else going on that I don't understand. I used the fastsim1 config from the addOnTests to produce 100 events. The first events are processed really quickly (<10s per evt) and then it starts to slow down more and more (~1min per evt); at the same time CPU consumption is rising. I have no idea where this comes from and I don't know how this should be related to my PR. Here are some lines from the TimeReport: But looking e.g. at the details of the generating step, I don't understand why it takes about 32s per evt: |
@skurz you can try to use igprof to get more information on the CPU usage (see e.g. https://github.com/kpedro88/utilities/blob/master/runIgprof.sh for an example of how to run it). The slowdown over time makes me suspect there could be a memory leak or failure to deallocate. You can try to watch the memory usage over time as it slows down. You can also run valgrind:
Or igprof in memory profile mode, this requires adding a service to the config:
|
@skurz could you run the job in the debugger and when it starts to slow down do a and then a |
@skurz you could also run |
Ok, I am going through your list. |
@skurz valgrind usually takes several hours to run, so I'm surprised you could have results already... @Dr15Jones maybe the stall monitor could provide some insight here? |
ok, you're right. I did something wrong. It's running now. |
Another update for today. |
@skurz There doesn't seem to be anything mysterious about that function: maybe the mystery is why the ParticleFlow code is calling it so often? |
@skurz can you post the complete timing report somewhere? Was this done using 1 thread? With multi-threaded framework, the path times are now a lie. All Paths actually start simultaneously (even in single threaded mode) and the timer for the path doesn't stop until the last module runs. Because of contention between 'legacy' and 'one' modules, modules from different Paths can get intermingled in the processing. If you want so see this for yourself, just add |
Yeah, I've already seen that there is nothing special about the CaloSubdetectorGeometry. So, the timing was done using 4 threads. How can I make sure that the correct timing is calculated? Is there a way to make sure that paths are not run simultaneosly (like a change in the config file)? Anyway, here is the timing report: TimeReport.txt |
I finally found the problem. I made a stupid mistake when "fixing" the old issue. I am going to make a new PR soon. Thank you very much for your help! |
@skurz It looks to me like your time is being spent in ParticleFlow calculation with many of those modules being 'unscheduled'. E.g.
I think I understand why the |
Here is the new PR: #22202 |
The goal is to prepare fastsim for the upgrade of the pixel detector. A new, configurable interface for the geometry of the fastsim tracker was developed. To this end, it was necessary to also write a new algorithm for the propagation of the particles inside the tracker.
The geometry of the detector used to be hard-coded but can now be specified in a single python file (FastSimulation/Geometry/python/TrackerMaterial_cfi.py)
Motivation for a new algorithm (particle propagation inside the tracker)