New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert Provenance Prefetching #17556
Revert Provenance Prefetching #17556
Conversation
When running on very many threads it appears that the framework sometimes thinks the prefetching for the PoolOutputModule never finishes and therefore the module is never run. Until the problem is found, we need to not do the prefetching.
A new Pull Request was created by @Dr15Jones (Chris Jones) for CMSSW_9_0_X. It involves the following packages: FWCore/Integration @cmsbuild, @smuzaffar, @Dr15Jones, @davidlange6 can you please review it and eventually sign? Thanks. cms-bot commands are listed here #13028 |
@davidlange6 this needs to be in for pre5 to avoid problems using the release on Cori. |
please test |
+1 |
The tests are being triggered in jenkins. |
This pull request is fully signed and it will be integrated in one of the next CMSSW_9_0_X IBs after it passes the integration tests. This pull request requires discussion in the ORP meeting before it's merged. @davidlange6, @smuzaffar |
+1 The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic: |
Comparison job queued. |
+1 |
@davidlange6 new information about this. The problem does not appear to be with this pull request but appears to only trigger the problem. The PoolOutputModule doesn't run because the module |
@lgray The problem appears to be in Our 200 pileup job appears to be stuck in this routine (or one it calls) for hours. The tracebacks we get after that time which have that module are
|
@Dr15Jones OK - interesting. I think this can be solved by introducing a max-iterations cut, since simulated annealing converges smoothly towards the end of the cooling process. |
@gartung the two jobs I looked at event number 10 (which is the 7th in the input file). Could you point people to the input file used as well as the configuration file? |
On cmslpc /eos/uscms/store/user/gartung/step2/pu200/step2.root |
@davidlange6 could we revert this change since we discovered the problem has nothing to do with this pull request? It does appear that this change does have a significant impact on the threading efficiency for large numbers of threads. |
I ran the job for 10 events (using 4 threads and 4 streams) on a standard Xeon system (cmslpc27) using the release CMSSW_9_0_X_2017-02-17-2300 (which has the prefetching) and in CMSSW_9_0_X_2017-02-18-1100 (which doesn't have the prefetching). Both ran to completion just fine, processing one of the events which had gotten stuck on the KNL system. While running those versions I noticed that the number of the modules was different so the configuration did change slightly. This leads me to conclude that
In all of these the prefetching has no role in the problem. |
I've now also run the configuration using CMSSW_9_0_0_pre4 with 4 threads/streams on a Xeon machine. The job finishes just fine. An interesting note is vanilla pre4 also has a different module numbering scheme than the KNL test. |
#17564 reinstates the prefetching |
@lgray I was able to watch the job in the debugger. The code isn't in an infinite loop, it is just in an incredibly slowly converging loop (i.e. 8+ hours for one event). The job is 'stuck' in the |
Running with CMSSW_9_0_X_2017-02-21-1100 over the first 10 events the job completed in a reasonable amount of time. I am trying now with 320 events. |
Hi Chris,
We also found some numerical issues that might be related to this.
Best,
-Lindsey
…On Tue, Feb 21, 2017 at 1:52 PM, Chris Jones ***@***.***> wrote:
@lgray <https://github.com/lgray> I was able to watch the job in the
debugger. The code isn't in an infinite loop, it is just in an incredibly
slowly converging loop (i.e. 8+ hours for one event). The job is 'stuck' in
the purge while loop
http://cmslxr.fnal.gov/source/RecoVertex/PrimaryVertexProducer/src/
DAClusterizerInZT.cc?v=CMSSW_9_0_0_pre4#0587
When I checked, tks.size() = 1480 and y.size() = 1172. Then after greater
than 5 minutes y.size() = 1168.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17556 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABBMOf0MxzRqFc3GUNBA8INtK0L5LL6xks5re0CSgaJpZM4ME1uW>
.
|
When running on very many threads it appears that the framework sometimes thinks the prefetching for the PoolOutputModule never finishes and therefore the module is never run. Until the problem is found, we need to not do the prefetching.
The problem was seen when running on KNL for 48 or 64 threads. Reverting only this part avoids large recompilation and allows the fix to be added with only minor recompilation later.