New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protect against multiple shifting for out of time pileup #1615
Protect against multiple shifting for out of time pileup #1615
Conversation
A new Pull Request was created by @wmtan for CMSSW_7_0_X. Protect against multiple shifting for out of time pileup It involves the following packages: SimDataFormats/CrossingFrame @cmsbuild, @civanch, @nclopezo, @mdhildreth, @giamman can you please review it and eventually sign? Thanks. |
@mdhildreth @civanch do you actually manage to check it despite Thanksgiving? If not I'll simply go ahead with the release at this point. |
I believe that Mike will test it, despite Thanksgiving. |
-1 runTheMatrix-results/201.0_ZmumuJets_Pt_20_300+ZmumuJets_Pt_20_300+DIGIPU1+RECOPU1+HARVEST/step2_ZmumuJets_Pt_20_300+ZmumuJets_Pt_20_300+DIGIPU1+RECOPU1+HARVEST.log ----- Begin Fatal Exception 28-Nov-2013 12:22:11 CET----------------------- An exception of category 'ProductNotFound' occurred while [0] Processing run: 1 lumi: 1 event: 1 [1] Running path 'digitisation_step' [2] Calling event method for module MixingModule/'mix' Exception Message: Principal::findProductByTag: Found zero products matching all criteria Looking for type: std::vector Looking for module label: g4SimHits Looking for productInstanceName: CastorBU Additional Info: [a] If you wish to continue processing events after a ProductNotFound exception, add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration. ----- End Fatal Exception ------------------------------------------------- you can see the results of the tests here: |
I just created a CMSSW_7_0_X_2013-11-28-0200 work area and then fetched the branch
built it and then ran the job
This worked fine for me
So if I built things properly, it means we have an intermittent problem. |
@nclopezo can you run the tests again? |
I tested myself and indeed it seems to be fine. @mdhildreth in your ballpark. |
Mmmm… I get the crash as well, also if I use a Neutrino gun as signal… On Nov 28, 2013, at 4:41 PM, cmsbuild notifications@github.com wrote:
|
Can someone run valgrind on the failing job? |
Hi All - testing now... |
For the record, I'm also running valgrind on it… Will take a while though... |
The problem appears to happen while trying to access collections in the doTheOffset method of the new boost::shared_ptr<Wrapper<std::vector > const> shPtr = getProductByTagstd::vector(ep, tag_, mcc); It crashes on the loop over the various collections while trying to access CastorBU. The problem for me is On Nov 29, 2013, at 11:35 AM, Giulio Eulisse notifications@github.com wrote:
|
Hi, I had also started valgrind on 201.0 earlier this morning, it just finished, you can see the results here: |
Which does not show anything particular… Great… :-/ |
Since Giulio and I saw it work, what release are other people basing their builds upon? |
git cms-merge-topic 1615 On Nov 29, 2013, at 4:55 PM, Chris Jones notifications@github.com wrote:
|
@fabiocos Exactly what job did you run? |
You can run "runTheMatrix.py -l 201". I produced also 10 SingleNeutrinos (lxbuild170.cern.ch:/build/fabiocos/production/pileup/CMSSW_7_0_X_2013-11-28-0200/work/step1.root), 100 fresh new minbias (lxbuild170:/build/fabiocos/production/pileup/CMSSW_7_0_X_2013-11-28-0200/work/minbias/minbias.root) and run the step2 of workflow 201 replacing the input with I manage to go through the reduced sequence process.pdigi = cms.Sequence(cms.SequencePlaceholder("randomEngineStateProducer")+cms.SequencePlaceholder("mix")+process.addPileupInfo) if I take out of the process.mix configuraion the collections of Castor, Ecal and Hcal test beam, FP420 and Totem. They are not the only empty collections though, CaloHitsTk is empty as well for instance and passes smoothly (at least apparently) |
I have a new theory. Maybe we have a function with same name but different functionality problem. Therefore the differences we see are due to shared library load order. To test this I have dumped the load order I see by doing
I have have put library_load_order_mixing_module.trace in /afs/cern.ch/user/c/chrjones/public. So if one of you who sees the problem can do the same, we can compare. I was using 'CMSSW_7_0_X_2013-11-28-0200' on lxplus. |
/afs/cern.ch/user/f/fabiocos/public/ForChris/library_load_order_mixing_module.trace It looks like I am recompiling more than you do... On Nov 29, 2013, at 5:46 PM, Chris Jones notifications@github.com wrote:
|
Fabio,
So in my case, I think |
I did |
My new branch made from |
I have now run the job in the debugger. "CastorBU" is not the first data requested, it is the second. The second lookup does see that the branch was registered and goes to the input file to read it. However, the framework says that the data is not available in the file for that event. |
I just looked at the pileup file
Then in root I checked the branch to see if the
which gave
which means though the branch is in the file, there is no data stored in that branch. To test that I checked the first data objects requested by the mixing module
Which showed
So as far as I can see, the system is correct, the data isn't there or that (or any other) event. Now I don't know why the previous version ignored missing data but it is missing. |
@wmtan @fabiocos @mdhildreth I think I've figured out what happened. The source file doesn't have any 'CastorBU' PCaloHits. Originally, the MixingModule checked each event to see if the source contained the relevant data products and only if it did would the workers be added to the list of workers to process that event. However, when build added the Adjusters, he made all the Adjusters run for each event, not just those Adjusters for which there was data in the Event. This is what causes the exception to occur since we are trying to adjust the CastorBU collection even though it would never be used. |
Believe it or not, I found time to debug this tonight. I came to the exact same conclusion as Chris. |
Pull request #1615 was updated. @civanch, @Dr15Jones, @mdhildreth, @cmsbuild, @nclopezo, @giamman, @ktf can you please check and sign again. |
This pull request has been redone, and retested for 7_0_X. All the relvals that failed last time now pass. |
+1 !!! On Dec 2, 2013, at 7:00 PM, wmtan notifications@github.com wrote:
Mike Hildreth e-mail: mikeh@undhep.hep.nd.edu |
-1 runTheMatrix-results/1003.0_RunMinBias2012A+RunMinBias2012A+RECODDQM+HARVESTDDQM/step2_RunMinBias2012A+RunMinBias2012A+RECODDQM+HARVESTDDQM.log you can see the results of the tests here: |
Are you referring to the fact that a real data workflow didn't run because of a DAS timeout in providing an output file? On Dec 3, 2013, at 11:53 AM, cmsbuild notifications@github.com wrote:
|
Hi Fabio, Sorry, that message is automatic, jenkins saw the failure and automatically posted the message here. I can talk to Giulio to discuss how should we handle it when this happens. |
Ok, so nothing is preventing integration I assume. I see the Core signature missing, but I assume this is just forgotten, On Dec 3, 2013, at 1:38 PM, David Mendez notifications@github.com wrote:
|
+1 |
This pull request is fully signed and it will be integrated in one of the next IBs unless changes or unless it breaks tests. @ktf can you please take care of it? |
Protect against multiple shifting for out of time pileup
This pull request is a modification to the mixing module to guarantee that the offsets are applied to the hits only once. The offsets are applied up front by the MixingModule using a new "Adjuster" class. With this fix, neither the digit accumulators nor the CrossingFrames can apply the shifts, as they do not have write access to the hits.
This fix has been regression tested, but it needs to be tested that it does indeed fix the problem, and does not introduce other problems. In particular, it should be tested that the shifts are in fact applied.
This is the second pass at the fix. The unneeded adjusters have been pruned. All the relvals that failed the first time now pass.