Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protect against multiple shifting for out of time pileup #1615

Merged
merged 1 commit into from Dec 3, 2013

Conversation

wmtan
Copy link
Contributor

@wmtan wmtan commented Nov 27, 2013

This pull request is a modification to the mixing module to guarantee that the offsets are applied to the hits only once. The offsets are applied up front by the MixingModule using a new "Adjuster" class. With this fix, neither the digit accumulators nor the CrossingFrames can apply the shifts, as they do not have write access to the hits.

This fix has been regression tested, but it needs to be tested that it does indeed fix the problem, and does not introduce other problems. In particular, it should be tested that the shifts are in fact applied.

This is the second pass at the fix. The unneeded adjusters have been pruned. All the relvals that failed the first time now pass.

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @wmtan for CMSSW_7_0_X.

Protect against multiple shifting for out of time pileup

It involves the following packages:

SimDataFormats/CrossingFrame
SimGeneral/MixingModule

@cmsbuild, @civanch, @nclopezo, @mdhildreth, @giamman can you please review it and eventually sign? Thanks.
You can sign-off by replying to this message having '+1' in the first line of your reply.
You can reject by replying to this message having '-1' in the first line of your reply.
@ktf you are the release manager for this.

@ktf
Copy link
Contributor

ktf commented Nov 27, 2013

@mdhildreth @civanch do you actually manage to check it despite Thanksgiving? If not I'll simply go ahead with the release at this point.

@davidlange6 @fabiocos

@wmtan
Copy link
Contributor Author

wmtan commented Nov 27, 2013

I believe that Mike will test it, despite Thanksgiving.
But, I can't say anything for sure, except that it builds, and that the full relval matrix is still running, with no failures so far.

@cmsbuild
Copy link
Contributor

-1
When I ran the RelVals I found an error in the following worklfows:
201.0 step2

runTheMatrix-results/201.0_ZmumuJets_Pt_20_300+ZmumuJets_Pt_20_300+DIGIPU1+RECOPU1+HARVEST/step2_ZmumuJets_Pt_20_300+ZmumuJets_Pt_20_300+DIGIPU1+RECOPU1+HARVEST.log
----- Begin Fatal Exception 28-Nov-2013 12:22:11 CET-----------------------
An exception of category 'ProductNotFound' occurred while
   [0] Processing run: 1 lumi: 1 event: 1
   [1] Running path 'digitisation_step'
   [2] Calling event method for module MixingModule/'mix'
Exception Message:
Principal::findProductByTag: Found zero products matching all criteria
Looking for type: std::vector
Looking for module label: g4SimHits
Looking for productInstanceName: CastorBU
   Additional Info:
      [a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
----- End Fatal Exception -------------------------------------------------

you can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1615/1520/summary.html

@Dr15Jones
Copy link
Contributor

I just created a CMSSW_7_0_X_2013-11-28-0200 work area and then fetched the branch

git fetch https://github.com/wmtan/cmssw.git ProtectAgainstMultipleOffsets:ProtectAgainstMultipleOffsets

built it and then ran the job

runTheMatrix.py -l '201.0'

This worked fine for me

201.0_ZmumuJets_Pt_20_300+ZmumuJets_Pt_20_300+DIGIPU1+RECOPU1+HARVEST Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Thu Nov 28 14:41:40 2013-date Thu Nov 28 14:20:04 2013; exit: 0 0 0 0
1 1 1 1 tests passed, 0 0 0 0 failed

So if I built things properly, it means we have an intermittent problem.

@ktf
Copy link
Contributor

ktf commented Nov 28, 2013

@nclopezo can you run the tests again?

@ktf
Copy link
Contributor

ktf commented Nov 28, 2013

I tested myself and indeed it seems to be fine. @mdhildreth in your ballpark.

@cmsbuild
Copy link
Contributor

@fabiocos
Copy link
Contributor

Mmmm… I get the crash as well, also if I use a Neutrino gun as signal…

On Nov 28, 2013, at 4:41 PM, cmsbuild notifications@github.com wrote:

+1
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1615/1521/summary.html


Reply to this email directly or view it on GitHub.

@Dr15Jones
Copy link
Contributor

Can someone run valgrind on the failing job?

@mdhildreth
Copy link
Contributor

Hi All - testing now...

@ktf
Copy link
Contributor

ktf commented Nov 29, 2013

For the record, I'm also running valgrind on it… Will take a while though...

@fabiocos
Copy link
Contributor

The problem appears to happen while trying to access collections in the doTheOffset method of the new
Adjuster class, in particular here

boost::shared_ptr<Wrapper<std::vector > const> shPtr = getProductByTagstd::vector(ep, tag_, mcc);

It crashes on the loop over the various collections while trying to access CastorBU. The problem for me is
systematic and perfectly reproducible.

On Nov 29, 2013, at 11:35 AM, Giulio Eulisse notifications@github.com wrote:

For the record, I'm also running valgrind on it… Will take a while though...


Reply to this email directly or view it on GitHub.

@nclopezo
Copy link
Contributor

Hi,

I had also started valgrind on 201.0 earlier this morning, it just finished, you can see the results here:

https://cmssdt.cern.ch/jenkins/job/Pull-Request-Integration/ARCHITECTURE=slc5_amd64_gcc481/1532/valgrindResult/pid=21446/

@ktf
Copy link
Contributor

ktf commented Nov 29, 2013

@Dr15Jones
Copy link
Contributor

Since Giulio and I saw it work, what release are other people basing their builds upon?

@fabiocos
Copy link
Contributor

git cms-merge-topic 1615
on top of CMSSW_7_0_X_2013-11-28-0200

On Nov 29, 2013, at 4:55 PM, Chris Jones notifications@github.com wrote:

Since Giulio and I saw it work, what release are other people basing their builds upon?


Reply to this email directly or view it on GitHub.

@Dr15Jones
Copy link
Contributor

@fabiocos Exactly what job did you run?

@fabiocos
Copy link
Contributor

You can run "runTheMatrix.py -l 201". I produced also 10 SingleNeutrinos (lxbuild170.cern.ch:/build/fabiocos/production/pileup/CMSSW_7_0_X_2013-11-28-0200/work/step1.root), 100 fresh new minbias (lxbuild170:/build/fabiocos/production/pileup/CMSSW_7_0_X_2013-11-28-0200/work/minbias/minbias.root) and run the step2 of workflow 201 replacing the input with
process.mix.input.fileNames = cms.untracked.vstring('file:minbias.root')

I manage to go through the reduced sequence

process.pdigi = cms.Sequence(cms.SequencePlaceholder("randomEngineStateProducer")+cms.SequencePlaceholder("mix")+process.addPileupInfo)

if I take out of the process.mix configuraion the collections of Castor, Ecal and Hcal test beam, FP420 and Totem. They are not the only empty collections though, CaloHitsTk is empty as well for instance and passes smoothly (at least apparently)

@Dr15Jones
Copy link
Contributor

I have a new theory. Maybe we have a function with same name but different functionality problem. Therefore the differences we see are due to shared library load order. To test this I have dumped the load order I see by doing

strace cmsRun step2_DIGI_L1_DIGI2RAW_HLT_RAW2DIGI_L1Reco_PU.py > & trace.log
grep open trace.log | grep '.so' | grep -v '= -1' > library_load_order_mixing_module.trace

I have have put library_load_order_mixing_module.trace in /afs/cern.ch/user/c/chrjones/public.

So if one of you who sees the problem can do the same, we can compare. I was using 'CMSSW_7_0_X_2013-11-28-0200' on lxplus.

@fabiocos
Copy link
Contributor

fabiocos commented Dec 1, 2013

/afs/cern.ch/user/f/fabiocos/public/ForChris/library_load_order_mixing_module.trace

It looks like I am recompiling more than you do...

On Nov 29, 2013, at 5:46 PM, Chris Jones notifications@github.com wrote:

I have a new theory. Maybe we have a function with same name but different functionality problem. Therefore the differences we see are due to shared library load order. To test this I have dumped the load order I see by doing

strace cmsRun step2_DIGI_L1_DIGI2RAW_HLT_RAW2DIGI_L1Reco_PU.py > & trace.log
grep open trace.log | grep '.so' | grep -v '= -1' > library_load_order_mixing_module.trace
I have have put library_load_order_mixing_module.trace in /afs/cern.ch/user/c/chrjones/public.

So if one of you who sees the problem can do the same, we can compare. I was using 'CMSSW_7_0_X_2013-11-28-0200' on lxplus.


Reply to this email directly or view it on GitHub.

@Dr15Jones
Copy link
Contributor

Fabio,
Do you know what steps you took which gave you so many additional packages? From my history I see I did

scram project CMSSW_7_0_X_2013-11-28-0200
cd src/
git cms-init
git fetch official-cmssw
git fetch https://github.com/wmtan/cmssw.git ProtectAgainstMultipleOffsets:ProtectAgainstMultipleOffsets
git checkout ProtectAgainstMultipleOffsets
git cms-addpkg SimDataFormats/CrossingFrame
git cms-addpkg SimGeneral/MixingModule
git cms-checkdeps -a
scram b -j 12
cd ../
mkdir matrix
cd matrix/
runTheMatrix.py -l '201.0'

So in my case, I think git cms-checkdeps -a would not give me dependent packages with version corresponding to CMSSW_7_0_X_2013-11-28-0200, but rather corresponding to what Bill's code was based off originally. So we might be seeing a bad interaction from the merge of Bill's code with something else in the release.

@Dr15Jones
Copy link
Contributor

I did git log on my work area and I saw no indication of Bill's commit there. So it appears my fetch failed. I then switched to from-CMSSW_7_0_X_2013-11-28-0200 and tried a git cms-merge-topic 1615 but that failed because from-CMSSW_7_0_X_2013-11-28-0200 already existed. So I duplicated that branch to a new name, deleted the old one and did the git cms-merge-topic 1615 once more in the newly named branch. doing git log on the resulting job did show Bill's commit as the most recent on my new branch. I will now test that one.

@Dr15Jones
Copy link
Contributor

My new branch made from git cms-merge-topic 1615 does show the 'get' problem. So the reason mine worked must have been from a git error on my part.

@Dr15Jones
Copy link
Contributor

I have now run the job in the debugger. "CastorBU" is not the first data requested, it is the second. The second lookup does see that the branch was registered and goes to the input file to read it. However, the framework says that the data is not available in the file for that event.

@Dr15Jones
Copy link
Contributor

I just looked at the pileup file

root -l root://eoscms//eos/cms/store/relval/CMSSW_6_2_0_pre8/RelValMinBias/GEN-SIM/PRE_ST62_V8-v1/00000/7A58AD6E-BFE0-E211-BA29-003048D375AE.root 

Then in root I checked the branch to see if the

Events->Scan("PCaloHits_g4SimHits_CastorBU_SIM.present")

which gave

************************
*    Row   * PCaloHits *
************************
*        0 *         0 *
*        1 *         0 *
*        2 *         0 *
*        3 *         0 *
*        4 *         0 *

which means though the branch is in the file, there is no data stored in that branch. To test that I checked the first data objects requested by the mixing module

root [2] Events->Scan("PCaloHits_g4SimHits_CaloHitsTk_SIM.present")

Which showed

************************
*    Row   * PCaloHits *
************************
*        0 *         1 *
*        1 *         1 *
*        2 *         1 *
*        3 *         1 *
*        4 *         1 *

So as far as I can see, the system is correct, the data isn't there or that (or any other) event.

Now I don't know why the previous version ignored missing data but it is missing.

@Dr15Jones
Copy link
Contributor

@wmtan @fabiocos @mdhildreth I think I've figured out what happened. The source file doesn't have any 'CastorBU' PCaloHits. Originally, the MixingModule checked each event to see if the source contained the relevant data products and only if it did would the workers be added to the list of workers to process that event. However, when build added the Adjusters, he made all the Adjusters run for each event, not just those Adjusters for which there was data in the Event. This is what causes the exception to occur since we are trying to adjust the CastorBU collection even though it would never be used.

@wmtan
Copy link
Contributor Author

wmtan commented Dec 2, 2013

Believe it or not, I found time to debug this tonight. I came to the exact same conclusion as Chris.
Adjusters are running for all possible hits, including Castor, Totem!!, the TestBeam!!! hits, etc!
A fatal exception is thrown for a hit not found.
So, I need to figure out why adjusters are running for stuff that is not in the data. We either need to remove those adjusters (best), or avoid the exceptions (easy, but a last resort).

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 2, 2013

Pull request #1615 was updated. @civanch, @Dr15Jones, @mdhildreth, @cmsbuild, @nclopezo, @giamman, @ktf can you please check and sign again.

@wmtan
Copy link
Contributor Author

wmtan commented Dec 3, 2013

This pull request has been redone, and retested for 7_0_X. All the relvals that failed last time now pass.
There is no crash. (There never was a crash). The Adjusters have been pruned so that the unnecessary adjusters that caused the not found exceptions are no longer there.
It needs to be tested that the time shifts are applied properly.

@mdhildreth
Copy link
Contributor

+1

!!!

On Dec 2, 2013, at 7:00 PM, wmtan notifications@github.com wrote:

This pull request has been redone, and retested for 7_0_X. All the relvals that failed last time now pass.
There is no crash. (There never was a crash). The Adjusters have been pruned so that the unnecessary adjusters that caused the not found exceptions are no longer there.
It needs to be tested that the time shifts are applied properly.


Reply to this email directly or view it on GitHub.


Mike Hildreth e-mail: mikeh@undhep.hep.nd.edu
Department of Physics mikeh@fnal.gov
225 Nieuwland Sciences Hall telephone: 574-631-6458 (office)
University of Notre Dame 574-631-5952 (FAX)
Notre Dame, IN 46556 http://www.hep.nd.edu/MikeHildreth

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2013

-1
When I ran the RelVals I found an error in the following worklfows:
1003.0 step2

runTheMatrix-results/1003.0_RunMinBias2012A+RunMinBias2012A+RECODDQM+HARVESTDDQM/step2_RunMinBias2012A+RunMinBias2012A+RECODDQM+HARVESTDDQM.log

you can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1615/1558/summary.html

@fabiocos
Copy link
Contributor

fabiocos commented Dec 3, 2013

Are you referring to the fact that a real data workflow didn't run because of a DAS timeout in providing an output file?
This 1) is not a release problem but an infrastructure one at most and anyway 2) is irrelevant for this change affecting only the simulation.

On Dec 3, 2013, at 11:53 AM, cmsbuild notifications@github.com wrote:

-1
When I ran the RelVals I found an error in the following worklfows:
1003.0 step2

runTheMatrix-results/1003.0_RunMinBias2012A+RunMinBias2012A+RECODDQM+HARVESTDDQM/step2_RunMinBias2012A+RunMinBias2012A+RECODDQM+HARVESTDDQM.log

you can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1615/1558/summary.html


Reply to this email directly or view it on GitHub.

@nclopezo
Copy link
Contributor

nclopezo commented Dec 3, 2013

Hi Fabio,

Sorry, that message is automatic, jenkins saw the failure and automatically posted the message here. I can talk to Giulio to discuss how should we handle it when this happens.

@fabiocos
Copy link
Contributor

fabiocos commented Dec 3, 2013

Ok, so nothing is preventing integration I assume. I see the Core signature missing, but I assume this is just forgotten,
as the code has been provided by the group itself. Do we have a back-port of this to 62X? That is probably even more
urgently needed.

On Dec 3, 2013, at 1:38 PM, David Mendez notifications@github.com wrote:

Hi Fabio,

Sorry, that message is automatic, jenkins saw the failure and automatically posted the message here. I can talk to Giulio to discuss how should we handle it when this happens.


Reply to this email directly or view it on GitHub.

@Dr15Jones
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2013

This pull request is fully signed and it will be integrated in one of the next IBs unless changes or unless it breaks tests. @ktf can you please take care of it?

nclopezo added a commit that referenced this pull request Dec 3, 2013
Protect against multiple shifting for out of time pileup
@nclopezo nclopezo merged commit e83f4e5 into cms-sw:CMSSW_7_0_X Dec 3, 2013
@wmtan wmtan deleted the ProtectAgainstMultipleOffsets branch December 3, 2013 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants