Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BranchChildren/Parentage problems that occur with SubProcess and EDAlias #17950

Merged
merged 3 commits into from Mar 20, 2017

Conversation

wddgit
Copy link
Contributor

@wddgit wddgit commented Mar 15, 2017

Also adds unit tests to cover these cases.

…EDAlias

Also adds unit tests to cover these cases.
In the previous commit, I added pruning of BranchChildren.
It was a nice little algorithm and I wanted to get it into
the git history. But after I wrote it I realized one cannot
prune with this simple algorithm with secondary file input.
@cmsbuild
Copy link
Contributor

A new Pull Request was created by @wddgit (W. David Dagenhart) for master.

It involves the following packages:

DataFormats/Provenance
FWCore/Framework
FWCore/Integration
IOPool/Input
IOPool/Output

@cmsbuild, @smuzaffar, @Dr15Jones, @davidlange6 can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @wmtan this is something you requested to watch as well.
@Muzaffar, @davidlange6, @smuzaffar you are the release manager for this.

cms-bot commands are listed here #13028

@Dr15Jones
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 15, 2017

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/18470/console Started: 2017/03/16 00:46

@cmsbuild
Copy link
Contributor

-1

Tested at: 6cb2b78

You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-17950/18470/summary.html

I found follow errors while testing this PR

Failed tests: UnitTests

  • Unit Tests:

I found errors in the following unit tests:

---> test TestSubProcess had ERRORS

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-17950/18470/summary.html

There are some workflows for which there are errors in the baseline:
25.0 step 5
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

@Dr15Jones
Copy link
Contributor

Please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 16, 2017

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/18489/console Started: 2017/03/16 15:03

Copy link
Contributor

@Dr15Jones Dr15Jones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with my review.

@@ -45,6 +45,7 @@ namespace edm {

ProductProvenance const* productProvenance() const;
BranchID const& branchID() const {return stable().branchID();}
BranchID const& originalBranchID() const {return stable().originalBranchID();}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment explaining the difference between branchID and originalBranchID

@@ -36,6 +38,7 @@ namespace edm {
preg_(new SignallingProductRegistry(preg)),
branchIDListHelper_(new BranchIDListHelper),
thinnedAssociationsHelper_(new ThinnedAssociationsHelper),
subProcessParentageHelper_(new SubProcessParentageHelper),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about switching to std::make_shared<>()? That is actually more memory efficient

@@ -141,6 +144,7 @@ namespace edm {
*preg_,
*branchIDListHelper_,
*thinnedAssociationsHelper_,
subProcessParentageHelper_ ? &*subProcessParentageHelper_ : nullptr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably simpler to do subProcessParentageHelper_ ? subProcessParentageHelper_.get() : nullptr

@@ -227,7 +227,7 @@ namespace edm {
std::shared_ptr<LuminosityBlockAuxiliary> fillLumiAuxiliary();
std::shared_ptr<RunAuxiliary> fillRunAuxiliary();
std::string const& newBranchToOldBranch(std::string const& newBranch) const;
void markBranchToBeDropped(bool dropDescendants, BranchID const& branchID, std::set<BranchID>& branchesToDrop) const;
void markBranchToBeDropped(bool dropDescendants, BranchDescription const& branchy, std::set<BranchID>& branchesToDrop, std::map<BranchID, BranchID> const& droppedToKeptAlias) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

branchy seems like an odd name

@Dr15Jones
Copy link
Contributor

@davidlange6 as far as David Dagenhart and I can tell, the unit test failure is from the problem fixed by #17946 and not from this pull request.

@davidlange6
Copy link
Contributor

davidlange6 commented Mar 20, 2017 via email

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 20, 2017

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/18562/console Started: 2017/03/20 16:51

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Comparison job queued.

@davidlange6
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit ae40518 into cms-sw:master Mar 20, 2017
@cmsbuild
Copy link
Contributor

wddgit added a commit to wddgit/cmssw that referenced this pull request Mar 21, 2017
The intent is that dropOnInput removes entries in the
ProductRegistry in addition to dropping the products.
The recent commit inadvertently disabled that for branches
that were already dropped.

And also this exposed the fact that GetterOfProducts requires
dictionaries for all products in the ProductRegistry, even the
dropped ones. And this should not be necessary and wastes a little
CPU, so that is fixed also.
cmsbuild added a commit that referenced this pull request Mar 22, 2017
Fix problem introduced by recent PR #17950 related to dropOnInput
@dmitrijus
Copy link
Contributor

dmitrijus commented Mar 24, 2017

Hi @wddgit, can this be backported to 90x?

This problem causes a segfault in IOPool/Streamer modules (when using data from GEN-SIM), this PR fixes it.

@wddgit
Copy link
Contributor Author

wddgit commented Mar 24, 2017

@dmitrijus We can probably backport the change, but we are interested in understanding how and why the seg fault is occurring. Can you tell me how to reproduce it so I can investigate this? If you already know exactly how the seg fault is occurring, can you explain it? Thanks.

@dmitrijus
Copy link
Contributor

@wddgit
To reproduce merge #18066 and use
cmsRun DQMStreamerOutputRepackerTest_cfg.py <any_root_with_mc_data>.root

The crash is a (double) null-pointer dereference in this line,
https://github.com/cms-sw/cmssw/blob/master/IOPool/Streamer/src/StreamSerializer.cc#L149

@wddgit
Copy link
Contributor Author

wddgit commented Mar 24, 2017

@Dr15Jones I reproduced the problem. If you write a file containing an EDAlias to a ROOT format output file and then in a subsequent process you read that file and try to run the streamer output module, then a seg fault occurs. The problem is indeed in the line he pointed to in StreamSerializer.cc. It successfully get the product using the EDAlias, then using the EDAlias BranchID tries to get the ProductProvenance, but that is stored keyed on the original BranchID not the EDAlias BranchID so it fails and returns a nullptr. The streamer output module does not check the pointer and dereferences and there is a seg fault. This was fixed by this PR in 9_1_X, but not backported to 9_0_X

I reproduced the error running a Framework unit test to make the file with the EDAlias
(cmsRun FWCore/Integration/test/testGetBy1_cfg.py)

and the following to read it and seg fault. It has nothing to do with the DQM PR.

import FWCore.ParameterSet.Config as cms

process = cms.Process("PROD2")

process.source = cms.Source("PoolSource",
    fileNames = cms.untracked.vstring(
        'file:testGetBy1.root'
    )
)

process.out = cms.OutputModule("EventStreamFileWriter",
    fileName = cms.untracked.string('testSeriesOfProcessesHLT.dat'),
    compression_level = cms.untracked.int32(1),
    use_compression = cms.untracked.bool(True),
    max_event_size = cms.untracked.int32(7000000)
)

process.e = cms.EndPath(process.out)

Should I backport the entire PR (and the other related PR that followed) to 9_0_X? This has probably been there for a while. It is somewhat unusual to run streamer output on a file containing an EDAlias and that is probably why no one noticed before.

@wddgit
Copy link
Contributor Author

wddgit commented Mar 24, 2017

I did not try it but this might also fail if you dropped the parentage and tried to run streamer output. So it could be looked at as also a bug in streamer output.

@Dr15Jones
Copy link
Contributor

Let's start with a fix to the Streamer. Both in 9_1 and 9_0.

@wddgit
Copy link
Contributor Author

wddgit commented Mar 27, 2017

I'm working on fixing the Streamer IO. It actually is not built to handle a nullptr for the ProductProvenance at all so this is more than a one or two line fix. I'll need to trace this all the way through streamer input and output and then make several little fixes in the entire streamer input and output chain to make it work.

@dmitrijus
Copy link
Contributor

@wddgit Thanks!

@wddgit
Copy link
Contributor Author

wddgit commented Mar 30, 2017

@dmitrijus I submitted pull requests for the Streamer Output Module problem. PR #18126 was already merged into 9_1_X and PR #18135 is the back port to 9_0_X which was just submitted (not merged yet). These should be sufficient to avoid the seg fault you saw.
@Dr15Jones My understanding is that we are not backporting this PR because we want to maintain stability in 9_0_X, avoid the effort to resolve conflicts, and the risk associated with it. Let me know if you want me to backport this also.

@wddgit wddgit deleted the fixBranchChildren branch August 2, 2017 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants