Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes in muon HLT reconstruction (reco::TrackExtra product not found) #39064

Closed
missirol opened this issue Aug 15, 2022 · 19 comments · Fixed by #39245
Closed

Crashes in muon HLT reconstruction (reco::TrackExtra product not found) #39064

missirol opened this issue Aug 15, 2022 · 19 comments · Fixed by #39245

Comments

@missirol
Copy link
Contributor

Over the last few weeks, HLT suffered 3 online crashes coming the HLT-muon reconstruction; the first 2 errors are almost identical, while the 3rd one is rather similar to the first 2, but comes from a different producer. The error messages are given below [*].

So far, no one has been able to reproduce any of these errors locally with the relevant error-stream files.

I open an issue to keep track of this, and to ask experts if the error messages suggest to them anything about what might be going wrong.

This config file should be representative of the HLT menu used online during the crashes (representative at least for what concerns the sequences that contain the problematic modules).

FYI: @JanFSchulte @khaosmos93 (Muon-HLT contacts), @silviodonato @Martin-Grunewald @fwyzard

[*]

  1. Run-356433 (CMSSW_12_4_3, Jul 29th):
[2] Calling method for module TSGFromL2Muon/'hltL3TrajSeedOIStateNoVtx'
Exception Message:
RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2802'
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
Additional Info:
[a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
  1. Run-356530 (CMSSW_12_4_3, Aug 1st):
[2] Calling method for module TSGFromL2Muon/'hltL3TrajSeedOIStateNoVtx'
Exception Message:
RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2804'
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
Additional Info:
[a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
  1. Run-357442 (CMSSW_12_4_6, Aug 14th):
[2] Calling method for module TSGForOIFromL2/'hltIterL3OISeedsFromL2MuonsNoVtx'
Exception Message:
RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2782'
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
Additional Info:
[a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
@cmsbuild
Copy link
Contributor

A new Issue was created by @missirol Marino Missiroli.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign hlt

@cmsbuild
Copy link
Contributor

New categories assigned: hlt

@missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

@missirol missirol changed the title HLT crashes from muon HLT reconstruction (reco::TrackExtra product not found) Crashes in muon HLT reconstruction (reco::TrackExtra product not found) Aug 15, 2022
@JanFSchulte
Copy link
Contributor

The new error is at least somewhat illuminating since it comes from a different CMSSW module as before, one I am more familiar with. Looking at the code, the only place where a TrackRef is created in that module is here https://github.com/cms-sw/cmssw/blob/6ba826d71058bcc4456e66ad7dc2c01edc542626/RecoMuon/TrackerSeedGenerator/plugins/TSGForOIFromL2.cc#L118, which suggests that the missing TrackExtra collection is the one associated with L2 muons. However, I have still no idea why that one would be missing, since the L2 muons themselves are clearly there.

@missirol
Copy link
Contributor Author

For completeness, below are the details to rerun the relevant HLT menu on the problematic events, even though the 3 HLT crashes do not seem to be reproducible offline.

  1. Run-356433 (CMSSW_12_4_3)

    • Error:
    [2] Calling method for module TSGFromL2Muon/'hltL3TrajSeedOIStateNoVtx'
    Exception Message:
    RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2802'
    can not be satisfied because the product cannot be found.
    Probably the branch containing the product is not stored in the input file.
    Additional Info:
    [a] If you wish to continue processing events after a ProductNotFound exception,
    add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
    
    • Recipe (not reproducing error):
    cmsrel CMSSW_12_4_3; cd CMSSW_12_4_3/src; cmsenv
    hltConfigFromDB --runNumber 356433 > hlt.py
    cat >> hlt.py <<@EOF
    process.source.fileListMode = True
    process.source.fileNames = ['/afs/cern.ch/work/m/missirol/public/fog/error_stream/run356433/run356433_ls0007_index000039_fu-c2b01-15-01_pid176387.raw']
    @EOF
    cmsRun hlt.py &> hlt.log
  2. Run-356530 (CMSSW_12_4_3)

    • Error:
    [2] Calling method for module TSGFromL2Muon/'hltL3TrajSeedOIStateNoVtx'
    Exception Message:
    RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2804'
    can not be satisfied because the product cannot be found.
    Probably the branch containing the product is not stored in the input file.
    Additional Info:
    [a] If you wish to continue processing events after a ProductNotFound exception,
    add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
    
    • Recipe (not reproducing error):
    cmsrel CMSSW_12_4_3; cd CMSSW_12_4_3/src; cmsenv
    hltConfigFromDB --runNumber 356530 > hlt.py
    cat >> hlt.py <<@EOF
    process.source.fileListMode = True
    process.source.fileNames = ['/afs/cern.ch/work/m/missirol/public/fog/error_stream/run356530/run356530_ls0004_index000064_fu-c2b04-40-01_pid1803760.raw']
    @EOF
    cmsRun hlt.py &> hlt.log
  3. Run-357442 (CMSSW_12_4_6)

    • Error:
    [2] Calling method for module TSGForOIFromL2/'hltIterL3OISeedsFromL2MuonsNoVtx'
    Exception Message:
    RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2782'
    can not be satisfied because the product cannot be found.
    Probably the branch containing the product is not stored in the input file.
    Additional Info:
    [a] If you wish to continue processing events after a ProductNotFound exception,
    add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
    
    • Recipe (not reproducing error):
    cmsrel CMSSW_12_4_6; cd CMSSW_12_4_6/src; cmsenv
    hltConfigFromDB --runNumber 357442 > hlt.py
    cat >> hlt.py <<@EOF
    process.source.fileListMode = True
    process.source.fileNames = ['/afs/cern.ch/work/m/missirol/public/fog/error_stream/run357442/run357442_ls0880_index000100_fu-c2b02-39-01_pid416797.raw']
    @EOF
    cmsRun hlt.py &> hlt.log

@makortel
Copy link
Contributor

makortel commented Aug 19, 2022

I ran the recipe for Run-357442 by adding EventContentAnalyzer, and see that the ProductID 1:2782 corresponds to hltL2CosmicMuons module. In the configuration, I see hltL2CosmicMuons would be consumed by hltL2MuonCandidatesNoVtxMeanTimerCosmicSeed EDProducer and several EDFilters, and hltL2MuonCandidatesNoVtxMeanTimerCosmicSeed only by several EDFilters. Assuming these EDFilters do not produce data products into the event (that anything else would be consuming), I don't see how a TrackExtraRef to hltL2CosmicMuons could propagate to hltIterL3OISeedsFromL2MuonsNoVtx. (but I could easily be missing something)

@makortel
Copy link
Contributor

Ok, so I ran my earlier test in a flawed way that lead the ProductIDs to shift. Being more careful I see 1:2782 to correspond hltL2Muons which makes more sense (but doesn't really reveal anything new).

@missirol
Copy link
Contributor Author

Thanks for having a look, Matti.

Trying to find where the output collections of hltL2Muons are produced (incl. the TrackExtras), and I think this happens in

OrphanHandle<reco::TrackCollection> MuonTrackLoader::loadTracks(TrajectoryContainer& trajectories,

A non-expert look inside suggests that the trackExtra ref is always set for the output tracks (I might be wrong):

track.setExtra(trackExtraRef);

Also, I don't know enough to really understand the error message. Based on the latter, can one say that the collection std::vector<reco::TrackExtra> was not produced during the job?

@makortel
Copy link
Contributor

The exception in RefCore is thrown here

void RefCore::productNotFoundException(std::type_info const& type) const {
throw edm::Exception(errors::ProductNotFound)
<< "RefCore: A request to resolve a reference to a product of type '" << TypeID(type) << "' with ProductID '"
<< id() << "'"
<< "\ncan not be satisfied because the product cannot be found."
<< "\nProbably the branch containing the product is not stored in the input file.\n";
}

which is called from RefCore::getProductPtr()
WrapperBase const* product = prodGetter->getIt(tId);
if (product == nullptr) {
productNotFoundException(type);
}

or from RefCore::getThinnedProductPtr()
std::optional product = prodGetter->getThinnedProduct(tId, key);
if (not product.has_value()) {
productNotFoundException(type);
}

Since the TrackExtraRef is a Ref to a std::vector<reco::TrackExtra>, the getThinnedProductPtr() is actually the one being called (even if there is no thinning going on). The code goes to check the existence of a thinned product if a non-thinned one is not found in
C const* prod = edm::template tryToGetProductWithCoreFromRef<C>(product, getter);
if (prod != nullptr) {
F func;
item = func(*prod, key);
product.setProductPtr(item);
return item;
}
unsigned int thinnedKey;
std::tie(prod, thinnedKey) = edm::template getThinnedProduct<C>(product, key, getter);

The tryToGetProductWithCoreFromRef() ends up calling
WrapperBase const* product = prodGetter->getIt(tId);

(as in RefCore::getProductPtr()).

The prodGetter->getIt() call goes into

WrapperBase const* EventPrincipal::getIt(ProductID const& pid) const { return getByProductID(pid).wrapper(); }

and
BasicHandle EventPrincipal::getByProductID(ProductID const& pid) const {

where this case could happen only if there is no product at all with the given ID, or the producing module was run but did not produce the product (both of which should be reproducible, and should not be the case where).

I was first wondering why exactly the code ends up searching the vector<TrackExtra> product by ProductID, but that is because the TrackExtraRef is constructed from a RefProd<...> and index in

reco::TrackExtraRef trackExtraRef(trackExtraCollectionRefProd, trackExtraIndex++);

The conditions leading to the ProductNotFound exception should be reproducible. The fact that there is a rare failure makes me wonder a (strange) race condition somewhere.

@makortel
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

@Dr15Jones and I discovered a "logical race condition" in the framework that would cause symptoms like this (but we can't tell if it is really causing these problems).

The L2MuonProducer that produced the Track and TrackExtra collections declares first the production of the Track collection and then the TrackExtra collection

produces<reco::TrackCollection>();
produces<reco::TrackCollection>("UpdatedAtVtx");
produces<TrackingRecHitCollection>();
produces<reco::TrackExtraCollection>();

and Track objects hold Refs to the TrackExtra. Downstream modules consume only the Track collection. The order of produces() declarations dictates the order where Event::commit_aux() moves the products into the EventPrincipal (and to the corresponding ProductResolvers) after the EDProducer::produce() has successfully finished. When a scheduled module puts a product in a ProductResolver, the consumers of that product become eligible to run (unless some other product they depend on have not yet been produced)
void PuttableProductResolver::putProduct(std::unique_ptr<WrapperBase> edp) const {
ProducedProductResolver::putProduct(std::move(edp));
bool expected = false;
if (prefetchRequested_.compare_exchange_strong(expected, true)) {
m_waitingTasks.doneWaiting(std::exception_ptr());
}
}

This means that the following can happen

  1. the Track collection is put in the ProductResolver (during the end of hltL2Muons module's produce)
  2. a consumer of that Track collection (e.g. hltIterL3OISeedsFromL2MuonsNoVtx) gets run by another thread
    • the other thread must have nothing else to do in order to steal a task from the thread still running hltL2Muons
  3. the operating system pauses the thread running the hltL2Muons
  4. the consumer calls a method Track that de-references the TrackExtraRef, but the TrackExtra collection is not in the corresponding ProductResolver yet
  5. the hltL2Muons puts the TrackExtra collection into the ProductResolver, but the job is going to terminate

A quick workaround (which I'm going to prepare) is to declare first the production of TrackExtra collection and only then the Track collection.

We will need some time to think for a more general solution (I'd guess the TrackExtra is not the only collection type we have that is only referenced-to by other products without explicit consumption).

As far as I can tell, unscheduled modules are not affected, because for them the product insertion into ProductResolver does not impact module scheduling. Instead, upon prefetch a task that releases all the modules consuming that product is inserted into the WaitingTaskList of the Worker producing the product.

void UnscheduledProductResolver::prefetchAsync_(WaitingTaskHolder waitTask,
Principal const& principal,
bool skipCurrentProcess,
ServiceToken const& token,
SharedResourcesAcquirer* sra,
ModuleCallingContext const* mcc) const {
if (skipCurrentProcess) {
return;
}
if (worker_ == nullptr) {
throw cms::Exception("LogicError") << "UnscheduledProductResolver::prefetchAsync_() called with null worker_. "
"This should not happen, please contact framework developers.";
}
//need to try changing prefetchRequested_ before adding to waitingTasks_
bool expected = false;
bool prefetchRequested = prefetchRequested_.compare_exchange_strong(expected, true);
waitingTasks_.add(waitTask);
if (prefetchRequested) {
//Have to create a new task which will make sure the state for UnscheduledProductResolver
// is properly set after the module has run
auto t = make_waiting_task([this](std::exception_ptr const* iPtr) {
//The exception is being rethrown because resolveProductImpl sets the ProductResolver to a failed
// state for the case where an exception occurs during the call to the function.
// Caught exception is propagated via WaitingTaskList
CMS_SA_ALLOW try {
resolveProductImpl<true>([iPtr]() {
if (iPtr) {
std::rethrow_exception(*iPtr);
}
});
} catch (...) {
waitingTasks_.doneWaiting(std::current_exception());
return;
}
waitingTasks_.doneWaiting(nullptr);
});
ParentContext parentContext(mcc);
EventTransitionInfo const& info = aux_->eventTransitionInfo();
worker_->doWorkAsync<OccurrenceTraits<EventPrincipal, BranchActionStreamBegin> >(
WaitingTaskHolder(*waitTask.group(), t),
info,
token,
info.principal().streamID(),
parentContext,
mcc->getStreamContext());
}
}

makortel added a commit to makortel/cmssw that referenced this issue Aug 25, 2022
This change works around a rare scheduling bug in the framework when
these modules are run as scheduled, see
cms-sw#39064
@makortel
Copy link
Contributor

A quick workaround (which I'm going to prepare) is to declare first the production of TrackExtra collection and only then the Track collection.

The PR is here #39201. I limited to modules that I easily saw were being used in the HLT.

missirol pushed a commit to missirol/cmssw that referenced this issue Aug 26, 2022
This change works around a rare scheduling bug in the framework when
these modules are run as scheduled, see
cms-sw#39064
@makortel
Copy link
Contributor

I was able to reproduce the exception with the "Run-357442" by adding a 1-second sleep between the iterations of the loop

for (unsigned int i = 0; i < products.size(); ++i) {
auto& p = get_underlying_safe(products[i]);
if (p) {
if (recordProv[i]) {
ep.put(provRecorder_.putTokenIndexToProductResolverIndex()[i], std::move(p), *presentParentageId);
} else {
ep.put(provRecorder_.putTokenIndexToProductResolverIndex()[i], std::move(p), s_emptyParentage);
}
}
}

for the hltL2Muons module (running with 1 stream and 4 threads).

I tested the mitigation in #39201 with this setup, and the test job succeeded. Of course this test does not imply that there couldn't be some other module(s), or other pairs of similar data products that could cause this problem to show up.

@makortel
Copy link
Contributor

I think I have a proper fix in #39245. At least it works with the "reproducer" in #39064 (comment) .

@makortel
Copy link
Contributor

#39245 converged. I also made backports to 12_4_X and 12_5_X in #39484 and #39485.

@missirol
Copy link
Contributor Author

+hlt

Thanks @makortel

@makortel
Copy link
Contributor

+core

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants