Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GsfElectronProducer Reading off the end of a std::vector #38175

Closed
Dr15Jones opened this issue Jun 1, 2022 · 20 comments
Closed

GsfElectronProducer Reading off the end of a std::vector #38175

Dr15Jones opened this issue Jun 1, 2022 · 20 comments

Comments

@Dr15Jones
Copy link
Contributor

Running a PR test under ASAN showed the problem in the addOnTest hlt_mc_PIon

==30644==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60200315e05c at pc 0x2ac4da578973 bp 0x2ac4c2eec170 sp 0x2ac4c2eec168
READ of size 4 at 0x60200315e05c thread T4

This appears to be related to the use of the DNN.

@Dr15Jones
Copy link
Contributor Author

The full traceback from ASAN is

==30644==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60200315e05c at pc 0x2ac4da578973 bp 0x2ac4c2eec170 sp 0x2ac4c2eec168
READ of size 4 at 0x60200315e05c thread T4
Begin processing the 5th record. Run 1, Event 6, LumiSection 1 on stream 2 at 01-Jun-2022 03:53:46.621 CEST
    #0 0x2ac4da578972 in GsfElectronProducer::produce(edm::Event&, edm::EventSetup const&) (/cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_ASAN_X_2022-05-30-1100/lib/el8_amd64_gcc10/pluginRecoEgammaEgammaElectronProducersPlugins.so+0x169972)
    #1 0x2ac46a9c9d1f in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (/cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_ASAN_X_2022-05-30-1100/lib/el8_amd64_gcc10/libFWCoreFramework.so+0x8efd1f)
    #2 0x2ac46a952ee2 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) (/cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_ASAN_X_2022-05-30-1100/lib/el8_amd64_gcc10/libFWCoreFramework.so+0x878ee2)
    #3 0x2ac46a639a24 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) (/cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_ASAN_X_2022-05-30-1100/lib/el8_amd64_gcc10/libFWCoreFramework.so+0x55fa24)
    #4 0x2ac46a63a26a in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (/cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_ASAN_X_2022-05-30-1100/lib/el8_amd64_gcc10/libFWCoreFramework.so+0x56026a)
    #5 0x2ac46a645cc6 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() (/cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_ASAN_X_2022-05-30-1100/lib/el8_amd64_gcc10/libFWCoreFramework.so+0x56bcc6)
    #6 0x2ac46b349b81 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (/cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_ASAN_X_2022-05-30-1100/lib/el8_amd64_gcc10/libFWCoreConcurrency.so+0x11b81)
    #7 0x2ac46c9a6ffb in tbb::detail::d1::task* tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::d1::task*, tbb::detail::r1::outermost_worker_waiter&) /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.h:322
    #8 0x2ac46c9a6ffb in tbb::detail::d1::task* tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter>(tbb::detail::d1::task*, tbb::detail::r1::outermost_worker_waiter&) /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.h:463
    #9 0x2ac46c9a6ffb in tbb::detail::r1::arena::process(tbb::detail::r1::thread_data&) /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/arena.cpp:138
    #10 0x2ac46c9b3592 in tbb::detail::r1::market::process(rml::job&) /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/market.cpp:597
    #11 0x2ac46c9b3592 in tbb::detail::r1::rml::private_worker::run() /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/private_server.cpp:267
    #12 0x2ac46c9b3592 in tbb::detail::r1::rml::private_worker::thread_routine(void*) /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/private_server.cpp:221
    #13 0x2ac46d61e1ce in start_thread (/lib64/libpthread.so.0+0x81ce)
    #14 0x2ac46d86fd82 in clone (/lib64/libc.so.6+0x39d82)

0x60200315e05c is located 0 bytes to the right of 12-byte region [0x60200315e050,0x60200315e05c)
allocated by thread T4 here:
    #0 0x2ac4697bd607 in operator new(unsigned long) ../../../../libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x2ac4c4a793e9 in egammaTools::EgammaDNNHelper::evaluate(std::vector<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >, std::allocator<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > > > > const&, std::vector<tensorflow::Session*, std::allocator<tensorflow::Session*> > const&) const (/cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_ASAN_X_2022-05-30-1100/lib/el8_amd64_gcc10/libRecoEgammaEgammaTools.so+0x933e9)
    #2 0x2ac4c8e00e3a in ElectronDNNEstimator::evaluate(std::vector<reco::GsfElectron, std::allocator<reco::GsfElectron> > const&, std::vector<tensorflow::Session*, std::allocator<tensorflow::Session*> > const&) const (/cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_ASAN_X_2022-05-30-1100/lib/el8_amd64_gcc10/libRecoEgammaElectronIdentification.so+0x19e3a)
    #3 0x2ac4da5770f6 in GsfElectronProducer::produce(edm::Event&, edm::EventSetup const&) (/cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_ASAN_X_2022-05-30-1100/lib/el8_amd64_gcc10/pluginRecoEgammaEgammaElectronProducersPlugins.so+0x1680f6)
    #4 0x2ac46a9c9d1f in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (/cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_ASAN_X_2022-05-30-1100/lib/el8_amd64_gcc10/libFWCoreFramework.so+0x8efd1f)
    #5 0x2ac46a952ee2 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) (/cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_ASAN_X_2022-05-30-1100/lib/el8_amd64_gcc10/libFWCoreFramework.so+0x878ee2)
    #6 0x2ac46a639a24 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) (/cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_ASAN_X_2022-05-30-1100/lib/el8_amd64_gcc10/libFWCoreFramework.so+0x55fa24)
    #7 0x2ac46a63a26a in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (/cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_ASAN_X_2022-05-30-1100/lib/el8_amd64_gcc10/libFWCoreFramework.so+0x56026a)
    #8 0x2ac46a645cc6 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() (/cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_ASAN_X_2022-05-30-1100/lib/el8_amd64_gcc10/libFWCoreFramework.so+0x56bcc6)
    #9 0x2ac46b349b81 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (/cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_ASAN_X_2022-05-30-1100/lib/el8_amd64_gcc10/libFWCoreConcurrency.so+0x11b81)
    #10 0x2ac46c9a6ffb in tbb::detail::d1::task* tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::d1::task*, tbb::detail::r1::outermost_worker_waiter&) /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.h:322
    #11 0x2ac46c9a6ffb in tbb::detail::d1::task* tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter>(tbb::detail::d1::task*, tbb::detail::r1::outermost_worker_waiter&) /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/task_dispatcher.h:463
    #12 0x2ac46c9a6ffb in tbb::detail::r1::arena::process(tbb::detail::r1::thread_data&) /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/arena.cpp:138
    #13 0x2ac46c9b3592 in tbb::detail::r1::market::process(rml::job&) /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/market.cpp:597
    #14 0x2ac46c9b3592 in tbb::detail::r1::rml::private_worker::run() /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/private_server.cpp:267
    #15 0x2ac46c9b3592 in tbb::detail::r1::rml::private_worker::thread_routine(void*) /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/private_server.cpp:221

Thread T4 created by T0 here:
Begin processing the 6th record. Run 1, Event 8, LumiSection 1 on stream 1 at 01-Jun-2022 03:53:47.376 CEST
    #0 0x2ac469767282 in __interceptor_pthread_create ../../../../libsanitizer/asan/asan_interceptors.cpp:214
    #1 0x2ac46c9b2c4f in tbb::detail::r1::rml::internal::thread_monitor::launch(void* (*)(void*), void*, unsigned long) /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/rml_thread_monitor.h:208
    #2 0x2ac46c9b2c4f in tbb::detail::r1::rml::private_worker::wake_or_launch() /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/private_server.cpp:299
    #3 0x2ac46c9b2c4f in tbb::detail::r1::rml::private_server::wake_some(int) /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc10/external/tbb/v2021.5.0-e966a5acb1e4d5fd7605074bafbb079c/tbb-v2021.5.0/src/tbb/private_server.cpp:407

SUMMARY: AddressSanitizer: heap-buffer-overflow (/cvmfs/cms-ib.cern.ch/nweek-02735/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_ASAN_X_2022-05-30-1100/lib/el8_amd64_gcc10/pluginRecoEgammaEgammaElectronProducersPlugins.so+0x169972) in GsfElectronProducer::produce(edm::Event&, edm::EventSetup const&)
Shadow bytes around the buggy address:
  0x0c0480623bb0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0480623bc0: fa fa fa fa fa fa 00 fa fa fa fa fa fa fa fa fa
  0x0c0480623bd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0480623be0: fa fa 05 fa fa fa 02 fa fa fa 03 fa fa fa fa fa
  0x0c0480623bf0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c0480623c00: fa fa 06 fa fa fa fa fa fa fa 00[04]fa fa fa fa
  0x0c0480623c10: fa fa 04 fa fa fa fd fa fa fa 04 fa fa fa fa fa
  0x0c0480623c20: fa fa fa fa fa fa 04 fa fa fa 00 00 fa fa 04 fa
  0x0c0480623c30: fa fa 00 00 fa fa 03 fa fa fa 02 fa fa fa 06 fa
  0x0c0480623c40: fa fa fa fa fa fa fa fa fa fa 04 fa fa fa fa fa
  0x0c0480623c50: fa fa fa fa fa fa fa fa fa fa 05 fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==30644==ABORTING

@Dr15Jones
Copy link
Contributor Author

Assign reconstruction.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 1, 2022

A new Issue was created by @Dr15Jones Chris Jones.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor Author

assign reconstruction

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 1, 2022

New categories assigned: reconstruction

@jpata,@slava77,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dr15Jones
Copy link
Contributor Author

Looking at the code, it appears to start from this call in GsfElectronProducer

const auto& dnn_ele_pfid = hoc->iElectronDNNEstimator->evaluate(electrons, tfSessions);
int jele = 0;
for (auto& el : electrons) {
const auto& values = dnn_ele_pfid[jele];
// get the previous values
auto& mvaOutput = mva_outputs[jele];
if (abs(el.superCluster()->eta()) <= extetaboundary) {
mvaOutput.dnn_e_sigIsolated = values[0];
mvaOutput.dnn_e_sigNonIsolated = values[1];
mvaOutput.dnn_e_bkgNonIsolated = values[2];
mvaOutput.dnn_e_bkgTau = values[3];
mvaOutput.dnn_e_bkgPhoton = values[4];
} else {
mvaOutput.dnn_e_sigIsolated = values[0];
mvaOutput.dnn_e_sigNonIsolated = 0.0;
mvaOutput.dnn_e_bkgNonIsolated = values[1];
mvaOutput.dnn_e_bkgTau = 0.0;
mvaOutput.dnn_e_bkgPhoton = values[2];
}

Given it is 0 bytes beyond an array of size 12, I'm guessing that the std::vector<std::vector<float>> returned by the call to evaluate is having one of its std::vector<float> accessed and the vector is probably only of length 3. Therefore the problematic access is probably line 67.

@Dr15Jones
Copy link
Contributor Author

So the underlying determination of which model to use (and thus how many outputs will be present) is decided by this piece of code

inline uint electronModelSelector(
const std::map<std::string, float>& vars, float ptThr, float etaThr, float endcapBoundary, float extEtaBoundary) {
/*
Selection of the model to be applied on the electron based on pt/eta cuts or whatever selection
*/
const auto pt = vars.at("pt");
const auto absEta = std::abs(vars.at("eta"));
if (absEta <= endcapBoundary) {
if (pt < ptThr)
return 0;
else {
if (absEta <= etaThr) {
return 1;
} else {
return 2;
}
}
} else {
if (absEta < extEtaBoundary)
return 3;
else {
return 4;
}

In particular, line 30 where absEta is a float. But the call at line 63 of GsfElectronProducer is using a double as defined here:

https://github.com/cms-sw/cmssw/blob/master/DataFormats/CaloRecHit/interface/CaloCluster.h#L181

What is worse, is line 63 is

if (abs(el.superCluster()->eta()) <= extetaboundary) {

and since I can't find any using std; directive in that file, that call to abs is for the one from cmath which would evaluate the value as in int !

@jpata
Copy link
Contributor

jpata commented Jun 2, 2022

type egamma

@cmsbuild cmsbuild added the egamma label Jun 2, 2022
@jpata
Copy link
Contributor

jpata commented Jun 2, 2022

Thanks Chris for the investigation!

@valsdav @a-kapoor @swagata87 @cms-sw/egamma-pog-l2 please take note of this issue.

@a-kapoor
Copy link
Contributor

a-kapoor commented Jun 2, 2022

Looking at it now. @jpata

@Dr15Jones
Copy link
Contributor Author

I think the underlying problem is with the interface presented by egammaTools::EgammaDNNHelper::evaluate. The function internally decides which model to run but the data structure returned from the function gives no way to know which model was applied to which elements. If the model info was returned from the function there would be no need to attempt to reapply the criteria externally to figure out how to interpret the std::vector<float> being returned.

@a-kapoor
Copy link
Contributor

a-kapoor commented Jun 2, 2022

@Dr15Jones Thanks a lot for all your comments. I completely agree with all the general comments. The interface egammaTools::EgammaDNNHelper::evaluate can be improved significantly.

I did some investigation, and to me it looks like the issue, rather, is inconsistent usage of el.superCluster()->eta() and el.eta().

The fact that there should be 3 nodes is decided based on el.eta(), when it should actually be decided based on el.superCluster()->eta().
Even a little difference between el.superCluster()->eta() and el.eta() will mess this up at boundaries. We also know, that these two numbers can actually be quite different some times. This could be the reason, that is in some cases, where we have 3 node values, yet, we are maybe asking for 5 node values and this results in a crash.

Do you think this makes sense with what we see in the crash? @Dr15Jones

Regarding using std::abs vs just abs, I printed out some values (while running the hlt_mc_PIon addon test) and it looks like the abs function is working fine

abs(sc eta): 0.559093  std::abs(sc eta): 0.559093
abs(sc eta): 0.476116  std::abs(sc eta): 0.476116
abs(sc eta): 1.6099  std::abs(sc eta): 1.6099

Anyway, I also agree that we should still move to std::abs.

I am working on a fix and checking if this inconsistency exists in other places of the DNN code.

@Dr15Jones
Copy link
Contributor Author

Dr15Jones commented Jun 2, 2022

Do you think this makes sense with what we see in the crash? @Dr15Jones

Anything that could cause the caller of the function and the function to disagree on which model was used would cause the crash. Numerical differences causing the cut logic to be different between the two would definitely cause it.

Regarding using std::abs vs just abs, I printed out some values (while running the hlt_mc_PIon addon test) and it looks like the abs function is working fine

Very strange. How did you do the print comparison? abs is very much an int only function, see

https://godbolt.org/z/EWh9d5c8v

NOTE: if somewhere in your test you have using std than abs will actually be std::abs.

@a-kapoor
Copy link
Contributor

a-kapoor commented Jun 2, 2022

@Dr15Jones

Regarding using std::abs vs just abs, I printed out some values (while running the hlt_mc_PIon addon test) and it looks like the abs function is working fine

Very strange. How did you do the print comparison? abs is very much an int only function, see

Okay! Here is a commit I created with just the cout statement in GsfElectronProducer.cc

Maybe there is a "using std" hidden somewhere but it is not in GsfElectronProducer.cc. Could it be picking from somewhere else?

@Dr15Jones
Copy link
Contributor Author

Maybe there is a "using std" hidden somewhere but it is not in GsfElectronProducer.cc. Could it be picking from somewhere else?

Guess one of the included headers is bring something in. I didn't find any evidence it comes from CMSSW. But for now, it doesn't matter since you've determined it is NOT an int conversion going on.

@a-kapoor
Copy link
Contributor

a-kapoor commented Jun 2, 2022

Maybe there is a "using std" hidden somewhere but it is not in GsfElectronProducer.cc. Could it be picking from somewhere else?

Guess one of the included headers is bring something in. I didn't find any evidence it comes from CMSSW. But for now, it doesn't matter since you've determined it is NOT an int conversion going on.

Indeed. I am working on fixing the actual problem and will get back on this thread.

@VinInn
Copy link
Contributor

VinInn commented Jun 8, 2022

grep -n abs /cvmfs/cms.cern.ch/slc7_amd64_gcc10/external/gcc/10.3.0-84898dea653199466402e67d73657f10/include/c++/10.3.0/stdlib.h
54:using std::abs;
63:using std::labs;

@a-kapoor
Copy link
Contributor

@Dr15Jones @jpata @swagata87 @lfinco
A pull request with the fix and more transparent model selection scheme is now opened: #38356

@jpata
Copy link
Contributor

jpata commented Jun 16, 2022

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants