Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASAN problem in DeepTauId #32837

Closed
makortel opened this issue Feb 6, 2021 · 15 comments · Fixed by #32838
Closed

ASAN problem in DeepTauId #32837

makortel opened this issue Feb 6, 2021 · 15 comments · Fixed by #32838

Comments

@makortel
Copy link
Contributor

makortel commented Feb 6, 2021

CMSSW_11_3_ASAN_X_2021-02-05-2300 reports in 3.0 (also in many other workflows) step 3

=================================================================
==11751==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x2b346d883e60 at pc 0x2b34791351a4 bp 0x2b346ec84700 sp 0x2b346ec846f8
READ of size 1 at 0x2b346d883e60 thread T4
    #0 0x2b34791351a3 in DeepTauId::fillGrids<edm::View<reco::Candidate>, pat::Tau>(pat::Tau const&, edm::View<reco::Candidate> const&, (anonymous namespace)::CellGrid&, (anonymous namespace)::CellGrid&)::{lambda(unsigned long, double, double, (anonymous namespace)::CellGrid&)#1}::operator()(unsigned long, double, double, (anonymous namespace)::CellGrid&) const [clone .isra.0] (/cvmfs/cms-ib.cern.ch/nweek-02666/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_ASAN_X_2021-02-05-2300/lib/slc7_amd64_gcc900/pluginRecoTauTagRecoTauPlugins.so+0x18b1a3)
    #1 0x2b34792061f5 in void DeepTauId::getPredictionsV2<pat::PackedCandidate, pat::Tau>(reco::BaseTau const&, unsigned long, edm::RefToBase<reco::BaseTau>, std::vector<pat::Electron, std::allocator<pat::Electron> > const*, std::vector<pat::Muon, std::allocator<pat::Muon> > const*, edm::View<reco::Candidate> const&, reco::Vertex const&, double, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >&, (anonymous namespace)::TauFunc) [clone .isra.0] (/cvmfs/cms-ib.cern.ch/nweek-02666/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_ASAN_X_2021-02-05-2300/lib/slc7_amd64_gcc900/pluginRecoTauTagRecoTauPlugins.so+0x25c1f5)
    #2 0x2b3479241d41 in DeepTauId::getPredictions(edm::Event&, edm::Handle<edm::View<reco::BaseTau> >) (/cvmfs/cms-ib.cern.ch/nweek-02666/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_ASAN_X_2021-02-05-2300/lib/slc7_amd64_gcc900/pluginRecoTauTagRecoTauPlugins.so+0x297d41)
    #3 0x2b3479a32ec7 in deep_tau::DeepTauBase::produce(edm::Event&, edm::EventSetup const&) (/cvmfs/cms-ib.cern.ch/nweek-02666/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_ASAN_X_2021-02-05-2300/lib/slc7_amd64_gcc900/libRecoTauTagRecoTau.so+0xffec7)
    #4 0x2b342cabd4cf in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (/cvmfs/cms-ib.cern.ch/nweek-02666/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_ASAN_X_2021-02-05-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so+0x8904cf)
    #5 0x2b342ca15282 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) (/cvmfs/cms-ib.cern.ch/nweek-02666/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_ASAN_X_2021-02-05-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so+0x7e8282)
    #6 0x2b342c72b034 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) (/cvmfs/cms-ib.cern.ch/nweek-02666/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_ASAN_X_2021-02-05-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so+0x4fe034)
    #7 0x2b342c72b4cb in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (/cvmfs/cms-ib.cern.ch/nweek-02666/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_ASAN_X_2021-02-05-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so+0x4fe4cb)
    #8 0x2b342c72be0f in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (/cvmfs/cms-ib.cern.ch/nweek-02666/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_ASAN_X_2021-02-05-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so+0x4fee0f)
    #9 0x2b342c732df7 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() (/cvmfs/cms-ib.cern.ch/nweek-02666/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_ASAN_X_2021-02-05-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so+0x505df7)
    #10 0x2b342eac9044 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop(tbb::internal::context_guard_helper<false>&, tbb::task*, long) ../../src/tbb/custom_scheduler.h:474
    #11 0x2b342eac92fa in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) ../../src/tbb/custom_scheduler.h:636
    #12 0x2b342eac29a6 in tbb::internal::arena::process(tbb::internal::generic_scheduler&) ../../src/tbb/arena.cpp:196
    #13 0x2b342eac136f in tbb::internal::market::process(rml::job&) ../../src/tbb/market.cpp:667
    #14 0x2b342eabdb6b in tbb::internal::rml::private_worker::run() ../../src/tbb/private_server.cpp:266
    #15 0x2b342eabdd68 in tbb::internal::rml::private_worker::thread_routine(void*) ../../src/tbb/private_server.cpp:219
    #16 0x2b342f8d0ea4 in start_thread (/lib64/libpthread.so.0+0x7ea4)
    #17 0x2b342fbe396c in clone (/lib64/libc.so.6+0xfe96c)

Address 0x2b346d883e60 is located in stack of thread T2 at offset 5168 in frame
    #0 0x2b345d63dcef in reco::parser::MethodSetter::push(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::variant<signed char, unsigned char, short, unsigned short, int, unsigned int, long, unsigned long, double, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::variant<signed char, unsigned char, short, unsigned short, int, unsigned int, long, unsigned long, double, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&, char const*, bool) const (/cvmfs/cms-ib.cern.ch/nweek-02666/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_ASAN_X_2021-02-05-2300/lib/slc7_amd64_gcc900/libCommonToolsUtils.so+0x19dcef)

  This frame has 47 object(s):
    [32, 33) '<unknown>'
    [48, 49) '<unknown>'
    [64, 65) '<unknown>'
    [80, 81) '<unknown>'
    [96, 97) '<unknown>'
    [112, 116) 'error' (line 60)
    [128, 136) 'member' (line 119)
    [160, 168) '<unknown>'
    [192, 200) '<unknown>'
    [224, 232) '<unknown>'
    [256, 264) '<unknown>'
    [288, 296) '<unknown>'
    [320, 328) '<unknown>'
    [352, 360) '<unknown>'
    [384, 392) '<unknown>'
    [416, 424) '<unknown>'
    [448, 456) '<unknown>'
    [480, 504) 'fixups' (line 59)
    [544, 568) '<unknown>'
    [608, 640) 'mem' (line 61)
    [672, 704) '<unknown>'
    [736, 768) '<unknown>'
    [800, 832) '<unknown>'
    [864, 896) '<unknown>'
    [928, 960) '<unknown>'
    [992, 1024) '<unknown>'
    [1056, 1088) '<unknown>'
    [1120, 1152) '<unknown>'
    [1184, 1216) '<unknown>'
    [1248, 1280) '<unknown>'
    [1312, 1344) '<unknown>'
    [1376, 1424) 'type' (line 58)
    [1456, 1504) 'retType' (line 64)
    [1536, 1584) '<unknown>'
    [1616, 1752) '<unknown>'
    [1824, 1960) '<unknown>'
    [2032, 2168) '<unknown>'
    [2240, 2672) '<unknown>'
    [2736, 3168) '<unknown>'
    [3232, 3664) '<unknown>'
    [3728, 4160) '<unknown>'
    [4224, 4656) '<unknown>'
    [4720, 5152) '<unknown>' <== Memory access at offset 5168 overflows this variable
    [5216, 5648) '<unknown>'
    [5712, 6144) '<unknown>'
    [6208, 6640) '<unknown>'
    [6704, 7136) '<unknown>'
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
      (longjmp and C++ exceptions *are* supported)
Thread T2 created by T0 here:
    #0 0x2b342b83e9c2 in __interceptor_pthread_create ../../../../libsanitizer/asan/asan_interceptors.cc:208
    #1 0x2b342eabda58 in rml::internal::thread_monitor::launch(void* (*)(void*), void*, unsigned long) ../../src/tbb/../rml/server/thread_monitor.h:218
    #2 0x2b342eabda58 in tbb::internal::rml::private_worker::wake_or_launch() ../../src/tbb/private_server.cpp:297
    #3 0x2b342eabda58 in tbb::internal::rml::private_server::wake_some(int) ../../src/tbb/private_server.cpp:395
    #4 0x60c0000a8fff  (<unknown module>)

SUMMARY: AddressSanitizer: stack-buffer-overflow (/cvmfs/cms-ib.cern.ch/nweek-02666/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_ASAN_X_2021-02-05-2300/lib/slc7_amd64_gcc900/pluginRecoTauTagRecoTauPlugins.so+0x18b1a3) in DeepTauId::fillGrids<edm::View<reco::Candidate>, pat::Tau>(pat::Tau const&, edm::View<reco::Candidate> const&, (anonymous namespace)::CellGrid&, (anonymous namespace)::CellGrid&)::{lambda(unsigned long, double, double, (anonymous namespace)::CellGrid&)#1}::operator()(unsigned long, double, double, (anonymous namespace)::CellGrid&) const [clone .isra.0]
Shadow bytes around the buggy address:
  0x05670db08770: 00 00 00 f2 f2 f2 f2 f2 00 00 00 f2 f2 f2 f2 f2
  0x05670db08780: 00 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 00
  0x05670db08790: 00 00 00 00 f3 f3 f3 f3 00 00 00 00 00 00 00 00
  0x05670db087a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x05670db087b0: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 01 f2
=>0x05670db087c0: 04 f2 00 f2 f2 f2 00 f2 f2 f2 00 f2[f2]f2 00 f2
  0x05670db087d0: f2 f2 00 f2 f2 f2 00 f2 f2 f2 00 f2 f2 f2 00 f2
  0x05670db087e0: f2 f2 f8 f2 f2 f2 f8 f2 f2 f2 f8 f2 f2 f2 f8 f2
  0x05670db087f0: f2 f2 00 f2 f2 f2 00 f2 f2 f2 00 f2 f2 f2 00 f2
  0x05670db08800: f2 f2 00 f2 f2 f2 00 f2 f2 f2 00 f2 f2 f2 00 f2
  0x05670db08810: f2 f2 00 f2 f2 f2 00 f2 f2 f2 00 f2 f2 f2 00 f2
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
Thread T4 created by T2 here:
    #0 0x2b342b83e9c2 in __interceptor_pthread_create ../../../../libsanitizer/asan/asan_interceptors.cc:208
    #1 0x2b342eabda58 in rml::internal::thread_monitor::launch(void* (*)(void*), void*, unsigned long) ../../src/tbb/../rml/server/thread_monitor.h:218
    #2 0x2b342eabda58 in tbb::internal::rml::private_worker::wake_or_launch() ../../src/tbb/private_server.cpp:297
    #3 0x2b342eabda58 in tbb::internal::rml::private_server::wake_some(int) ../../src/tbb/private_server.cpp:395
    #4 0x13efff  (<unknown module>)

==11751==ABORTING

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc900/CMSSW_11_3_ASAN_X_2021-02-05-2300/pyRelValMatrixLogs/run/3.0_ProdQCD_Pt_3000_3500+ProdQCD_Pt_3000_3500+DIGIPROD1+RECOPROD1/step3_ProdQCD_Pt_3000_3500+ProdQCD_Pt_3000_3500+DIGIPROD1+RECOPROD1.log#/

@makortel
Copy link
Contributor Author

makortel commented Feb 6, 2021

assign reconstruction

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 6, 2021

New categories assigned: reconstruction

@slava77,@perrotta,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 6, 2021

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

makortel commented Feb 6, 2021

This problem might be causing the non-reproducibility discussed in #32628.

@slava77
Copy link
Contributor

slava77 commented Feb 6, 2021

This problem might be causing the non-reproducibility discussed in #32628.

the ASAN issue started appearing after #32676 was merged yesterday; so, the older issue may still be unrelated

@slava77
Copy link
Contributor

slava77 commented Feb 6, 2021

@swozniewski
please take a look

cmsrel CMSSW_11_3_ASAN_X_2021-02-05-2300; .. cmsenv; git cms-addpkg RecoTauTag/RecoTau; scram b -r -j 16 USER_CXXFLAGS=-g
should help to see where the problems are

@slava77
Copy link
Contributor

slava77 commented Feb 6, 2021

@makortel
3.0 is MC and it runs in this case in 4 threads; isn't it a bad reference for crashes due to the generic gen/sim/digi non-reproducibility in random sequences?

@makortel
Copy link
Contributor Author

makortel commented Feb 6, 2021

3.0 is MC and it runs in this case in 4 threads; isn't it a bad reference for crashes due to the generic gen/sim/digi non-reproducibility in random sequences?

There are 534 failing workflows in that ASAN IB, 3.0 was the first one. I checked a few other failing workflows, the cause in them was the same. From data workflows e.g. 4.25 appears to fail pretty quickly.

@slava77
Copy link
Contributor

slava77 commented Feb 6, 2021

There are 534 failing workflows in that ASAN IB, 3.0 was the first one.

OK, this was more of a suggestion for the future reports to use more repeatable cases

@slava77
Copy link
Contributor

slava77 commented Feb 6, 2021

IIUC, the problem is in
static auto getCellIndex = [this](double x, double maxX, double size, int& index) {
RecoTauTag/RecoTau/plugins/DeepTauId.cc:1059-> RecoTauTag/RecoTau/plugins/DeepTauId.cc:1047

bool tryGetCellIndex(double deltaEta, double deltaPhi, CellIndex& cellIndex) const {
static auto getCellIndex = [this](double x, double maxX, double size, int& index) {
const double absX = std::abs(x);
if (absX > maxX)
return false;
double absIndex;
if (disable_CellIndex_workaround_) {

I was looking at this earlier due to the static analyzer report, but considered the warning there to be false-positive.
Apparently not; does it mean that e.g. the lambda capture of the data member is done at the object construction time and compiled as a static variable or smth?

@makortel
Copy link
Contributor Author

makortel commented Feb 6, 2021

Apparently not; does it mean that e.g. the lambda capture of the data member is done at the object construction time and compiled as a static variable or smth?

I think that is the case, i.e. this is captured at the first call to the function that constructs the lambda object, and gets used in every call to the lambda afterwards. Thanks for the quick fix!

@slava77
Copy link
Contributor

slava77 commented Feb 8, 2021

@cmsbuild closed this in #32838 yesterday

I'm curious what was the mix of words that lead to closing this issue immediately after merging of #32838. There I mentioned "this apparently fixes #32837"; was that enough?

@makortel
Copy link
Contributor Author

makortel commented Feb 8, 2021

@cmsbuild closed this in #32838 yesterday

I'm curious what was the mix of words that lead to closing this issue immediately after merging of #32838. There I mentioned "this apparently fixes #32837"; was that enough?

Yes, fixes #<issue number> is one of the keywords that link an issue to a PR (and lead to closing of the issue when the PR is merged to the default branch of the repo)
https://docs.github.com/en/github/managing-your-work-on-github/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword

@slava77
Copy link
Contributor

slava77 commented Feb 8, 2021

Yes, fixes #<issue number> is one of the keywords

I thought this was for a more restricted context when this starts the message. Apparently it can be a part of a more complex sentence.

@swozniewski
Copy link
Contributor

Fortunately, you did not write "Are we sure that this already fixes #32837 ?"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants