Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protect against concurrent access of DD4Hep Detector #33236

Merged
merged 1 commit into from Mar 23, 2021

Conversation

Dr15Jones
Copy link
Contributor

PR description:

Under the hood, DD4Hep uses ROOT's global gGeoManager and therefore
can not support concurrent changes.

  • usesResources declaration avoids ES modules from running at same time
  • Make sure to reset gGeoManager pointer to avoid accidental reuse.

PR validation:

  • Code compiles
  • Ran step 3 of workflow 11650.911 using 6 threads and 4 streams. Without this change it failed (either exception of seg fault) 4 out of 6 times. After this change, it never failed after trying 6 times.

Under the hood, DD4Hep uses ROOT's global gGeoManager and therefore
can not support concurrent changes.
- usesResources declaration avoids ES modules from running at same time
- Make sure to reset gGeoManager pointer to avoid accidental reuse.
@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-33236/21686

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @Dr15Jones (Chris Jones) for master.

It involves the following packages:

DetectorDescription/DDCMS
FWCore/Concurrency
MagneticField/GeomBuilder

@perrotta, @smuzaffar, @civanch, @Dr15Jones, @makortel, @cvuosalo, @ianna, @mdhildreth, @cmsbuild, @slava77, @jpata can you please review it and eventually sign? Thanks.
@namapane, @vargasa, @makortel, @wddgit, @fabiocos, @slomeo this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor Author

Please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e1281c/13644/summary.html
COMMIT: b568cef
CMSSW: CMSSW_11_3_X_2021-03-20-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/33236/13644/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 37
  • DQMHistoTests: Total histograms compared: 2639935
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2639912
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 36 files compared)
  • Checked 155 log files, 37 edm output root files, 37 DQM output files
  • TriggerResults: no differences found

@ianna
Copy link
Contributor

ianna commented Mar 22, 2021

Thanks, @Dr15Jones !

@ianna
Copy link
Contributor

ianna commented Mar 22, 2021

+1

@cvuosalo
Copy link
Contributor

Markus Frank has concerns about modifying gGeoManager. We need to discuss the issue some more with him.

@Dr15Jones
Copy link
Contributor Author

Dr15Jones commented Mar 22, 2021

@cvuosalo on a practical point, after many repeat runnings of the failed job I have yet to see a crash after this modification. If access to gGeoManager were necessary for our code after the construction phase, I would have expected a crash to have occured.

@civanch
Copy link
Contributor

civanch commented Mar 22, 2021

Hi all, few points to have in mind:

  • this crash happens when RECO geometry and MF are built from XML
  • we do not need CPU efficiency and parallel build of these objects, because this will be done only once to create DB geometry
  • in production we take RECO geometry from DB

May be we should not even try to force these geometry builders be so granular? Why not to have 1 module, which will call these modules sequentially and, if needed, provide everything as a parameters when call. gGeoManager may be called in one place only, not in each module.

@Dr15Jones
Copy link
Contributor Author

@civanch wrote

this crash happens when RECO geometry and MF are built from XML

This does not appear to be the case. The RECO geometry is coming from XML but the MF is coming from the module that says it gets it from the DB.

@Dr15Jones
Copy link
Contributor Author

So with my change, I see three instances of TGeoManager being created

  1. must be a static object in the libROOTEve.so which is being loaded when a ROOT dictionary is being read
#0  0x00007fffc757ef40 in TGeoManager::TGeoManager() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-20-1100/external/slc7_amd64_gcc900/lib/libGeom.so
#1  0x00007ffe981209a6 in _GLOBAL__sub_I_REveGeoShape.cxx () from /cvmfs/cms-ib.cern.ch/nweek-02672/slc7_amd64_gcc900/lcg/root/6.22.08-41a30d6c5dc13b67366c419d2bab8795/lib/libROOTEve.so
#2  0x00007ffff7dea9c3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#3  0x00007ffff7def59e in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#4  0x00007ffff7dea7d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#5  0x00007ffff7deeb8b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#6  0x00007ffff5d45fab in dlopen_doit () from /lib64/libdl.so.2
#7  0x00007ffff7dea7d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#8  0x00007ffff5d465ad in _dlerror_run () from /lib64/libdl.so.2
#9  0x00007ffff5d46041 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#10 0x00007fffd6776b05 in cling::utils::platform::DLOpen(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) ()
   from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-20-1100/external/slc7_amd64_gcc900/lib/libCling.so
...
   from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-20-1100/lib/slc7_amd64_gcc900/libCommonToolsUtils.so
#37 0x00007fffbf842dbf in reco::parser::cutParser(edm::TypeWithDict const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<reco::parser::SelectorBase>&, bool) ()
   from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-20-1100/lib/slc7_amd64_gcc900/libCommonToolsUtils.so
#38 0x00007fff6892b5a5 in reco::tau::RecoTauStringCleanerPlugin::RecoTauStringCleanerPlugin(edm::ParameterSet const&, edm::ConsumesCollector&&) ()
   from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-20-1100/lib/slc7_amd64_gcc900/pluginRecoTauTagRecoTauPlugins.so
#39 0x00007fff6892bacf in edmplugin::PluginFactory<reco::tau::RecoTauCleanerPlugin* (edm::ParameterSet const&, edm::ConsumesCollector&&)>::PMaker<reco::tau::RecoTauStringCleanerPlugin>::create(edm::ParameterSet const&, edm::ConsumesCollector&&) const ()
   from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-20-1100/lib/slc7_amd64_gcc900/pluginRecoTauTagRecoTauPlugins.so
...
  1. From DDDetectorESProducer
#0  0x00007fffc7583b50 in TGeoManager::TGeoManager(char const*, char const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-20-1100/external/slc7_amd64_gcc900/lib/libGeom.so
#1  0x00007fffc6ea7968 in dd4hep::DetectorImp::DetectorImp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-20-1100/external/slc7_amd64_gcc900/lib/libDDCore.so.1.16
#2  0x00007fffc6ea7d06 in dd4hep::Detector::getInstance(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-20-1100/external/slc7_amd64_gcc900/lib/libDDCore.so.1.16
#3  0x00007fffc738acda in cms::DDDetector::DDDetector(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) ()
   from /uscms_data/d2/cdj/build/temp/dd4hep/CMSSW_11_3_X_2021-03-20-1100/lib/slc7_amd64_gcc900/libDetectorDescriptionDDCMS.so
#4  0x00007fffbf58a295 in DDDetectorESProducer::produceGeom(IdealGeometryRecord const&) () from /uscms_data/d2/cdj/build/temp/dd4hep/CMSSW_11_3_X_2021-03-20-1100/lib/slc7_amd64_gcc900/pluginDetectorDescriptionPlugins.so
...
  1. From DD4hep_VolumeBasedMagneticFieldESProducerFromDB
#0  0x00007fffc7583b50 in TGeoManager::TGeoManager(char const*, char const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-20-1100/external/slc7_amd64_gcc900/lib/libGeom.so
#1  0x00007fffc6ea7968 in dd4hep::DetectorImp::DetectorImp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-20-1100/external/slc7_amd64_gcc900/lib/libDDCore.so.1.16
#2  0x00007fffc6ea7d06 in dd4hep::Detector::getInstance(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_11_3_X_2021-03-20-1100/external/slc7_amd64_gcc900/lib/libDDCore.so.1.16
#3  0x00007fffc738acda in cms::DDDetector::DDDetector(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) ()
   from /uscms_data/d2/cdj/build/temp/dd4hep/CMSSW_11_3_X_2021-03-20-1100/lib/slc7_amd64_gcc900/libDetectorDescriptionDDCMS.so
#4  0x00007fffbf6b9571 in magneticfield::DD4hep_VolumeBasedMagneticFieldESProducerFromDB::produce(IdealMagneticFieldRecord const&) () from /uscms_data/d2/cdj/build/temp/dd4hep/CMSSW_11_3_X_2021-03-20-1100/lib/slc7_amd64_gcc900/pluginDD4hep_MagneticFieldGeomBuilderPlugins.so

@Dr15Jones
Copy link
Contributor Author

So I did a further test. in the DDDetector constructor, instead of resetting gGeoManager back to its previous value at the end of the function, I instead set it always to nullptr. I than reran the test job. The program ran fine and checking in the debugger I still see only the same three calls to the TGeoManager constructor. This means outside of the DDDetector constructor no CMSSW call directly or indirectly calls gGeoManager.

@makortel
Copy link
Contributor

+1

@makortel
Copy link
Contributor

I think this PR is a step forward (to fix the crashes). If we need to adjust the use of gGeoManager based on feedback from DD4Hep authors, we can do it later.

@Dr15Jones
Copy link
Contributor Author

Resolves cms-sw/framework-team#92

@cvuosalo
Copy link
Contributor

@Dr15Jones Just a clarification about the magnetic field geometry building: The magnetic field geometry file record is fetched from the DB, but it is an XML file blob that has to be parsed to build the magnetic field. It is unlike the reco geometry that we store in the DB that has already been parsed and processed.

@slava77
Copy link
Contributor

slava77 commented Mar 23, 2021

+reconstruction

for #33236 b568cef

  • code changes are in line with the PR description
  • jenkins tests pass and comparisons with the baseline show no differences

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@silviodonato
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 99380ed into cms-sw:master Mar 23, 2021
@Dr15Jones Dr15Jones deleted the fixThreadDD4Hep branch April 2, 2021 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants