Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try a second time if pop fails in ReusableObjectHolder #36813

Closed

Conversation

Dr15Jones
Copy link
Contributor

PR description:

Experience has shown that for some builds, tbb::concurrent_queue try_pop appears to falsely fail. Adding a second attempt to try to decrease that frequency.

PR validation:

Code compiles and the related unit test passes.

Experience has shown that for some builds, tbb::concurrent_queue
try_pop appears to falsely fail. Adding a second attempt to try
to decrease that frequency.
@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36813/27969

  • This PR adds an extra 16KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @Dr15Jones (Chris Jones) for master.

It involves the following packages:

  • FWCore/Utilities (core)

@cmsbuild, @smuzaffar, @Dr15Jones, @makortel can you please review it and eventually sign? Thanks.
@makortel, @felicepantaleo, @wddgit this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor

@cmsbuild, please test

@makortel
Copy link
Contributor

@cmsbuild, please test for slc7_aarch64_gcc11

@Dr15Jones
Copy link
Contributor Author

So I looked at the implementation of oneApi::tbb::concurrent_queue::try_pop and they make use of std::memory_order::relaxed in a couple of places which could definitely lead to the false positives.

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests RelVals
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-744ca0/22029/summary.html
COMMIT: 581c00f
CMSSW: CMSSW_12_3_X_2022-01-24-2300/slc7_aarch64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/36813/22029/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-744ca0/22029/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-744ca0/22029/git-merge-result

Unit Tests

I found errors in the following unit tests:

---> test TestFWCoreIntegrationStandalone had ERRORS
---> test unitTestsGroup_1 had ERRORS
---> test unitTestsGroup_4 had ERRORS
---> test testFWCoreUtilities had ERRORS
and more ...

RelVals

----- Begin Fatal Exception 26-Jan-2022 23:19:50 CET-----------------------
An exception of category 'Vertex' occurred while
   [0] Processing  Event run: 194533 lumi: 329 event: 462355458 stream: 0
   [1] Running path 'dqmofflineOnPAT_1_step'
   [2] Prefetching for module SingleTopTChannelLeptonDQM_miniAOD/'singleTopElectronMediumDQM_miniAOD'
   [3] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [4] Prefetching for module PATMuonSelector/'selectedPatMuons'
   [5] Prefetching for module PATMuonProducer/'patMuons'
   [6] Prefetching for module MuonProducer/'muons'
   [7] Prefetching for module PFProducer/'particleFlowTmp'
   [8] Prefetching for module PFBlockProducer/'particleFlowBlock'
   [9] Prefetching for module PFElecTkProducer/'pfTrackElec'
   [10] Prefetching for module PFConversionProducer/'pfConversions'
   [11] Calling method for module ConversionProducer/'allConversions'
Exception Message:
Refitted track not found in list
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 26-Jan-2022 23:32:46 CET-----------------------
An exception of category 'Vertex' occurred while
   [0] Processing  Event run: 326479 lumi: 7 event: 1579493 stream: 0
   [1] Running path 'dqmoffline_8_step'
   [2] Prefetching for module SMPDQM/'SMPDQM'
   [3] Prefetching for module MuonProducer/'muons'
   [4] Prefetching for module PFProducer/'particleFlowTmp'
   [5] Prefetching for module PFBlockProducer/'particleFlowBlock'
   [6] Prefetching for module PFElecTkProducer/'pfTrackElec'
   [7] Prefetching for module PFConversionProducer/'pfConversions'
   [8] Calling method for module ConversionProducer/'allConversions'
Exception Message:
Refitted track not found in list
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 26-Jan-2022 23:49:16 CET-----------------------
An exception of category 'Vertex' occurred while
   [0] Processing  Event run: 319450 lumi: 76 event: 106007323 stream: 0
   [1] Running path 'dqmoffline_10_step'
   [2] Prefetching for module SMPDQM/'SMPDQM'
   [3] Prefetching for module MuonProducer/'muons'
   [4] Prefetching for module PFProducer/'particleFlowTmp'
   [5] Prefetching for module PFBlockProducer/'particleFlowBlock'
   [6] Prefetching for module PFElecTkProducer/'pfTrackElec'
   [7] Prefetching for module PFConversionProducer/'pfConversions'
   [8] Calling method for module ConversionProducer/'allConversions'
Exception Message:
Refitted track not found in list
----- End Fatal Exception -------------------------------------------------

@makortel
Copy link
Contributor

On a sample of one test, this didn't seem to fully cure the "issue"

===== Test "testFWCoreUtilities" ====
Running ............................... # seen: 3 3
F...

reusableobjectholder_t.cppunit.cpp:241:Assertion
Test name: reusableobjectholder_test::testSimultaneousUse
assertion failed
- Expression: t1ItemsSeen.size() > 0 && t1ItemsSeen.size() < 3

Failures !!!
Run: 34   Failure total: 1   Failures: 1   Errors: 0

---> test testFWCoreUtilities had ERRORS

(whether it had an impact on the distribution of elements in the ReusableObjectHolder would require larger scale test)

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-744ca0/22028/summary.html
COMMIT: 581c00f
CMSSW: CMSSW_12_3_X_2022-01-26-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/36813/22028/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-744ca0/22028/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-744ca0/22028/git-merge-result

Unit Tests

I found errors in the following unit tests:

---> test materialBudgetTrackerPlots had ERRORS

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 332 differences found in the comparisons
  • DQMHistoTests: Total files compared: 43
  • DQMHistoTests: Total histograms compared: 3449324
  • DQMHistoTests: Total failures: 89
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3449213
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 42 files compared)
  • Checked 181 log files, 42 edm output root files, 43 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

-1

Per our discussion earlier, a different approach would be needed.

@Dr15Jones Dr15Jones closed this Feb 8, 2022
@Dr15Jones Dr15Jones deleted the tryAgainReusableObjectHolder branch February 10, 2022 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants