Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[12_5_X] Release dependent modules only after the worker has finished for scheduled modules #39485

Merged

Conversation

makortel
Copy link
Contributor

PR description:

Backport of #39245. Fixes a rare scheduling bug that was observed at the HLT in #39064 (that was first mitigated in #39201).

PR validation:

Tests in #39245 and #39484.

This change was triggered by a case where a PuttableProductResolver
was filled by a Worker that produced many products, and one of the
products (A) had a Ref to another product (B), and the product B was
not consumed by any module (it was only accessed through the Ref).
Since the putProduct() released the WaitingTaskList of the Resolver,
that lead to the consumer of A to run, and that consumer dereferenced
the Ref to see that B was not there.

Besides scheduled modules another cases where PuttableProductResolver
is used are Sources inheriting PuttableSourceBase, and TestProcessor.
In these cases the products are put into the Resolvers (or left as
non-produced) before launching the prefetching of the unscheduled
system. Therefore in these use cases the consuming modules do not need
to wait for the Resolver to be filled.

After several fix attempts it seemed easiest to just use the Worker's
WaitingTaskList directly in PuttableProductResolver. This approach
fulfills the requirements of both Worker and Source(-like) use cases,
and even simplifies the code.
@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 23, 2022

A new Pull Request was created by @makortel (Matti Kortelainen) for CMSSW_12_5_X.

It involves the following packages:

  • FWCore/Framework (core)

@cmsbuild, @smuzaffar, @Dr15Jones, @makortel can you please review it and eventually sign? Thanks.
@missirol, @wddgit this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

enable threading

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-33bec1/27742/summary.html
COMMIT: ddf3b87
CMSSW: CMSSW_12_5_X_2022-09-22-2300/el8_amd64_gcc10
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/39485/27742/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test EcalTPG_updateWeightIdMap_test had ERRORS
---> test TestIOPoolInputNoParentDictionary had ERRORS

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3699454
  • DQMHistoTests: Total failures: 8
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3699424
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
  • Checked 212 log files, 49 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor Author

The EcalTPG_updateWeightIdMap_test fails already in the IBs. (is a backport of some fix missing?)

The TestIOPoolInputNoParentDictionary seems to have the problems that were fixed later in 12_6_X. @smuzaffar, should we backport the fixes to 12_5_X?

@smuzaffar
Copy link
Contributor

smuzaffar commented Sep 24, 2022

The TestIOPoolInputNoParentDictionary seems to have the problems that were fixed later in 12_6_X. @smuzaffar, should we backport the fixes to 12_5_X?

@makortel , sure. I just opened the PR #39492 for 12.5.X

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-33bec1/27771/summary.html
COMMIT: ddf3b87
CMSSW: CMSSW_12_5_X_2022-09-26-1100/el8_amd64_gcc10
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/39485/27771/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test EcalTPG_updateWeightGroup_test had ERRORS
---> test EcalTPG_updateWeightIdMap_test had ERRORS

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3699454
  • DQMHistoTests: Total failures: 14
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3699418
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
  • Checked 212 log files, 49 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor Author

Apparently #39235 that fixed the EcalTPG_updateWeightGroup_test and EcalTPG_updateWeightIdMap_test has not been backported to 12_5_X.

@makortel
Copy link
Contributor Author

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_12_5_X IBs (but tests are reportedly failing) and once validation in the development release cycle CMSSW_12_6_X is complete. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

perrotta commented Sep 26, 2022

backport of #39245

@perrotta
Copy link
Contributor

perrotta commented Sep 26, 2022

Apparently #39235 that fixed the EcalTPG_updateWeightGroup_test and EcalTPG_updateWeightIdMap_test has not been backported to 12_5_X.

Thank you for pointing it out @makortel
Now it is backported, see #39503

@perrotta
Copy link
Contributor

please test with #39503

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-33bec1/27779/summary.html
COMMIT: ddf3b87
CMSSW: CMSSW_12_5_X_2022-09-26-1100/el8_amd64_gcc10
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/39485/27779/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3699454
  • DQMHistoTests: Total failures: 19
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3699412
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 50 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 212 log files, 49 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

@perrotta
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 126d5c5 into cms-sw:CMSSW_12_5_X Sep 27, 2022
@makortel makortel deleted the puttableProductResolverPutProduct_125x branch September 27, 2022 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants