Upgrade ConcurrentHadronizerFilter and ConcurrentGeneratorFilter to support concurrent runs #39374

wddgit · 2022-09-12T22:13:28Z

PR description:

Upgrade the code in this module to support concurrent runs. When support for concurrent runs is added to the Framework in the near future, it will become possible for streamBeginRun and streamEndRun to run concurrently with other transitions. There is special code in this module which would be broken by this additional concurrency.

This should not affect the output of this module or its externally visible behavior. Results should be the same before and after this is merged.

@Dr15Jones and @makortel Could you take a look at this also? This is more a technical change related to Framework support of concurrency than a generator issue.

PR validation:

Given that this does not change the output or behavior of this module, I am relying on existing tests. The unit tests pass. I also ran existing tests under the debugger and with print statements manually added to verify things were working as expected.

If there are any additional tests that generator experts have available, it would be great if someone could run them to verify that I didn't inadvertently break something.

cmsbuild · 2022-09-12T22:19:37Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39374/32077

This PR adds an extra 20KB to repository

cmsbuild · 2022-09-12T22:20:01Z

A new Pull Request was created by @wddgit (W. David Dagenhart) for master.

It involves the following packages:

GeneratorInterface/Core (generators)

@SiewYan, @mkirsano, @Saptaparna, @cmsbuild, @alberto-sanchez, @menglu21, @GurpreetSinghChahal can you please review it and eventually sign? Thanks.
@alberto-sanchez, @mkirsano this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

wddgit · 2022-09-12T22:20:47Z

please test

makortel · 2022-09-12T22:24:04Z

@wddgit Could you add ConcurrentHadronizerFilter and ConcurrentGeneratorFilter to the PR title?

wddgit · 2022-09-12T22:27:36Z

I modified the title. Thanks. That is better.

makortel · 2022-09-12T22:31:34Z

GeneratorInterface/Core/interface/ConcurrentHadronizerFilter.h

+    mutable std::atomic<unsigned long long> greatestNStreamEndLumis_{0};
+    mutable std::atomic<bool> streamEndRunComplete_{true};
+    // The next two data members are thread safe and can be safely mutable because
+    // they are only modified/read in globalBeginRun and globalBeginLuminosityBlock.


Can we (in principle) have globalBeginRun and globalBeginLuminosityBlock transitions being run simultaneously? (obviously the lumi would have to belong to a different run) Or two globalBeginRuns or two globalBeginLuminosityBlocks?

This is an interesting question. I would be interested in hearing Chris's (@Dr15Jones) opinion about it.

The existing code in CMSSW serializes globalBeginLumi transitions (and all run transitions). The globalBeginLumi transition completes before the Framework even looks to see what transition should come next. The run concurrency PR currently under review continues that pattern for globalBeginRun. globalBeginLumi and globalBeginRun are serialized like input source activity. They will not run concurrently. It would not be trivial to change this. It would take some effort just to identify the things that would break if we changed this behavior. I think it is built into the current Framework implementation in multiple places.

On the other hand, the TWIKI that documents the multithreading design does not require this serialization: https://twiki.cern.ch/twiki/bin/view/CMSPublic/FWMultithreadedFrameworkTransitions
Maybe we should add this requirement to this TWIKI documentation?

If we want to make this code support that kind of concurrency in addition to adding support for run concurrency then I need to do more work on this PR (and in multiple other places probably in separate PRs later).

My personal guess is that the performance improvement this additional concurrency would allow is not worth the effort that would be required to implement it.

The global run/lumi transitions need to be serialized as the EventSetup system doesn't support concurrent determination of new IOVs.

Thanks. I think we should document that the global run/lumi transitions are serialized (I didn't read the twiki David pointed to carefully enough to say whether it would be implied by the text there already or not)

I added an extra requirement in the TWIKI with this additional requirement on global begin transitions (in the bullet points under the figure, the next to last one is new). Feel free to edit it or let me know if you want me to reword it.

https://twiki.cern.ch/twiki/bin/view/CMSPublic/FWMultithreadedFrameworkTransitions

Thanks David, looks good.

cmsbuild · 2022-09-13T02:10:01Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-33da22/27486/summary.html
COMMIT: a8f007e
CMSSW: CMSSW_12_6_X_2022-09-12-1100/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/39374/27486/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 5 differences found in the comparisons
DQMHistoTests: Total files compared: 51
DQMHistoTests: Total histograms compared: 3618326
DQMHistoTests: Total failures: 8
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3618296
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
Checked 212 log files, 49 edm output root files, 51 DQM output files
TriggerResults: no differences found

Dr15Jones · 2022-09-13T19:20:02Z

GeneratorInterface/Core/interface/ConcurrentGeneratorFilter.h

    while (useInLumi_.load() == nullptr) {
    }
+


How about a comment here explaining why the extra wait?

Done. Thanks.

Dr15Jones · 2022-09-13T19:48:42Z

GeneratorInterface/Core/interface/ConcurrentGeneratorFilter.h

-      useInLumi_.store(nullptr);
+    gen::GenStreamCache<HAD, DEC>* streamCachePtr = this->streamCache(id);
+    if (this->luminosityBlockCache(lumi.index())->useInLumi_ != streamCachePtr) {
+      initLumi(streamCachePtr, lumi, es);


Just for clarity, should have have an else which sets the lumi cache useInLum_ to nullptr? It isn't strictly needed as the lumi cache will go away at the end of the luminosity block. But it might make the logic easier to understand and avoid possible uses of that cache later.

On second thought, resetting the lumi cache useInLumi_ here would require the variable to become an atomic, which really isn't worth it.

This transition will run multiple times (once per stream), possibly concurrently. useInLumi_ in the lumi cache is not atomic. Only reading it here is OK, but writing would create a data race. I could make it atomic, but it does not seem worth adding the overhead of an atomic. I'm not sure whether your proposed extra code clarifies or just leads the reader to wonder why one set it null... As is, this lumi cache value is written to in one line of code and read in only this one line code.

I should have refreshed my window. I didn't see your second comment before my reply.

Dr15Jones · 2022-09-13T19:53:19Z

GeneratorInterface/Core/interface/ConcurrentGeneratorFilter.h

+    }
+
+    auto lumiCache = std::make_shared<gen::GenLumiCache<HAD, DEC>>();
+    lumiCache->useInLumi_ = useInLumi_.load();


If globalBeginLuminosityBlockProduce uses the lumi cache's useInLumi_ we could reset the member variable useInLumi_ here which might help understanding the logic.

It is critically important that streamBeginLuminosityBlock uses the value from the lumi cache. That is the main reason I created the lumi cache and only using it for that purpose emphasizes that requirement.

For globalBeginLuminosityBlockProduce, we could use either value (the module data member or the lumi cache member). To me it just seems easier to use the class data member because it avoids the extra code to get the value out of the lumi cache. I'll change this if you really want, but adding that seems less readable to me. globalBeginLuminosityBlock and globalBeginLuminosityBlockProduce just run one immediately after the other. If you really want me to change this, let me know and I will. I don't feel strongly about it.

Dr15Jones · 2022-09-13T20:00:54Z

GeneratorInterface/Core/interface/ConcurrentGeneratorFilter.h

+    // The next two data members are thread safe and can be safely mutable because
+    // they are only modified/read in globalBeginRun and globalBeginLuminosityBlock.
+    mutable unsigned long long nGlobalBeginRuns_{0};
+    mutable unsigned long long nInitializedInGlobalLumi_{0};


Is this actually nInitializedInGlobalLumiAfterNewRun_?

You're right. That is a more accurate name. I will change it.

Dr15Jones · 2022-09-13T20:19:05Z

GeneratorInterface/Core/interface/ConcurrentGeneratorFilter.h

@@ -407,6 +449,9 @@ namespace edm {
    if (rCache->product_.compare_exchange_strong(expect, griproduct.get())) {
      griproduct.release();
    }
+    if (cache == useInLumi_.load()) {


If useInLumi_ is set to nullptr in globalBeginLuminosityBlockProduce will this ever happen?

Yes. streamEndLuminosityBlockSummary selects the stream cache to use in globalBeginLumi and that occurs before streamEndRun. So it will happen on the stream with the selected stream cache.

makortel · 2022-09-26T14:33:59Z

@cms-sw/generators-l2 Kind ping.

wddgit · 2022-09-26T19:53:26Z

I ran runTheMatrix.py with all the tests (not just the limited ones) and threading enabled. The tests all passed (except for 3 known and unrelated failures). That test also included PR #38801 and PR #39491. The run concurrency pull request is now only waiting on approval of these 3 PR's. We are hoping to get it merged into 12_6_X soon so we have time to gain experience with it before 12_6_0 is finalized.

wddgit · 2022-09-27T14:01:13Z

PR #39491 (the DQMStore fix) was merged last night and PR #38801 was approved by Core last night. The run concurrency PR is waiting only on this PR now.

Saptaparna · 2022-09-27T14:08:35Z

+1

cmsbuild · 2022-09-27T14:08:57Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

perrotta · 2022-09-27T17:26:00Z

please test
(let refresh the 11 days old tests before merging)

cmsbuild · 2022-09-27T21:37:11Z

-1

Failed Tests: RelVals-THREADING
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-33da22/27801/summary.html
COMMIT: fac3b87
CMSSW: CMSSW_12_6_X_2022-09-27-1100/el8_amd64_gcc10
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/39374/27801/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-THREADING

23234.023234.0_TTbar_14TeV+2026D49+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14INPUT+DigiTrigger+RecoGlobal+HARVESTGlobal/step3_TTbar_14TeV+2026D49+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14INPUT+DigiTrigger+RecoGlobal+HARVESTGlobal.log
35034.035034.0_TTbar_14TeV+2026D77+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14INPUT+DigiTrigger+RecoGlobal+HARVESTGlobal/step3_TTbar_14TeV+2026D77+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14INPUT+DigiTrigger+RecoGlobal+HARVESTGlobal.log
28234.028234.0_TTbar_14TeV+2026D60+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal/step3_TTbar_14TeV+2026D60+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal.log

Expand to see more relval errors ...

39634.999

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 4 differences found in the comparisons
DQMHistoTests: Total files compared: 51
DQMHistoTests: Total histograms compared: 3624368
DQMHistoTests: Total failures: 3
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3624343
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
Checked 212 log files, 49 edm output root files, 51 DQM output files
TriggerResults: no differences found

makortel · 2022-09-27T21:43:42Z

Workflow failures are those that should get fixed with #39500.

perrotta · 2022-09-27T21:54:53Z

please test with #39500

cmsbuild · 2022-09-28T03:15:38Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-33da22/27803/summary.html
COMMIT: fac3b87
CMSSW: CMSSW_12_6_X_2022-09-27-1100/el8_amd64_gcc10
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/39374/27803/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 1 differences found in the comparisons
DQMHistoTests: Total files compared: 51
DQMHistoTests: Total histograms compared: 3624368
DQMHistoTests: Total failures: 3
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3624343
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
Checked 212 log files, 49 edm output root files, 51 DQM output files
TriggerResults: no differences found

perrotta · 2022-09-28T06:15:23Z

+1

Upgrade to support concurrent runs

a8f007e

cmsbuild added this to the CMSSW_12_6_X milestone Sep 12, 2022

cmsbuild added code-checks-pending generators-pending orp-pending pending-signatures tests-pending labels Sep 12, 2022

cmsbuild added code-checks-approved and removed code-checks-pending labels Sep 12, 2022

cmsbuild added tests-started and removed tests-pending labels Sep 12, 2022

wddgit changed the title ~~Upgrade to support concurrent runs~~ Upgrade ConcurrentHadronizerFilter and ConcurrentGeneratorFilter to support concurrent runs Sep 12, 2022

makortel reviewed Sep 12, 2022

View reviewed changes

cmsbuild added tests-approved and removed tests-started labels Sep 13, 2022

Dr15Jones reviewed Sep 13, 2022

View reviewed changes

Rename a variable and add a comment

b3c0f74

cmsbuild added code-checks-pending and removed tests-approved code-checks-approved labels Sep 13, 2022

wddgit mentioned this pull request Sep 26, 2022

prepare DQMStore for Framework concurrent run support #39491

Merged

cmsbuild added fully-signed generators-approved and removed generators-pending pending-signatures labels Sep 27, 2022

cmsbuild added tests-started and removed tests-approved labels Sep 27, 2022

cmsbuild added tests-rejected and removed tests-started labels Sep 27, 2022

cmsbuild added requires-external tests-started and removed tests-rejected labels Sep 27, 2022

cmsbuild added tests-approved and removed tests-started labels Sep 28, 2022

cmsbuild added orp-approved and removed orp-pending labels Sep 28, 2022

cmsbuild merged commit 78f7204 into cms-sw:master Sep 28, 2022

makortel mentioned this pull request Sep 28, 2022

Fix ConcurrentHadronizerFilter to work with concurrent Run PR cms-sw/framework-team#429

Closed

wddgit deleted the concurrentRunsHadronizer branch January 13, 2023 17:31

Upgrade ConcurrentHadronizerFilter and ConcurrentGeneratorFilter to support concurrent runs #39374

Upgrade ConcurrentHadronizerFilter and ConcurrentGeneratorFilter to support concurrent runs #39374

Conversation

wddgit commented Sep 12, 2022

PR description:

PR validation:

cmsbuild commented Sep 12, 2022

cmsbuild commented Sep 12, 2022

wddgit commented Sep 12, 2022

makortel commented Sep 12, 2022

wddgit commented Sep 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmsbuild commented Sep 13, 2022

Comparison Summary

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

makortel commented Sep 26, 2022

wddgit commented Sep 26, 2022

wddgit commented Sep 27, 2022

Saptaparna commented Sep 27, 2022

cmsbuild commented Sep 27, 2022

perrotta commented Sep 27, 2022

cmsbuild commented Sep 27, 2022

RelVals-THREADING

Comparison Summary

makortel commented Sep 27, 2022

perrotta commented Sep 27, 2022

cmsbuild commented Sep 28, 2022

Comparison Summary

perrotta commented Sep 28, 2022