Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SwitchProducer mechanism to allow runtime decision which algorithm implementation to run #25439

Merged
merged 4 commits into from Jan 10, 2019

Conversation

makortel
Copy link
Contributor

@makortel makortel commented Dec 6, 2018

This PR provides an implementation for a SwitchProducer to allow a runtime decision on which EDProducer from a given set to run for a given module label. The main use case in mind are heterogeneous/offloaded/accelerated algorithms, where we would run the accelerated EDProducer if the accelerator device is available in the system, and run the "legacy" CPU EDProducer otherwise.

For a given accelerator type, the generic SwitchProducer is supposed to be inherited along

def _switch_foo():
    return (isFooDeviceEnabled(), 2)

class SwitchProducerFoo(cms.SwitchProducer):
    def __init__(self, **kargs):
        super(SwitchProducerFoo,self).__init__(
            dict(cpu = SwitchProducer.getCpu(),
                 foo = _switch_foo),
            **kargs
        )

Here the deriving class provides a dictionary of possible cases along with functions that return a (bool, int) tuple. The bool tells whether that device type is enabled on the machine, and the int gives a priority of that device wrt. other devices. SwitchProducer.getCpu() returns a function for the CPU (always enabled, priority 1). Note that in order to be able to pickle the configuration, these functions have to be at (python) module level (i.e. e.g. lambdas or static methods are not sufficient).

The decision logic chooses the case that has the largest priority among those that are enabled. The decision is made at the point where the python configuration is transformed for C++, i.e. in case of production at the worker node.

The switch decision does not affect the configuration hash, meaning that the framework will consider a run/lumi/event produced with isFooDeviceEnabled() == False the same as with True.

The concrete SwitchProducerFoo would then be used along (e.g. in a cfi file)

bar = SwitchProducerFoo(
    cpu = cms.EDProducer("Bar", ...),
    foo = cms.EDProducer("BarOnFooDevice", ...)
)

In this example, if isFooDeviceEnabled() returns

  • True => BarOnFooDevice EDProducer is run
  • False => Bar EDProducer is run

The switched EDProducers are required to declare exactly the same products with the produces() call.

When a client asks for a products if bar, it will get pointed to the exactly the same edm::Wrapper that contains the product by the chosen EDProducer. The additional "layer' is, however, tracked in provenance, such that the parent of a product bar is bar@cpu or bar@foo, depending which one of the EDProducers was run.

The python interface is likely to evolve with more experience with it, with accelerated EDProducers, and with devices beyond CUDA. Note that this PR adds only the generic infrastructure, a CUDA-specific implementation will follow in a separate PR (or e.g. as part of the effort of #25353).

Tested in 10_4_0_pre3, no changes expected.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 6, 2018

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 6, 2018

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 6, 2018

A new Pull Request was created by @makortel (Matti Kortelainen) for master.

It involves the following packages:

DataFormats/Common
DataFormats/Provenance
FWCore/Framework
FWCore/Integration
FWCore/Modules
FWCore/ParameterSet

@cmsbuild, @smuzaffar, @Dr15Jones can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @rovere, @wddgit this is something you requested to watch as well.
@davidlange6, @slava77, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor

fwyzard commented Dec 6, 2018

The switched EDProducers are required to declare exactly the same products with the produces() call.

Doesn't this go in the opposite direction of what we discussed earlier, with the possibility of having different "chain" of modules used for e.g. the CPU vs GPU reconstruction ?

Or do you expect the switch to be used only for the last module in the chain, and relay on the dependencies to trigger different chains ?

But, wouldn't that lead to different provenances ?

@makortel
Copy link
Contributor Author

makortel commented Dec 6, 2018

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 6, 2018

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/32044/console Started: 2018/12/06 23:05

@makortel
Copy link
Contributor Author

makortel commented Dec 6, 2018

@fwyzard

Doesn't this go in the opposite direction of what we discussed earlier, with the possibility of having different "chain" of modules used for e.g. the CPU vs GPU reconstruction ?

No, enabling that has been the plan all along.

Or do you expect the switch to be used only for the last module in the chain, and relay on the dependencies to trigger different chains ?

Exactly, the switch is used only for the last module in the chain, and the prefetching (along data dependencies) takes care of the rest.

Or to be more precise, the switch should be used for each step of the chain for which a product is (possibly) needed in the CPU (such that a CPU version of the algorithm exists as well). E.g. for the current state of patatrack (#25353) there would be switches for

  • siPixelDigis: between SiPixelRawToDigi and (e.g.) SiPixelDigisFromCUDA
  • siPixelClustersPreSplitting: between SiPixelClusterProducer and (e.g.) SiPixelClustersFromCUDA
  • siPixelRecHitsPreSplitting: between SiPixelRecHitConverter and (e.g.) SiPixelRecHitsFromCUDA
  • pixelTracksHitQuadruplets: between CAHitQuadrupletEDProducer and (e.g.) CAHitQuadrupletsFromCUDA

Here all *FromCUDA are the modules that convert the CPU-side SoA to the "legacy" data formats (and are "to be done", naming convention can be discussed).

But, wouldn't that lead to different provenances ?

At the event level provenance (ProductProvenance), yes, but that is also intentional, so that we can check afterwards what exactly was run. But it's really not different from a (hypothetical) case now where an EDProducer (producing C) reads a product A on some events and B on the rest. In such a case the ProductProvenance shows that in "A" events the parent of C is A, and in "B" events the parent is B.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 7, 2018

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 7, 2018

Comparison job queued.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 7, 2018

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-25439/32044/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 33
  • DQMHistoTests: Total histograms compared: 3136422
  • DQMHistoTests: Total failures: 100
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3136118
  • DQMHistoTests: Total skipped: 204
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 32 files compared)
  • Checked 137 log files, 14 edm output root files, 33 DQM output files

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 8, 2019

The code-checks are being triggered in jenkins.

@makortel
Copy link
Contributor Author

makortel commented Jan 8, 2019

Tidied up the history. Apparently nowadays GitHub provides a nice link to show the diff of a force-push. That provides a way to see a diff of an update even if some earlier commits got updated (as long as the base commit stays the same).

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 8, 2019

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-25439/7862

  • This PR adds an extra 152KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 8, 2019

Pull request #25439 was updated. @cmsbuild, @smuzaffar, @Dr15Jones can you please check and sign again.

@makortel
Copy link
Contributor Author

makortel commented Jan 8, 2019

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 8, 2019

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/32471/console Started: 2019/01/08 17:04

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 9, 2019

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 9, 2019

Comparison job queued.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 9, 2019

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-25439/32471/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 33
  • DQMHistoTests: Total histograms compared: 3153717
  • DQMHistoTests: Total failures: 232
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3153281
  • DQMHistoTests: Total skipped: 204
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 32 files compared)
  • Checked 137 log files, 14 edm output root files, 33 DQM output files

@Dr15Jones
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @davidlange6, @slava77, @smuzaffar, @fabiocos (and backports should be raised in the release meeting by the corresponding L2)

@fabiocos
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants