Add a generic mechanism to specify compute accelerators to use in the configuration #36699

makortel · 2022-01-12T21:36:30Z

PR description:

This PR resolves #31760. It adds a new parameter

process.options.accelerators = cms.untracked.vstring('*')

that can be used to specify the compute accelerator(s) a job should use. A special value auto (also the default to preserve the current behavior with CUDA) can be used to let the job to pick the accelerators that are available on a worker node. An empty vstring means that no accelerators are used. If a specific accelerator name is given and the worker node does not have that accelerator, the job is terminated with a specific exit code.
that can be used to specify the compute accelerator(s) a job is allowed to use. Patterns with * and ? wildcards are allowed (similar to shell). Default value is * (i.e. the intersection what's defined in the job and available in the worker node) to preserve the current behavior with CUDA. An empty vstring is an error.

The recognized (and allowed) values are specified with instances of new ProcessAccelerator-derived classes whose objects are attached to the Process. The system implicitly adds cpu label to denote CPU fallback. Each ProcessAccelerator class defines the accelerator labels it recognizes, returns a subset of the labels that are enabled on a worker node, and have a possibility to customize (within restrictions) the Process right before the point where the python configuration is "serialized" for C++ code at the worker node. These customizations must not change the configuration hash, which is ensured by wrapping the Process object into another class that (currently) gives access only to Services.

In order to interoperate with ProcessAccelerator the SwitchProducers are changed slightly: the case-specific functions that tell whether that case is enabled or not take now the list labels of enabled accelerators as an argument.

Encapsulating the accelerator knowledge into specific configuration objects defined outside of the framework leaves the framework to stay independent of the accelerator technologies, and makes the system flexible and easy to extend. This design should be easy to extend e.g. for the use of Alpaka (or any portability technology) that can support multiple accelerators and communicate additional information about the chosen accelerator for the framework (such as the namespace of EDModules corresponding the chosen accelerator). It should also help with #30044.

For CUDA this PR adds ProcessAcceleratorCUDA, and changes all current loads of HeterogeneousCore.CUDAServices.CUDAService_cfi with HeterogeneousCore.CUDACore.ProcessAcceleratorCUDA_cfi. The ProcessAcceleratorCUDA internally loads the CUDAService if it is not loaded already, and also adds the CUDAService to MessageLogger categories. The logic of whether CUDA is enabled on a worker node or not is moved from SwitchProducerCUDA to ProcessAcceleratorCUDA. For the NVIDIA GPU "acclerator label" I picked gpu-nvidia.

PR validation:

Unit tests pass (on a machine without a GPU and on a machine with a GPU)

cmsbuild · 2022-01-12T21:44:39Z

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36699/27772

This PR adds an extra 284KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

code-format:
https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36699/27772/code-format.patch
e.g. curl -k https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36699/27772/code-format.patch | patch -p1
You can also run scram build code-format to apply code format directly

cmsbuild · 2022-01-13T00:53:35Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36699/27773

This PR adds an extra 272KB to repository

cmsbuild · 2022-01-13T00:54:00Z

A new Pull Request was created by @makortel (Matti Kortelainen) for master.

It involves the following packages:

Configuration/StandardSequences (operations)
FWCore/Framework (core)
FWCore/Integration (core)
FWCore/ParameterSet (core)
FWCore/TestProcessor (core)
FWCore/Utilities (core)
HLTrigger/Configuration (hlt)
HeterogeneousCore/CUDACore (heterogeneous)
HeterogeneousCore/CUDATest (heterogeneous)
RecoLocalCalo/EcalRecProducers (reconstruction)
RecoLocalCalo/HGCalRecProducers (upgrade, reconstruction)
RecoLocalCalo/HcalRecProducers (reconstruction)
Validation/HGCalValidation (dqm)

@Martin-Grunewald, @perrotta, @makortel, @ahmad3213, @cmsbuild, @missirol, @fwyzard, @pmandrik, @smuzaffar, @Dr15Jones, @emanueleusai, @AdrianoDee, @jfernan2, @slava77, @jpata, @qliphy, @fabiocos, @pbo0, @clacaputo, @srimanob, @davidlange6, @rvenditti can you please review it and eventually sign? Thanks.
@felicepantaleo, @argiro, @Martin-Grunewald, @bsunanda, @pfs, @thomreis, @lgray, @mmusich, @slomeo, @sethzenz, @apsallid, @silviodonato, @abdoulline, @JanFSchulte, @dgulhan, @missirol, @simonepigazzini, @vandreev11, @GiacomoSguazzoni, @rovere, @VinInn, @cseez, @hatakeyamak, @ebrondol, @mtosi, @fabiocos, @rchatter, @wddgit, @edjtscott, @lecriste, @mariadalfonso this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

makortel · 2022-01-13T01:02:45Z

test parameters:

enable_test = gpu

makortel · 2022-01-13T01:02:51Z

@cmsbuild, please test

makortel · 2022-01-13T01:03:17Z

@Dr15Jones please review

@fwyzard could you take a look too?

clacaputo · 2022-02-22T11:52:54Z

+reconstruction

jfernan2 · 2022-02-22T11:53:31Z

+1

fwyzard · 2022-02-22T12:10:48Z

FWCore/Framework/src/ensureAvailableAccelerators.cc

+  void ensureAvailableAccelerators(edm::ParameterSet const& parameterSet) {
+    auto const& selectedAccelerators =
+        parameterSet.getUntrackedParameter<std::vector<std::string>>("@selected_accelerators");
+    ParameterSet const& optionsPset(parameterSet.getUntrackedParameterSet("options"));


also for a follow-up PR, this could be moved inside the if block

srimanob · 2022-02-22T12:17:47Z

+Upgrade

Martin-Grunewald · 2022-02-22T13:35:59Z

+1

perrotta · 2022-02-23T13:55:28Z

+operations

A few additions and improvements suggested in several parts of the thread remain uninstatiated: they can be implemented in a follow up PR

cmsbuild · 2022-02-23T13:55:50Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

perrotta · 2022-02-23T13:56:21Z

+1

makortel mentioned this pull request Jan 12, 2022

[RFC] Add a mechanism to set the chosen case for SwitchProducer #35510

Closed

cmsbuild added this to the CMSSW_12_3_X milestone Jan 12, 2022

cmsbuild added code-checks-pending core-pending dqm-pending heterogeneous-pending hlt-pending operations-pending orp-pending pending-signatures reconstruction-pending tests-pending upgrade-pending labels Jan 12, 2022

This was referenced Jan 12, 2022

Add a way to specify compute accelerators in the configuration cms-sw/framework-team#21

Closed

Add a way to specify compute accelerators in the configuration #31760

Closed

cmsbuild added code-checks-rejected and removed code-checks-pending labels Jan 12, 2022

makortel force-pushed the useAccelerators_v2 branch from 98596a8 to 80a7513 Compare January 13, 2022 00:46

cmsbuild added code-checks-pending and removed code-checks-rejected labels Jan 13, 2022

cmsbuild added code-checks-approved and removed code-checks-pending labels Jan 13, 2022

cmsbuild added tests-started and removed tests-pending labels Jan 13, 2022

cmsbuild added core-approved and removed core-pending labels Feb 22, 2022

cmsbuild added reconstruction-approved and removed reconstruction-pending labels Feb 22, 2022

cmsbuild added dqm-approved and removed dqm-pending labels Feb 22, 2022

fwyzard reviewed Feb 22, 2022

View reviewed changes

cmsbuild added upgrade-approved and removed upgrade-pending labels Feb 22, 2022

cmsbuild added hlt-approved and removed hlt-pending labels Feb 22, 2022

cmsbuild added fully-signed operations-approved and removed operations-pending pending-signatures labels Feb 23, 2022

cmsbuild added orp-approved and removed orp-pending labels Feb 23, 2022

cmsbuild merged commit f5bd904 into cms-sw:master Feb 23, 2022

makortel deleted the useAccelerators_v2 branch February 23, 2022 17:10

fwyzard mentioned this pull request Mar 3, 2022

ECAL DQM - Add WF .513 for ECAL GPU vs. CPU validation #37123

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a generic mechanism to specify compute accelerators to use in the configuration #36699

Add a generic mechanism to specify compute accelerators to use in the configuration #36699

makortel commented Jan 12, 2022 •

edited

cmsbuild commented Jan 12, 2022

cmsbuild commented Jan 13, 2022

cmsbuild commented Jan 13, 2022

makortel commented Jan 13, 2022

makortel commented Jan 13, 2022

makortel commented Jan 13, 2022

clacaputo commented Feb 22, 2022

jfernan2 commented Feb 22, 2022

fwyzard Feb 22, 2022

srimanob commented Feb 22, 2022

Martin-Grunewald commented Feb 22, 2022

perrotta commented Feb 23, 2022

cmsbuild commented Feb 23, 2022

perrotta commented Feb 23, 2022

Add a generic mechanism to specify compute accelerators to use in the configuration #36699

Add a generic mechanism to specify compute accelerators to use in the configuration #36699

Conversation

makortel commented Jan 12, 2022 • edited

PR description:

PR validation:

cmsbuild commented Jan 12, 2022

cmsbuild commented Jan 13, 2022

cmsbuild commented Jan 13, 2022

makortel commented Jan 13, 2022

makortel commented Jan 13, 2022

makortel commented Jan 13, 2022

clacaputo commented Feb 22, 2022

jfernan2 commented Feb 22, 2022

fwyzard Feb 22, 2022

Choose a reason for hiding this comment

srimanob commented Feb 22, 2022

Martin-Grunewald commented Feb 22, 2022

perrotta commented Feb 23, 2022

cmsbuild commented Feb 23, 2022

perrotta commented Feb 23, 2022

makortel commented Jan 12, 2022 •

edited