Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a generic mechanism to specify compute accelerators to use in the configuration #36699

Merged
merged 9 commits into from Feb 23, 2022

Conversation

makortel
Copy link
Contributor

@makortel makortel commented Jan 12, 2022

PR description:

This PR resolves #31760. It adds a new parameter

process.options.accelerators = cms.untracked.vstring('*')

that can be used to specify the compute accelerator(s) a job should use. A special value auto (also the default to preserve the current behavior with CUDA) can be used to let the job to pick the accelerators that are available on a worker node. An empty vstring means that no accelerators are used. If a specific accelerator name is given and the worker node does not have that accelerator, the job is terminated with a specific exit code.
that can be used to specify the compute accelerator(s) a job is allowed to use. Patterns with * and ? wildcards are allowed (similar to shell). Default value is * (i.e. the intersection what's defined in the job and available in the worker node) to preserve the current behavior with CUDA. An empty vstring is an error.

The recognized (and allowed) values are specified with instances of new ProcessAccelerator-derived classes whose objects are attached to the Process. The system implicitly adds cpu label to denote CPU fallback. Each ProcessAccelerator class defines the accelerator labels it recognizes, returns a subset of the labels that are enabled on a worker node, and have a possibility to customize (within restrictions) the Process right before the point where the python configuration is "serialized" for C++ code at the worker node. These customizations must not change the configuration hash, which is ensured by wrapping the Process object into another class that (currently) gives access only to Services.

In order to interoperate with ProcessAccelerator the SwitchProducers are changed slightly: the case-specific functions that tell whether that case is enabled or not take now the list labels of enabled accelerators as an argument.

Encapsulating the accelerator knowledge into specific configuration objects defined outside of the framework leaves the framework to stay independent of the accelerator technologies, and makes the system flexible and easy to extend. This design should be easy to extend e.g. for the use of Alpaka (or any portability technology) that can support multiple accelerators and communicate additional information about the chosen accelerator for the framework (such as the namespace of EDModules corresponding the chosen accelerator). It should also help with #30044.

For CUDA this PR adds ProcessAcceleratorCUDA, and changes all current loads of HeterogeneousCore.CUDAServices.CUDAService_cfi with HeterogeneousCore.CUDACore.ProcessAcceleratorCUDA_cfi. The ProcessAcceleratorCUDA internally loads the CUDAService if it is not loaded already, and also adds the CUDAService to MessageLogger categories. The logic of whether CUDA is enabled on a worker node or not is moved from SwitchProducerCUDA to ProcessAcceleratorCUDA. For the NVIDIA GPU "acclerator label" I picked gpu-nvidia.

PR validation:

Unit tests pass (on a machine without a GPU and on a machine with a GPU)

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36699/27772

  • This PR adds an extra 284KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36699/27773

  • This PR adds an extra 272KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @makortel (Matti Kortelainen) for master.

It involves the following packages:

  • Configuration/StandardSequences (operations)
  • FWCore/Framework (core)
  • FWCore/Integration (core)
  • FWCore/ParameterSet (core)
  • FWCore/TestProcessor (core)
  • FWCore/Utilities (core)
  • HLTrigger/Configuration (hlt)
  • HeterogeneousCore/CUDACore (heterogeneous)
  • HeterogeneousCore/CUDATest (heterogeneous)
  • RecoLocalCalo/EcalRecProducers (reconstruction)
  • RecoLocalCalo/HGCalRecProducers (upgrade, reconstruction)
  • RecoLocalCalo/HcalRecProducers (reconstruction)
  • Validation/HGCalValidation (dqm)

@Martin-Grunewald, @perrotta, @makortel, @ahmad3213, @cmsbuild, @missirol, @fwyzard, @pmandrik, @smuzaffar, @Dr15Jones, @emanueleusai, @AdrianoDee, @jfernan2, @slava77, @jpata, @qliphy, @fabiocos, @pbo0, @clacaputo, @srimanob, @davidlange6, @rvenditti can you please review it and eventually sign? Thanks.
@felicepantaleo, @argiro, @Martin-Grunewald, @bsunanda, @pfs, @thomreis, @lgray, @mmusich, @slomeo, @sethzenz, @apsallid, @silviodonato, @abdoulline, @JanFSchulte, @dgulhan, @missirol, @simonepigazzini, @vandreev11, @GiacomoSguazzoni, @rovere, @VinInn, @cseez, @hatakeyamak, @ebrondol, @mtosi, @fabiocos, @rchatter, @wddgit, @edjtscott, @lecriste, @mariadalfonso this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

test parameters:

  • enable_test = gpu

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@makortel
Copy link
Contributor Author

@Dr15Jones please review

@fwyzard could you take a look too?

@clacaputo
Copy link
Contributor

+reconstruction

@jfernan2
Copy link
Contributor

+1

void ensureAvailableAccelerators(edm::ParameterSet const& parameterSet) {
auto const& selectedAccelerators =
parameterSet.getUntrackedParameter<std::vector<std::string>>("@selected_accelerators");
ParameterSet const& optionsPset(parameterSet.getUntrackedParameterSet("options"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for a follow-up PR, this could be moved inside the if block

@srimanob
Copy link
Contributor

+Upgrade

@Martin-Grunewald
Copy link
Contributor

+1

@perrotta
Copy link
Contributor

+operations

  • A few additions and improvements suggested in several parts of the thread remain uninstatiated: they can be implemented in a follow up PR

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a way to specify compute accelerators in the configuration
10 participants