Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign all GPU workflows to detect if a GPU is present, and fall back to CPU otherwise (11.3.x) #33519

Merged
merged 11 commits into from May 12, 2021

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Apr 24, 2021

PR description:

Redesign the GPU workflows:

  • the CPU (*e.g. ###.501) and GPU (###.502) workflows should now be as close as possible;
  • the implementation of the CPU and GPU workflows has been simplified;
  • all GPU workflows use the SwitchProducerCUDA mechanism to detect if a GPU is available and offload a module or task to the GPU; if not, they automatically fall back to the equivalent CPU modules and tasks;
  • when the "gpu" modifier is used, the pixel local reconstruction workflow used the "HLT" payload type both on the CPU and on the GPU, for better consistency of the results;
  • the "Patatrack" pixel tracks reconstruction on CPU is based on a modifier (pixelNtupletFit) instead of a customisation, in line with the other workflows;
  • the HCAL-only workflows should follow more closely the implementation of the general reconstruction sequence, both for Run 2 (2018) and Run 3 scenarios.

Some changes to the relevant EDProducers have made the definition of the workflows easier:

  • the SoA-to-legacy HCAL rechit producer has been updated to make the production of the SoA and/or legacy collections optional;
  • the legacy ECAL unpacker has been updated to declare only the event products it will actually produce;
  • the default labels used in many modules have been updated to reflect the labels used in the configuration.

Some other general changes and code clean up:

  • remove some no-longer-used files as well as some commented-out code
  • always clone() a module used in a SwitchProducerCUDA
  • move the implementation of the gpuVertexFinder kernels from gpuVertexFinderImpl.h to gpuVertexFinder.cc

PR validation:

The GPU workflows (e.g. ###.502) now work also without a GPU:

CUDA_VISIBLE_DEVICES= runTheMatrix.py -w upgrade -j 16 -l 10824.501,10824.502,10824.505,10824.506,10824.511,10824.512,10824.521,10824.522,11634.501,11634.502,11634.505,11634.506,11634.511,11634.512,11634.521,11634.522
...
10824.501_TTbar_13+2018_Patatrack_PixelOnlyCPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:11 2021-date Sat Apr 24 08:21:54 2021; exit: 0 0 0 0
10824.502_TTbar_13+2018_Patatrack_PixelOnlyGPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:08 2021-date Sat Apr 24 08:21:55 2021; exit: 0 0 0 0
10824.505_TTbar_13+2018_Patatrack_PixelOnlyTripletsCPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:12 2021-date Sat Apr 24 08:21:55 2021; exit: 0 0 0 0
10824.506_TTbar_13+2018_Patatrack_PixelOnlyTripletsGPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:08 2021-date Sat Apr 24 08:21:56 2021; exit: 0 0 0 0
10824.511_TTbar_13+2018_Patatrack_ECALOnlyCPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:03 2021-date Sat Apr 24 08:21:56 2021; exit: 0 0 0 0
10824.512_TTbar_13+2018_Patatrack_ECALOnlyGPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:03 2021-date Sat Apr 24 08:21:57 2021; exit: 0 0 0 0
10824.521_TTbar_13+2018_Patatrack_HCALOnlyCPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:03 2021-date Sat Apr 24 08:21:57 2021; exit: 0 0 0 0
10824.522_TTbar_13+2018_Patatrack_HCALOnlyGPU+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:03 2021-date Sat Apr 24 08:21:58 2021; exit: 0 0 0 0
11634.501_TTbar_14TeV+2021_Patatrack_PixelOnlyCPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:20 2021-date Sat Apr 24 08:21:58 2021; exit: 0 0 0 0
11634.502_TTbar_14TeV+2021_Patatrack_PixelOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:17 2021-date Sat Apr 24 08:21:59 2021; exit: 0 0 0 0
11634.505_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsCPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:20 2021-date Sat Apr 24 08:21:59 2021; exit: 0 0 0 0
11634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:17 2021-date Sat Apr 24 08:22:00 2021; exit: 0 0 0 0
11634.511_TTbar_14TeV+2021_Patatrack_ECALOnlyCPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:13 2021-date Sat Apr 24 08:22:00 2021; exit: 0 0 0 0
11634.512_TTbar_14TeV+2021_Patatrack_ECALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:13 2021-date Sat Apr 24 08:22:01 2021; exit: 0 0 0 0
11634.521_TTbar_14TeV+2021_Patatrack_HCALOnlyCPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:12 2021-date Sat Apr 24 08:22:01 2021; exit: 0 0 0 0
11634.522_TTbar_14TeV+2021_Patatrack_HCALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Sat Apr 24 08:26:12 2021-date Sat Apr 24 08:22:02 2021; exit: 0 0 0 0
16 16 16 16 tests passed, 0 0 0 0 failed

If this PR is a backport please specify the original PR and why you need to backport that PR:

This PR is a backport of #33428 to CMSSW_11_3_X (they are cut from the same branch).

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 24, 2021

A new Pull Request was created by @fwyzard (Andrea Bocci) for CMSSW_11_3_X.

It involves the following packages:

CalibTracker/Configuration
Configuration/ProcessModifiers
Configuration/PyReleaseValidation
Configuration/StandardSequences
DQM/HcalTasks
DQM/Integration
EventFilter/EcalRawToDigi
EventFilter/SiPixelRawToDigi
RecoLocalCalo/Configuration
RecoLocalCalo/EcalRecProducers
RecoLocalCalo/HcalRecProducers
RecoLocalTracker/Configuration
RecoLocalTracker/SiPixelClusterizer
RecoLocalTracker/SiPixelRecHits
RecoParticleFlow/PFClusterProducer
RecoPixelVertexing/Configuration
RecoPixelVertexing/PixelTrackFitting
RecoPixelVertexing/PixelTriplets
RecoPixelVertexing/PixelVertexFinding
RecoTracker/Configuration

@andrius-k, @chayanit, @wajidalikhan, @kpedro88, @tlampen, @pohsun, @perrotta, @yuanchao, @silviodonato, @ErnestaP, @ahmad3213, @cmsbuild, @davidlange6, @jfernan2, @slava77, @jpata, @qliphy, @fabiocos, @francescobrivio, @malbouis, @jordan-martins, @kmaeshima, @christopheralanwest, @franzoni, @srimanob, @rvenditti can you please review it and eventually sign? Thanks.
@echabert, @felicepantaleo, @yduhm, @robervalwalsh, @argiro, @Martin-Grunewald, @OzAmram, @thomreis, @lgray, @threus, @mmusich, @seemasharmafnal, @mmarionncern, @battibass, @makortel, @abdoulline, @JanFSchulte, @dgulhan, @apsallid, @slomeo, @simonepigazzini, @pieterdavid, @DryRun, @GiacomoSguazzoni, @rovere, @VinInn, @ferencek, @tocheng, @hatakeyamak, @alesaggio, @ebrondol, @mtosi, @fabiocos, @clelange, @batinkov, @rchatter, @gbenelli, @dkotlins, @lecriste, @cbernet, @gpetruc, @mariadalfonso, @tvami this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 24, 2021

backport #33428

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 24, 2021

please test

@fwyzard fwyzard changed the title Auto gpu workflows Redesign all GPU workflows to detect if a GPU is present, and fall back to CPU otherwise (11.3.x) Apr 24, 2021
@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-4ddaba/14558/summary.html
COMMIT: 24fa417
CMSSW: CMSSW_11_3_X_2021-04-23-2300/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/33519/14558/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals

  • 7.37.3_CosmicsSPLoose_UP18+CosmicsSPLoose_UP18+DIGICOS_UP18+RECOCOS_UP18+ALCACOS_UP18+HARVESTCOS_UP18/step3_CosmicsSPLoose_UP18+CosmicsSPLoose_UP18+DIGICOS_UP18+RECOCOS_UP18+ALCACOS_UP18+HARVESTCOS_UP18.log
  • 101.0101.0_SingleElectronE120EHCAL+SingleElectronE120EHCAL/step1_SingleElectronE120EHCAL+SingleElectronE120EHCAL.log

@fwyzard
Copy link
Contributor Author

fwyzard commented May 6, 2021

Would it be a good idea to have a 11_3_1 when this PR is merged? So that, local tests on any GPU machines can be done easily?

Yes - local tests and grid tests, too.

@silviodonato
Copy link
Contributor

+Upgrade

Would it be a good idea to have a 11_3_1 when this PR is merged? So that, local tests on any GPU machines can be done easily? @silviodonato @qliphy

yes, we will build a new release rather soon also for the MWGR

@jfernan2
Copy link
Contributor

jfernan2 commented May 6, 2021

+1

@civanch
Copy link
Contributor

civanch commented May 8, 2021

+1

@chayanit
Copy link

chayanit commented May 8, 2021

+1

@qliphy
Copy link
Contributor

qliphy commented May 10, 2021

ping @cms-sw/alca-l2

@yuanchao
Copy link
Contributor

+1

@qliphy
Copy link
Contributor

qliphy commented May 11, 2021

+operations

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_11_3_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_12_0_X is complete. This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@qliphy
Copy link
Contributor

qliphy commented May 12, 2021

+1

@cmsbuild cmsbuild merged commit fcba6d9 into cms-sw:CMSSW_11_3_X May 12, 2021
@fwyzard fwyzard deleted the auto_gpu_workflows branch August 18, 2021 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants