Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CUDA to version 11.5.2 #7669

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Mar 6, 2022

Update CUDA to version 11.5.2:

  • CUDA runtime version 11.5.117
  • NVIDIA drivers version 495.29.05

See https://docs.nvidia.com/cuda/archive/11.5.2/cuda-toolkit-release-notes/index.html for the full CUDA 11.5.x release notes and change log.

Update cuDNN to version 8.3.2.44.
See https://docs.nvidia.com/deeplearning/cudnn/release-notes/rel_8.html for the release notes and change log.

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 6, 2022

A new Pull Request was created by @fwyzard (Andrea Bocci) for branch IB/CMSSW_12_3_X/master.

@cmsbuild, @smuzaffar, @iarspider can you please review it and eventually sign? Thanks.
@perrotta, @dpiparo, @qliphy you are the release manager for this.
cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 6, 2022

The comparisons done in cms-sw/cmssw#37129 indicate that CUDA 11.5.x should give a small performance improvement with respect to 11.4.x:
image

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 6, 2022

@smuzaffar before merging this, I'd like to compare the performance in CMSSW itself using the PR artifacts.

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 6, 2022

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 6, 2022

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-810d14/22886/summary.html
COMMIT: 2142753
CMSSW: CMSSW_12_3_X_2022-03-05-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/7669/22886/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

+ chmod -Rf a+rX,u+w,g-w,o-w .
+ '[' 11.5 '!=' 11.4 ']'
+ echo 'Incompatible CUDA version in cudnn.spec!'
Incompatible CUDA version in cudnn.spec!
+ exit 1
error: Bad exit status from /pool/condor/dir_96975/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.KFFlBn (%prep)


RPM build errors:
Bad exit status from /pool/condor/dir_96975/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.KFFlBn (%prep)



@smuzaffar
Copy link
Contributor

@fwyzard , looks like cudnn also needs an update

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 7, 2022 via email

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 7, 2022

Pull request #7669 was updated.

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 7, 2022

please test

@fwyzard fwyzard changed the title Update to CUDA 11.5.2 Update CUDA to version 11.5.2 Mar 7, 2022
@fwyzard fwyzard force-pushed the IB/CMSSW_12_3_X/master_cuda_11.5.2 branch from e646de7 to 2c2ae36 Compare March 7, 2022 22:38
@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 7, 2022

Pull request #7669 was updated.

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 7, 2022

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 7, 2022

please test for aarch64

@smuzaffar smuzaffar changed the base branch from IB/CMSSW_12_3_X/master to IB/CMSSW_12_4_X/master March 11, 2022 08:15
@smuzaffar
Copy link
Contributor

enable gpu

@smuzaffar
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-810d14/23056/summary.html
COMMIT: 1ff1772
CMSSW: CMSSW_12_4_X_2022-03-11-1100/slc7_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7669/23056/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-810d14/23056/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-810d14/23056/git-merge-result

Unit Tests

I found errors in the following unit tests:

---> test test_edmPickEvents had ERRORS

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • Reco comparison had 3 failed jobs
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19811
  • DQMHistoTests: Total failures: 1567
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 18244
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3695153
  • DQMHistoTests: Total failures: 56
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3695075
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 204 log files, 45 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@smuzaffar
Copy link
Contributor

please test for slc7_aarch64_gcc11

@smuzaffar
Copy link
Contributor

please test for slc7_ppc64le_gcc11

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-810d14/23070/summary.html
COMMIT: 1ff1772
CMSSW: CMSSW_12_4_X_2022-03-12-1100/slc7_ppc64le_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7669/23070/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test test_edmPickEvents had ERRORS
---> test DRNTest had ERRORS
---> test testFWCoreUtilities had ERRORS
---> test materialBudgetTrackerPlots had ERRORS
and more ...

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests RelVals
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-810d14/23073/summary.html
COMMIT: 1ff1772
CMSSW: CMSSW_12_4_X_2022-03-12-1100/slc7_aarch64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7669/23073/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test test_edmPickEvents had ERRORS
---> test DRNTest had ERRORS
---> test TestFWCoreServicesDriver had ERRORS
---> test testFWCoreUtilities had ERRORS
and more ...

RelVals

----- Begin Fatal Exception 13-Mar-2022 22:02:01 CET-----------------------
An exception of category 'Vertex' occurred while
   [0] Processing  Event run: 194533 lumi: 329 event: 462355458 stream: 0
   [1] Running path 'dqmofflineOnPAT_1_step'
   [2] Prefetching for module SingleTopTChannelLeptonDQM_miniAOD/'singleTopElectronMediumDQM_miniAOD'
   [3] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [4] Prefetching for module PATMuonSelector/'selectedPatMuons'
   [5] Prefetching for module PATMuonProducer/'patMuons'
   [6] Prefetching for module MuonProducer/'muons'
   [7] Prefetching for module PFProducer/'particleFlowTmp'
   [8] Prefetching for module PFBlockProducer/'particleFlowBlock'
   [9] Prefetching for module PFElecTkProducer/'pfTrackElec'
   [10] Prefetching for module PFConversionProducer/'pfConversions'
   [11] Calling method for module ConversionProducer/'allConversions'
Exception Message:
Refitted track not found in list
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 13-Mar-2022 22:13:34 CET-----------------------
An exception of category 'Vertex' occurred while
   [0] Processing  Event run: 326479 lumi: 7 event: 1579493 stream: 0
   [1] Running path 'dqmoffline_8_step'
   [2] Prefetching for module SMPDQM/'SMPDQM'
   [3] Prefetching for module MuonProducer/'muons'
   [4] Prefetching for module PFProducer/'particleFlowTmp'
   [5] Prefetching for module PFBlockProducer/'particleFlowBlock'
   [6] Prefetching for module PFElecTkProducer/'pfTrackElec'
   [7] Prefetching for module PFConversionProducer/'pfConversions'
   [8] Calling method for module ConversionProducer/'allConversions'
Exception Message:
Refitted track not found in list
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 13-Mar-2022 22:23:29 CET-----------------------
An exception of category 'Vertex' occurred while
   [0] Processing  Event run: 319450 lumi: 76 event: 106007323 stream: 0
   [1] Running path 'dqmoffline_10_step'
   [2] Prefetching for module SMPDQM/'SMPDQM'
   [3] Prefetching for module MuonProducer/'muons'
   [4] Prefetching for module PFProducer/'particleFlowTmp'
   [5] Prefetching for module PFBlockProducer/'particleFlowBlock'
   [6] Prefetching for module PFElecTkProducer/'pfTrackElec'
   [7] Prefetching for module PFConversionProducer/'pfConversions'
   [8] Calling method for module ConversionProducer/'allConversions'
Exception Message:
Refitted track not found in list
----- End Fatal Exception -------------------------------------------------

@smuzaffar
Copy link
Contributor

+externals

@smuzaffar before merging this, I'd like to compare the performance in CMSSW itself using the PR artifacts.

@fwyzard , if you want more tests then we can include it in DEVEL IBs first otherwise this looks go to go in

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_12_4_X/master IBs (but tests are reportedly failing). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 14, 2022 via email

@smuzaffar
Copy link
Contributor

bot area for amd64 ( #7669 (comment) ) should remain available till Sat. 19th of March. If you still need a bot area after that then feel free to start another PR test

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 15, 2022

OK, a comparison of the performance of the pixel reconstruction in the full CMSSW looks good too:

CMSSW_12_4_X_2022-03-11-1100

Running 4 times over 4100 events with 4 jobs, each with 16 threads, 16 streams and 1 GPUs
  1271.2 ±   0.5 ev/s (4000 events, 99.4% overlap)
  1269.0 ±   0.5 ev/s (4000 events, 99.4% overlap)
  1265.1 ±   0.4 ev/s (4000 events, 99.3% overlap)
  1259.8 ±   0.6 ev/s (4000 events, 99.3% overlap)
 --------------------
  1266.3 ±   5.0 ev/s

CMSSW_12_4_X_2022-03-11-1100 + #7669

Running 4 times over 4100 events with 4 jobs, each with 16 threads, 16 streams and 1 GPUs
  1275.0 ±   0.4 ev/s (4000 events, 99.2% overlap)
  1273.5 ±   0.5 ev/s (4000 events, 99.1% overlap)
  1267.0 ±   0.8 ev/s (4000 events, 97.4% overlap)
  1265.2 ±   0.5 ev/s (4000 events, 99.0% overlap)
 --------------------
  1270.2 ±   4.8 ev/s

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 15, 2022

Same for the HCAL reconstruction:

CMSSW_12_4_X_2022-03-11-1100

Running 4 times over 4100 events with 4 jobs, each with 16 threads, 16 streams and 1 GPUs
  1454.3 ±   0.2 ev/s (4000 events, 99.9% overlap)
  1452.7 ±   0.2 ev/s (4000 events, 99.8% overlap)
  1457.8 ±   0.2 ev/s (4000 events, 99.9% overlap)
  1452.5 ±   0.2 ev/s (4000 events, 99.6% overlap)
 --------------------
  1454.3 ±   2.5 ev/s

CMSSW_12_4_X_2022-03-11-1100 + #7669

Running 4 times over 4100 events with 4 jobs, each with 16 threads, 16 streams and 1 GPUs
  1486.9 ±   0.2 ev/s (4000 events, 99.8% overlap)
  1484.9 ±   0.2 ev/s (4000 events, 99.8% overlap)
  1484.5 ±   0.2 ev/s (4000 events, 99.9% overlap)
  1480.1 ±   0.2 ev/s (4000 events, 99.9% overlap)
 --------------------
  1484.1 ±   2.9 ev/s

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 15, 2022

@smuzaffar @perrotta @qliphy for me this is good to go in.

@smuzaffar
Copy link
Contributor

+externals

@smuzaffar smuzaffar merged commit 2229ebc into cms-sw:IB/CMSSW_12_4_X/master Mar 15, 2022
@fwyzard fwyzard deleted the IB/CMSSW_12_3_X/master_cuda_11.5.2 branch April 1, 2022 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants