Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to CUDA 11.1 #6267

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Sep 24, 2020

Update to CUDA 11.1:

  • CUDA version 11.1.74
  • NVIDIA drivers version 455.23.05

From the release notes:

  • add support for GCC 10 and clang 10
  • support multi-threaded launch to different CUDA streams
  • improve MPS error handling when using multiple GPUs
  • various bug fixes

See https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html .

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 24, 2020

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 24, 2020

The tests are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @fwyzard (Andrea Bocci) for branch IB/CMSSW_11_2_X/master.

@cmsbuild, @smuzaffar, @mrodozov can you please review it and eventually sign? Thanks.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

-1

Tested at: d32501c

  • Build:

I found compilation error when building:

+ ln -s ../compute-sanitizer/compute-sanitizer /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/ab0db370d574467e557d482f013ac8dc/opt/cmssw/slc7_amd64_gcc820/external/cuda/11.1.0-f65abf/bin/compute-sanitizer
+ mv /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/slc7_amd64_gcc820/external/cuda/11.1.0-f65abf/build/nvvm /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/ab0db370d574467e557d482f013ac8dc/opt/cmssw/slc7_amd64_gcc820/external/cuda/11.1.0-f65abf/
+ mv /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/slc7_amd64_gcc820/external/cuda/11.1.0-f65abf/build/EULA.txt /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/ab0db370d574467e557d482f013ac8dc/opt/cmssw/slc7_amd64_gcc820/external/cuda/11.1.0-f65abf/
+ mv /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/slc7_amd64_gcc820/external/cuda/11.1.0-f65abf/build/version.txt /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/ab0db370d574467e557d482f013ac8dc/opt/cmssw/slc7_amd64_gcc820/external/cuda/11.1.0-f65abf/
mv: cannot stat '/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/slc7_amd64_gcc820/external/cuda/11.1.0-f65abf/build/version.txt': No such file or directory
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.KtHgB2 (%install)


RPM build errors:
Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.KtHgB2 (%install)



You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-f65abf/9546/summary.html

Update to CUDA 11.1:
  * CUDA version 11.1.74
  * NVIDIA drivers version 455.23.05

From the release notes:
  - add support for GCC 10 and clang 10
  - support multi-threaded launch to different CUDA streams
  - improve MPS error handling when using multiple GPUs
  - various bug fixes

See https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html .
@fwyzard fwyzard force-pushed the IB/CMSSW_11_2_X/master_cuda_11.1 branch from d32501c to 7ccaf24 Compare September 24, 2020 16:16
@cmsbuild
Copy link
Contributor

Pull request #6267 was updated.

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 24, 2020 via email

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 24, 2020

The tests are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+1
Tested at: 7ccaf24
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-f65abf/9553/summary.html
CMSSW: CMSSW_11_2_X_2020-09-24-1100
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-f65abf/9553/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 35
  • DQMHistoTests: Total histograms compared: 2539438
  • DQMHistoTests: Total failures: 7
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2539409
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 34 files compared)
  • Checked 149 log files, 22 edm output root files, 35 DQM output files

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 28, 2020

The tests are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

-1

Tested at: 5f70e0d

CMSSW: CMSSW_11_2_X_2020-09-27-0000
SCRAM_ARCH: slc7_ppc64le_gcc820
You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-f65abf/9611/summary.html

I found follow errors while testing this PR

Failed tests: UnitTests

  • Unit Tests:

I found errors in the following unit tests:

---> test TestCUDATest had ERRORS

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

+1
Tested at: 5f70e0d
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-f65abf/9612/summary.html
CMSSW: CMSSW_11_2_X_2020-09-28-1100
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-f65abf/9612/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 35
  • DQMHistoTests: Total histograms compared: 2539438
  • DQMHistoTests: Total failures: 7
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2539409
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 34 files compared)
  • Checked 149 log files, 22 edm output root files, 35 DQM output files

@cmsbuild
Copy link
Contributor

-1

Tested at: 5f70e0d

CMSSW: CMSSW_11_2_X_2020-09-27-0000
SCRAM_ARCH: slc7_ppc64le_gcc820
You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-f65abf/9623/summary.html

I found follow errors while testing this PR

Failed tests: UnitTests

  • Unit Tests:

I found errors in the following unit tests:

---> test TestCUDATest had ERRORS

@cmsbuild
Copy link
Contributor

Comparison job queued.

@smuzaffar
Copy link
Contributor

@fwyzard , something wrong with ppc64le cuda distribution. Although we should not be using cuda from system now ( Pr tests properly use set [a] ) but still unit tests fails/crash the same way

[a]

+ CMS_NVIDIA_VERSION=455.23.05
+ '[' 440.33.01 ']'
++ parse_version 455.23.05
++ '[' 455.23.05 ']'
++ echo 455.23.05
++ IFS=.
++ read MAJOR MINOR PATCH
++ echo 455023005
++ IFS=.
++ read MAJOR MINOR PATCH
++ parse_version 440.33.01
++ '[' 440.33.01 ']'
++ echo 440.33.01
++ IFS=.
++ read MAJOR MINOR PATCH
++ echo 440033001
++ IFS=.
++ read MAJOR MINOR PATCH
+ ((  455023005 <= 440033001  ))
+ echo RUNTIME:path:append:LD_LIBRARY_PATH=/scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/testBuildDir/slc7_ppc64le_gcc820/external/cuda/11.1.0-f65abf/drivers
RUNTIME:path:append:LD_LIBRARY_PATH=/scratch/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/testBuildDir/slc7_ppc64le_gcc820/external/cuda/11.1.0-f65abf/drivers

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 29, 2020

Is there a way to build the external (so I can test it exactly as it will be) without including it in the IBs (so they don't get broken) ?

@smuzaffar
Copy link
Contributor

On power machine you can run

cmssw-cc7
git clone git@github.com:cms-sw/cms-bot
./cms-bot/test-prs.sh -a slc7_ppc64le_gcc820 -r CMSSW_11_2_X  cms-sw/cmsdist#6267

this should build the externals, create a CMSSW dev area and setup the new tools.

Anyway, I have done it and externals are available under ibmminsky-1:/scratch/d/externals . In case you want to test the new cuda then setup /scratch/d/externals/slc7_ppc64le_gcc820/external/cuda-toolfile/2.1-cms/etc/scram.d/*.xml in your cmssw dev area.

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 29, 2020

Thanks, will try to have a look later today.

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 30, 2020

@smuzaffar I suspect the problem is actually with the Minsky 1 machine:

[2020-09-30 12:16:50] fwyzard@ibmminsky-1:~$ nvidia-smi
Wed Sep 30 12:16:52 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  Off  | 00000002:01:00.0 Off |                    0 |
| N/A   34C    P0    33W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  Off  | 00000003:01:00.0 Off |                    0 |
| N/A   35C    P0    30W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-SXM2...  Off  | 00000006:01:00.0 Off |                    0 |
| N/A   35C    P0   ERR! / 300W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-SXM2...  Off  | 00000007:01:00.0 Off |                    0 |
| N/A   33C    P0    32W / 300W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The third GPU (number 2) is in ERR! state, and when I try to run simple CUDA jobs they hang.

I'll try again on Minsky 2...

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 30, 2020

On Minsky 2 I can build and run applications using CUDA 11.1:

[2020-09-30 12:32:06] fwyzard@ibmminsky-2:/scratch/fwyzard/CMSSW_11_2_X_2020-09-29-2300$ test/slc7_ppc64le_gcc820/testCUDAService 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
testCUDAService is a Catch v2.2.2 host application.
Run with -? for options

-------------------------------------------------------------------------------
Tests of CUDAService
  CUDAService enabled
  CUDA Queries
-------------------------------------------------------------------------------
/scratch/fwyzard/CMSSW_11_2_X_2020-09-29-2300/src/HeterogeneousCore/CUDAServices/test/testCUDAService.cpp:56
...............................................................................

/scratch/fwyzard/CMSSW_11_2_X_2020-09-29-2300/src/HeterogeneousCore/CUDAServices/test/testCUDAService.cpp:70: 
warning:
  CUDA Driver Version / Runtime Version: 11.1 / 11.1

/scratch/fwyzard/CMSSW_11_2_X_2020-09-29-2300/src/HeterogeneousCore/CUDAServices/test/testCUDAService.cpp:75: 
warning:
  Detected 4 CUDA Capable device(s)

-------------------------------------------------------------------------------
Tests of CUDAService
  CUDAService enabled
  CUDAService device free memory
-------------------------------------------------------------------------------
/scratch/fwyzard/CMSSW_11_2_X_2020-09-29-2300/src/HeterogeneousCore/CUDAServices/test/testCUDAService.cpp:94
...............................................................................

/scratch/fwyzard/CMSSW_11_2_X_2020-09-29-2300/src/HeterogeneousCore/CUDAServices/test/testCUDAService.cpp:101: 
warning:
  Device 0 memory total 17071734784 free 16795435008

/scratch/fwyzard/CMSSW_11_2_X_2020-09-29-2300/src/HeterogeneousCore/CUDAServices/test/testCUDAService.cpp:101: 
warning:
  Device 1 memory total 17071734784 free 16795369472

/scratch/fwyzard/CMSSW_11_2_X_2020-09-29-2300/src/HeterogeneousCore/CUDAServices/test/testCUDAService.cpp:101: 
warning:
  Device 2 memory total 17071734784 free 16795435008

/scratch/fwyzard/CMSSW_11_2_X_2020-09-29-2300/src/HeterogeneousCore/CUDAServices/test/testCUDAService.cpp:101: 
warning:
  Device 3 memory total 17071734784 free 16795435008

CUDA device 0: 16017 MB free / 16280 MB total memory
CUDA device 1: 16017 MB free / 16280 MB total memory
CUDA device 2: 16017 MB free / 16280 MB total memory
CUDA device 3: 16017 MB free / 16280 MB total memory
/scratch/fwyzard/CMSSW_11_2_X_2020-09-29-2300/src/HeterogeneousCore/CUDAServices/test/testCUDAService.cpp:108: 
warning:
  Device with most free memory 0
       as given by CUDAService 0

===============================================================================
All tests passed (12 assertions in 1 test case)

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 30, 2020

@smuzaffar @silviodonato I'd suggest to merge this PR so we can have CUDA 11.1 in pre7 ?

@smuzaffar smuzaffar merged commit 9ee070d into cms-sw:IB/CMSSW_11_2_X/master Sep 30, 2020
@smuzaffar
Copy link
Contributor

OK merged now. I have disabled Minsky 1 in jenkins and will ask Openlab team to look in to this issue.

@fwyzard fwyzard deleted the IB/CMSSW_11_2_X/master_cuda_11.1 branch May 10, 2021 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants