Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ROCm utilities and tests, and CUDA fallbacks #40619

Merged
merged 7 commits into from Feb 1, 2023

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Jan 26, 2023

PR description:

Add basic ROCm functions:

  • hipCheck() macro to check if HIP API functions are successful, or raise an exception;
  • cms::rocmtest::requireDevices() function to check if the system has any ROCm GPUs, or exit successfully.

Add tests for these functions and for the corresponding CUDA functions.

Add basic ROCm utilities:

  • rocmIsEnabled exits with success if a ROCm device is available ad configured correctly
  • rocmComputeCapabilities lists all ROCm devices and their GCN architecture:
    $ rocmComputeCapabilities 
       0      gfx900    AMD Radeon Pro WX 9100

PR validation:

The new unit tests pass on machines with and without GPUs.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 26, 2023

please test with cms-sw/cmsdist#8266

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-40619/33893

ERROR: Build errors found during clang-tidy run.

#error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
 ^
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/external/rocm/5.0.2-95c215630c939706b0552e3eee38861c/include/hip/hip_runtime_api.h:5501:2: error: ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__"); [clang-diagnostic-error]
#error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
 ^
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/external/rocm/5.0.2-95c215630c939706b0552e3eee38861c/include/hip/hip_runtime_api.h:5526:61: error: use of undeclared identifier 'hipHostMallocDefault' [clang-diagnostic-error]
                                       unsigned int flags = hipHostMallocDefault) {
                                                            ^
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/external/rocm/5.0.2-95c215630c939706b0552e3eee38861c/include/hip/hip_runtime_api.h:5532:61: error: use of undeclared identifier 'hipMemAttachGlobal' [clang-diagnostic-error]
                                       unsigned int flags = hipMemAttachGlobal) {
                                                            ^
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/external/rocm/5.0.2-95c215630c939706b0552e3eee38861c/include/hip/hip_vector_types.h:38:2: error: ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__"); [clang-diagnostic-error]
#error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
 ^
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/external/rocm/5.0.2-95c215630c939706b0552e3eee38861c/include/hip/library_types.h:48:2: error: ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__"); [clang-diagnostic-error]
#error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
 ^
HeterogeneousCore/ROCmUtilities/src/requireDevices.cc:12:19: error: use of undeclared identifier 'hipGetDeviceCount' [clang-diagnostic-error]
    auto status = hipGetDeviceCount(&devices);
                  ^
Suppressed 213 warnings (212 in non-user code, 1 with check filters).
--
gmake: *** [config/SCRAM/GMake/Makefile.coderules:129: code-checks] Error 2
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 26, 2023

@makortel these are the basic HIP/ROCm utilities I would add, to ease the implementation of the ROCm service and build rules.

However, we should settle on the "HIP" vs "ROCm" naming approach...

As far as I understand, "ROCm" indicates the backend and the kind of hardware, while "HIP" indicates the API.

Which makes sense: for example, one could program ROCm devices using other APIs, like OpenCL or SYCL, or use HIP to program other devices, like NVIDIA GPUs.

On the other hand, at the moment we do not plan to use HIP on other hardware than ROCm -- though HIP-over-Level 0 is an interesting idea to avoid SYCL...

So far I've used

  • HIP wherever the API is the most relevant aspect, e.g. hipCheck(...);
  • ROCm wherever the architecture is the most relevant aspect, e.g. the rocm SCRAM tool and the cms::rocmtest::requireDevices() function.

However, I guess we could also just use "HIP" everywhere (except maybe the tool).

What do you think ?

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 26, 2023

-code-checks

I guess the code checks do not use the flags from cms-sw/cmsdist#8266 ?

@smuzaffar
Copy link
Contributor

code-checks with cms.week1.PR_92dfa22c/53.0-3fa13e8f691137f6726f64f123800e88

lets run the code-checks using cms-sw/cmsdist#8266 tool-conf

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-40619/33895

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 26, 2023

code-checks with cms.week1.PR_92dfa22c/53.0-3fa13e8f691137f6726f64f123800e88

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-40619/33898

@cmsbuild
Copy link
Contributor

Pull request #40619 was updated. can you please check and sign again.

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1f2297/30252/summary.html
COMMIT: bbcce06
CMSSW: CMSSW_13_0_X_2023-01-30-1100/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/40619/30252/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1f2297/30252/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1f2297/30252/git-merge-result

Comparison Summary

Summary:

  • You potentially added 3 lines to the logs
  • Reco comparison results: 12 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3555495
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3555467
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 211 log files, 162 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 31, 2023

please test with cms-sw/cmsdist#8273

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 1, 2023

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1f2297/30279/summary.html
COMMIT: bbcce06
CMSSW: CMSSW_13_0_X_2023-01-31-1100/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/40619/30279/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1f2297/30279/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1f2297/30279/git-merge-result

Comparison Summary

Summary:

  • You potentially added 6 lines to the logs
  • Reco comparison results: 10 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3555495
  • DQMHistoTests: Total failures: 9
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3555464
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 211 log files, 162 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@perrotta
Copy link
Contributor

perrotta commented Feb 1, 2023

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants