[rocm7.0_internal_testing] Prevent static initialization of at::cuda::warp_size() #2293

ethanwee1 · 2025-06-25T16:45:59Z

Fixes SWDEV-540240, SWDEV-540309, SWDEV-539989

Error

#24 437.7   what():  HIP error: no ROCm-capable device is detected
#24 437.7 HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
#24 437.7 For debugging consider passing AMD_SERIALIZE_KERNEL=3
#24 437.7 Device-side assertions were explicitly omitted for this error check; the error probably arose while initializing the DSA handlers.
#24 437.7 Exception raised from c10_hip_check_implementation at /pytorch/c10/hip/HIPException.cpp:44 (most recent call first):
#24 437.7 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x88 (0x7f272de18738 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
#24 437.7 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x55 (0x7f272ddb42ed in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
...
#24 437.7 frame #7: at::cuda::getCurrentDeviceProperties() + 0x9 (0x7f270b5874e9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so)
#24 437.7 frame #8: at::cuda::warp_size() + 0x9 (0x7f270b587509 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so)
#24 437.7 frame #9: <unknown function> + 0x81ac8b (0x7f2709c27c8b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so)

Explanation

80cca70 created a static global variable that used at::cuda::warp_size() to initialize its value, which needs GPUs to be visible to query device properties. However, GPUs are not present on CPU-only build systems.

Solution

Convert static variable into a static function, thus preventing static initialization.

Validation

http://rocm-ci.amd.com/job/pyt_whl_docker_mainline/1461/artifact/build_artifacts.txt/*view*/

Ran microbenchmark to confirm basic functionality:

root@ubb4-rack-22:/var/lib/jenkins/pytorch-micro-benchmarking# python3 micro_benchmarking_pytorch.py --network resnet50
INFO: running forward and backward for warmup.
INFO: running the benchmark..
OK: finished running benchmark..
--------------------SUMMARY--------------------------
Microbenchmark for network : resnet50
Num devices: 1
Dtype: FP32
Mini batch size [img] : 64
Time per mini-batch : 0.10158218145370483
Throughput [img/sec] : 630.0317544289736=

rocm-repo-management-api · 2025-06-25T16:55:44Z

Jenkins build for 9991022d48d5480423fc3dc1d3b0fb93cdaa638a commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-06-25T19:55:36Z

Jenkins build for 9991022d48d5480423fc3dc1d3b0fb93cdaa638a commit is in progress
Links: Blue Ocean view / Build artifacts

jeffdaily

I like it. Please submit upstream PR as well.

jeffdaily · 2025-06-25T20:38:16Z

I like it. Please submit upstream PR as well.

Actually, can you upstream this in combination with 80cca70.

…:warp_size() (#2293) Fixes SWDEV-540240, SWDEV-540309, SWDEV-539989 ``` ... ``` 80cca70 created a static global variable that used `at::cuda::warp_size()` to initialize its value, which needs GPUs to be visible to query device properties. However, GPUs are not present on CPU-only build systems. Convert static variable into a static function, thus preventing static initialization. http://rocm-ci.amd.com/job/pyt_whl_docker_mainline/1461/artifact/build_artifacts.txt/*view*/ Ran microbenchmark to confirm basic functionality: ``` root@ubb4-rack-22:/var/lib/jenkins/pytorch-micro-benchmarking# python3 micro_benchmarking_pytorch.py --network resnet50 INFO: running forward and backward for warmup. INFO: running the benchmark.. OK: finished running benchmark.. --------------------SUMMARY-------------------------- Microbenchmark for network : resnet50 Num devices: 1 Dtype: FP32 Mini batch size [img] : 64 Time per mini-batch : 0.10158218145370483 Throughput [img/sec] : 630.0317544289736= ```

ethanwee1 and others added 2 commits June 25, 2025 09:36

test

2fbd7d1

adding these files as it goes from constexpr to function

a04e6b5

Update per suggests

9991022

jeffdaily approved these changes Jun 25, 2025

View reviewed changes

jeffdaily marked this pull request as ready for review June 25, 2025 20:43

jithunnair-amd changed the title ~~[Rocm7.0_internal_testing] Fix warpsize for 80cca7006d94df97ee932fd5903ed20c08c2eb34~~ [rocm7.0_internal_testing] Fix warpsize for 80cca7006d94df97ee932fd5903ed20c08c2eb34 Jun 25, 2025

jithunnair-amd changed the title ~~[rocm7.0_internal_testing] Fix warpsize for 80cca7006d94df97ee932fd5903ed20c08c2eb34~~ [rocm7.0_internal_testing] Prevent static initialization of at::cuda::warp_size() Jun 25, 2025

jithunnair-amd merged commit 944be5a into rocm7.0_internal_testing Jun 25, 2025
0 of 3 checks passed

jithunnair-amd deleted the rocm7.0_IT_fix_warpsize branch June 25, 2025 22:10

xinyazhang mentioned this pull request Jul 7, 2025

[release/2.4] Backport AOTriton 0.10b to support gfx950 and ROCM 7.0 #2318

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rocm7.0_internal_testing] Prevent static initialization of at::cuda::warp_size() #2293

[rocm7.0_internal_testing] Prevent static initialization of at::cuda::warp_size() #2293

Uh oh!

ethanwee1 commented Jun 25, 2025 •

edited by jeffdaily

Loading

Uh oh!

rocm-repo-management-api bot commented Jun 25, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Jun 25, 2025

Uh oh!

jeffdaily left a comment

Uh oh!

jeffdaily commented Jun 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[rocm7.0_internal_testing] Prevent static initialization of at::cuda::warp_size() #2293

[rocm7.0_internal_testing] Prevent static initialization of at::cuda::warp_size() #2293

Uh oh!

Conversation

ethanwee1 commented Jun 25, 2025 • edited by jeffdaily Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Error

Explanation

Solution

Validation

Uh oh!

rocm-repo-management-api bot commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Jun 25, 2025

Uh oh!

jeffdaily left a comment

Choose a reason for hiding this comment

Uh oh!

jeffdaily commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ethanwee1 commented Jun 25, 2025 •

edited by jeffdaily

Loading

rocm-repo-management-api bot commented Jun 25, 2025 •

edited

Loading

jeffdaily commented Jun 25, 2025 •

edited

Loading