Skip to content

compiler: hip: change default to use old driver for rocm 7.13#4331

Merged
skganesan008 merged 3 commits intomainfrom
amd-compiler-ww11-hip2BeOld
Apr 4, 2026
Merged

compiler: hip: change default to use old driver for rocm 7.13#4331
skganesan008 merged 3 commits intomainfrom
amd-compiler-ww11-hip2BeOld

Conversation

@ronlieb
Copy link
Copy Markdown
Contributor

@ronlieb ronlieb commented Apr 3, 2026

Restores more efficient processing of rocgdb call stack test.

flags to choose , default with this PR goes back to --no-offload-new-driver
--offload-new-driver
--no-offload-new-driver

ROCm 7.14 we will attempt to reenable new driver.

Copy link
Copy Markdown
Contributor

@searlmc1 searlmc1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@skganesan008
Copy link
Copy Markdown

rerun of rccl test failure shows the following:
Expected NULL handle for buffer size 4194304 bytes
[ INFO ] Test 'VariableSizeBuffers_Disabled' (PID: 5272) FAILED with exit code 1 after 2854 ms
[ INFO ] Running isolated test 'VariableSizeBuffers_Enabled' (PID: 5284) with env: NCCL_LOCAL_REGISTER=1
[ INFO ] Test 'VariableSizeBuffers_Enabled' PASSED (2857 ms)
[ INFO ] Running isolated test 'DeregisterNullHandle' (PID: 5296)
[16:40:13Z] Mem: 97.1/3023.4GB (3%) | CPU: 0% | Jobs: ~1/384 | Disk: 1341GB free
[ INFO ] Test 'DeregisterNullHandle' PASSED (2951 ms)
[ INFO ] Process-Isolated Tests: 4 passed, 3 failed, 0 skipped (20186 ms total)
[ INFO ] Failed: CommRegisterDeregister_Disabled - Test failed with exit code 1
[ INFO ] Failed: MultipleBufferRegistration_Disabled - Test failed with exit code 1
[ INFO ] Failed: VariableSizeBuffers_Disabled - Test failed with exit code 1
/_w/TheRock/TheRock/rocm-systems/projects/rccl/test/RegisterTests.cpp:244: Failure
Value of: passed

Actual: false
Expected: true
One or more isolated tests failed
[ FAILED ] Register.ProcessIsolatedRegisterTests (20189 ms)
[----------] 1 test from Register (20189 ms total)
[----------] 6 tests from Scatter

The same error is also seen on another open PR #4344

All of the pytorch build failures seems to be related to following error:
2026-04-04T05:33:16.6486344Z /opt/rh/gcc-toolset-13/root/usr/libexec/gcc/x86_64-redhat-linux/13/ld: /__w/TheRock/TheRock/external-builds/pytorch/pytorch/build/lib/libtorch_hip.so: undefined reference to `rsmi_init'
2026-04-04T05:33:16.6486962Z collect2: error: ld returned 1 exit status

This error is also seen on already merged PR #4337

@skganesan008
Copy link
Copy Markdown

We are landing the PR as the errors seen is not related to this PR based on the findings so far

@skganesan008 skganesan008 merged commit 15e5441 into main Apr 4, 2026
718 of 744 checks passed
@skganesan008 skganesan008 deleted the amd-compiler-ww11-hip2BeOld branch April 4, 2026 18:01
@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants