Skip to content

Fix hardcoded TF_ROCM_AMDGPU_TARGETS#782

Merged
mmakevic-amd merged 2 commits intorocm-dev-infrafrom
mmakevic/fix_mgpu_targets_list
Apr 7, 2026
Merged

Fix hardcoded TF_ROCM_AMDGPU_TARGETS#782
mmakevic-amd merged 2 commits intorocm-dev-infrafrom
mmakevic/fix_mgpu_targets_list

Conversation

@mmakevic-amd
Copy link
Copy Markdown

Motivation

In CI, multiple mGPU tests are failing with:

Could not load RepeatBufferKernel: INTERNAL: Failed call to hipGetFuncBySymbol: hipError_t(98)

The problem is that we run them locally on gfx950 and in our Bazel command, we hardcoded:

--repo_env=TF_ROCM_AMDGPU_TARGETS=gfx90a,gfx942

Technical Details

This change will make the testing script more flexible, since it will automatically detect the present target and only build for that target

Test Plan

Needs to be tested in CI

Test Result

TBA

Submission Checklist

Copy link
Copy Markdown
Collaborator

@leo-automation leo-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does RBE use TF_ROCM_AMDGPU_TARGETS? If yes I believe that way we'll send gfx950 when running this script on gfx950 runner to an RBE worker that runs other arch

@mmakevic-amd
Copy link
Copy Markdown
Author

Does RBE use TF_ROCM_AMDGPU_TARGETS? If yes I believe that way we'll send gfx950 when running this script on gfx950 runner to an RBE worker that runs other arch

Ah, that makes sense. I don't think there is a good way to automatically detect the target then? Should I just add gfx950 to the hardcoded list?

@leo-automation
Copy link
Copy Markdown
Collaborator

Does RBE use TF_ROCM_AMDGPU_TARGETS? If yes I believe that way we'll send gfx950 when running this script on gfx950 runner to an RBE worker that runs other arch

Ah, that makes sense. I don't think there is a good way to automatically detect the target then? Should I just add gfx950 to the hardcoded list?

Yes, but I am thinking doing it in the main if block might be better. Like below

for arg in "$@"; do if [[ "$arg" == "--config=ci_multi_gpu" ]]; then TAG_FILTERS="" TEST_TARGETS=("${TEST_TARGETS_MGPU[@]}") AMDGPU_TARGETS="${TF_ROCM_SGPU_AMDGPU_TARGETS:-gfx90a,gfx942}" fi if [[ "$arg" == "--config=ci_single_gpu" ]]; then TAG_FILTERS="${TAG_FILTERS},gpu,-multi_gpu,-no_oss" TEST_TARGETS=("${TEST_TARGETS_SGPU[@]}") AMDGPU_TARGETS="${TF_ROCM_SGPU_AMDGPU_TARGETS:-gfx950}" fi done

@mmakevic-amd mmakevic-amd force-pushed the mmakevic/fix_mgpu_targets_list branch from bd90398 to 61de8dd Compare April 7, 2026 13:33
@mmakevic-amd mmakevic-amd changed the title Do not hardcode TF_ROCM_AMDGPU_TARGETS in test command Fix hardcoded TF_ROCM_AMDGPU_TARGETS Apr 7, 2026
@mmakevic-amd
Copy link
Copy Markdown
Author

Does RBE use TF_ROCM_AMDGPU_TARGETS? If yes I believe that way we'll send gfx950 when running this script on gfx950 runner to an RBE worker that runs other arch

Ah, that makes sense. I don't think there is a good way to automatically detect the target then? Should I just add gfx950 to the hardcoded list?

Yes, but I am thinking doing it in the main if block might be better. Like below

for arg in "$@"; do if [[ "$arg" == "--config=ci_multi_gpu" ]]; then TAG_FILTERS="" TEST_TARGETS=("${TEST_TARGETS_MGPU[@]}") AMDGPU_TARGETS="${TF_ROCM_SGPU_AMDGPU_TARGETS:-gfx90a,gfx942}" fi if [[ "$arg" == "--config=ci_single_gpu" ]]; then TAG_FILTERS="${TAG_FILTERS},gpu,-multi_gpu,-no_oss" TEST_TARGETS=("${TEST_TARGETS_SGPU[@]}") AMDGPU_TARGETS="${TF_ROCM_SGPU_AMDGPU_TARGETS:-gfx950}" fi done

Fixed, please take a look

if [[ "$arg" == "--config=ci_multi_gpu" ]]; then
TAG_FILTERS=""
TEST_TARGETS=("${TEST_TARGETS_MGPU[@]}")
AMDGPU_TARGETS="${TF_ROCM_AMDGPU_TARGETS:-gfx90a,gfx942}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally did that, but as @leo-amd pointed out, it won't work when using RBE (see discussion above)

Copy link
Copy Markdown
Collaborator

@i-chaochen i-chaochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so IIUC, the latest change is to specific gfx942 for sgpu and gfx950 for mgpu?

@mmakevic-amd
Copy link
Copy Markdown
Author

so IIUC, the latest change is to specific gfx942 for sgpu and gfx950 for mgpu?

Yes, sGPU uses gfx90a and gfx942 and mGPU gfx950

@mmakevic-amd mmakevic-amd merged commit 429f759 into rocm-dev-infra Apr 7, 2026
1 check failed
@mmakevic-amd mmakevic-amd deleted the mmakevic/fix_mgpu_targets_list branch April 7, 2026 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants