Fix hardcoded TF_ROCM_AMDGPU_TARGETS by mmakevic-amd · Pull Request #782 · ROCm/xla

mmakevic-amd · 2026-04-07T13:11:42Z

Motivation

In CI, multiple mGPU tests are failing with:

Could not load RepeatBufferKernel: INTERNAL: Failed call to hipGetFuncBySymbol: hipError_t(98)

The problem is that we run them locally on gfx950 and in our Bazel command, we hardcoded:

--repo_env=TF_ROCM_AMDGPU_TARGETS=gfx90a,gfx942

Technical Details

This change will make the testing script more flexible, since it will automatically detect the present target and only build for that target

Test Plan

Needs to be tested in CI

Test Result

TBA

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

leo-automation

Does RBE use TF_ROCM_AMDGPU_TARGETS? If yes I believe that way we'll send gfx950 when running this script on gfx950 runner to an RBE worker that runs other arch

mmakevic-amd · 2026-04-07T13:17:12Z

Does RBE use TF_ROCM_AMDGPU_TARGETS? If yes I believe that way we'll send gfx950 when running this script on gfx950 runner to an RBE worker that runs other arch

Ah, that makes sense. I don't think there is a good way to automatically detect the target then? Should I just add gfx950 to the hardcoded list?

leo-automation · 2026-04-07T13:23:18Z

Does RBE use TF_ROCM_AMDGPU_TARGETS? If yes I believe that way we'll send gfx950 when running this script on gfx950 runner to an RBE worker that runs other arch

Ah, that makes sense. I don't think there is a good way to automatically detect the target then? Should I just add gfx950 to the hardcoded list?

Yes, but I am thinking doing it in the main if block might be better. Like below

for arg in "$@"; do if [[ "$arg" == "--config=ci_multi_gpu" ]]; then TAG_FILTERS="" TEST_TARGETS=("${TEST_TARGETS_MGPU[@]}") AMDGPU_TARGETS="${TF_ROCM_SGPU_AMDGPU_TARGETS:-gfx90a,gfx942}" fi if [[ "$arg" == "--config=ci_single_gpu" ]]; then TAG_FILTERS="${TAG_FILTERS},gpu,-multi_gpu,-no_oss" TEST_TARGETS=("${TEST_TARGETS_SGPU[@]}") AMDGPU_TARGETS="${TF_ROCM_SGPU_AMDGPU_TARGETS:-gfx950}" fi done

mmakevic-amd · 2026-04-07T13:34:25Z

Does RBE use TF_ROCM_AMDGPU_TARGETS? If yes I believe that way we'll send gfx950 when running this script on gfx950 runner to an RBE worker that runs other arch

Ah, that makes sense. I don't think there is a good way to automatically detect the target then? Should I just add gfx950 to the hardcoded list?

Yes, but I am thinking doing it in the main if block might be better. Like below

for arg in "$@"; do if [[ "$arg" == "--config=ci_multi_gpu" ]]; then TAG_FILTERS="" TEST_TARGETS=("${TEST_TARGETS_MGPU[@]}") AMDGPU_TARGETS="${TF_ROCM_SGPU_AMDGPU_TARGETS:-gfx90a,gfx942}" fi if [[ "$arg" == "--config=ci_single_gpu" ]]; then TAG_FILTERS="${TAG_FILTERS},gpu,-multi_gpu,-no_oss" TEST_TARGETS=("${TEST_TARGETS_SGPU[@]}") AMDGPU_TARGETS="${TF_ROCM_SGPU_AMDGPU_TARGETS:-gfx950}" fi done

Fixed, please take a look

i-chaochen · 2026-04-07T13:36:27Z

build_tools/rocm/execute_ci_build_upstream.sh

    if [[ "$arg" == "--config=ci_multi_gpu" ]]; then
        TAG_FILTERS=""
        TEST_TARGETS=("${TEST_TARGETS_MGPU[@]}")
+        AMDGPU_TARGETS="${TF_ROCM_AMDGPU_TARGETS:-gfx90a,gfx942}"


do we still have to hard code gfx? cannot we just get the rocminfo like here? https://github.com/openxla/xla/pull/36513/changes#diff-84cc06d38378056901dad1204d79317414d28d609a1fe884b4b1b52552749ac4R71-R76

I originally did that, but as @leo-amd pointed out, it won't work when using RBE (see discussion above)

build_tools/rocm/execute_ci_build_upstream.sh

i-chaochen

so IIUC, the latest change is to specific gfx942 for sgpu and gfx950 for mgpu?

mmakevic-amd · 2026-04-07T15:20:25Z

so IIUC, the latest change is to specific gfx942 for sgpu and gfx950 for mgpu?

Yes, sGPU uses gfx90a and gfx942 and mGPU gfx950

mmakevic-amd requested review from i-chaochen and leo-automation April 7, 2026 13:11

leo-automation reviewed Apr 7, 2026

View reviewed changes

Set TF_ROCM_AMDGPU_TARGETS based on type of tests being run

61de8dd

mmakevic-amd force-pushed the mmakevic/fix_mgpu_targets_list branch from bd90398 to 61de8dd Compare April 7, 2026 13:33

mmakevic-amd changed the title ~~Do not hardcode TF_ROCM_AMDGPU_TARGETS in test command~~ Fix hardcoded TF_ROCM_AMDGPU_TARGETS Apr 7, 2026

i-chaochen approved these changes Apr 7, 2026

View reviewed changes

leo-automation requested changes Apr 7, 2026

View reviewed changes

build_tools/rocm/execute_ci_build_upstream.sh Outdated Show resolved Hide resolved

Fix the lists

12c8d2f

i-chaochen reviewed Apr 7, 2026

View reviewed changes

leo-automation approved these changes Apr 7, 2026

View reviewed changes

mmakevic-amd merged commit 429f759 into rocm-dev-infra Apr 7, 2026
1 check failed

mmakevic-amd deleted the mmakevic/fix_mgpu_targets_list branch April 7, 2026 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hardcoded TF_ROCM_AMDGPU_TARGETS#782

Fix hardcoded TF_ROCM_AMDGPU_TARGETS#782
mmakevic-amd merged 2 commits intorocm-dev-infrafrom
mmakevic/fix_mgpu_targets_list

mmakevic-amd commented Apr 7, 2026

Uh oh!

leo-automation left a comment

Uh oh!

mmakevic-amd commented Apr 7, 2026

Uh oh!

leo-automation commented Apr 7, 2026

Uh oh!

mmakevic-amd commented Apr 7, 2026

Uh oh!

i-chaochen Apr 7, 2026

Uh oh!

mmakevic-amd Apr 7, 2026

Uh oh!

Uh oh!

i-chaochen left a comment

Uh oh!

mmakevic-amd commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mmakevic-amd commented Apr 7, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

leo-automation left a comment

Choose a reason for hiding this comment

Uh oh!

mmakevic-amd commented Apr 7, 2026

Uh oh!

leo-automation commented Apr 7, 2026

Uh oh!

mmakevic-amd commented Apr 7, 2026

Uh oh!

i-chaochen Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

mmakevic-amd Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

i-chaochen left a comment

Choose a reason for hiding this comment

Uh oh!

mmakevic-amd commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants