Fix hardcoded TF_ROCM_AMDGPU_TARGETS#782
Conversation
leo-automation
left a comment
There was a problem hiding this comment.
Does RBE use TF_ROCM_AMDGPU_TARGETS? If yes I believe that way we'll send gfx950 when running this script on gfx950 runner to an RBE worker that runs other arch
Ah, that makes sense. I don't think there is a good way to automatically detect the target then? Should I just add gfx950 to the hardcoded list? |
Yes, but I am thinking doing it in the main if block might be better. Like below
|
bd90398 to
61de8dd
Compare
Fixed, please take a look |
| if [[ "$arg" == "--config=ci_multi_gpu" ]]; then | ||
| TAG_FILTERS="" | ||
| TEST_TARGETS=("${TEST_TARGETS_MGPU[@]}") | ||
| AMDGPU_TARGETS="${TF_ROCM_AMDGPU_TARGETS:-gfx90a,gfx942}" |
There was a problem hiding this comment.
do we still have to hard code gfx? cannot we just get the rocminfo like here? https://github.com/openxla/xla/pull/36513/changes#diff-84cc06d38378056901dad1204d79317414d28d609a1fe884b4b1b52552749ac4R71-R76
There was a problem hiding this comment.
I originally did that, but as @leo-amd pointed out, it won't work when using RBE (see discussion above)
i-chaochen
left a comment
There was a problem hiding this comment.
so IIUC, the latest change is to specific gfx942 for sgpu and gfx950 for mgpu?
Yes, sGPU uses gfx90a and gfx942 and mGPU gfx950 |
Motivation
In CI, multiple mGPU tests are failing with:
The problem is that we run them locally on
gfx950and in our Bazel command, we hardcoded:Technical Details
This change will make the testing script more flexible, since it will automatically detect the present target and only build for that target
Test Plan
Needs to be tested in CI
Test Result
TBA
Submission Checklist