Skip to content

Bump rocm-systems from b9e258c to 478a7a4#4337

Merged
amd-justchen merged 1 commit intomainfrom
bump-rocm-systems-478a7a4
Apr 4, 2026
Merged

Bump rocm-systems from b9e258c to 478a7a4#4337
amd-justchen merged 1 commit intomainfrom
bump-rocm-systems-478a7a4

Conversation

@systems-assistant
Copy link
Copy Markdown
Contributor

Bumps ROCm/rocm-systems from b9e258c to 478a7a4.

Commits

See full comparison here:

ROCm/rocm-systems@b9e258c...478a7a4


Copy link
Copy Markdown
Contributor

@geomin12 geomin12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm if CI is good

@amd-justchen
Copy link
Copy Markdown
Contributor

amd-justchen commented Apr 4, 2026

Seeing the usual flaky tests on first run attempt (as seen in #4324 which passed after re-runs).

Failed sub-jobs:

Linux::release / Test gfx1151 / Test Sanity Check / Test sanity (shard 1/1) (gfx1151)
https://github.com/ROCm/TheRock/actions/runs/23953754624/job/69881399380
Linux::release / Test gfx94X-dcgpu / Test hip-tests / Test hip-tests (shard 3/4) (gfx94X-dcgpu)
https://github.com/ROCm/TheRock/actions/runs/23953754624/job/69881598643
Linux::release / Test gfx94X-dcgpu / Test hip-tests / Test hip-tests (shard 2/4) (gfx94X-dcgpu)
https://github.com/ROCm/TheRock/actions/runs/23953754624/job/69881598669
Linux::release / Test gfx94X-dcgpu / Test hip-tests / Test hip-tests (shard 1/4) (gfx94X-dcgpu)
https://github.com/ROCm/TheRock/actions/runs/23953754624/job/69881598673
Linux::release / Test gfx94X-dcgpu / Test rocsolver / Test rocsolver (shard 1/2) (gfx94X-dcgpu)
https://github.com/ROCm/TheRock/actions/runs/23953754624/job/69881598794
Linux::release / Test gfx94X-dcgpu / Test hipblaslt / Test hipblaslt (shard 5/6) (gfx94X-dcgpu)
https://github.com/ROCm/TheRock/actions/runs/23953754624/job/69881598819
Linux::release / Test gfx94X-dcgpu / Test hipblaslt / Test hipblaslt (shard 1/6) (gfx94X-dcgpu)
https://github.com/ROCm/TheRock/actions/runs/23953754624/job/69881598848

Known Flaky tests that hit NonHost printf test errors that typically succeed on retry. Re-running.
(across shards 1,2,3 of total 4 shards)
1780 - Unit_NonHost_Printf_loop (Subprocess aborted)
1781 - Unit_NonHost_Printf_multiple_Threads (Subprocess aborted)
1782 - Unit_NonHost_Printf_BufferAvailability (Subprocess aborted)

Linux::release / Test gfx94X-dcgpu / Test rocwmma / Test rocwmma (shard 4/4) (gfx94X-dcgpu)
https://github.com/ROCm/TheRock/actions/runs/23953754624/job/69881598875
Linux::release / Test gfx94X-dcgpu / Test rocwmma / Test rocwmma (shard 1/4) (gfx94X-dcgpu)
https://github.com/ROCm/TheRock/actions/runs/23953754624/job/69881598882
Linux::release / Test gfx94X-dcgpu / Test rocwmma / Test rocwmma (shard 2/4) (gfx94X-dcgpu)
https://github.com/ROCm/TheRock/actions/runs/23953754624/job/69881598888
Linux::release / Test gfx94X-dcgpu / Test rocwmma / Test rocwmma (shard 3/4) (gfx94X-dcgpu)
https://github.com/ROCm/TheRock/actions/runs/23953754624/job/69881598891

Known Flaky tests that all timed out after an hour. These typically pass in ~5-15 minutes when they do pass. Re-running.

Linux::release / Build PyTorch | gfx120X-all / Build PyTorch | gfx120X-all | torch release/2.10 | py3.12
https://github.com/ROCm/TheRock/actions/runs/23953754624/job/69884838061
Linux::release / Build PyTorch | gfx110X-all / Build PyTorch | gfx110X-all | torch release/2.10 | py3.12
https://github.com/ROCm/TheRock/actions/runs/23953754624/job/69884838064
Linux::release / Build PyTorch | gfx1151 / Build PyTorch | gfx1151 | torch release/2.10 | py3.12
https://github.com/ROCm/TheRock/actions/runs/23953754624/job/69884838079
Linux::release / Build PyTorch | gfx94X-dcgpu / Build PyTorch | gfx94X-dcgpu | torch release/2.10 | py3.12
https://github.com/ROCm/TheRock/actions/runs/23953754624/job/69884838081

Known failure with pytorch 2.10 , but no need to gate bump PR on this. Pytorch Team is working on it according to @amd-aakash

Windows::release / Test gfx110X-all / Test rocthrust / Test rocthrust (shard 1/1) (gfx110X-all)
https://github.com/ROCm/TheRock/actions/runs/23953754624/job/69889817114

Test timed out. Re-running

@amd-justchen
Copy link
Copy Markdown
Contributor

amd-justchen commented Apr 4, 2026

Allowing the pytorch 2.10 failures and waiting only for remaining rocwmma tests to pass (shard 2 and 3, re-runs allowed shard 1 and 4 to pass)

Making the call that these will most likely eventually pass over re-runs given re-runs with the previously mentioned bump PR did. (Also I'm suspecting these aren't actually being sharded and the 4 jobs are actually all running the same test)

With that said I will merge this bump PR.

@amd-justchen amd-justchen merged commit 3bacd1e into main Apr 4, 2026
528 of 559 checks passed
@amd-justchen amd-justchen deleted the bump-rocm-systems-478a7a4 branch April 4, 2026 00:26
@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Apr 4, 2026
@amd-justchen
Copy link
Copy Markdown
Contributor

amd-justchen commented Apr 4, 2026

Details on the rocwmma jobs not sharding, with each of the 4 jobs running all 28 tests. (discovered this on accident when reusing hiptests ctest command which did have sharding args on a mi300 shared dev machine)
image

amd-justchen added a commit that referenced this pull request Apr 6, 2026
lajagapp pushed a commit that referenced this pull request Apr 7, 2026
rahulc-gh pushed a commit that referenced this pull request Apr 9, 2026
Bumps [ROCm/rocm-systems](https://github.com/ROCm/rocm-systems) from
`b9e258c` to `478a7a4`.

<details>
<summary>Commits</summary>

See full comparison here:


ROCm/rocm-systems@b9e258c...478a7a4

</details>
<br />

Co-authored-by: therockbot <therockbot@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants