Add ROCm Qwen3.5-9B smoke path by benenzhu · Pull Request #2 · AMD-AIM/Relax

benenzhu · 2026-04-27T09:15:54Z

Summary

Adds a clean-start amd/run_qwen35-9b.sh wrapper for Qwen3.5-9B DAPO-Math ROCm smoke runs.
Installs ROCm TransformerEngine wheels and flash-linear-attention in the ROCm Dockerfile so Megatron TE and Qwen3.5 GatedDeltaNet can initialize.
Carries over the verified ROCm runtime workarounds: TE auto attention, BSHD, disabled Dynamo/JIT fuser, and 1-GPU SGLang rollout engines.
Reduces the default Qwen3.5 runner to a smaller smoke profile after the full-style rollout configuration hit SGLang logits OOM.

Validation

pre-commit run --all-files --show-diff-on-failure passes.
bash amd/run_qwen35-9b.sh restarted Ray cleanly and launched the direct runner.
Megatron Qwen3.5 GatedDeltaNet initialized after installing flash-linear-attention.
Two 1-GPU SGLang rollout engines started and avoided the prior invalid device ordinal error.
All 5 services registered successfully.
The run reached step 0: rollout, reference logprob, actor_fwd logprob, advantages, and actor training started.
TE selected FusedAttention backend (sub-backend 1) during Megatron logprob/training.
The final reduced smoke profile was not rerun after lowering rollout pressure further.

Notes

The full-style Qwen3.5 rollout settings progressed much further than before but eventually hit HIP OOM in SGLang logits allocation while full token usage reached ~1.0. This PR keeps the bring-up work and sets safer defaults for the next smoke attempt.

Made with Cursor

# ⭐ Feature ## Add Qwen3.5-9B ROCm smoke path - Add one-command Qwen3.5-9B DAPO-Math smoke runner that restarts Ray from a clean state. - Align Qwen3.5 runtime flags with the validated Qwen3-4B ROCm path: TE auto attention, BSHD, and disabled Dynamo/JIT fuser. - Reduce the default Qwen3.5 run to a smaller smoke profile to avoid filling SGLang KV/logits memory during initial validation. Made-with: Cursor --- # 🐛 Bug Fix ## Install ROCm training dependencies in Dockerfile - Install ROCm TransformerEngine 2.4.0 wheels for Megatron TE specs. - Install flash-linear-attention for Qwen3.5 GatedDeltaNet initialization. --- # 📝 Documentation ## Record Qwen3.5 ROCm bring-up results - Document FLA import resolution, SGLang rollout device ordinal fix, and the full-profile rollout OOM finding. - Record that the smoke reached all service registration and step-0 training/logprob flow. --- # 🎨 Style ## Apply pre-commit formatting - Apply import ordering, doc formatting, script copyright, and formatting fixes from pre-commit.

Resolve AMD runner conflicts after the Qwen3-4B ROCm PR landed on main. Keep the Qwen3.5 smoke runner changes while preserving the main branch's Qwen3-4B baseline, and avoid hardcoded private IP defaults in local smoke wrappers. Made-with: Cursor

gemini-code-assist

Code Review

This pull request introduces a one-command smoke runner for Qwen3.5-9B on ROCm and updates the documentation with validation and troubleshooting details. Key changes include updating the ROCm Dockerfile to install TransformerEngine and flash-linear-attention, implementing dynamic master address detection in shell scripts, and optimizing rollout parameters for smoke testing. Review feedback identified redundant path entries in the PYTHONPATH environment variable within the newly added shell scripts.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

# 🐛 Bug Fix ## Avoid debug logging failures during tiny smoke runs - Skip rollout metric logging when the transient training debug buffer does not include response length metadata yet. - Keep training from failing before the rollout/logprob path has completed. Made-with: Cursor --- # ⭐ Feature ## Tighten Qwen3.5-9B smoke defaults - Use a smaller global batch for the reduced smoke profile while preserving GRPO grouping constraints. - Record the latest validation findings in the AMD runbook.

benenzhu added 4 commits April 27, 2026 11:28

add first qwen3

95f2698

zz

4b6d22b

merge: sync qwen35 branch with main

f7102af

Resolve AMD runner conflicts after the Qwen3-4B ROCm PR landed on main. Keep the Qwen3.5 smoke runner changes while preserving the main branch's Qwen3-4B baseline, and avoid hardcoded private IP defaults in local smoke wrappers. Made-with: Cursor

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread amd/run-qwen35-9b-dapo-math-direct.sh Outdated

Comment thread amd/run_qwen35-9b.sh Outdated

benenzhu and others added 3 commits April 27, 2026 17:17

Update amd/run-qwen35-9b-dapo-math-direct.sh

d909fef

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update amd/run_qwen35-9b.sh

b5e2a03

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

benenzhu merged commit 1354cc4 into main Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ROCm Qwen3.5-9B smoke path#2

Add ROCm Qwen3.5-9B smoke path#2
benenzhu merged 7 commits into
mainfrom
zty_dev_qwen35

benenzhu commented Apr 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benenzhu commented Apr 27, 2026

Summary

Validation

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant