Skip to content

Add ROCm Qwen3.5-9B smoke path#2

Merged
benenzhu merged 7 commits into
mainfrom
zty_dev_qwen35
Apr 28, 2026
Merged

Add ROCm Qwen3.5-9B smoke path#2
benenzhu merged 7 commits into
mainfrom
zty_dev_qwen35

Conversation

@benenzhu
Copy link
Copy Markdown

Summary

  • Adds a clean-start amd/run_qwen35-9b.sh wrapper for Qwen3.5-9B DAPO-Math ROCm smoke runs.
  • Installs ROCm TransformerEngine wheels and flash-linear-attention in the ROCm Dockerfile so Megatron TE and Qwen3.5 GatedDeltaNet can initialize.
  • Carries over the verified ROCm runtime workarounds: TE auto attention, BSHD, disabled Dynamo/JIT fuser, and 1-GPU SGLang rollout engines.
  • Reduces the default Qwen3.5 runner to a smaller smoke profile after the full-style rollout configuration hit SGLang logits OOM.

Validation

  • pre-commit run --all-files --show-diff-on-failure passes.
  • bash amd/run_qwen35-9b.sh restarted Ray cleanly and launched the direct runner.
  • Megatron Qwen3.5 GatedDeltaNet initialized after installing flash-linear-attention.
  • Two 1-GPU SGLang rollout engines started and avoided the prior invalid device ordinal error.
  • All 5 services registered successfully.
  • The run reached step 0: rollout, reference logprob, actor_fwd logprob, advantages, and actor training started.
  • TE selected FusedAttention backend (sub-backend 1) during Megatron logprob/training.
  • The final reduced smoke profile was not rerun after lowering rollout pressure further.

Notes

The full-style Qwen3.5 rollout settings progressed much further than before but eventually hit HIP OOM in SGLang logits allocation while full token usage reached ~1.0. This PR keeps the bring-up work and sets safer defaults for the next smoke attempt.

Made with Cursor

# ⭐ Feature

## Add Qwen3.5-9B ROCm smoke path

- Add one-command Qwen3.5-9B DAPO-Math smoke runner that restarts Ray from a clean state.
- Align Qwen3.5 runtime flags with the validated Qwen3-4B ROCm path: TE auto attention, BSHD, and disabled Dynamo/JIT fuser.
- Reduce the default Qwen3.5 run to a smaller smoke profile to avoid filling SGLang KV/logits memory during initial validation.

Made-with: Cursor

---

# 🐛 Bug Fix

## Install ROCm training dependencies in Dockerfile

- Install ROCm TransformerEngine 2.4.0 wheels for Megatron TE specs.
- Install flash-linear-attention for Qwen3.5 GatedDeltaNet initialization.

---

# 📝 Documentation

## Record Qwen3.5 ROCm bring-up results

- Document FLA import resolution, SGLang rollout device ordinal fix, and the full-profile rollout OOM finding.
- Record that the smoke reached all service registration and step-0 training/logprob flow.

---

# 🎨 Style

## Apply pre-commit formatting

- Apply import ordering, doc formatting, script copyright, and formatting fixes from pre-commit.
Resolve AMD runner conflicts after the Qwen3-4B ROCm PR landed on main.

Keep the Qwen3.5 smoke runner changes while preserving the main branch's Qwen3-4B baseline, and avoid hardcoded private IP defaults in local smoke wrappers.

Made-with: Cursor
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a one-command smoke runner for Qwen3.5-9B on ROCm and updates the documentation with validation and troubleshooting details. Key changes include updating the ROCm Dockerfile to install TransformerEngine and flash-linear-attention, implementing dynamic master address detection in shell scripts, and optimizing rollout parameters for smoke testing. Review feedback identified redundant path entries in the PYTHONPATH environment variable within the newly added shell scripts.

Comment thread amd/run-qwen35-9b-dapo-math-direct.sh Outdated
Comment thread amd/run_qwen35-9b.sh Outdated
benenzhu and others added 3 commits April 27, 2026 17:17
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
# 🐛 Bug Fix

## Avoid debug logging failures during tiny smoke runs

- Skip rollout metric logging when the transient training debug buffer does not include response length metadata yet.
- Keep training from failing before the rollout/logprob path has completed.

Made-with: Cursor

---

# ⭐ Feature

## Tighten Qwen3.5-9B smoke defaults

- Use a smaller global batch for the reduced smoke profile while preserving GRPO grouping constraints.
- Record the latest validation findings in the AMD runbook.
@benenzhu benenzhu merged commit 1354cc4 into main Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant