[AMD/Hyperloom] Tune dsr1-fp8-mi355x-sglang: --num-continuous-decode-steps 4 → 8#1109
Open
lishuoshuo-amd wants to merge 2 commits intoSemiAnalysisAI:mainfrom
Open
[AMD/Hyperloom] Tune dsr1-fp8-mi355x-sglang: --num-continuous-decode-steps 4 → 8#1109lishuoshuo-amd wants to merge 2 commits intoSemiAnalysisAI:mainfrom
lishuoshuo-amd wants to merge 2 commits intoSemiAnalysisAI:mainfrom
Conversation
Author
|
@claude review |
Collaborator
|
cc @Duyi-Wang |
limou102
approved these changes
Apr 27, 2026
54aee90 to
b10c872
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Tune
--num-continuous-decode-stepsfrom 4 to 8 for DeepSeek-R1-0528 FP8 on MI355X (SGLang).Increasing continuous decode steps reduces prefill/decode scheduling overhead, lowering per-token latency (TPOT) and improving overall throughput.
Changes
benchmarks/single_node/dsr1_fp8_mi355x.sh:--num-continuous-decode-steps 4→8perf-changelog.yaml: Added changelog entryPerformance Results
Hyperloom CI Optimization Report (conc=64, 1k/1k)
Full Parameter Sweep (12 points, 0 failures)
Verified across the complete (tp, conc, isl, osl) search-space from
amd-master.yaml:Average gain: +4.7% — positive improvement across all parameter combinations with no regression.
Baseline Validation Against InferenceX Official
Baseline aligns within <1% of official InferenceX data, confirming test environment reliability.
Note: All throughput numbers in this PR refer to output (decode) token throughput, never total. The "Optimization Report" and "Baseline Validation" tables show per-GPU values; the "Full Parameter Sweep" table shows aggregate (TP-summed) values from raw SGLang output_throughput. Per-GPU = aggregate / TP. Gain percentages are unit-invariant.
Related Issue
Automated optimization by Hyperloom CI.
Type of Change
Checklist
perf-changelog.yaml