Skip to content

dsv4-b300-sglang: update points#1179

Merged
Oseltamivir merged 19 commits intomainfrom
dsv4-b300-sglang-conc2048-mega-moe
Apr 28, 2026
Merged

dsv4-b300-sglang: update points#1179
Oseltamivir merged 19 commits intomainfrom
dsv4-b300-sglang-conc2048-mega-moe

Conversation

@yhyang201
Copy link
Copy Markdown
Collaborator

@yhyang201 yhyang201 commented Apr 26, 2026

Summary

Test plan

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment on lines +80 to +94
export SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1
export SGLANG_OPT_FIX_HASH_MEGA_MOE=1
export SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=288
PARALLEL_ARGS=(
--dp-size "$TP"
--enable-dp-attention
--moe-a2a-backend deepep
--cuda-graph-max-bs 288
--deepep-config "$DEEPEP_CONFIG"
--chunked-prefill-size 65536
--tokenizer-worker-num 4
--enable-prefill-delayer
)
MAX_RUNNING_REQUESTS=2560
MEM_FRACTION_STATIC=0.87
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Two pre-existing comments immediately above the DP_ATTENTION block became inaccurate after this PR added the CONC=2048 branch. The block comment at lines 63-66 still describes the recipe as "flashinfer_mxfp4 runner + halved prefill chunks + prefill-delayer", but the new CONC=2048 path uses --moe-a2a-backend deepep and --chunked-prefill-size 65536 (4x the non-DP value of 8192, not halved). Line 69 says the DP-attn branch "overrides to 0.94", but it now overrides to either 0.94 or 0.87 depending on CONC — worth refreshing the comments alongside this change so future maintainers don't trust stale assumptions.

Extended reasoning...

What the stale comments say

Lines 63-66 contain the rationale comment for the DP_ATTENTION dispatch block:

Pick the parallelism + MoE backend based on DP_ATTENTION (mirrors the vllm script's pattern). DP-attention runs the empirically-tuned high-concurrency recipe (flashinfer_mxfp4 runner + halved prefill chunks + prefill-delayer); single-instance uses flashinfer_mxfp4 with the cookbook defaults.

Line 69 contains:

# Default; the DP-attn branch below overrides to 0.94.

Both were accurate before this PR — the DP-attn branch was a single recipe that always used flashinfer_mxfp4, set --chunked-prefill-size 16384 (half the previous 32768 cookbook value, hence "halved"), and always set MEM_FRACTION_STATIC=0.94.

Why this PR makes them inaccurate

The new if [ "$CONC" = "2048" ]; then ... else ... split inside the DP-attn branch breaks both invariants:

  1. The CONC=2048 path uses --moe-a2a-backend deepep (not flashinfer_mxfp4), SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1 (the mega_moe deepep recipe — described in the PR description and changelog as a different recipe family entirely), and --chunked-prefill-size 65536. The block comment now describes only half of the DP-attn cases.

  2. The wording "halved prefill chunks" is now actively misleading: 65536 is 8x the non-DP path's --chunked-prefill-size 8192, i.e. multiplied, not halved. A reader looking at line 65 next to lines 78-94 will see a direct contradiction.

  3. MEM_FRACTION_STATIC is now overridden to 0.94 (CONC<2048) or 0.87 (CONC=2048), so line 69's single-value claim is no longer correct.

Step-by-step proof

  • Before this PR: DP_ATTENTION=true → always --moe-runner-backend flashinfer_mxfp4, --chunked-prefill-size 16384, MEM_FRACTION_STATIC=0.94. Comments are correct.
  • After this PR with DP_ATTENTION=true CONC=2048: --moe-a2a-backend deepep (not flashinfer_mxfp4) ✗, --chunked-prefill-size 65536 (not halved relative to non-DP 8192 — it's 8x) ✗, MEM_FRACTION_STATIC=0.87 (not 0.94) ✗. All three claims fail.
  • After this PR with DP_ATTENTION=true CONC=1024: comments still happen to be correct, but a maintainer reading them as describing "the DP-attn recipe" will be wrong about the other branch.

Severity / impact

This is a documentation accuracy issue, not a behavioral bug — runtime behavior is unaffected. But the file's comments are explicitly there to give future maintainers the empirical rationale ("empirically-tuned", "cookbook defaults"), and silently letting them drift turns future debugging into a trap. Easiest fix is to update lines 63-66 to mention both recipes (flashinfer_mxfp4 + halved chunks for CONC<2048; mega_moe deepep + larger chunks for CONC=2048) and reword line 69 to say the DP-attn branch overrides to 0.94 or 0.87 depending on CONC.

@yhyang201
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24961231373
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 6a02d2d
Approval: not required (trusted collaborator).

@yhyang201
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24962186268
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 0ea8e62
Approval: not required (trusted collaborator).

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Apr 26, 2026

@yhyang201 Hi please hold off on sweeps until we get some CI unblocked

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Apr 26, 2026

@Qiaolin-Yu Qiaolin-Yu self-assigned this Apr 26, 2026
@yhyang201
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24978717689
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: e8685d9
Approval: not required (trusted collaborator).

@yhyang201
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24991420778
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 4575ce6
Approval: not required (trusted collaborator).

@yhyang201
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24993602429
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 0fb4d3c
Approval: not required (trusted collaborator).

@yhyang201
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24994940494
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: c0f9334
Approval: not required (trusted collaborator).

@yhyang201
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24997173342
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 5352757
Approval: not required (trusted collaborator).

@yhyang201
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24997928458
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: a6e7ea0
Approval: not required (trusted collaborator).

@yhyang201
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24998946908
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 758012f
Approval: not required (trusted collaborator).

@yhyang201
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24999947919
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 8e2d2ff
Approval: not required (trusted collaborator).

yhyang201 and others added 15 commits April 28, 2026 13:37
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… configs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nc=2048

- YAML: conc=2048 and conc=4096 (both 1k1k and 8k1k) had tp=4, should be tp=8
- Script: conc=2048 was missing explicit SWA_FULL_TOKENS_RATIO=0.1, causing
  1k1k to incorrectly use 0.5 from the ISL-based default

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Disable NVSHMEM IB transport in the two code paths that explicitly use
--moe-a2a-backend deepep (EP_SIZE=8 and CONC=2048/4096).
Pin dsv4-fp4-b300-sglang to lmsysorg/sglang:deepseek-v4-b300@sha256:2fec8d7958bb0d53b50d7bf04d6ae6a7de8a35503775826e0550a45dd8c3ee15.
Both high-conc (CONC=2048/4096) and medium-conc recipes use ep=8 in
the YAML, so EP_SIZE is always "8" for both. The previous if/elif
order meant EP_SIZE=8 matched first, shadowing the CONC=2048/4096
branch entirely. Swap the order so the more specific high-conc check
runs first.
- max-running-requests: 4608 → 4352
- swa-full-tokens-ratio: 0.06 → 0.075
- MEGA_MOE_NUM_MAX_TOKENS_PER_RANK: 544 → 8320
- add --decode-log-interval 5
- move SGLANG_LOG_FORWARD_ITERS to conc-2048 only
@yhyang201 yhyang201 force-pushed the dsv4-b300-sglang-conc2048-mega-moe branch from f596249 to 4ef3386 Compare April 28, 2026 05:39
- 1k1k: keep identical to main (tp:8/ep:1/conc:1, tp:4/ep:1/conc:32, tp:4/ep:4/conc:512)
- 8k1k: replace conc:512 with conc:2048 and conc:4096 (tp:8/ep:8 mega_moe deepep)
- Remove all tp:4/ep:8 entries (ep>tp is misleading)
- Remove temporary disable comments
@yhyang201 yhyang201 force-pushed the dsv4-b300-sglang-conc2048-mega-moe branch from b4f2b50 to 862d82e Compare April 28, 2026 06:48
@yhyang201 yhyang201 force-pushed the dsv4-b300-sglang-conc2048-mega-moe branch from fd64708 to 862d82e Compare April 28, 2026 06:54
Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment thread perf-changelog.yaml
Comment thread perf-changelog.yaml Outdated
@Oseltamivir Oseltamivir merged commit e3a8521 into main Apr 28, 2026
15 of 26 checks passed
@Oseltamivir Oseltamivir deleted the dsv4-b300-sglang-conc2048-mega-moe branch April 28, 2026 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

4 participants