dsv4-b300-sglang: update points by yhyang201 · Pull Request #1179 · SemiAnalysisAI/InferenceX

yhyang201 · 2026-04-26T15:57:37Z

Summary

Test plan

🤖 Generated with Claude Code

github-actions · 2026-04-26T15:57:44Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-26T16:04:01Z

+        export SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1
+        export SGLANG_OPT_FIX_HASH_MEGA_MOE=1
+        export SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=288
+        PARALLEL_ARGS=(
+            --dp-size "$TP"
+            --enable-dp-attention
+            --moe-a2a-backend deepep
+            --cuda-graph-max-bs 288
+            --deepep-config "$DEEPEP_CONFIG"
+            --chunked-prefill-size 65536
+            --tokenizer-worker-num 4
+            --enable-prefill-delayer
+        )
+        MAX_RUNNING_REQUESTS=2560
+        MEM_FRACTION_STATIC=0.87


🟡 Two pre-existing comments immediately above the DP_ATTENTION block became inaccurate after this PR added the CONC=2048 branch. The block comment at lines 63-66 still describes the recipe as "flashinfer_mxfp4 runner + halved prefill chunks + prefill-delayer", but the new CONC=2048 path uses --moe-a2a-backend deepep and --chunked-prefill-size 65536 (4x the non-DP value of 8192, not halved). Line 69 says the DP-attn branch "overrides to 0.94", but it now overrides to either 0.94 or 0.87 depending on CONC — worth refreshing the comments alongside this change so future maintainers don't trust stale assumptions.

Extended reasoning...

What the stale comments say

Lines 63-66 contain the rationale comment for the DP_ATTENTION dispatch block:

Pick the parallelism + MoE backend based on DP_ATTENTION (mirrors the vllm script's pattern). DP-attention runs the empirically-tuned high-concurrency recipe (flashinfer_mxfp4 runner + halved prefill chunks + prefill-delayer); single-instance uses flashinfer_mxfp4 with the cookbook defaults.

Line 69 contains:

# Default; the DP-attn branch below overrides to 0.94.

Both were accurate before this PR — the DP-attn branch was a single recipe that always used flashinfer_mxfp4, set --chunked-prefill-size 16384 (half the previous 32768 cookbook value, hence "halved"), and always set MEM_FRACTION_STATIC=0.94.

Why this PR makes them inaccurate

The new if [ "$CONC" = "2048" ]; then ... else ... split inside the DP-attn branch breaks both invariants:

The CONC=2048 path uses --moe-a2a-backend deepep (not flashinfer_mxfp4), SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1 (the mega_moe deepep recipe — described in the PR description and changelog as a different recipe family entirely), and --chunked-prefill-size 65536. The block comment now describes only half of the DP-attn cases.

The wording "halved prefill chunks" is now actively misleading: 65536 is 8x the non-DP path's --chunked-prefill-size 8192, i.e. multiplied, not halved. A reader looking at line 65 next to lines 78-94 will see a direct contradiction.

MEM_FRACTION_STATIC is now overridden to 0.94 (CONC<2048) or 0.87 (CONC=2048), so line 69's single-value claim is no longer correct.

Step-by-step proof

Before this PR: DP_ATTENTION=true → always --moe-runner-backend flashinfer_mxfp4, --chunked-prefill-size 16384, MEM_FRACTION_STATIC=0.94. Comments are correct.

After this PR with DP_ATTENTION=true CONC=2048: --moe-a2a-backend deepep (not flashinfer_mxfp4) ✗, --chunked-prefill-size 65536 (not halved relative to non-DP 8192 — it's 8x) ✗, MEM_FRACTION_STATIC=0.87 (not 0.94) ✗. All three claims fail.

After this PR with DP_ATTENTION=true CONC=1024: comments still happen to be correct, but a maintainer reading them as describing "the DP-attn recipe" will be wrong about the other branch.

Severity / impact

This is a documentation accuracy issue, not a behavioral bug — runtime behavior is unaffected. But the file's comments are explicitly there to give future maintainers the empirical rationale ("empirically-tuned", "cookbook defaults"), and silently letting them drift turns future debugging into a trap. Easiest fix is to update lines 63-66 to mention both recipes (flashinfer_mxfp4 + halved chunks for CONC<2048; mega_moe deepep + larger chunks for CONC=2048) and reword line 69 to say the DP-attn branch overrides to 0.94 or 0.87 depending on CONC.

yhyang201 · 2026-04-26T16:17:20Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-26T16:17:28Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24961231373
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 6a02d2d
Approval: not required (trusted collaborator).

yhyang201 · 2026-04-26T17:04:04Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-26T17:04:14Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24962186268
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 0ea8e62
Approval: not required (trusted collaborator).

cquil11 · 2026-04-26T20:15:00Z

@yhyang201 Hi please hold off on sweeps until we get some CI unblocked

cquil11 · 2026-04-26T21:03:36Z

Ok back @yhyang201 https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24962186268

yhyang201 · 2026-04-27T05:48:18Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T05:48:27Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24978717689
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: e8685d9
Approval: not required (trusted collaborator).

yhyang201 · 2026-04-27T11:05:28Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T11:05:39Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24991420778
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 4575ce6
Approval: not required (trusted collaborator).

yhyang201 · 2026-04-27T11:57:02Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T11:57:12Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24993602429
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 0fb4d3c
Approval: not required (trusted collaborator).

yhyang201 · 2026-04-27T12:26:58Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T12:27:08Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24994940494
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: c0f9334
Approval: not required (trusted collaborator).

yhyang201 · 2026-04-27T13:14:15Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T13:14:26Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24997173342
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 5352757
Approval: not required (trusted collaborator).

yhyang201 · 2026-04-27T13:29:44Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T13:29:55Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24997928458
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: a6e7ea0
Approval: not required (trusted collaborator).

yhyang201 · 2026-04-27T13:49:53Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T13:50:09Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24998946908
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 758012f
Approval: not required (trusted collaborator).

yhyang201 · 2026-04-27T14:08:58Z

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang

github-actions · 2026-04-27T14:09:13Z

@yhyang201 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24999947919
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang
Pinned ref: 8e2d2ff
Approval: not required (trusted collaborator).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… configs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…nc=2048 - YAML: conc=2048 and conc=4096 (both 1k1k and 8k1k) had tp=4, should be tp=8 - Script: conc=2048 was missing explicit SWA_FULL_TOKENS_RATIO=0.1, causing 1k1k to incorrectly use 0.5 from the ISL-based default Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Disable NVSHMEM IB transport in the two code paths that explicitly use --moe-a2a-backend deepep (EP_SIZE=8 and CONC=2048/4096).

Pin dsv4-fp4-b300-sglang to lmsysorg/sglang:deepseek-v4-b300@sha256:2fec8d7958bb0d53b50d7bf04d6ae6a7de8a35503775826e0550a45dd8c3ee15.

…2048/4096

Both high-conc (CONC=2048/4096) and medium-conc recipes use ep=8 in the YAML, so EP_SIZE is always "8" for both. The previous if/elif order meant EP_SIZE=8 matched first, shadowing the CONC=2048/4096 branch entirely. Swap the order so the more specific high-conc check runs first.

- max-running-requests: 4608 → 4352 - swa-full-tokens-ratio: 0.06 → 0.075 - MEGA_MOE_NUM_MAX_TOKENS_PER_RANK: 544 → 8320 - add --decode-log-interval 5 - move SGLANG_LOG_FORWARD_ITERS to conc-2048 only

…2048 too

- 1k1k: keep identical to main (tp:8/ep:1/conc:1, tp:4/ep:1/conc:32, tp:4/ep:4/conc:512) - 8k1k: replace conc:512 with conc:2048 and conc:4096 (tp:8/ep:8 mega_moe deepep) - Remove all tp:4/ep:8 entries (ep>tp is misleading) - Remove temporary disable comments

Oseltamivir

lgtm

yhyang201 requested a review from a team April 26, 2026 15:57

github-project-automation Bot added this to InferenceMAX Board Apr 26, 2026

claude Bot reviewed Apr 26, 2026

View reviewed changes

yhyang201 requested review from Qiaolin-Yu, jgangani and kedarpotdar-nv as code owners April 26, 2026 16:11

Qiaolin-Yu self-assigned this Apr 26, 2026

yhyang201 and others added 15 commits April 28, 2026 13:37

dsv4-b300-sglang: conc=2048 mega_moe deepep recipe

2263ac1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dsv4-b300-sglang: add conc=4096 mega_moe deepep recipe

f1bd23d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dsv4-b300-sglang: 1k1k conc=512/1024 mega_moe deepep recipe

f3e105f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dsv4-b300-sglang: merge changelog entries into single PR#1179 entry

90d3bfd

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dsv4-b300-sglang: add conc=2048/4096 mega_moe CI entries for both ISL…

86b77f5

… configs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dsv4-b300-sglang: set NVSHMEM_DISABLE_IB=1 for deepep recipes

1a65efb

Disable NVSHMEM IB transport in the two code paths that explicitly use --moe-a2a-backend deepep (EP_SIZE=8 and CONC=2048/4096).

dsv4-b300-sglang: update image to sha256:2fec8d79

35068f7

Pin dsv4-fp4-b300-sglang to lmsysorg/sglang:deepseek-v4-b300@sha256:2fec8d7958bb0d53b50d7bf04d6ae6a7de8a35503775826e0550a45dd8c3ee15.

dsv4-b300-sglang: enable SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW for conc …

2f955a8

…2048/4096

dsv4-b300-sglang: set swa-full-tokens-ratio 0.06 for conc 2048/4096

854efe4

dsv4-b300-sglang: temporarily limit sweep to 8k1k conc 2048/4096

8477526

dsv4-b300-sglang: update conc-4096 recipe parameters

506702b

- max-running-requests: 4608 → 4352 - swa-full-tokens-ratio: 0.06 → 0.075 - MEGA_MOE_NUM_MAX_TOKENS_PER_RANK: 544 → 8320 - add --decode-log-interval 5 - move SGLANG_LOG_FORWARD_ITERS to conc-2048 only

dsv4-b300-sglang: set MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=8320 for conc-…

1f75a9c

…2048 too

perf-changelog: rebase on main, append PR#1179 entry

4ef3386

yhyang201 force-pushed the dsv4-b300-sglang-conc2048-mega-moe branch from f596249 to 4ef3386 Compare April 28, 2026 05:39

Qiaolin-Yu approved these changes Apr 28, 2026

View reviewed changes

yhyang201 force-pushed the dsv4-b300-sglang-conc2048-mega-moe branch from b4f2b50 to 862d82e Compare April 28, 2026 06:48

Merge branch 'main' into dsv4-b300-sglang-conc2048-mega-moe

fd64708

yhyang201 force-pushed the dsv4-b300-sglang-conc2048-mega-moe branch from fd64708 to 862d82e Compare April 28, 2026 06:54

dsv4-b300-sglang: restore 8k1k tp:4/ep:4/conc:512 entry

5862015

Oseltamivir approved these changes Apr 28, 2026

View reviewed changes

Oseltamivir added the sweep-enabled label Apr 28, 2026

Oseltamivir requested changes Apr 28, 2026

View reviewed changes

Comment thread perf-changelog.yaml

Qiaolin-Yu reviewed Apr 28, 2026

View reviewed changes

Comment thread perf-changelog.yaml Outdated

Apply suggestion from @Qiaolin-Yu

535faf1

Oseltamivir merged commit e3a8521 into main Apr 28, 2026
15 of 26 checks passed

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 28, 2026

Oseltamivir deleted the dsv4-b300-sglang-conc2048-mega-moe branch April 28, 2026 07:10

Conversation

yhyang201 commented Apr 26, 2026 • edited by Qiaolin-Yu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

claude Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

yhyang201 commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

yhyang201 commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

cquil11 commented Apr 26, 2026

Uh oh!

cquil11 commented Apr 26, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

yhyang201 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

Oseltamivir left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yhyang201 commented Apr 26, 2026 •

edited by Qiaolin-Yu

Loading