Skip to content

Add pure TP configuration to H200 vLLM DSv4 deployment#1285

Merged
cquil11 merged 3 commits intomainfrom
dsv4-fp8-h200-vllm-pure-tp
May 5, 2026
Merged

Add pure TP configuration to H200 vLLM DSv4 deployment#1285
cquil11 merged 3 commits intomainfrom
dsv4-fp8-h200-vllm-pure-tp

Conversation

@ywang96
Copy link
Copy Markdown
Collaborator

@ywang96 ywang96 commented May 5, 2026

No description provided.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ywang96 ywang96 requested a review from a team May 5, 2026 20:39
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment thread benchmarks/single_node/dsv4_fp8_h200.sh Outdated
--no-enable-prefix-caching \
--enable-expert-parallel \
--data-parallel-size $TP \
--tensor-parallel-size $TP \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 This PR makes a major recipe change (EP+DP=8 → pure TP=8) on dsv4-fp8-h200-vllm and dsv4-fp8-h200-vllm-mtp, but perf-changelog.yaml is not updated. Per AGENTS.md, the changelog is the trigger for benchmark re-runs, so without a new entry push-to-main will not re-benchmark these configs and inferencex.com will keep publishing the old EP+DP numbers. Please append a new entry covering both config-keys to the END of perf-changelog.yaml (matching the pattern from #1279/#1222).

Extended reasoning...

What the bug is: This PR modifies benchmarks/single_node/dsv4_fp8_h200.sh and benchmarks/single_node/dsv4_fp8_h200_mtp.sh to swap the parallelism strategy from expert-parallel + data-parallel size 8 to pure tensor-parallel size 8 — a structural recipe change that will materially change throughput, latency, and GPU memory characteristics. However, no entry is added to perf-changelog.yaml.

Why this matters / code path: AGENTS.md is explicit:

  • Line 161-162: "perf-changelog.yaml triggers which configs to benchmark ... New entries MUST be appended to the END of the file — never insert in the middle or prepend."
  • Line 174: "Changes to perf-changelog.yaml trigger benchmark runs"
  • Lines 192-195 (Adding a New Benchmark Configuration): "Add corresponding entry to perf-changelog.yaml to trigger benchmark"
  • Lines 295-300 (Updating Docker Images / parameter changes): "MUST: Add an entry to perf-changelog.yaml"

The run-sweep.yml workflow only fires off post-merge benchmark sweeps in response to changes to perf-changelog.yaml. With no entry, push-to-main will not re-run the sweep for dsv4-fp8-h200-vllm or dsv4-fp8-h200-vllm-mtp, so the published dashboard will still reflect numbers measured under the old EP+DP recipe even after this code lands.

Strong precedent: The directly preceding PR #1279 added a changelog entry for an arguably much smaller change — bumping num_speculative_tokens from 1 → 2 on this same config. The tail of perf-changelog.yaml shows that entry (and #1222's MTP-add entry just before it). A switch from EP+DP=8 to pure TP=8 is a materially larger recipe shift than a single token-budget bump, so if #1279 warranted an entry, this PR definitely does.

Step-by-step proof:

  1. git diff main...HEAD --stat for this PR shows only the two .sh files changed; perf-changelog.yaml is untouched.
  2. Push-to-main triggers run-sweep.yml, which (per its paths: filter) only invokes the benchmark matrix when perf-changelog.yaml is in the changed-paths list.
  3. Since this PR doesn't touch perf-changelog.yaml, the post-merge job will skip both dsv4-fp8-h200-vllm and dsv4-fp8-h200-vllm-mtp.
  4. Result: inferencex.com continues serving the old EP+DP numbers despite the merged code now running pure TP=8 — a silent staleness that is exactly what the changelog mechanism is designed to prevent.

How to fix: Append a new entry to the END of perf-changelog.yaml (per AGENTS.md, never insert in the middle), e.g.:

- config-keys:
    - dsv4-fp8-h200-vllm
    - dsv4-fp8-h200-vllm-mtp
  description:
    - "Switch H200 vLLM DSv4 recipe from EP + DP=8 to pure TP=8"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1285

This will cause the post-merge sweep to re-benchmark both configs under the new recipe and refresh the published numbers.

Comment thread benchmarks/single_node/dsv4_fp8_h200.sh Outdated
--no-enable-prefix-caching \
--enable-expert-parallel \
--data-parallel-size $TP \
--tensor-parallel-size $TP \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The PR switches both dsv4_fp8_h200.sh and dsv4_fp8_h200_mtp.sh to pure TP, but .github/configs/nvidia-master.yaml lines 2627, 2631 (dsv4-fp8-h200-vllm) and lines 2649, 2653 (dsv4-fp8-h200-vllm-mtp) still specify { tp: 8, ep: 8, dp-attn: true, ... }. Those values are exported as EP_SIZE/DP_ATTENTION by benchmark-tmpl.yml, embedded in RESULT_FILENAME as ep8-dpaTrue, and recorded by utils/process_result.py as the ep and dp_attention metadata fields — so pure-TP runs will be silently mislabeled and aggregated under the old EP+DP-attn label. Update both search-space entries to ep: 1, dp-attn: false (or drop those keys) in this PR.

Extended reasoning...

Bug

After this PR, the script no longer passes --enable-expert-parallel or --data-parallel-size $TP; it just passes --tensor-parallel-size $TP. So actual deployment is pure TP=8, EP effectively 1, DP-attention disabled.

However, .github/configs/nvidia-master.yaml was not updated. Lines 2614–2653 still configure both DSv4 H200 vLLM entries with EP+DP-attn search-space:

dsv4-fp8-h200-vllm:
  ...
  scenarios:
    fixed-seq-len:
    - isl: 1024
      osl: 1024
      search-space:
      - { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64 }
    - isl: 8192
      osl: 1024
      search-space:
      - { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64 }

dsv4-fp8-h200-vllm-mtp:
  ...
      search-space:
      - { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64, spec-decoding: mtp }
      ...
      - { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64, spec-decoding: mtp }

How it manifests

.github/workflows/benchmark-tmpl.yml lines 103–105 export the search-space fields as env vars:

TP: ${{ inputs.tp }}
EP_SIZE: ${{ inputs.ep }}
DP_ATTENTION: ${{ inputs.dp-attn }}

Line 180 builds the result filename by interpolating those env vars verbatim:

RESULT_FILENAME: ${{ env.EXP_NAME }}_${{ env.PRECISION }}_${{ env.FRAMEWORK }}_tp${{ env.TP }}-ep${{ env.EP_SIZE }}-dpa${{ env.DP_ATTENTION }}_disagg-...

utils/process_result.py lines 110–119 then reads those env vars as required and writes them into the result-metadata JSON:

single_node_env = get_required_env_vars(['TP', 'EP_SIZE', 'DP_ATTENTION'])
tp_size      = int(single_node_env['TP'])
ep_size      = int(single_node_env['EP_SIZE'])
dp_attention = single_node_env['DP_ATTENTION']

single_node_data = {
    'is_multinode': False,
    'tp': tp_size,
    'ep': ep_size,
    'dp_attention': dp_attention,
    ...
}

Step-by-step proof

  1. CI workflow consumes nvidia-master.yaml and dispatches a dsv4-fp8-h200-vllm job with inputs tp=8, ep=8, dp-attn=true.
  2. benchmark-tmpl.yml exports TP=8, EP_SIZE=8, DP_ATTENTION=true into the job environment.
  3. The launcher invokes benchmarks/single_node/dsv4_fp8_h200.sh, which after this PR runs vllm serve ... --tensor-parallel-size 8 only — no EP, no DP-attention. The EP_SIZE and DP_ATTENTION env vars are now ignored by the script.
  4. RESULT_FILENAME is computed as ..._tp8-ep8-dpaTrue_disagg-....
  5. utils/process_result.py writes { tp: 8, ep: 8, dp_attention: "true", ... } into the result JSON despite the actual run being pure TP.
  6. Aggregated/published TP-only numbers are filed under an EP+DP-attn label, contaminating downstream comparison.

Why existing code doesn't prevent it

process_result.py has no way to discover what flags vllm serve was actually launched with — it just trusts the env vars from the search-space. The script itself silently ignores EP_SIZE and DP_ATTENTION now (since it no longer references them), so there is no error or warning. The previous commit message "switch from DP+EP to pure TP" matches the diff but missed the YAML side of the configuration.

Fix

Update both search-space entries in .github/configs/nvidia-master.yaml to match the new deployment:

dsv4-fp8-h200-vllm:
  ...
      search-space:
      - { tp: 8, ep: 1, dp-attn: false, conc-start: 4, conc-end: 64 }
    ...
      - { tp: 8, ep: 1, dp-attn: false, conc-start: 4, conc-end: 64 }

dsv4-fp8-h200-vllm-mtp:
  ...
      search-space:
      - { tp: 8, ep: 1, dp-attn: false, conc-start: 4, conc-end: 64, spec-decoding: mtp }
      ...
      - { tp: 8, ep: 1, dp-attn: false, conc-start: 4, conc-end: 64, spec-decoding: mtp }

This should be done in the same PR — otherwise the first run after merge produces mislabeled ep8-dpaTrue artifacts that are hard to correct retroactively.

Re-introduces the DP-attention + expert-parallel path on top of pure TP
by reading DP_ATTENTION/EP_SIZE env vars (matching dsv4_fp4_b200_vllm.sh).
nvidia-master.yaml now sweeps both {tp:8,ep:1,dp-attn:false} and
{tp:8,ep:8,dp-attn:true}, with conc-end raised from 64 to 256.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ywang96 ywang96 changed the title Switch H200 vLLM DSv4 deployment to pure TP Add pure TP configuration to H200 vLLM DSv4 deployment May 5, 2026
@ywang96
Copy link
Copy Markdown
Collaborator Author

ywang96 commented May 5, 2026

/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp8-h200-vllm

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

@ywang96 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25403559794
Command: test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp8-h200-vllm
Pinned ref: 06bf7c4
Approval: not required (trusted collaborator).

Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@cquil11 cquil11 merged commit 366ffbc into main May 5, 2026
3 checks passed
@cquil11 cquil11 deleted the dsv4-fp8-h200-vllm-pure-tp branch May 5, 2026 23:28
cquil11 added a commit that referenced this pull request May 6, 2026
cquil11 added a commit that referenced this pull request May 6, 2026
cquil11 added a commit that referenced this pull request May 6, 2026
The original PR #1285 landed without this changelog entry; it was reverted
in #1286 and is being re-landed here together with the missing entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cquil11 added a commit that referenced this pull request May 6, 2026
The original PR #1285 landed without this changelog entry; it was reverted
in #1286 and is being re-landed here together with the missing entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cquil11 added a commit that referenced this pull request May 6, 2026
…ntry (#1287)

* Revert "Revert "Add pure TP configuration to H200 vLLM DSv4 deployment (#1285)""

This reverts commit 2d8547c.

* Add perf-changelog entry for dsv4-fp8-h200-vllm{,-mtp} pure-TP config

The original PR #1285 landed without this changelog entry; it was reverted
in #1286 and is being re-landed here together with the missing entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants