[TRTLLM-12154][test] Add Qwen3-32B FP8 disagg stress test#14278
[TRTLLM-12154][test] Add Qwen3-32B FP8 disagg stress test#14278brnguyen2 wants to merge 4 commits into
Conversation
|
/bot run --extra-stage "DGX_H200-8_GPUs-PyTorch-Post-Merge-1" |
181a22d to
ce5f6a0
Compare
|
/bot run --extra-stage "DGX_H200-8_GPUs-PyTorch-Post-Merge-1" |
|
/bot help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
PR_Github #49195 [ run ] triggered by Bot. Commit: |
|
PR_Github #49195 [ run ] completed with state
|
c7357ef to
e4166bb
Compare
|
/bot run --extra-stage "DGX_H200-8_GPUs-PyTorch-Post-Merge-1" |
|
PR_Github #49487 [ run ] triggered by Bot. Commit: |
e4166bb to
690a201
Compare
|
/bot kill |
|
/bot run --disable-fail-fast --stage-list "DGX_H200-8_GPUs-PyTorch-Post-Merge-1" |
|
PR_Github #49493 [ run ] triggered by Bot. Commit: |
|
PR_Github #49494 [ kill ] triggered by Bot. Commit: |
|
PR_Github #49493 [ run ] completed with state |
|
PR_Github #49487 [ run ] completed with state |
|
PR_Github #49494 [ kill ] completed with state |
|
/bot run --disable-fail-fast --stage-list "DGX_H200-8_GPUs-PyTorch-Post-Merge-1" |
|
PR_Github #49495 [ run ] triggered by Bot. Commit: |
|
Verified that the targeted rerun reached the newly added test and it passed:
The selected test run completed successfully with this pytest summary: |
690a201 to
2c22fef
Compare
|
PR_Github #49495 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #49765 [ kill ] triggered by Bot. Commit: |
|
PR_Github #49766 [ run ] triggered by Bot. Commit: |
|
PR_Github #49765 [ kill ] completed with state |
|
PR_Github #49751 [ run ] completed with state |
|
PR_Github #49766 [ run ] completed with state
|
|
The recent CI failure on this PR (
This PR only touches the Qwen3-32B FP8 disagg stress test plumbing — none of its files are related to Failure analysis: https://pbss.s8k.io/v1/AUTH_svc_tensorrt/sw-tensorrt-ci-analysis/LLM/main/L0_MergeRequest_PR/39365/failure_analysis.html |
|
/bot run |
|
/bot kill |
|
/bot run --disable-fail-fast |
Add a Qwen3-32B FP8 disaggregated serving smoke and stress test that exercises Eagle3 with 4x TP1 context workers and 1x TP4 generation worker on 8 GPUs. The YAML enables FP8 KV cache, chunked prefill, block and partial reuse, cache transfer, and a top-level Eagle3 speculative_config shared by context and generation workers. The draft model is stored as a model-root-relative path and the disagg harness now resolves relative model and speculative_model values through llm_models_root while preserving absolute paths. Wire the smoke test into the H200 L0 list and the full 10k-request stress case into the QA stress list. Add Qwen-specific output substring checks and keep the stress accuracy threshold aligned with the adjacent GPT-OSS stress case. Signed-off-by: Brian Nguyen <brnguyen@nvidia.com>
Signed-off-by: Brian Nguyen <brnguyen@nvidia.com>
Signed-off-by: Brian Nguyen <brnguyen@nvidia.com>
…t_bf16_mtp[mtp_on] The test crashes during autotuner warmup with 'NoneType' object has no attribute 'gather_ids' at modeling_speculative.py:1748 when MTP eagle one-model is combined with Qwen3.5-35B-A3B. Pre-existing regression on main introduced by the Qwen3.5 VL MoE landing (96a4a09); unrelated to this PR's changes. Tracked in https://nvbugs/6206179. Signed-off-by: Brian Nguyen <brnguyen@nvidia.com>
|
PR_Github #49929 [ run ] triggered by Bot. Commit: |
473ba2c to
02a618a
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #49930 [ kill ] triggered by Bot. Commit: |
|
PR_Github #49929 [ run ] completed with state |
|
PR_Github #49930 [ kill ] completed with state |
|
/bot kill |
|
PR_Github #49931 [ run ] triggered by Bot. Commit: |
|
PR_Github #49933 [ run ] triggered by Bot. Commit: |
|
PR_Github #49934 [ kill ] triggered by Bot. Commit: |
|
PR_Github #49933 [ run ] completed with state |
|
PR_Github #49931 [ run ] completed with state |
|
PR_Github #49934 [ kill ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #49935 [ run ] triggered by Bot. Commit: |
|
PR_Github #49935 [ run ] completed with state
|
Summary
Adds Qwen3-32B FP8 disaggregated serving coverage for the Eagle3 path on 8 GPUs.
The new disagg config uses:
Qwen3/Qwen3-32B-FP8Zhi-Create-Qwen3-32B-Eagle3, resolved relative tollm_models_root()speculative_configso context and generation workers agree on the cache-state handshakeThe disagg harness now resolves relative
modelandspeculative_config.speculative_modelvalues throughllm_models_root()while preserving absolute paths. This keeps the YAML portable across CI/local model roots.Also makes the GSM8K lm-eval parser tolerate padded table cells, matching the output shape emitted by the local-completions run.
Test entries:
test_disaggregated_qwen3_32b_fp8smoke test inl0_dgx_h200.ymltest_disaggregated_stress_test::qwen3_32b_fp8_stressinqa/llm_function_stress.txt,request_count=10000,accuracy_threshold=0.42Refs: TRTLLM-12154
Test plan
tests/integration/defs/disaggregated/test_disaggregated.py::test_disaggregated_qwen3_32b_fp8after rebuildSummary by CodeRabbit
Release Notes