Conversation
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
QiJune
added a commit
that referenced
this pull request
May 28, 2026
…n DGX_B200 Source-verified each test's actual GPU requirement, then rebalanced placement so each test runs on a stage whose reserved GPU count matches what it uses. l0_dgx_b200.yml: - Add new 2-GPU pre-merge pytorch/mpi condition for 9 tests previously on a 4-GPU stage but using 2 GPU (test_autotuner_distributed_strategy, two TestQwen3_5_35B_A3B::test_bf16[tp2-*], test_disaggregated_deepseek_v3_lite_fp8_nixl, three TestKVCacheV2DSv3Lite::test_mtp_*, two TestFlux* 2-GPU pipeline tests). - Move TestDeepSeekV32::test_nvfp4_attn_multi_gpus from the 8-GPU post-merge stage to the 4-GPU post-merge stage (test uses tp=4 per @skip_less_mpi_world_size(4)). - Remove test_configurable_moe_single_gpu -k "MEGAMOE_DEEPGEMM", 8 unconditional 1-GPU visual_gen tests, and test_ray_disaggregated_serving[tp2]; they now live in their right-sized stages (see below). l0_b200.yml: - Add the single-GPU MEGAMOE_DEEPGEMM row next to the existing test_configurable_moe_single_gpu CUTLASS/TRTLLM/CUTEDSL/DEEPGEMM/DENSEGEMM rows in the 1-GPU pre-merge pytorch condition. - Add 8 visual_gen 1-GPU tests (test_visual_gen_quickstart, five LPIPS golden tests, two visual_gen_benchmark tests) to the 1-GPU post-merge pytorch condition. All use VisualGenArgs without parallel_config or explicit cfg_size=1 / ulysses_size=1. l0_dgx_h100.yml: - Add test_ray_disaggregated_serving[tp2] to the 4-GPU pytorch/ray pre-merge condition. The test is disaggregated with tp=2 in each of the context and generation servers; the in-body check skips when device_count < tp_size*2, so [tp2] actually needs 4 GPUs (not 2). Placed in DGX_H100-4_GPUs-PyTorch-Ray-1; no new stage needed. L0_Test.groovy: - Add DGX_B200-2_GPUs-PyTorch-1 stage to x86SlurmTestConfigs (single split, 2 GPU, dgx-b200-flex pool). Note: 6 conditional visual_gen tests under l0_dgx_b200.yml condition #5 (test_wan_t2v_example, four test_vbench_dimension_score_wan*, two test_vbench_dimension_score_ltx2_*) were considered but kept in place. They call _generate_wan_video / _generate_ltx2_video, which append --cfg_size 2 only when torch.cuda.device_count() >= 2. Moving to a 1-GPU stage would silently drop the cfg_size=2 code path from CI coverage. Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.