Clean llm by QiJune · Pull Request #5 · QiJune/TensorRT-LLM

QiJune · 2025-06-25T06:36:37Z

No description provided.

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

…n DGX_B200 Source-verified each test's actual GPU requirement, then rebalanced placement so each test runs on a stage whose reserved GPU count matches what it uses. l0_dgx_b200.yml: - Add new 2-GPU pre-merge pytorch/mpi condition for 9 tests previously on a 4-GPU stage but using 2 GPU (test_autotuner_distributed_strategy, two TestQwen3_5_35B_A3B::test_bf16[tp2-*], test_disaggregated_deepseek_v3_lite_fp8_nixl, three TestKVCacheV2DSv3Lite::test_mtp_*, two TestFlux* 2-GPU pipeline tests). - Move TestDeepSeekV32::test_nvfp4_attn_multi_gpus from the 8-GPU post-merge stage to the 4-GPU post-merge stage (test uses tp=4 per @skip_less_mpi_world_size(4)). - Remove test_configurable_moe_single_gpu -k "MEGAMOE_DEEPGEMM", 8 unconditional 1-GPU visual_gen tests, and test_ray_disaggregated_serving[tp2]; they now live in their right-sized stages (see below). l0_b200.yml: - Add the single-GPU MEGAMOE_DEEPGEMM row next to the existing test_configurable_moe_single_gpu CUTLASS/TRTLLM/CUTEDSL/DEEPGEMM/DENSEGEMM rows in the 1-GPU pre-merge pytorch condition. - Add 8 visual_gen 1-GPU tests (test_visual_gen_quickstart, five LPIPS golden tests, two visual_gen_benchmark tests) to the 1-GPU post-merge pytorch condition. All use VisualGenArgs without parallel_config or explicit cfg_size=1 / ulysses_size=1. l0_dgx_h100.yml: - Add test_ray_disaggregated_serving[tp2] to the 4-GPU pytorch/ray pre-merge condition. The test is disaggregated with tp=2 in each of the context and generation servers; the in-body check skips when device_count < tp_size*2, so [tp2] actually needs 4 GPUs (not 2). Placed in DGX_H100-4_GPUs-PyTorch-Ray-1; no new stage needed. L0_Test.groovy: - Add DGX_B200-2_GPUs-PyTorch-1 stage to x86SlurmTestConfigs (single split, 2 GPU, dgx-b200-flex pool). Note: 6 conditional visual_gen tests under l0_dgx_b200.yml condition #5 (test_wan_t2v_example, four test_vbench_dimension_score_wan*, two test_vbench_dimension_score_ltx2_*) were considered but kept in place. They call _generate_wan_video / _generate_ltx2_video, which append --cfg_size 2 only when torch.cuda.device_count() >= 2. Moving to a 1-GPU stage would silently drop the cfg_size=2 code path from CI coverage. Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

QiJune added 23 commits June 24, 2025 11:40

split _build_model method for TorchLlm and TrtLlm

53a7adf

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

Merge branch 'main' into split_build

6f202df

fix ci

8fafd59

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

rebase

f48dc53

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

clean

03506eb

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

fix

7d63875

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

Merge branch 'main' into split_build

f14abca

fix ci

b0df829

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

rebase

5383592

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

Merge branch 'main' into split_build

73bb27c

rebase

a5ed3c4

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

rebase

a622382

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

refactor _build_model method of TorchLlm

a7b2e5d

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

clean

a39a09e

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

fix

1e453d3

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

fix

cb753c9

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

fix

7f67e93

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

fix

593e5d9

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

fix

b5e300b

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

fix

42a131d

Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>

rebase

3031621

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

fix

34ea621

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

rebase

1d263f1

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean llm#5

Clean llm#5
QiJune wants to merge 23 commits into
mainfrom
clean_llm

QiJune commented Jun 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

QiJune commented Jun 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant