Skip to content

[TRTLLM-10695][ci] add verl stage in CI#11306

Merged
Superjomn merged 21 commits intoNVIDIA:mainfrom
Superjomn:add-verl-stage
Mar 13, 2026
Merged

[TRTLLM-10695][ci] add verl stage in CI#11306
Superjomn merged 21 commits intoNVIDIA:mainfrom
Superjomn:add-verl-stage

Conversation

@Superjomn
Copy link
Collaborator

@Superjomn Superjomn commented Feb 5, 2026

Changes

This PR adds a CI stage for testing trtllm rollout-related tests.

The CI stage

This is modeled after the Triton server CI. Environment setup is consolidated in tests/integration/defs/verl/test_verl_cases.py, and each VERL test file is wrapped with a dedicated test wrapper so the stage can be plugged into TRT-LLM like a standard CI stage.

Activated test cases

Test Status Duration
test_adapter PASSED 331.7s
test_async_server PASSED 252.8s
test_rollout_utils PASSED 356.6s

Here's the full inventory of verl TRT-LLM tests at tag 4cda6af:

test_async_server.py (4 tests) — all enabled

Test Requirements
test_placement_group_with_sub_ray_resource_pool mocked, no GPU
test_placement_group_with_ray_resource_pool mocked, no GPU
test_async_generate GPU + Qwen2.5-0.5B-Instruct
test_async_memory_management GPU + Qwen2.5-0.5B-Instruct

test_adapter.py (5 tests) — all enabled

Test Requirements
test_make_async_request_get_method mocked
test_make_async_request_post_method mocked
test_make_async_request_http_error mocked
test_make_async_request_max_attempts_exceeded mocked
test_init_without_device_mesh GPU + Ray + Hydra config

test_trtllm_rollout_utils.py (8 tests, 23 after parametrize) — partially enabled

Test Requirements Status
test_unimodal_generate (×3 prompts) GPU + Qwen2.5-Math-7B excluded (-k not ...)
test_unimodal_batch_generate GPU + Qwen2.5-Math-7B excluded (-k not ...)
test_multimodal_generate_with_image (×3) GPU + Qwen2.5-VL-7B-Instruct enabled
test_multimodal_different_image_sizes (×3) GPU + Qwen2.5-VL-7B-Instruct enabled
test_multimodal_text_only_fallback GPU + Qwen2.5-VL-7B-Instruct enabled
test_wake_sleep_cycle GPU + Qwen2.5-Math-7B enabled*

Currently excluded: test_unimodal_generate and test_unimodal_batch_generate — they require Qwen2.5-Math-7B which isn't in the CI cache.

Note: test_wake_sleep_cycle also uses Qwen2.5-Math-7B. It passed in build #29880 so it may have a fallback, but it could be a potential issue. Want me to check its implementation more closely?

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for Verl backend integration testing with new test configurations
    • Enabled Verl-based testing on DGX B200 GPUs in post-merge pipelines
  • Tests

    • Added Verl test suite configuration with environment setup for dependency installation and build steps
    • Extended test infrastructure to recognize and process Verl-specific test paths

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@Superjomn Superjomn requested review from a team as code owners February 5, 2026 06:43
@Superjomn Superjomn requested review from mlefeb01 and ruodil February 5, 2026 06:43
@Superjomn Superjomn marked this pull request as draft February 5, 2026 06:43
@Superjomn Superjomn force-pushed the add-verl-stage branch 2 times, most recently from d1343f3 to 36ff776 Compare February 10, 2026 06:02
@Superjomn
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-Verl-Post-Merge-1"

@Superjomn Superjomn marked this pull request as ready for review February 10, 2026 06:16
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 10, 2026

📝 Walkthrough

Walkthrough

This change adds Verl backend support to the Jenkins test framework by transforming verl-prefixed test paths to actual paths, extending test configuration to recognize Verl stage names, and implementing Verl environment setup with repository cloning and configuration management. New test shards and configuration files are introduced for Verl integration testing.

Changes

Cohort / File(s) Summary
Jenkins Groovy Script
jenkins/L0_Test.groovy
Added Verl backend recognition in getMakoArgsFromStageName by checking for "-Verl-" in stage names. Implemented processShardTestList to transform verl:: prefixed test paths using VERL_ROOT. Extended runLLMTestlistOnPlatformImpl with Verl environment setup including verl_config.yml parsing, repo cloning, and environment variable configuration. Added "DGX_B200-4_GPUs-Verl-Post-Merge-1" test shard entries for regular and Slurm mappings.
Verl Test Configuration
tests/integration/test_lists/test-db/l0_verl.yml
New test selection file defining conditions for 4-GPU B200 systems running post-merge tests with Verl backend and MPI orchestration. Includes single test entry with verl:: prefix for async server rollout testing.
Verl Environment Setup
tests/integration/test_lists/test-db/verl_config.yml
New Verl CI configuration specifying repository location and tag, install commands for gdrcopy, nvshmem, DeepEP with patching, and Python dependencies. Defines environment variables (NVSHMEM_DIR, LD_LIBRARY_PATH, PATH) for container setup.

Sequence Diagram

sequenceDiagram
    participant Jenkins as Jenkins Pipeline
    participant GroovyScript as L0_Test.groovy
    participant Config as verl_config.yml
    participant Repo as Verl Repository
    participant Env as Environment Setup
    participant TestRunner as Test Execution

    Jenkins->>GroovyScript: runLLMTestlistOnPlatformImpl(stageName="-Verl-")
    GroovyScript->>Config: Read verl_config.yml
    Config-->>GroovyScript: repo_url, install_commands
    GroovyScript->>Repo: Clone Verl repository
    Repo-->>GroovyScript: Repo cloned, set VERL_ROOT
    GroovyScript->>GroovyScript: processShardTestList: Transform verl:: paths
    GroovyScript->>Env: Export environment variables
    GroovyScript->>Env: Execute install_commands (gdrcopy, nvshmem, DeepEP)
    Env-->>GroovyScript: Environment configured
    GroovyScript->>GroovyScript: getMakoArgsFromStageName: backend=verl
    GroovyScript->>TestRunner: Execute Verl tests with MPI orchestration
    TestRunner-->>Jenkins: Test results
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning PR description is incomplete, missing required sections. Only provides 'Changes' and 'Activated test cases'; lacks PR title format, Description, Test Coverage, and checklist completion. Add PR title in format [TRTLLM-10695][ci] Add VERL stage in CI, fill Description section, list Test Coverage details, and verify all PR Checklist items are addressed.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: adding a Verl stage to the CI pipeline, with proper JIRA ticket reference and infra type notation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tests/integration/test_lists/test-db/verl_config.yml`:
- Around line 1-6: Remove the unused test_dir key from the verl_config YAML
since L0_Test.groovy and the codebase only read repo_url, repo_tag,
install_commands and env_vars; open the file, delete the line containing
"test_dir: \"tests\"" so the verl_config block contains only the fields actually
consumed (repo_url and repo_tag), and run a quick grep for "test_dir" to confirm
there are no remaining references.
🧹 Nitpick comments (2)
tests/integration/test_lists/test-db/verl_config.yml (1)

18-21: Hardcoded Python 3.12 path is fragile.

Lines 19 and 39 embed /usr/local/lib/python3.12/dist-packages/.... If the CI container ever moves to a different Python version, these paths will silently break. Consider deriving the path dynamically, e.g.:

- >-
  NVSHMEM_SITE=$(python3 -c "import nvidia.nvshmem; print(nvidia.nvshmem.__path__[0])")

or at minimum add a comment noting the Python 3.12 dependency.

Also applies to: 39-39

jenkins/L0_Test.groovy (1)

2704-2723: Env var resolution is hardcoded for only $LD_LIBRARY_PATH and $PATH — fragile for future additions.

Lines 2717–2718 only resolve two specific bare $VAR references. If a future env_vars entry references a different existing env var (e.g., $HOME, $CUDA_HOME), it will be left as a literal string in the Jenkins environment. The ${VAR} syntax (curly-brace) is handled generically via resolvedVars on lines 2713–2715, but the bare $VAR syntax is not.

Consider a general resolution loop over resolvedVars and env to replace any $KEY pattern, or at minimum, document that only ${...} syntax should be used in verl_config.yml:

♻️ Suggested improvement for more general resolution
-                        // Resolve references to existing env vars
-                        value = value.replace('$LD_LIBRARY_PATH', env.LD_LIBRARY_PATH ?: '')
-                        value = value.replace('$PATH', env.PATH ?: '')
+                        // Resolve any $VAR references to previously resolved vars (bare syntax)
+                        resolvedVars.each { k, v ->
+                            value = value.replace('$' + k, v)
+                        }
+                        // Resolve remaining $VAR references against Jenkins env
+                        def varPattern = /\$([A-Za-z_][A-Za-z0-9_]*)/
+                        value = value.replaceAll(varPattern) { match, varName ->
+                            env."${varName}" ?: match
+                        }

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35437 [ run ] triggered by Bot. Commit: 36ff776

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35437 [ run ] completed with state FAILURE. Commit: 36ff776
/LLM/main/L0_MergeRequest_PR pipeline #27371 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@Superjomn
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-Verl-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35498 [ run ] triggered by Bot. Commit: bfbd3bf

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35498 [ run ] completed with state FAILURE. Commit: bfbd3bf
/LLM/main/L0_MergeRequest_PR pipeline #27408 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@hchings
Copy link
Collaborator

hchings commented Feb 10, 2026

/bot run --stage-list "DGX_B200-4_GPUs-Verl-Post-Merge-1"

1 similar comment
@hchings
Copy link
Collaborator

hchings commented Feb 10, 2026

/bot run --stage-list "DGX_B200-4_GPUs-Verl-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35550 [ run ] triggered by Bot. Commit: 0ad1836

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35550 [ run ] completed with state FAILURE. Commit: 0ad1836
/LLM/main/L0_MergeRequest_PR pipeline #27454 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@Superjomn Superjomn force-pushed the add-verl-stage branch 2 times, most recently from 1aa18da to 70a215b Compare February 13, 2026 08:49
@Superjomn
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-Verl-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35896 [ run ] triggered by Bot. Commit: 70a215b

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35896 [ run ] completed with state FAILURE. Commit: 70a215b
/LLM/main/L0_MergeRequest_PR pipeline #27721 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@Superjomn
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-Verl-Post-Merge-1"

Tests with the verl:: prefix live in the external verl repository and
are only available at Jenkins runtime (resolved to ${VERL_ROOT}/ by
L0_Test.groovy). The local pre-merge validation script has no access
to that repo, so these entries were flagged as invalid. Filter them out
before pytest collection so the CI check passes cleanly.

Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
Made-with: Cursor
Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
Made-with: Cursor
Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
Made-with: Cursor
…_config.yml

The verl conftest.py runs install commands via subprocess.run(shell=True),
which uses /bin/sh. pushd/popd are bash builtins and fail with exit code 127
under /bin/sh. Replace with POSIX-compatible (cd dir && ...) subshells.

Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
Made-with: Cursor
Replace the verl:: prefix mechanism with a local wrapper test file
that invokes verl repo tests via subprocess, eliminating the need
for special CI infrastructure to handle external test paths.

Signed-off-by: Chunwei Yan <chunweiy@nvidia.com>
Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
…L_ROOT env

Clone the verl repo into tests/integration/defs/verl/verl_repo/ so the
wrapper test discovers it by relative path (__file__), avoiding Jenkins
env var propagation issues in Docker-on-Slurm execution.

Signed-off-by: Chunwei Yan <chunweiy@nvidia.com>
Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
… fixture

The Verl stage runs via the sbatch path which does not execute
runLLMTestlistOnPlatformImpl, so the Groovy setup block never ran.
Move all setup (env vars, dependency install, repo clone) into a
session-scoped pytest fixture in test_verl_cases.py, following
the triton-server-ci pattern.

Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
The verl test_async_server.py imports ray, which was not listed
in verl_config.yml install_commands.

Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
The verl test imports verl.single_controller which requires the verl
package to be installed. Add pip install -e after cloning the repo.

Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
Hydra resolves config paths relative to cwd. The verl tests need
cwd=VERL_ROOT so the trainer/config directory is found correctly.

Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
The verl test uses TRTLLM_TEST_MODEL_PATH_ROOT to locate model
weights (defaults to ~/models). In CI, models are at
/scratch.trt_llm_data/llm-models.

Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
The verl test needs Qwen/Qwen2.5-0.5B-Instruct at a local path.
Add model download step using huggingface_hub.snapshot_download
to TRTLLM_TEST_MODEL_PATH_ROOT before running tests.

Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
The pinned verl commit 4ef45d0 uses OpenAIServer(llm=...) but
TRT-LLM now expects OpenAIServer(generator=...). Update to
4cda6af which has the compatible API call.

Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
Add wrapper tests for test_adapter.py (HTTP adapter + server init)
and test_trtllm_rollout_utils.py (multimodal rollout + lifecycle).
Unimodal tests requiring Qwen2.5-Math-7B are excluded via -k filter
since the model is not in the CI cache. Use CI model cache paths
with symlinks to bridge HF-style naming to flat CI cache structure.

Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
The CI model cache at /scratch.trt_llm_data/llm-models is read-only.
Instead of creating symlinks there, use /tmp/verl-models as a writable
staging directory with symlinks pointing back to the read-only cache.

Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
@Superjomn
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-Verl-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38589 [ run ] triggered by Bot. Commit: 828a891 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38589 [ run ] completed with state SUCCESS. Commit: 828a891
/LLM/main/L0_MergeRequest_PR pipeline #29925 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

Copy link
Collaborator

@ZhanruiSunCh ZhanruiSunCh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for L0_Test.groovy. If you want this stage be auto triggerd in pre-merge, you need modify here: https://github.com/NVIDIA/TensorRT-LLM/blob/main/jenkins/L0_MergeRequest.groovy#L642-L647

Copy link
Collaborator

@hchings hchings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Superjomn
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38676 [ run ] triggered by Bot. Commit: 828a891 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38676 [ run ] completed with state SUCCESS. Commit: 828a891
/LLM/main/L0_MergeRequest_PR pipeline #29999 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Superjomn
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38727 [ run ] triggered by Bot. Commit: 828a891 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38727 [ run ] completed with state SUCCESS. Commit: 828a891
/LLM/main/L0_MergeRequest_PR pipeline #30045 completed with status: 'SUCCESS'

CI Report

Link to invocation

@Superjomn Superjomn enabled auto-merge (squash) March 13, 2026 00:52
@Superjomn Superjomn merged commit 0507609 into NVIDIA:main Mar 13, 2026
5 checks passed
@Superjomn Superjomn deleted the add-verl-stage branch March 13, 2026 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants