[TRTLLM-10695][ci] add verl stage in CI#11306
Conversation
d1343f3 to
36ff776
Compare
|
/bot run --stage-list "DGX_B200-4_GPUs-Verl-Post-Merge-1" |
📝 WalkthroughWalkthroughThis change adds Verl backend support to the Jenkins test framework by transforming verl-prefixed test paths to actual paths, extending test configuration to recognize Verl stage names, and implementing Verl environment setup with repository cloning and configuration management. New test shards and configuration files are introduced for Verl integration testing. Changes
Sequence DiagramsequenceDiagram
participant Jenkins as Jenkins Pipeline
participant GroovyScript as L0_Test.groovy
participant Config as verl_config.yml
participant Repo as Verl Repository
participant Env as Environment Setup
participant TestRunner as Test Execution
Jenkins->>GroovyScript: runLLMTestlistOnPlatformImpl(stageName="-Verl-")
GroovyScript->>Config: Read verl_config.yml
Config-->>GroovyScript: repo_url, install_commands
GroovyScript->>Repo: Clone Verl repository
Repo-->>GroovyScript: Repo cloned, set VERL_ROOT
GroovyScript->>GroovyScript: processShardTestList: Transform verl:: paths
GroovyScript->>Env: Export environment variables
GroovyScript->>Env: Execute install_commands (gdrcopy, nvshmem, DeepEP)
Env-->>GroovyScript: Environment configured
GroovyScript->>GroovyScript: getMakoArgsFromStageName: backend=verl
GroovyScript->>TestRunner: Execute Verl tests with MPI orchestration
TestRunner-->>Jenkins: Test results
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/integration/test_lists/test-db/verl_config.yml`:
- Around line 1-6: Remove the unused test_dir key from the verl_config YAML
since L0_Test.groovy and the codebase only read repo_url, repo_tag,
install_commands and env_vars; open the file, delete the line containing
"test_dir: \"tests\"" so the verl_config block contains only the fields actually
consumed (repo_url and repo_tag), and run a quick grep for "test_dir" to confirm
there are no remaining references.
🧹 Nitpick comments (2)
tests/integration/test_lists/test-db/verl_config.yml (1)
18-21: Hardcoded Python 3.12 path is fragile.Lines 19 and 39 embed
/usr/local/lib/python3.12/dist-packages/.... If the CI container ever moves to a different Python version, these paths will silently break. Consider deriving the path dynamically, e.g.:- >- NVSHMEM_SITE=$(python3 -c "import nvidia.nvshmem; print(nvidia.nvshmem.__path__[0])")or at minimum add a comment noting the Python 3.12 dependency.
Also applies to: 39-39
jenkins/L0_Test.groovy (1)
2704-2723: Env var resolution is hardcoded for only$LD_LIBRARY_PATHand$PATH— fragile for future additions.Lines 2717–2718 only resolve two specific bare
$VARreferences. If a futureenv_varsentry references a different existing env var (e.g.,$HOME,$CUDA_HOME), it will be left as a literal string in the Jenkins environment. The${VAR}syntax (curly-brace) is handled generically viaresolvedVarson lines 2713–2715, but the bare$VARsyntax is not.Consider a general resolution loop over
resolvedVarsandenvto replace any$KEYpattern, or at minimum, document that only${...}syntax should be used inverl_config.yml:♻️ Suggested improvement for more general resolution
- // Resolve references to existing env vars - value = value.replace('$LD_LIBRARY_PATH', env.LD_LIBRARY_PATH ?: '') - value = value.replace('$PATH', env.PATH ?: '') + // Resolve any $VAR references to previously resolved vars (bare syntax) + resolvedVars.each { k, v -> + value = value.replace('$' + k, v) + } + // Resolve remaining $VAR references against Jenkins env + def varPattern = /\$([A-Za-z_][A-Za-z0-9_]*)/ + value = value.replaceAll(varPattern) { match, varName -> + env."${varName}" ?: match + }
|
PR_Github #35437 [ run ] triggered by Bot. Commit: |
|
PR_Github #35437 [ run ] completed with state
|
36ff776 to
bfbd3bf
Compare
|
/bot run --stage-list "DGX_B200-4_GPUs-Verl-Post-Merge-1" |
|
PR_Github #35498 [ run ] triggered by Bot. Commit: |
|
PR_Github #35498 [ run ] completed with state
|
|
/bot run --stage-list "DGX_B200-4_GPUs-Verl-Post-Merge-1" |
1 similar comment
|
/bot run --stage-list "DGX_B200-4_GPUs-Verl-Post-Merge-1" |
|
PR_Github #35550 [ run ] triggered by Bot. Commit: |
|
PR_Github #35550 [ run ] completed with state
|
1aa18da to
70a215b
Compare
|
/bot run --stage-list "DGX_B200-4_GPUs-Verl-Post-Merge-1" |
|
PR_Github #35896 [ run ] triggered by Bot. Commit: |
|
PR_Github #35896 [ run ] completed with state
|
70a215b to
0128576
Compare
|
/bot run --stage-list "DGX_B200-4_GPUs-Verl-Post-Merge-1" |
Tests with the verl:: prefix live in the external verl repository and
are only available at Jenkins runtime (resolved to ${VERL_ROOT}/ by
L0_Test.groovy). The local pre-merge validation script has no access
to that repo, so these entries were flagged as invalid. Filter them out
before pytest collection so the CI check passes cleanly.
Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
Made-with: Cursor
Signed-off-by: Chunwei Yan <yanchunwei@outlook.com> Made-with: Cursor
Signed-off-by: Chunwei Yan <yanchunwei@outlook.com> Made-with: Cursor
…_config.yml The verl conftest.py runs install commands via subprocess.run(shell=True), which uses /bin/sh. pushd/popd are bash builtins and fail with exit code 127 under /bin/sh. Replace with POSIX-compatible (cd dir && ...) subshells. Signed-off-by: Chunwei Yan <yanchunwei@outlook.com> Made-with: Cursor
Replace the verl:: prefix mechanism with a local wrapper test file that invokes verl repo tests via subprocess, eliminating the need for special CI infrastructure to handle external test paths. Signed-off-by: Chunwei Yan <chunweiy@nvidia.com> Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
…L_ROOT env Clone the verl repo into tests/integration/defs/verl/verl_repo/ so the wrapper test discovers it by relative path (__file__), avoiding Jenkins env var propagation issues in Docker-on-Slurm execution. Signed-off-by: Chunwei Yan <chunweiy@nvidia.com> Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
… fixture The Verl stage runs via the sbatch path which does not execute runLLMTestlistOnPlatformImpl, so the Groovy setup block never ran. Move all setup (env vars, dependency install, repo clone) into a session-scoped pytest fixture in test_verl_cases.py, following the triton-server-ci pattern. Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
The verl test_async_server.py imports ray, which was not listed in verl_config.yml install_commands. Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
The verl test imports verl.single_controller which requires the verl package to be installed. Add pip install -e after cloning the repo. Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
Hydra resolves config paths relative to cwd. The verl tests need cwd=VERL_ROOT so the trainer/config directory is found correctly. Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
The verl test uses TRTLLM_TEST_MODEL_PATH_ROOT to locate model weights (defaults to ~/models). In CI, models are at /scratch.trt_llm_data/llm-models. Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
The verl test needs Qwen/Qwen2.5-0.5B-Instruct at a local path. Add model download step using huggingface_hub.snapshot_download to TRTLLM_TEST_MODEL_PATH_ROOT before running tests. Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
The pinned verl commit 4ef45d0 uses OpenAIServer(llm=...) but TRT-LLM now expects OpenAIServer(generator=...). Update to 4cda6af which has the compatible API call. Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
Add wrapper tests for test_adapter.py (HTTP adapter + server init) and test_trtllm_rollout_utils.py (multimodal rollout + lifecycle). Unimodal tests requiring Qwen2.5-Math-7B are excluded via -k filter since the model is not in the CI cache. Use CI model cache paths with symlinks to bridge HF-style naming to flat CI cache structure. Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
The CI model cache at /scratch.trt_llm_data/llm-models is read-only. Instead of creating symlinks there, use /tmp/verl-models as a writable staging directory with symlinks pointing back to the read-only cache. Signed-off-by: Chunwei Yan <yanchunwei@outlook.com>
0cd4b27 to
828a891
Compare
|
/bot run --stage-list "DGX_B200-4_GPUs-Verl-Post-Merge-1" |
|
PR_Github #38589 [ run ] triggered by Bot. Commit: |
|
PR_Github #38589 [ run ] completed with state |
ZhanruiSunCh
left a comment
There was a problem hiding this comment.
LGTM for L0_Test.groovy. If you want this stage be auto triggerd in pre-merge, you need modify here: https://github.com/NVIDIA/TensorRT-LLM/blob/main/jenkins/L0_MergeRequest.groovy#L642-L647
|
/bot run |
|
PR_Github #38676 [ run ] triggered by Bot. Commit: |
|
PR_Github #38676 [ run ] completed with state
|
|
/bot run |
|
PR_Github #38727 [ run ] triggered by Bot. Commit: |
|
PR_Github #38727 [ run ] completed with state |
Changes
This PR adds a CI stage for testing trtllm rollout-related tests.
The CI stage
This is modeled after the Triton server CI. Environment setup is consolidated in tests/integration/defs/verl/test_verl_cases.py, and each VERL test file is wrapped with a dedicated test wrapper so the stage can be plugged into TRT-LLM like a standard CI stage.
Activated test cases
test_adaptertest_async_servertest_rollout_utilsHere's the full inventory of verl TRT-LLM tests at tag
4cda6af:test_async_server.py (4 tests) — all enabled
test_placement_group_with_sub_ray_resource_pooltest_placement_group_with_ray_resource_pooltest_async_generatetest_async_memory_managementtest_adapter.py (5 tests) — all enabled
test_make_async_request_get_methodtest_make_async_request_post_methodtest_make_async_request_http_errortest_make_async_request_max_attempts_exceededtest_init_without_device_meshtest_trtllm_rollout_utils.py (8 tests, 23 after parametrize) — partially enabled
test_unimodal_generate(×3 prompts)-k not ...)test_unimodal_batch_generate-k not ...)test_multimodal_generate_with_image(×3)test_multimodal_different_image_sizes(×3)test_multimodal_text_only_fallbacktest_wake_sleep_cycleCurrently excluded:
test_unimodal_generateandtest_unimodal_batch_generate— they requireQwen2.5-Math-7Bwhich isn't in the CI cache.Note:
test_wake_sleep_cyclealso usesQwen2.5-Math-7B. It passed in build #29880 so it may have a fallback, but it could be a potential issue. Want me to check its implementation more closely?Summary by CodeRabbit
Release Notes
New Features
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.