Skip to content

cp: fix: update the custom vllm instructions (1116) into r0.4.0#1377

Merged
terrykong merged 1 commit intor0.4.0from
cherry-pick-1116-r0.4.0
Oct 16, 2025
Merged

cp: fix: update the custom vllm instructions (1116) into r0.4.0#1377
terrykong merged 1 commit intor0.4.0from
cherry-pick-1116-r0.4.0

Conversation

@chtruong814
Copy link
Contributor

@chtruong814 chtruong814 commented Oct 16, 2025

beep boop [🤖]: Hi @terrykong 👋,

we've cherry picked #1116 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • New Features

    • Added optional BUILD_CUSTOM_VLLM build flag for Docker to enable custom vLLM installation.
    • Introduced new environment variables (VLLM_PRECOMPILED_WHEEL_LOCATION, NRL_FORCE_REBUILD_VENVS) for custom vLLM workflows.
  • Documentation

    • Updated guides with streamlined custom vLLM build process, new parameter signatures, and Docker rebuild instructions.
    • Added expanded verification steps and example commands for validating custom vLLM installations.

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@chtruong814 chtruong814 requested a review from a team as a code owner October 16, 2025 17:02
@chtruong814 chtruong814 requested a review from terrykong October 16, 2025 17:02
@chtruong814 chtruong814 requested review from a team as code owners October 16, 2025 17:02
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Oct 16, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 16, 2025

📝 Walkthrough

Walkthrough

This PR introduces an optional custom vLLM build flow for NeMo RL by adding a conditional Docker build flag (BUILD_CUSTOM_VLLM), substantially rewriting the build script with new parameter handling and pyproject.toml modification capabilities, and providing comprehensive documentation for the workflow.

Changes

Cohort / File(s) Summary
Git configuration
.gitignore
Updated pattern from .git to /.git for repository-root specificity; removed 3rdparty/vllm entry.
Docker build system
docker/Dockerfile
Added optional BUILD_CUSTOM_VLLM build argument to the hermetic stage; copies tools/build-custom-vllm.sh and conditionally executes custom vLLM build script with environment sourcing from 3rdparty/vllm/nemo-rl.env when flag is set.
Build automation
tools/build-custom-vllm.sh
Replaced hard-coded defaults with explicit parameters (GIT_URL, GIT_REF, VLLM_WHEEL_COMMIT); added idempotent pyproject.toml update logic via embedded Python script using tomlkit to configure local vLLM path, unpinn vllm dependencies, ensure setuptools_scm inclusion, and set no-build-isolation flags; added repository root discovery and environment file generation for Docker integration.
Documentation
docs/guides/use-custom-vllm.md
Restructured and expanded usage guide with updated script signature, condensed workflow steps, detailed verification procedures, new environment variable guidance (VLLM_PRECOMPILED_WHEEL_LOCATION, NRL_FORCE_REBUILD_VENVS), and Docker rebuild instructions with BUILD_CUSTOM_VLLM flag example.

Sequence Diagram

sequenceDiagram
    participant User
    participant Docker as Docker Build
    participant Script as build-custom-vllm.sh
    participant PyProj as pyproject.toml
    participant vLLM as 3rdparty/vllm

    User->>Docker: docker build --build-arg BUILD_CUSTOM_VLLM=true
    Docker->>Script: Copy & conditionally execute
    alt BUILD_CUSTOM_VLLM set
        Script->>vLLM: Clone/verify repository
        Script->>PyProj: Update with local vLLM path
        PyProj->>PyProj: Add setuptools_scm dependency<br/>Unpin vllm<br/>Configure editable source<br/>Set no-build-isolation
        Script->>vLLM: Build custom vLLM
        Script->>Docker: Generate nemo-rl.env
        Docker->>Docker: Source nemo-rl.env<br/>Continue install
    else BUILD_CUSTOM_VLLM not set
        Docker->>Docker: Use default vLLM installation
    end
    Docker-->>User: Build complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The build script rewrite introduces substantial logic density with embedded Python configuration via tomlkit, parameter scheme changes, and environmental setup. Docker conditional flow adds moderate complexity. Documentation requires verification for correctness and completeness. The heterogeneous nature of changes across build automation, Docker orchestration, and documentation necessitates separate reasoning for each component.

Possibly related PRs

  • NVIDIA-NeMo/RL#1299: Modifies Docker build flow to adjust vLLM-related build and install steps, complementing the custom vLLM build infrastructure introduced here.

Suggested labels

CI:docs, r0.4.0

Suggested reviewers

  • terrykong
  • yfw

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The title, though referencing the cherry-pick and change, is cluttered with backticks, a PR number, and target branch details, making it overly verbose and not a clear, concise summary of the primary update to the custom vLLM instructions. Please simplify the title to clearly and concisely state the main change without backticks or branch/PR metadata, for example “Update custom vLLM instructions for r0.4.0 release.”
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Test Results For Major Changes ✅ Passed The changes in this PR are limited to build configuration, scripts, and documentation for an optional custom vLLM installation flow and do not modify core algorithmic code, numerical behavior, convergence, or performance of the library. As these are primarily tooling and documentation updates rather than new features or breaking changes to the runtime, they are considered minor and do not require performance benchmarks or regression test data in the PR description.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-1116-r0.4.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9088e55 and 3c0b098.

📒 Files selected for processing (4)
  • .gitignore (1 hunks)
  • docker/Dockerfile (2 hunks)
  • docs/guides/use-custom-vllm.md (1 hunks)
  • tools/build-custom-vllm.sh (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Follow the Google Shell Style Guide for all shell scripts
Use uv run to execute Python scripts in shell/driver scripts instead of activating virtualenvs and calling python directly
Add the NVIDIA copyright header (with current year) at the top of all shell scripts, excluding tests/ and test-only scripts

Files:

  • tools/build-custom-vllm.sh
docs/**/*.md

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

When a markdown doc under docs/**/*.md is added or renamed, update docs/index.md to include it in the appropriate section

Files:

  • docs/guides/use-custom-vllm.md
🪛 LanguageTool
docs/guides/use-custom-vllm.md

[grammar] ~31-~31: There might be a mistake here.
Context: ... ## Verify Your Custom vLLM in Isolation Test your setup to ensure your custom vL...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR

Comment on lines +51 to +93
OLD_UV_PROJECT_ENVIRONMENT=$UV_PROJECT_ENVIRONMENT
unset UV_PROJECT_ENVIRONMENT
uv venv

# Remove all comments from requirements files to prevent use_existing_torch.py from incorrectly removing xformers
echo "Removing comments from requirements files..."
find requirements/ -name "*.txt" -type f -exec sed -i 's/#.*$//' {} \; 2>/dev/null || true
find requirements/ -name "*.txt" -type f -exec sed -i '/^[[:space:]]*$/d' {} \; 2>/dev/null || true
# Replace xformers==.* (but preserve any platform markers at the end)
# NOTE: that xformers is bumped from 0.0.30 to 0.0.31 to work with torch==2.7.1. This version may need to change to change when we upgrade torch.
find requirements/ -name "*.txt" -type f -exec sed -i -E 's/^(xformers)==[^;[:space:]]*/\1==0.0.31/' {} \; 2>/dev/null || true

uv run --no-project use_existing_torch.py

# Install dependencies
echo "Installing dependencies..."
uv pip install --upgrade pip
uv pip install numpy setuptools setuptools_scm
uv pip install torch==2.7.0 --torch-backend=cu128
uv pip install torch==2.7.1 --torch-backend=cu128

# Install vLLM using precompiled wheel
echo "Installing vLLM with precompiled wheel..."
uv pip install --no-build-isolation -e .

echo "Build completed successfully!"
echo "The built vLLM is available in: $BUILD_DIR"
echo "You can now update your pyproject.toml to use this local version."
echo "Follow instructions on https://github.com/NVIDIA-NeMo/RL/blob/main/docs/guides/use-custom-vllm.md for how to configure your local NeMo RL environment to use this custom vLLM."

echo "Updating repo pyproject.toml to point vLLM to local clone..."

PYPROJECT_TOML="$REPO_ROOT/pyproject.toml"
if [[ ! -f "$PYPROJECT_TOML" ]]; then
echo "[ERROR] pyproject.toml not found at $PYPROJECT_TOML. This script must be run from the repo root and pyproject.toml must exist."
exit 1
fi

cd "$REPO_ROOT"

export UV_PROJECT_ENVIRONMENT=$OLD_UV_PROJECT_ENVIRONMENT
if [[ -n "$UV_PROJECT_ENVIRONMENT" ]]; then
# We optionally set this if the project environment is outside of the project directory.
# If we do not set this then uv pip install commands will fail
export VIRTUAL_ENV=$UV_PROJECT_ENVIRONMENT
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don't dereference an unset UV_PROJECT_ENVIRONMENT under set -u.

With set -u active, the assignment on Line 51 explodes whenever UV_PROJECT_ENVIRONMENT is not already exported (the common case), so the whole script aborts before cloning vLLM. Wrap the read in safe parameter expansion and only restore the variable when it previously existed.

-OLD_UV_PROJECT_ENVIRONMENT=$UV_PROJECT_ENVIRONMENT
-unset UV_PROJECT_ENVIRONMENT
+OLD_UV_PROJECT_ENVIRONMENT=${UV_PROJECT_ENVIRONMENT-}
+if [[ -v UV_PROJECT_ENVIRONMENT ]]; then
+  unset UV_PROJECT_ENVIRONMENT
+fi
@@
-export UV_PROJECT_ENVIRONMENT=$OLD_UV_PROJECT_ENVIRONMENT
-if [[ -n "$UV_PROJECT_ENVIRONMENT" ]]; then
-    # We optionally set this if the project environment is outside of the project directory.
-    # If we do not set this then uv pip install commands will fail
-    export VIRTUAL_ENV=$UV_PROJECT_ENVIRONMENT
-fi
+if [[ -n "$OLD_UV_PROJECT_ENVIRONMENT" ]]; then
+    export UV_PROJECT_ENVIRONMENT=$OLD_UV_PROJECT_ENVIRONMENT
+    # We optionally set this if the project environment is outside of the project directory.
+    # If we do not set this then uv pip install commands will fail
+    export VIRTUAL_ENV=$UV_PROJECT_ENVIRONMENT
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
OLD_UV_PROJECT_ENVIRONMENT=$UV_PROJECT_ENVIRONMENT
unset UV_PROJECT_ENVIRONMENT
uv venv
# Remove all comments from requirements files to prevent use_existing_torch.py from incorrectly removing xformers
echo "Removing comments from requirements files..."
find requirements/ -name "*.txt" -type f -exec sed -i 's/#.*$//' {} \; 2>/dev/null || true
find requirements/ -name "*.txt" -type f -exec sed -i '/^[[:space:]]*$/d' {} \; 2>/dev/null || true
# Replace xformers==.* (but preserve any platform markers at the end)
# NOTE: that xformers is bumped from 0.0.30 to 0.0.31 to work with torch==2.7.1. This version may need to change to change when we upgrade torch.
find requirements/ -name "*.txt" -type f -exec sed -i -E 's/^(xformers)==[^;[:space:]]*/\1==0.0.31/' {} \; 2>/dev/null || true
uv run --no-project use_existing_torch.py
# Install dependencies
echo "Installing dependencies..."
uv pip install --upgrade pip
uv pip install numpy setuptools setuptools_scm
uv pip install torch==2.7.0 --torch-backend=cu128
uv pip install torch==2.7.1 --torch-backend=cu128
# Install vLLM using precompiled wheel
echo "Installing vLLM with precompiled wheel..."
uv pip install --no-build-isolation -e .
echo "Build completed successfully!"
echo "The built vLLM is available in: $BUILD_DIR"
echo "You can now update your pyproject.toml to use this local version."
echo "Follow instructions on https://github.com/NVIDIA-NeMo/RL/blob/main/docs/guides/use-custom-vllm.md for how to configure your local NeMo RL environment to use this custom vLLM."
echo "Updating repo pyproject.toml to point vLLM to local clone..."
PYPROJECT_TOML="$REPO_ROOT/pyproject.toml"
if [[ ! -f "$PYPROJECT_TOML" ]]; then
echo "[ERROR] pyproject.toml not found at $PYPROJECT_TOML. This script must be run from the repo root and pyproject.toml must exist."
exit 1
fi
cd "$REPO_ROOT"
export UV_PROJECT_ENVIRONMENT=$OLD_UV_PROJECT_ENVIRONMENT
if [[ -n "$UV_PROJECT_ENVIRONMENT" ]]; then
# We optionally set this if the project environment is outside of the project directory.
# If we do not set this then uv pip install commands will fail
export VIRTUAL_ENV=$UV_PROJECT_ENVIRONMENT
fi
OLD_UV_PROJECT_ENVIRONMENT=${UV_PROJECT_ENVIRONMENT-}
if [[ -v UV_PROJECT_ENVIRONMENT ]]; then
unset UV_PROJECT_ENVIRONMENT
fi
uv venv
# Remove all comments from requirements files to prevent use_existing_torch.py from incorrectly removing xformers
echo "Removing comments from requirements files..."
find requirements/ -name "*.txt" -type f -exec sed -i 's/#.*$//' {} \; 2>/dev/null || true
find requirements/ -name "*.txt" -type f -exec sed -i '/^[[:space:]]*$/d' {} \; 2>/dev/null || true
# Replace xformers==.* (but preserve any platform markers at the end)
# NOTE: that xformers is bumped from 0.0.30 to 0.0.31 to work with torch==2.7.1. This version may need to change to change when we upgrade torch.
find requirements/ -name "*.txt" -type f -exec sed -i -E 's/^(xformers)==[^;[:space:]]*/\1==0.0.31/' {} \; 2>/dev/null || true
uv run --no-project use_existing_torch.py
# Install dependencies
echo "Installing dependencies..."
uv pip install --upgrade pip
uv pip install numpy setuptools setuptools_scm
uv pip install torch==2.7.1 --torch-backend=cu128
# Install vLLM using precompiled wheel
echo "Installing vLLM with precompiled wheel..."
uv pip install --no-build-isolation -e .
echo "Build completed successfully!"
echo "The built vLLM is available in: $BUILD_DIR"
echo "Updating repo pyproject.toml to point vLLM to local clone..."
PYPROJECT_TOML="$REPO_ROOT/pyproject.toml"
if [[ ! -f "$PYPROJECT_TOML" ]]; then
echo "[ERROR] pyproject.toml not found at $PYPROJECT_TOML. This script must be run from the repo root and pyproject.toml must exist."
exit 1
fi
cd "$REPO_ROOT"
if [[ -n "$OLD_UV_PROJECT_ENVIRONMENT" ]]; then
export UV_PROJECT_ENVIRONMENT=$OLD_UV_PROJECT_ENVIRONMENT
# We optionally set this if the project environment is outside of the project directory.
# If we do not set this then uv pip install commands will fail
export VIRTUAL_ENV=$UV_PROJECT_ENVIRONMENT
fi

@terrykong terrykong added the CI:docs Run doctest label Oct 16, 2025
@terrykong terrykong enabled auto-merge (squash) October 16, 2025 17:14
@terrykong terrykong merged commit 6989bc3 into r0.4.0 Oct 16, 2025
66 of 69 checks passed
@terrykong terrykong deleted the cherry-pick-1116-r0.4.0 branch October 16, 2025 17:36
terrykong added a commit that referenced this pull request Nov 19, 2025
…1377)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick CI:docs Run doctest documentation Improvements or additions to documentation Run CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants