[CI] Replace tox with nox, use nemo:26.04 for megatron tests, and simplify CI workflows#1286
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRenamed CI jobs for Megatron, adjusted ONNX/TensorRT pip extras, removed an uninstall step in the example runner, extended GPU test triggers and changed GPU job matrix/install behavior, replaced local checkpoint helpers with shared test utilities, and updated tox GPU envs and pip invocation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/gpu_tests.yml:
- Line 79: The container_image value includes the registry prefix causing
double-prefixing when the workflow later prepends nvcr.io/nvidia/ to
matrix.container_image; update the matrix entry named container_image (the value
currently "nvcr.io/nvidia/nemo:26.04") to just "nemo:26.04" so that the later
composition nvcr.io/nvidia/${{ matrix.container_image }} produces a valid image
path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 858abd58-c526-4d96-841a-619dad94dcc2
📒 Files selected for processing (9)
.github/workflows/_example_tests_runner.yml.github/workflows/example_tests.yml.github/workflows/gpu_tests.ymlexamples/llm_distill/README.mdexamples/llm_ptq/README.mdexamples/megatron_bridge/README.mdexamples/pruning/README.mdtests/gpu_megatron/torch/peft/plugins/test_megatron_peft.pytox.ini
💤 Files with no reviewable changes (1)
- .github/workflows/_example_tests_runner.yml
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1286 +/- ##
==========================================
- Coverage 76.83% 76.14% -0.70%
==========================================
Files 461 459 -2
Lines 49523 49153 -370
==========================================
- Hits 38052 37428 -624
- Misses 11471 11725 +254
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
8b4ac5f to
32ff674
Compare
32ff674 to
e7616cf
Compare
…iliency-ext dependency (#1285) - `megatron-core==0.17.0` released yesterday which requires nightly version of `nvidia-resiliency-ext` for an import. Pre-installed version in DLFW Pytorch container is `nvidia-resiliency-ext==0.5.0` - Temporarily pin `mcore<0.17.0` to unblock PR from merging. - Pin `pulp<4.0` as it has some breaking changes and release imminent Correct fix is to just use `nemo:26.04` container instead of PyTorch container for megatron-based tests since it always has correct combination of all packages needed for the megatron ecosystem - Done in #1286 --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
e7616cf to
d1faaa7
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tox.ini (1)
91-91: Keep TRT-LLM container version comment in sync with workflow.Line 91 references
release:1.2.0, while CI config currently points torelease:1.3.0rc10. Updating this comment would avoid confusion.Suggested comment-only fix
-# Container: nvcr.io/nvidia/tensorrt-llm/release:1.2.0 or later +# Container: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10 or later🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tox.ini` at line 91, Update the TRT-LLM container version comment in tox.ini (the comment line referencing nvcr.io/nvidia/tensorrt-llm/release:1.2.0) to match the CI workflow's version (release:1.3.0rc10) so the in-file comment stays in sync with the workflow configuration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tox.ini`:
- Line 91: Update the TRT-LLM container version comment in tox.ini (the comment
line referencing nvcr.io/nvidia/tensorrt-llm/release:1.2.0) to match the CI
workflow's version (release:1.3.0rc10) so the in-file comment stays in sync with
the workflow configuration.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: e7eac202-9711-4140-bf9f-93d3bfc2f4e3
📒 Files selected for processing (5)
.github/workflows/_example_tests_runner.yml.github/workflows/example_tests.yml.github/workflows/gpu_tests.ymltests/gpu_megatron/torch/peft/plugins/test_megatron_peft.pytox.ini
💤 Files with no reviewable changes (1)
- .github/workflows/_example_tests_runner.yml
🚧 Files skipped from review as they are similar to previous changes (1)
- .github/workflows/gpu_tests.yml
d1faaa7 to
5d62dc3
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tox.ini`:
- Around line 83-87: Update the container comment in the tox.ini testenv block
so it matches the CI-pinned image: replace "nvcr.io/nvidia/nemo:26.02 or later"
with the exact pinned "nvcr.io/nvidia/nemo:26.02"; this comment is adjacent to
the [testenv:cuda13-gpu-megatron] section (see the commands_pre = python -m pip
install -e .[hf,dev-test]) so you can find and update the line there to avoid
implying newer images are tested.
- Line 64: The VCS pip install line "python -m pip install --no-build-isolation
git+https://github.com/Dao-AILab/fast-hadamard-transform.git" pins to the
default branch and is non-reproducible; change it to a fixed immutable ref by
appending either @<tag> or @<commit-sha> (for example
git+https://github.com/Dao-AILab/fast-hadamard-transform.git@vX.Y.Z or @<sha>)
so the tox env installs a specific, immutable release; update any other GitHub
installs on the same lines similarly to use explicit tags or SHAs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 9de23443-0ea3-4609-96ca-609fdc3d29b5
📒 Files selected for processing (5)
.github/workflows/_example_tests_runner.yml.github/workflows/example_tests.yml.github/workflows/gpu_tests.ymltests/gpu_megatron/torch/peft/plugins/test_megatron_peft.pytox.ini
💤 Files with no reviewable changes (1)
- .github/workflows/_example_tests_runner.yml
✅ Files skipped from review due to trivial changes (1)
- tests/gpu_megatron/torch/peft/plugins/test_megatron_peft.py
🚧 Files skipped from review as they are similar to previous changes (2)
- .github/workflows/gpu_tests.yml
- .github/workflows/example_tests.yml
|
@claude review |
4e894ff to
ff11c1c
Compare
c593f82 to
c34ca45
Compare
7e692ed to
7fc1faf
Compare
5cf3b39 to
02aff5e
Compare
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Type of change: New feature / infrastructure improvement Replaces `tox` + `tox-current-env` with `nox` for all test, lint, docs, and wheel build sessions. The primary motivation was that `tox-current-env` is incompatible with uv venvs in NGC containers (e.g. NeMo's `/opt/venv`) — it picks the system Python via `sys._base_executable` instead of the container's venv Python which has megatron packages pre-installed. Key changes: - **`noxfile.py`** replaces `tox.ini` with GPU, CPU unit, partial-install, pre-commit, docs, and wheel sessions - **GPU sessions** use `venv_backend="none"` (run directly in container env) and `python -m pip/pytest` to avoid PATH mismatches - **CPU unit sessions** use 2-level `@nox.parametrize` over torch × transformers versions — any combination is selectable e.g. `nox -s "unit-3.12(torch_211-tf_latest)"` - **uv** is set as the default venv backend for CPU sessions (faster installs); `envdir=/tmp/.nox` avoids permission errors in mounted container directories - All CI workflows updated to use `pip install nox uv && nox -s <session>` ```bash pip install nox uv # install once nox -l # list all sessions nox -s "unit-3.12(torch_211-tf_latest)" # default unit tests nox -s "unit-3.12(torch_28-tf_min)" # torch 2.8 + min transformers nox -s gpu_megatron # run inside NeMo container ``` - Ran `nox -l` to verify all session names - Ran `gpu_megatron` session locally inside NeMo container — confirmed it uses `/opt/venv/bin/python` correctly Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: N/A — CI infrastructure only - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ (added `nox` and `uv` to `dev-test`, both Apache-2.0) - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A — no user-facing changes Supersedes the tox-current-env workaround in the parent branch. Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…retrained race condition hang Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
02aff5e to
55c5526
Compare
What does this PR do?
Type of change: New feature / infrastructure improvement
Follow-up to #1285 for correct CI test environment for megatron based tests
Replaces
tox+tox-current-envwithnoxfor all test, lint, docs, and wheel build sessions. The primary motivation was thattox-current-envis incompatible with uv venvs in NGC containers (e.g. NeMo's/opt/venv) — it picks the system Python viasys._base_executableinstead of the container's venv Python which has megatron packages pre-installed.Key changes:
noxfile.pyreplacestox.iniwith GPU, CPU unit, partial-install, pre-commit, docs, and wheel sessionsvenv_backend="none"(run directly in container env) andpython -m pip/pytestto avoid PATH mismatchesAlso includes CI workflow simplifications:
_pr_gate.ymlnew reusable workflow centralizing file-change detection + linux-check wait logic (was duplicated across 3 workflow files)runs-oningpu_tests.yml,example_tests.yml,regression_tests.ymlmulti-py/multi-torch/multi-transformersinto a singlemulti-versionmatrix job inunit_tests.ymlneeds: [pr-gate]were incorrectly skipped when all pr-gate internal jobs were skipped; fixed by making the gate job always runschedule/workflow_dispatchUsage
Testing
nox -lto verify all session namesgpu_megatronsession locally inside NeMo container — confirmed it uses/opt/venv/bin/pythoncorrectlyBefore your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ (addednoxanduvtodev-test, both Apache-2.0)Additional Information
Supersedes the tox-current-env workaround in the parent branch.