Skip to content

[CI] Replace tox with nox, use nemo:26.04 for megatron tests, and simplify CI workflows#1286

Merged
kevalmorabia97 merged 9 commits intomainfrom
kmorabia/use-nemo-container-for-megatron-tests
Apr 18, 2026
Merged

[CI] Replace tox with nox, use nemo:26.04 for megatron tests, and simplify CI workflows#1286
kevalmorabia97 merged 9 commits intomainfrom
kmorabia/use-nemo-container-for-megatron-tests

Conversation

@kevalmorabia97
Copy link
Copy Markdown
Collaborator

@kevalmorabia97 kevalmorabia97 commented Apr 17, 2026

What does this PR do?

Type of change: New feature / infrastructure improvement

Follow-up to #1285 for correct CI test environment for megatron based tests

Replaces tox + tox-current-env with nox for all test, lint, docs, and wheel build sessions. The primary motivation was that tox-current-env is incompatible with uv venvs in NGC containers (e.g. NeMo's /opt/venv) — it picks the system Python via sys._base_executable instead of the container's venv Python which has megatron packages pre-installed.

Key changes:

  • noxfile.py replaces tox.ini with GPU, CPU unit, partial-install, pre-commit, docs, and wheel sessions
  • GPU sessions use venv_backend="none" (run directly in container env) and python -m pip/pytest to avoid PATH mismatches
  • uv is set as the default venv backend (if available) for CPU sessions (faster installs)

Also includes CI workflow simplifications:

  • _pr_gate.yml new reusable workflow centralizing file-change detection + linux-check wait logic (was duplicated across 3 workflow files)
  • Collapsed pr/non-pr job pairs into single jobs with conditional runs-on in gpu_tests.yml, example_tests.yml, regression_tests.yml
  • Collapsed multi-py / multi-torch / multi-transformers into a single multi-version matrix job in unit_tests.yml
  • PR path filtering for unit test secondary jobs (multi-version, launcher, partial-install) — skipped if no relevant files changed
  • Fixed schedule/workflow_dispatch skipping — jobs with needs: [pr-gate] were incorrectly skipped when all pr-gate internal jobs were skipped; fixed by making the gate job always run
  • multi-version, launcher, partial-install now also run on schedule / workflow_dispatch

Usage

python -m pip install nox uv                                                    # install nox and uv (once)
nox -l                                                                          # list all sessions
nox -s gpu_megatron                                                             # run a GPU session (inside container)
nox -s "unit-3.12(torch_211, tf_latest)"                                        # run a specific unit test combination
nox -s "unit-3.12(torch_211, tf_latest)" -R                                     # force-recreate venv (e.g. after dep changes)
COVERAGE_PROCESS_START=pyproject.toml nox -s "unit-3.12(torch_211, tf_latest)"  # with coverage

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: N/A — CI infrastructure only
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ (added nox and uv to dev-test, both Apache-2.0)
  • Did you write any new necessary tests?: N/A
  • Did you update Changelog?: N/A — no user-facing changes

Additional Information

Supersedes the tox-current-env workaround in the parent branch.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Renamed CI jobs for Megatron, adjusted ONNX/TensorRT pip extras, removed an uninstall step in the example runner, extended GPU test triggers and changed GPU job matrix/install behavior, replaced local checkpoint helpers with shared test utilities, and updated tox GPU envs and pip invocation.

Changes

Cohort / File(s) Summary
Workflow job renames & ONNX extras
.github/workflows/example_tests.yml
Renamed jobs nemo-prmegatron-pr and nemo-non-prmegatron-non-pr; updated required-check needs references and changed ONNX/TensorRT runner pip_install_extras from "[all,dev-test]" to "[onnx,hf,dev-test]".
Example runner uninstall removal
.github/workflows/_example_tests_runner.yml
Removed the explicit pip uninstall -y nvidia-modelopt from the "Install dependencies" step; other install flow unchanged.
GPU workflow & matrix changes
.github/workflows/gpu_tests.yml
Added changed-file triggers for tests/gpu_megatron/** and tests/gpu_trtllm/**; renamed matrix examples to use underscores (gpu_regression, gpu_megatron, gpu_trtllm); changed gpu_megatron container to nemo:26.02; altered install/test step: gpu_megatron now does python -m pip install -e .[hf,dev-test] and pytest tests/gpu_megatron --cov, while other matrix entries run tox (installed via python -m pip).
Test utilities consolidation
tests/gpu_megatron/torch/peft/plugins/test_megatron_peft.py
Removed local save_distributed_checkpoint/load_distributed_checkpoint helper wrappers and direct megatron.core.dist_checkpointing usage; now imports save_distributed_checkpoint, load_distributed_checkpoint, and initialize_for_megatron from _test_utils.torch.megatron.utils.
Tox GPU envs & pip invocation
tox.ini
Renamed GPU env headers to use underscores (e.g., cuda13-gpu-regressioncuda13-gpu_regression), introduced/adjusted cuda13-gpu_trtllm and removed the megatron pre-install there; replaced pip ... with python -m pip ... across GPU testenvs; updated comments to reflect container expectations and adjusted pre-install/editable install steps.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title mentions replacing tox with nox and using nemo:26.04, but the actual changes use nemo:26.02 and don't fully replace tox with nox across the codebase—tox.ini is modified but not removed. Update the title to accurately reflect the changes: 'Replace nemo example tests with megatron, use nemo:26.02 for megatron gpu tests' or similar.
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns ✅ Passed PR introduces no security anti-patterns per SECURITY.md guidelines; checkpoint utilities use megatron's built-in methods without unsafe patterns.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kmorabia/use-nemo-container-for-megatron-tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/gpu_tests.yml:
- Line 79: The container_image value includes the registry prefix causing
double-prefixing when the workflow later prepends nvcr.io/nvidia/ to
matrix.container_image; update the matrix entry named container_image (the value
currently "nvcr.io/nvidia/nemo:26.04") to just "nemo:26.04" so that the later
composition nvcr.io/nvidia/${{ matrix.container_image }} produces a valid image
path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 858abd58-c526-4d96-841a-619dad94dcc2

📥 Commits

Reviewing files that changed from the base of the PR and between 7e82a5c and 8b4ac5f.

📒 Files selected for processing (9)
  • .github/workflows/_example_tests_runner.yml
  • .github/workflows/example_tests.yml
  • .github/workflows/gpu_tests.yml
  • examples/llm_distill/README.md
  • examples/llm_ptq/README.md
  • examples/megatron_bridge/README.md
  • examples/pruning/README.md
  • tests/gpu_megatron/torch/peft/plugins/test_megatron_peft.py
  • tox.ini
💤 Files with no reviewable changes (1)
  • .github/workflows/_example_tests_runner.yml

Comment thread .github/workflows/gpu_tests.yml Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 17, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-18 16:57 UTC

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.14%. Comparing base (2b315ed) to head (55c5526).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1286      +/-   ##
==========================================
- Coverage   76.83%   76.14%   -0.70%     
==========================================
  Files         461      459       -2     
  Lines       49523    49153     -370     
==========================================
- Hits        38052    37428     -624     
- Misses      11471    11725     +254     
Flag Coverage Δ
examples 41.88% <0.00%> (+0.57%) ⬆️
gpu 58.89% <100.00%> (-1.82%) ⬇️
regression 14.98% <0.00%> (?)
unit 52.96% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/use-nemo-container-for-megatron-tests branch from 8b4ac5f to 32ff674 Compare April 17, 2026 07:47
@kevalmorabia97 kevalmorabia97 changed the title Use nemo:26.04 container for megatron gpu tests Use nemo:26.02 container for megatron gpu tests Apr 17, 2026
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/use-nemo-container-for-megatron-tests branch from 32ff674 to e7616cf Compare April 17, 2026 07:49
kevalmorabia97 added a commit that referenced this pull request Apr 17, 2026
…iliency-ext dependency (#1285)

- `megatron-core==0.17.0` released yesterday which requires nightly
version of `nvidia-resiliency-ext` for an import. Pre-installed version
in DLFW Pytorch container is `nvidia-resiliency-ext==0.5.0`
  - Temporarily pin `mcore<0.17.0` to unblock PR from merging. 
- Pin `pulp<4.0` as it has some breaking changes and release imminent

Correct fix is to just use `nemo:26.04` container instead of PyTorch
container for megatron-based tests since it always has correct
combination of all packages needed for the megatron ecosystem - Done in
#1286

---------

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/use-nemo-container-for-megatron-tests branch from e7616cf to d1faaa7 Compare April 17, 2026 12:24
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tox.ini (1)

91-91: Keep TRT-LLM container version comment in sync with workflow.

Line 91 references release:1.2.0, while CI config currently points to release:1.3.0rc10. Updating this comment would avoid confusion.

Suggested comment-only fix
-# Container: nvcr.io/nvidia/tensorrt-llm/release:1.2.0 or later
+# Container: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10 or later
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tox.ini` at line 91, Update the TRT-LLM container version comment in tox.ini
(the comment line referencing nvcr.io/nvidia/tensorrt-llm/release:1.2.0) to
match the CI workflow's version (release:1.3.0rc10) so the in-file comment stays
in sync with the workflow configuration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tox.ini`:
- Line 91: Update the TRT-LLM container version comment in tox.ini (the comment
line referencing nvcr.io/nvidia/tensorrt-llm/release:1.2.0) to match the CI
workflow's version (release:1.3.0rc10) so the in-file comment stays in sync with
the workflow configuration.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: e7eac202-9711-4140-bf9f-93d3bfc2f4e3

📥 Commits

Reviewing files that changed from the base of the PR and between e7616cf and d1faaa7.

📒 Files selected for processing (5)
  • .github/workflows/_example_tests_runner.yml
  • .github/workflows/example_tests.yml
  • .github/workflows/gpu_tests.yml
  • tests/gpu_megatron/torch/peft/plugins/test_megatron_peft.py
  • tox.ini
💤 Files with no reviewable changes (1)
  • .github/workflows/_example_tests_runner.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/gpu_tests.yml

@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/use-nemo-container-for-megatron-tests branch from d1faaa7 to 5d62dc3 Compare April 17, 2026 12:31
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tox.ini`:
- Around line 83-87: Update the container comment in the tox.ini testenv block
so it matches the CI-pinned image: replace "nvcr.io/nvidia/nemo:26.02 or later"
with the exact pinned "nvcr.io/nvidia/nemo:26.02"; this comment is adjacent to
the [testenv:cuda13-gpu-megatron] section (see the commands_pre = python -m pip
install -e .[hf,dev-test]) so you can find and update the line there to avoid
implying newer images are tested.
- Line 64: The VCS pip install line "python -m pip install --no-build-isolation
git+https://github.com/Dao-AILab/fast-hadamard-transform.git" pins to the
default branch and is non-reproducible; change it to a fixed immutable ref by
appending either @<tag> or @<commit-sha> (for example
git+https://github.com/Dao-AILab/fast-hadamard-transform.git@vX.Y.Z or @<sha>)
so the tox env installs a specific, immutable release; update any other GitHub
installs on the same lines similarly to use explicit tags or SHAs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9de23443-0ea3-4609-96ca-609fdc3d29b5

📥 Commits

Reviewing files that changed from the base of the PR and between d1faaa7 and 5d62dc3.

📒 Files selected for processing (5)
  • .github/workflows/_example_tests_runner.yml
  • .github/workflows/example_tests.yml
  • .github/workflows/gpu_tests.yml
  • tests/gpu_megatron/torch/peft/plugins/test_megatron_peft.py
  • tox.ini
💤 Files with no reviewable changes (1)
  • .github/workflows/_example_tests_runner.yml
✅ Files skipped from review due to trivial changes (1)
  • tests/gpu_megatron/torch/peft/plugins/test_megatron_peft.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/workflows/gpu_tests.yml
  • .github/workflows/example_tests.yml

Comment thread tox.ini Outdated
Comment thread tox.ini Outdated
@kevalmorabia97 kevalmorabia97 changed the title Use nemo:26.02 container for megatron gpu tests [CI] Replace tox with nox and use nemo:26.02 container for megatron gpu tests Apr 18, 2026
@kevalmorabia97
Copy link
Copy Markdown
Collaborator Author

@claude review

@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/use-nemo-container-for-megatron-tests branch from 4e894ff to ff11c1c Compare April 18, 2026 09:50
@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner April 18, 2026 09:50
@kevalmorabia97 kevalmorabia97 requested review from cjluo-nv and removed request for cjluo-nv April 18, 2026 09:50
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/use-nemo-container-for-megatron-tests branch 2 times, most recently from c593f82 to c34ca45 Compare April 18, 2026 09:55
@kevalmorabia97 kevalmorabia97 changed the title [CI] Replace tox with nox and use nemo:26.02 container for megatron gpu tests [CI] Replace tox with nox and use nemo:26.04 container for megatron gpu tests Apr 18, 2026
@kevalmorabia97 kevalmorabia97 enabled auto-merge (squash) April 18, 2026 10:02
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/use-nemo-container-for-megatron-tests branch from 7e692ed to 7fc1faf Compare April 18, 2026 13:25
@kevalmorabia97 kevalmorabia97 disabled auto-merge April 18, 2026 14:15
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/use-nemo-container-for-megatron-tests branch 3 times, most recently from 5cf3b39 to 02aff5e Compare April 18, 2026 15:36
@kevalmorabia97 kevalmorabia97 changed the title [CI] Replace tox with nox and use nemo:26.04 container for megatron gpu tests [CI] Replace tox with nox, use nemo:26.04 for megatron tests, and simplify CI workflows Apr 18, 2026
kevalmorabia97 and others added 9 commits April 18, 2026 08:54
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Type of change: New feature / infrastructure improvement

Replaces `tox` + `tox-current-env` with `nox` for all test, lint, docs,
and wheel build sessions. The primary motivation was that
`tox-current-env` is incompatible with uv venvs in NGC containers (e.g.
NeMo's `/opt/venv`) — it picks the system Python via
`sys._base_executable` instead of the container's venv Python which has
megatron packages pre-installed.

Key changes:
- **`noxfile.py`** replaces `tox.ini` with GPU, CPU unit,
partial-install, pre-commit, docs, and wheel sessions
- **GPU sessions** use `venv_backend="none"` (run directly in container
env) and `python -m pip/pytest` to avoid PATH mismatches
- **CPU unit sessions** use 2-level `@nox.parametrize` over torch ×
transformers versions — any combination is selectable e.g. `nox -s
"unit-3.12(torch_211-tf_latest)"`
- **uv** is set as the default venv backend for CPU sessions (faster
installs); `envdir=/tmp/.nox` avoids permission errors in mounted
container directories
- All CI workflows updated to use `pip install nox uv && nox -s
<session>`

```bash
pip install nox uv                              # install once
nox -l                                          # list all sessions
nox -s "unit-3.12(torch_211-tf_latest)"         # default unit tests
nox -s "unit-3.12(torch_28-tf_min)"             # torch 2.8 + min transformers
nox -s gpu_megatron                             # run inside NeMo container
```

- Ran `nox -l` to verify all session names
- Ran `gpu_megatron` session locally inside NeMo container — confirmed
it uses `/opt/venv/bin/python` correctly

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: N/A — CI infrastructure only
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ (added `nox`
and `uv` to `dev-test`, both Apache-2.0)
- Did you write any new necessary tests?: N/A
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A — no user-facing changes

Supersedes the tox-current-env workaround in the parent branch.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…retrained race condition hang

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/use-nemo-container-for-megatron-tests branch from 02aff5e to 55c5526 Compare April 18, 2026 15:54
@kevalmorabia97 kevalmorabia97 enabled auto-merge (squash) April 18, 2026 16:22
@kevalmorabia97 kevalmorabia97 merged commit 3d0f0db into main Apr 18, 2026
61 checks passed
@kevalmorabia97 kevalmorabia97 deleted the kmorabia/use-nemo-container-for-megatron-tests branch April 18, 2026 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants