fix(rm): raise clear error when CP is used with DTensor RM training by terrykong · Pull Request #2483 · NVIDIA-NeMo/RL

terrykong · 2026-05-13T07:23:40Z

Summary

Adds early validation in rm.setup() that raises a clear ValueError when context_parallel_size > 1 is set for reward model training on the DTensor backend
Adds unit tests for the validation (both the rejection and the pass-through cases)
All validation and tests are tagged with TODO(https://github.com/NVIDIA-NeMo/RL/issues/2482) for easy cleanup when CP support is implemented

Context

Related to #2482 — this PR does not implement CP support for RM training, it only surfaces a clear error message instead of letting users hit cryptic runtime failures (Unknown parallel style: local_rowwise or NotImplementedError: Operator aten.log_sigmoid_forward.default does not have a sharding strategy registered).

Test plan

uv run --group test pytest tests/unit/algorithms/test_rm.py -v passes
test_context_parallel_rejected_for_dtensor_rm verifies the ValueError fires when CP > 1
test_context_parallel_allowed_when_one verifies CP=1 passes the validation

copy-pr-bot · 2026-05-13T07:23:44Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

terrykong · 2026-05-13T07:25:32Z

/ok to test 01993fb

yuki-97

One issue: test_context_parallel_allowed_when_one uses a broken negative lookahead regex that will not catch regressions.

…sor RM training Context parallelism (context_parallel_size > 1) is not supported for reward model training on the DTensor backend because the log_sigmoid operator lacks a DTensor sharding strategy for CP meshes. Instead of letting users hit cryptic runtime errors, raise a clear ValueError during setup with a link to the tracking issue. Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong · 2026-05-15T05:29:16Z

/ok to test d318709

The NeMo Gym docs URL returns 404, causing sphinx-build CI to fail. Add the URL pattern to linkcheck_ignore since the external docs site is not under our control. Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong · 2026-05-15T06:24:18Z

/ok to test 9c93d8d

Replace the blanket linkcheck_ignore for all NeMo Gym docs with pinned v0.2.1 URLs so linkcheck still validates them. Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong · 2026-05-15T07:44:52Z

/ok to test 4c24d9e

terrykong requested review from a team as code owners May 13, 2026 07:23

terrykong added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label May 13, 2026

terrykong requested a review from yuki-97 May 13, 2026 07:24

terrykong changed the title ~~fix(rm): raise clear error when context parallelism is used with DTensor RM training~~ fix(rm): raise clear error when CP is used with DTensor RM training May 13, 2026

terrykong enabled auto-merge (squash) May 13, 2026 07:25

copy-pr-bot Bot temporarily deployed to public May 13, 2026 07:25 Inactive

copy-pr-bot Bot temporarily deployed to public May 13, 2026 07:26 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 13, 2026 07:26 Inactive

copy-pr-bot Bot temporarily deployed to public May 13, 2026 07:26 Inactive

copy-pr-bot Bot temporarily deployed to public May 13, 2026 07:30 Inactive

yuki-97 reviewed May 13, 2026

View reviewed changes

Comment thread tests/unit/algorithms/test_rm.py Outdated

terrykong force-pushed the terryk/rl-749-rm-cp-error-message branch from 4157ff0 to d318709 Compare May 15, 2026 05:28

copy-pr-bot Bot temporarily deployed to public May 15, 2026 05:29 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 15, 2026 05:29 Inactive

copy-pr-bot Bot temporarily deployed to public May 15, 2026 05:29 Inactive

copy-pr-bot Bot temporarily deployed to public May 15, 2026 05:30 Inactive

ci: ignore NeMo Gym docs in sphinx linkcheck

9c93d8d

The NeMo Gym docs URL returns 404, causing sphinx-build CI to fail. Add the URL pattern to linkcheck_ignore since the external docs site is not under our control. Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong requested a review from a team as a code owner May 15, 2026 06:23

github-actions Bot added the Documentation Improvements or additions to documentation label May 15, 2026

copy-pr-bot Bot temporarily deployed to public May 15, 2026 06:24 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 15, 2026 06:24 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci May 15, 2026 06:24 Inactive

copy-pr-bot Bot temporarily deployed to public May 15, 2026 06:24 Inactive

copy-pr-bot Bot temporarily deployed to public May 15, 2026 06:25 Inactive

copy-pr-bot Bot temporarily deployed to public May 15, 2026 06:27 Inactive

fix(docs): pin gym doc links to v0.2.1 instead of blanket-ignoring

4c24d9e

Replace the blanket linkcheck_ignore for all NeMo Gym docs with pinned v0.2.1 URLs so linkcheck still validates them. Signed-off-by: Terry Kong <terryk@nvidia.com>

copy-pr-bot Bot temporarily deployed to public May 15, 2026 07:45 Inactive

copy-pr-bot Bot temporarily deployed to public May 15, 2026 07:48 Inactive

yuki-97 approved these changes May 15, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to nemo-ci May 15, 2026 10:45 Inactive

terrykong merged commit f87cb1b into main May 15, 2026
47 of 53 checks passed

terrykong deleted the terryk/rl-749-rm-cp-error-message branch May 15, 2026 11:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rm): raise clear error when CP is used with DTensor RM training#2483

fix(rm): raise clear error when CP is used with DTensor RM training#2483
terrykong merged 3 commits into
mainfrom
terryk/rl-749-rm-cp-error-message

terrykong commented May 13, 2026

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

terrykong commented May 13, 2026

Uh oh!

yuki-97 left a comment

Uh oh!

Uh oh!

terrykong commented May 15, 2026

Uh oh!

terrykong commented May 15, 2026

Uh oh!

terrykong commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

terrykong commented May 13, 2026

Summary

Context

Test plan

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

terrykong commented May 13, 2026

Uh oh!

yuki-97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

terrykong commented May 15, 2026

Uh oh!

terrykong commented May 15, 2026

Uh oh!

terrykong commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants