fix: skip_reference_policy_logprobs_calculation=true crashes training by ShriyaRishab · Pull Request #2174 · NVIDIA-NeMo/RL

ShriyaRishab · 2026-03-30T20:40:00Z

What does this PR do ?

Summary

Setting skip_reference_policy_logprobs_calculation=true in GRPO config crashes because:

reference_policy_logprobs is never assigned to train_data when skipped
use_reference_model() context manager crashes when no reference state dict exists

Fixes #1968

Root Cause

Three code paths needed fixes:

grpo.py sync path: missing train_data["reference_policy_logprobs"] assignment
grpo.py async path: same
Policy workers: use_reference_model() tries to swap non-existent state dicts

Fix

When skip is enabled, assign torch.zeros_like(prev_logprobs) to reference_policy_logprobs
Added _has_reference_model() base method
In get_reference_policy_logprobs(): return zeros if no reference model
In all three worker use_reference_model() context managers: yield without swapping if no reference state dict

Issues

List issues that this PR closes (syntax):
#1968

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Fixes NVIDIA-NeMo#1968: Setting skip_reference_policy_logprobs_calculation=true with reference_policy_kl_penalty=0 crashed training in three ways: Bug 1: use_reference_model() context manager crash when reference model was never initialized (AttributeError on reference_state_dict). Fix: Added early-return guard in use_reference_model() for all three worker types (megatron, dtensor v1, dtensor v2) - yields without swapping when reference model is None/missing. Bug 2: Async GRPO path unconditionally called get_reference_policy_logprobs() without checking the skip flag. Fix: Added the same skip guard as the sync path, setting zeros_like for reference_policy_logprobs when skipping. Bug 3: Missing reference_policy_logprobs key in train_data causing shape mismatches downstream in loss computation. Fix: Both sync and async paths now explicitly set train_data['reference_policy_logprobs'] = zeros_like(prev_logprobs) when skipping. Also added a _has_reference_model() helper and zeros fallback in base_policy_worker.get_reference_policy_logprobs() as defense-in-depth.

copy-pr-bot · 2026-03-30T20:40:05Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

guyueh1 · 2026-05-05T23:57:12Z

@ShriyaRishab skip_reference_policy_logprobs_calculation should just skip the computation of reference model logprobs, but should not skip the initialization of reference model states, why do we need to do the changes in base_policy_worker.py?

Cherry-picked PR NVIDIA-NeMo#2174 didn't run ruff format on the worker files it touched. This commit applies the format pass so subsequent diffs stay clean. No functional changes. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Linglin Jing <linglinj@nvidia.com>

Adds a functional smoke test for the path enabled by PR NVIDIA-NeMo#2178 plus the auto-skip safety net added in response to yuki-97's review: > and I think it's better to add a functional test (or modify one > exist functional test) for reference_policy_kl_penalty == 0. The test runs a 2-step GRPO with reference_policy_kl_penalty=0 and without explicitly setting skip_reference_policy_logprobs_calculation, then asserts: * the auto-skip log line fires (proves setup() override worked); * the existing "Reference policy logprob calculation will be skipped" confirmation log fires; * standard probs_ratio + gen_kl_error metric envelopes pass (PR NVIDIA-NeMo#2174 zeros placeholder keeps loss math valid when KL penalty is zero). Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Linglin Jing <linglinj@nvidia.com>

…ratio Adds two parametrized unit tests in tests/unit/algorithms/test_grpo.py that cover both grpo_train and async_grpo_train: - test_grpo_train_skips_reference_policy_logprobs_when_configured: guards issue NVIDIA-NeMo#1968 / PRs NVIDIA-NeMo#2174, NVIDIA-NeMo#2178 by asserting that policy.get_reference_policy_logprobs is never called when grpo.skip_reference_policy_logprobs_calculation=True. - test_grpo_train_skips_prev_logprobs_when_force_on_policy_ratio: guards PR NVIDIA-NeMo#2177 by asserting that policy.get_logprobs is never called when loss_fn.force_on_policy_ratio=True. Both tests reuse the existing mock_grpo_components fixture and the mock_async_grpo_infrastructure helper so they require no GPU / Ray cluster and run in CI in milliseconds (modulo cold-start import cost). Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Linglin Jing <linglinj@nvidia.com>

The two regression tests added in this PR drive `grpo_train` / `async_grpo_train` through code paths that call `torch.zeros_like(prev_logprobs)` (PRs NVIDIA-NeMo#2174 / NVIDIA-NeMo#2178) and `torch.zeros_like(generation_logprobs)` (PR NVIDIA-NeMo#2177). Under the bare `mock_grpo_components` fixture those inputs are `MagicMock` objects, so CI failed with `TypeError: zeros_like(): argument 'input' (position 1) must be Tensor, not MagicMock` at `nemo_rl/algorithms/grpo.py:1801`. Add a `_patched_logprob_phase` context manager that swaps in real tensors for `policy.get_logprobs`, `policy.get_reference_policy_logprobs`, and `batched_message_log_to_flat_message`, and use it in both the sync and async branches of the two new tests. Signed-off-by: Linglin Jing <linglinj@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

ShriyaRishab requested review from a team as code owners March 30, 2026 20:40

ShriyaRishab mentioned this pull request Mar 30, 2026

fix: skip_reference_policy_logprobs_calculation=true crashes training ShriyaRishab/RL#4

Open

dmvevents mentioned this pull request Apr 12, 2026

skip_reference_policy_logprobs_calculation=true crashes training with RuntimeError / NameError #1968

Closed

terrykong mentioned this pull request May 5, 2026

feat: skip logprob and reference logprob computation under certain conditions #1891

Closed

4 tasks

jinglinglingling mentioned this pull request May 8, 2026

fix: fix skip_reference_policy_logprobs_calculation and skip_prev_logprobs #2443

Merged

ShriyaRishab closed this May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: skip_reference_policy_logprobs_calculation=true crashes training#2174

fix: skip_reference_policy_logprobs_calculation=true crashes training#2174
ShriyaRishab wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
ShriyaRishab:fix/issue-1968

ShriyaRishab commented Mar 30, 2026

Uh oh!

copy-pr-bot Bot commented Mar 30, 2026

Uh oh!

guyueh1 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ShriyaRishab commented Mar 30, 2026

What does this PR do ?

Summary

Root Cause

Fix

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Mar 30, 2026

Uh oh!

guyueh1 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants