fix: update openpipe-art accuracy reward logic by aslanshi · Pull Request #1623 · NVIDIA/NeMo-Agent-Toolkit

aslanshi · 2026-02-20T21:30:24Z

Description

Fixes a bug in the episode_value_from_states method where the time-decay exponents were computed in reverse order, causing earlier steps to receive less weight instead of more.

Changes

Fixed exponent order: Changed np.arange(T, -1, -1) to np.arange(0, T + 1) so that earlier steps (index 0) get weight γ⁰ = 1 (largest) and later steps get weight γᵀ (smallest)
Added parameter validation: Added bounds checking for gamma_base and delta_bonus to ensure they fall within valid ranges (0, 1]

Closes #1613

By Submitting this PR I confirm:

I am familiar with the Contributing Guidelines.
We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
When the PR is ready for review, new or existing tests cover these changes.
When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

Bug Fixes
- Implemented parameter validation to ensure reinforcement learning configuration values remain within valid ranges, raising errors for invalid inputs.
Refactor
- Modified reward weighting calculation methodology to use forward exponential decay instead of reverse decay, affecting how historical rewards contribute to final episode values.

Signed-off-by: Nanchun Shi <nanchuns@nvidia.com>

copy-pr-bot · 2026-02-20T21:30:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-20T21:30:45Z

Walkthrough

The change fixes inverted temporal discounting logic in the episode_value_from_states method by adding input validation for decay parameters and reversing exponent ordering to ensure earlier moves receive higher weights instead of lower ones.

Changes

Cohort / File(s)	Summary
Temporal Discounting Fix `examples/finetuning/rl_with_openpipe_art/src/rl_with_openpipe_art/accuracy_evaluator.py`	Added input validation for `gamma_base` and `delta_bonus` parameters (values must be in (0, 1]). Reversed exponent ordering from descending (T, T-1, ..., 0) to ascending (0, 1, ..., T) to correctly prioritize earlier moves with higher weights in temporal discounting calculations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: update openpipe-art accuracy reward logic' is concise, descriptive, uses imperative mood, and clearly summarizes the main change to the temporal discounting logic in the accuracy evaluator.
Linked Issues check	✅ Passed	The PR fully implements the requirements from issue `#1613`: it fixes the inverted exponent computation from np.arange(T, -1, -1) to np.arange(0, T+1), adds parameter validation for gamma_base and delta_bonus, and aligns the temporal discounting with intended behavior.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with the issue requirements: exponent logic fix, parameter validation, and internal computation adjustments without altering the public API signature.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

examples/finetuning/rl_with_openpipe_art/src/rl_with_openpipe_art/accuracy_evaluator.py (2)
58-61: Stale "Reverse-discounted" label in the section comment.

After the fix, the exponents run forward [0, 1, …, T], so the section header "Reverse-discounted base" is no longer accurate; it describes the old (buggy) ordering.
♻️ Update comment to match new semantics
-        # 2) Reverse-discounted base in [0,1]
-        #    exponents = [0, 1, ..., T] so that earlier steps (index 0) get
-        #    weight gamma^0 = 1 (largest) and later steps get gamma^T (smallest).
+        # 2) Time-discounted base in [0,1]
+        #    exponents = [0, 1, ..., T] so that earlier steps (index 0) get
+        #    weight gamma^0 = 1 (largest) and later steps get gamma^T (smallest).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/finetuning/rl_with_openpipe_art/src/rl_with_openpipe_art/accuracy_evaluator.py`
around lines 58 - 61, The comment above the exponents array is outdated: it
calls the section "Reverse-discounted base" but the code now sets exponents =
np.arange(0, T + 1) (forward order 0..T). Update the comment to reflect the new
semantics (e.g., "Discounted/forward-discounted base in [0,1]" or "Exponents
0..T so earlier steps get larger weight gamma^0") so it matches the exponents
variable and behavior used in accuracy_evaluator.py.
45-48: Clarify the misleading comment on line 58.

The comment "Reverse-discounted base" at line 58 contradicts the code on line 61, which uses forward indexing (np.arange(0, T + 1)). While lines 59–60 explain the intent clearly, the label "Reverse-discounted" is confusing. Replace it with a description that matches the actual forward-indexed approach:
-        # 2) Reverse-discounted base in [0,1]
+        # 2) Temporally-discounted base in [0,1]
Note: The ValueError messages on lines 46 and 48 do not trigger Ruff warnings because the TRY rule category is not enabled in pyproject.toml; only E, F, W, I, PL, and UP are selected for linting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/finetuning/rl_with_openpipe_art/src/rl_with_openpipe_art/accuracy_evaluator.py`
around lines 45 - 48, The comment "Reverse-discounted base" is misleading
because the code uses forward indexing (np.arange(0, T + 1)) to build discount
factors; update the comment near the calculation that uses np.arange(0, T + 1)
to describe a forward-indexed discount vector (e.g., "Forward-indexed discount
factors / discount base for timesteps 0..T") so it accurately reflects the
implementation that uses gamma_base and delta_bonus to compute per-timestep
discounts.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@examples/finetuning/rl_with_openpipe_art/src/rl_with_openpipe_art/accuracy_evaluator.py`:
- Around line 58-61: The comment above the exponents array is outdated: it calls
the section "Reverse-discounted base" but the code now sets exponents =
np.arange(0, T + 1) (forward order 0..T). Update the comment to reflect the new
semantics (e.g., "Discounted/forward-discounted base in [0,1]" or "Exponents
0..T so earlier steps get larger weight gamma^0") so it matches the exponents
variable and behavior used in accuracy_evaluator.py.
- Around line 45-48: The comment "Reverse-discounted base" is misleading because
the code uses forward indexing (np.arange(0, T + 1)) to build discount factors;
update the comment near the calculation that uses np.arange(0, T + 1) to
describe a forward-indexed discount vector (e.g., "Forward-indexed discount
factors / discount base for timesteps 0..T") so it accurately reflects the
implementation that uses gamma_base and delta_bonus to compute per-timestep
discounts.

willkill07 · 2026-02-20T21:50:45Z

/ok to test 277bc2b

willkill07 · 2026-02-21T00:24:31Z

/merge

fix incorrect time decayed reward

277bc2b

Signed-off-by: Nanchun Shi <nanchuns@nvidia.com>

aslanshi requested a review from a team as a code owner February 20, 2026 21:30

willkill07 added bug Something isn't working non-breaking Non-breaking change labels Feb 20, 2026

willkill07 approved these changes Feb 20, 2026

View reviewed changes

coderabbitai bot reviewed Feb 20, 2026

View reviewed changes

rapids-bot bot merged commit c736251 into NVIDIA:develop Feb 21, 2026
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: update openpipe-art accuracy reward logic#1623

fix: update openpipe-art accuracy reward logic#1623
rapids-bot[bot] merged 1 commit intoNVIDIA:developfrom
aslanshi:fix/openpipe-art-wrong-acc-eval

aslanshi commented Feb 20, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Feb 20, 2026

Uh oh!

coderabbitai bot commented Feb 20, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

willkill07 commented Feb 20, 2026

Uh oh!

willkill07 commented Feb 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aslanshi commented Feb 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

By Submitting this PR I confirm:

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 20, 2026

Uh oh!

coderabbitai bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

willkill07 commented Feb 20, 2026

Uh oh!

willkill07 commented Feb 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aslanshi commented Feb 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 20, 2026 •

edited

Loading