Skip to content

fix: update openpipe-art accuracy reward logic#1623

Merged
rapids-bot[bot] merged 1 commit intoNVIDIA:developfrom
aslanshi:fix/openpipe-art-wrong-acc-eval
Feb 21, 2026
Merged

fix: update openpipe-art accuracy reward logic#1623
rapids-bot[bot] merged 1 commit intoNVIDIA:developfrom
aslanshi:fix/openpipe-art-wrong-acc-eval

Conversation

@aslanshi
Copy link
Contributor

@aslanshi aslanshi commented Feb 20, 2026

Description

Fixes a bug in the episode_value_from_states method where the time-decay exponents were computed in reverse order, causing earlier steps to receive less weight instead of more.

Changes

  • Fixed exponent order: Changed np.arange(T, -1, -1) to np.arange(0, T + 1) so that earlier steps (index 0) get weight γ⁰ = 1 (largest) and later steps get weight γᵀ (smallest)
  • Added parameter validation: Added bounds checking for gamma_base and delta_bonus to ensure they fall within valid ranges (0, 1]

Closes #1613

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • Bug Fixes

    • Implemented parameter validation to ensure reinforcement learning configuration values remain within valid ranges, raising errors for invalid inputs.
  • Refactor

    • Modified reward weighting calculation methodology to use forward exponential decay instead of reverse decay, affecting how historical rewards contribute to final episode values.

Signed-off-by: Nanchun Shi <nanchuns@nvidia.com>
@aslanshi aslanshi requested a review from a team as a code owner February 20, 2026 21:30
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link

coderabbitai bot commented Feb 20, 2026

Walkthrough

The change fixes inverted temporal discounting logic in the episode_value_from_states method by adding input validation for decay parameters and reversing exponent ordering to ensure earlier moves receive higher weights instead of lower ones.

Changes

Cohort / File(s) Summary
Temporal Discounting Fix
examples/finetuning/rl_with_openpipe_art/src/rl_with_openpipe_art/accuracy_evaluator.py
Added input validation for gamma_base and delta_bonus parameters (values must be in (0, 1]). Reversed exponent ordering from descending (T, T-1, ..., 0) to ascending (0, 1, ..., T) to correctly prioritize earlier moves with higher weights in temporal discounting calculations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: update openpipe-art accuracy reward logic' is concise, descriptive, uses imperative mood, and clearly summarizes the main change to the temporal discounting logic in the accuracy evaluator.
Linked Issues check ✅ Passed The PR fully implements the requirements from issue #1613: it fixes the inverted exponent computation from np.arange(T, -1, -1) to np.arange(0, T+1), adds parameter validation for gamma_base and delta_bonus, and aligns the temporal discounting with intended behavior.
Out of Scope Changes check ✅ Passed All changes are directly aligned with the issue requirements: exponent logic fix, parameter validation, and internal computation adjustments without altering the public API signature.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@willkill07 willkill07 added bug Something isn't working non-breaking Non-breaking change labels Feb 20, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
examples/finetuning/rl_with_openpipe_art/src/rl_with_openpipe_art/accuracy_evaluator.py (2)

58-61: Stale "Reverse-discounted" label in the section comment.

After the fix, the exponents run forward [0, 1, …, T], so the section header "Reverse-discounted base" is no longer accurate; it describes the old (buggy) ordering.

♻️ Update comment to match new semantics
-        # 2) Reverse-discounted base in [0,1]
-        #    exponents = [0, 1, ..., T] so that earlier steps (index 0) get
-        #    weight gamma^0 = 1 (largest) and later steps get gamma^T (smallest).
+        # 2) Time-discounted base in [0,1]
+        #    exponents = [0, 1, ..., T] so that earlier steps (index 0) get
+        #    weight gamma^0 = 1 (largest) and later steps get gamma^T (smallest).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/finetuning/rl_with_openpipe_art/src/rl_with_openpipe_art/accuracy_evaluator.py`
around lines 58 - 61, The comment above the exponents array is outdated: it
calls the section "Reverse-discounted base" but the code now sets exponents =
np.arange(0, T + 1) (forward order 0..T). Update the comment to reflect the new
semantics (e.g., "Discounted/forward-discounted base in [0,1]" or "Exponents
0..T so earlier steps get larger weight gamma^0") so it matches the exponents
variable and behavior used in accuracy_evaluator.py.

45-48: Clarify the misleading comment on line 58.

The comment "Reverse-discounted base" at line 58 contradicts the code on line 61, which uses forward indexing (np.arange(0, T + 1)). While lines 59–60 explain the intent clearly, the label "Reverse-discounted" is confusing. Replace it with a description that matches the actual forward-indexed approach:

-        # 2) Reverse-discounted base in [0,1]
+        # 2) Temporally-discounted base in [0,1]

Note: The ValueError messages on lines 46 and 48 do not trigger Ruff warnings because the TRY rule category is not enabled in pyproject.toml; only E, F, W, I, PL, and UP are selected for linting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/finetuning/rl_with_openpipe_art/src/rl_with_openpipe_art/accuracy_evaluator.py`
around lines 45 - 48, The comment "Reverse-discounted base" is misleading
because the code uses forward indexing (np.arange(0, T + 1)) to build discount
factors; update the comment near the calculation that uses np.arange(0, T + 1)
to describe a forward-indexed discount vector (e.g., "Forward-indexed discount
factors / discount base for timesteps 0..T") so it accurately reflects the
implementation that uses gamma_base and delta_bonus to compute per-timestep
discounts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@examples/finetuning/rl_with_openpipe_art/src/rl_with_openpipe_art/accuracy_evaluator.py`:
- Around line 58-61: The comment above the exponents array is outdated: it calls
the section "Reverse-discounted base" but the code now sets exponents =
np.arange(0, T + 1) (forward order 0..T). Update the comment to reflect the new
semantics (e.g., "Discounted/forward-discounted base in [0,1]" or "Exponents
0..T so earlier steps get larger weight gamma^0") so it matches the exponents
variable and behavior used in accuracy_evaluator.py.
- Around line 45-48: The comment "Reverse-discounted base" is misleading because
the code uses forward indexing (np.arange(0, T + 1)) to build discount factors;
update the comment near the calculation that uses np.arange(0, T + 1) to
describe a forward-indexed discount vector (e.g., "Forward-indexed discount
factors / discount base for timesteps 0..T") so it accurately reflects the
implementation that uses gamma_base and delta_bonus to compute per-timestep
discounts.

@willkill07
Copy link
Member

/ok to test 277bc2b

@willkill07
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit c736251 into NVIDIA:develop Feb 21, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: episode_value_from_states weights later moves more heavily than earlier moves

2 participants