fix: update openpipe-art accuracy reward logic#1623
fix: update openpipe-art accuracy reward logic#1623rapids-bot[bot] merged 1 commit intoNVIDIA:developfrom
Conversation
Signed-off-by: Nanchun Shi <nanchuns@nvidia.com>
WalkthroughThe change fixes inverted temporal discounting logic in the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
examples/finetuning/rl_with_openpipe_art/src/rl_with_openpipe_art/accuracy_evaluator.py (2)
58-61: Stale "Reverse-discounted" label in the section comment.After the fix, the exponents run forward
[0, 1, …, T], so the section header "Reverse-discounted base" is no longer accurate; it describes the old (buggy) ordering.♻️ Update comment to match new semantics
- # 2) Reverse-discounted base in [0,1] - # exponents = [0, 1, ..., T] so that earlier steps (index 0) get - # weight gamma^0 = 1 (largest) and later steps get gamma^T (smallest). + # 2) Time-discounted base in [0,1] + # exponents = [0, 1, ..., T] so that earlier steps (index 0) get + # weight gamma^0 = 1 (largest) and later steps get gamma^T (smallest).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/finetuning/rl_with_openpipe_art/src/rl_with_openpipe_art/accuracy_evaluator.py` around lines 58 - 61, The comment above the exponents array is outdated: it calls the section "Reverse-discounted base" but the code now sets exponents = np.arange(0, T + 1) (forward order 0..T). Update the comment to reflect the new semantics (e.g., "Discounted/forward-discounted base in [0,1]" or "Exponents 0..T so earlier steps get larger weight gamma^0") so it matches the exponents variable and behavior used in accuracy_evaluator.py.
45-48: Clarify the misleading comment on line 58.The comment "Reverse-discounted base" at line 58 contradicts the code on line 61, which uses forward indexing (
np.arange(0, T + 1)). While lines 59–60 explain the intent clearly, the label "Reverse-discounted" is confusing. Replace it with a description that matches the actual forward-indexed approach:- # 2) Reverse-discounted base in [0,1] + # 2) Temporally-discounted base in [0,1]Note: The ValueError messages on lines 46 and 48 do not trigger Ruff warnings because the TRY rule category is not enabled in
pyproject.toml; only E, F, W, I, PL, and UP are selected for linting.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/finetuning/rl_with_openpipe_art/src/rl_with_openpipe_art/accuracy_evaluator.py` around lines 45 - 48, The comment "Reverse-discounted base" is misleading because the code uses forward indexing (np.arange(0, T + 1)) to build discount factors; update the comment near the calculation that uses np.arange(0, T + 1) to describe a forward-indexed discount vector (e.g., "Forward-indexed discount factors / discount base for timesteps 0..T") so it accurately reflects the implementation that uses gamma_base and delta_bonus to compute per-timestep discounts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@examples/finetuning/rl_with_openpipe_art/src/rl_with_openpipe_art/accuracy_evaluator.py`:
- Around line 58-61: The comment above the exponents array is outdated: it calls
the section "Reverse-discounted base" but the code now sets exponents =
np.arange(0, T + 1) (forward order 0..T). Update the comment to reflect the new
semantics (e.g., "Discounted/forward-discounted base in [0,1]" or "Exponents
0..T so earlier steps get larger weight gamma^0") so it matches the exponents
variable and behavior used in accuracy_evaluator.py.
- Around line 45-48: The comment "Reverse-discounted base" is misleading because
the code uses forward indexing (np.arange(0, T + 1)) to build discount factors;
update the comment near the calculation that uses np.arange(0, T + 1) to
describe a forward-indexed discount vector (e.g., "Forward-indexed discount
factors / discount base for timesteps 0..T") so it accurately reflects the
implementation that uses gamma_base and delta_bonus to compute per-timestep
discounts.
|
/ok to test 277bc2b |
|
/merge |
Description
Fixes a bug in the episode_value_from_states method where the time-decay exponents were computed in reverse order, causing earlier steps to receive less weight instead of more.
Changes
np.arange(T, -1, -1)tonp.arange(0, T + 1)so that earlier steps (index 0) get weight γ⁰ = 1 (largest) and later steps get weight γᵀ (smallest)gamma_baseanddelta_bonusto ensure they fall within valid ranges (0, 1]Closes #1613
By Submitting this PR I confirm:
Summary by CodeRabbit
Bug Fixes
Refactor