Skip to content

perf: math rubric skip overlong answers#1046

Merged
willccbb merged 5 commits intomainfrom
fix/math-rubric-strict-extract
Mar 20, 2026
Merged

perf: math rubric skip overlong answers#1046
willccbb merged 5 commits intomainfrom
fix/math-rubric-strict-extract

Conversation

@mikasenghaas
Copy link
Member

@mikasenghaas mikasenghaas commented Mar 20, 2026

Description

Two fixes to improve the performance of the math rubric at high concurrency:

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Medium Risk
Moderate risk because it changes MathRubric scoring behavior (unboxed answers now score 0 and long parsed responses are skipped), which can affect training/eval metrics; changes are localized and guarded by configurable limits.

Overview
MathRubric now enforces boxed-format answers by default and avoids expensive verification on huge outputs. It switches the default parser extractor to extract_boxed_answer(strict=True), so responses without a \boxed{} final answer no longer get passed through to symbolic parsing.

Adds a configurable max_verify_chars (default 50_000) and skips math_verify when the parsed response exceeds this limit (with a warning), improving throughput under high concurrency.

Updates extract_boxed_answer API/docs to support strict mode, and adjusts tests to require boxed completions and to cover the new length limit behavior.

Written by Cursor Bugbot for commit 8d5dbbe. This will update automatically on new commits. Configure here.

@mikasenghaas mikasenghaas changed the title math rubric strict extract boxed answer perf: math rubric skip overlong answers Mar 20, 2026
…nses

Add strict_extract_boxed_answer that returns empty string on no \boxed{}
match (instead of returning the full text). Add max_verify_chars guard
to MathRubric to skip math_verify for responses exceeding 50k chars,
preventing thread pool starvation from pathologically long expressions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mikasenghaas mikasenghaas force-pushed the fix/math-rubric-strict-extract branch from 7f37a0f to 9aeb520 Compare March 20, 2026 15:44
@mikasenghaas mikasenghaas requested a review from willccbb March 20, 2026 15:46
@mikasenghaas mikasenghaas marked this pull request as ready for review March 20, 2026 15:46
chopratejas and others added 2 commits March 20, 2026 16:32
Problem:
When a completion contains no \boxed{} tag, extract_boxed_answer returns
the entire input text. This is passed to math_verify, which matches any
number in the text — allowing a model to get correct-answer credit by
mentioning the answer anywhere without using \boxed{}.

During RL training, this means a model can skip the \boxed{} format
entirely and still score 1.0 by embedding the correct number in its
reasoning text. The strategy scoreboard from rewardprobe shows the
impact: "correct_lazy" (just outputting the answer) scores 1.0, while
"perfect" (full reasoning + boxed answer) scores only 0.67.

Fix:
Add a `strict` parameter to extract_boxed_answer (default: False).
When strict=True, returns "" on no match instead of the full text.
MathRubric now uses strict=True via functools.partial.

This is backwards compatible:
- extract_boxed_answer(text) still returns text (default strict=False)
- Only MathRubric's parser uses strict=True
- Other callers (rlm_env.py, etc.) are unaffected
- Tests updated to use \boxed{} format in completions

Found using rewardprobe (https://github.com/chopratejas/rewardprobe).
Wrap the timeout test completion in \boxed{} so the strict parser can
extract it, and raise max_verify_chars to allow the 100k-char string
through the length check to actually exercise the timeout logic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas force-pushed the fix/math-rubric-strict-extract branch from 67eb88b to 823541a Compare March 20, 2026 16:33
Now redundant since extract_boxed_answer supports strict=True directly
via the cherry-picked fix from chopratejas.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

The invalid-answer tests were still passing raw completions without
\boxed{}, so the strict parser returned "" before math_verify ran.
Wrapping in \boxed{} ensures the tests exercise actual math verification.

Also update docs/reference.md with the new extract_boxed_answer(strict)
signature.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@willccbb willccbb merged commit fb64a9a into main Mar 20, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants