feat(env): mask all-tool-failed rollouts (SimpleTIR-style void turn masking)#1416
Open
mvanhorn wants to merge 1 commit into
Open
feat(env): mask all-tool-failed rollouts (SimpleTIR-style void turn masking)#1416mvanhorn wants to merge 1 commit into
mvanhorn wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
verifiers/envs/tool_env.pyExtend
ToolEnv.__init__signature withmask_all_failed_tool_calls: bool = False. Store it asself.mask_all_failed_tool_calls. Appendif mask_all_failed_tool_calls: self.add_metric(self.void_turn_rollouts_metric).Add a new method that records outcomes inside
env_response. Diff against the existing function (only the twotool_messages.append(...)sites change — one for success, one for the exception path):_finalize_statehook that runs after the rollout loop — setstate["masked"]when the flag is on and all outcomes are"error". Override the baseEnvironment.rolloutpost-step OR add this logic at the start ofscore_objectsvia the rubric (decided in step 4). For minimal blast radius, do it insideToolEnvvia overriding_post_rollout_hookif it exists, otherwise compute lazily in the metric callable + in the base rubric mask check.Concretely: add a method on
ToolEnv:Wire this into the existing rollout finalization by setting
state["masked"] = self._should_mask(state)after theMultiTurnEnv.rolloutloop completes. Inverifiers/envs/multiturn_env.py::MultiTurnEnv.rolloutthere is already a finalization section —ToolEnvoverrides nothing today, so addasync def rollout(self, *args, **kwargs):that callssuper().rollout(...)and post-processesstate["masked"]. (Confirm signature by reading multiturn_env.py before editing.)verifiers/rubrics/rubric.pyIn
Rubric.score_objects(state), before computing rewards, short-circuit whenstate.get("masked")is truthy:This keeps the JSON output schema stable (downstream parsers will not see missing keys) and sets a single explicit zero reward.
Docs
Add a short paragraph to
verifiers/envs/AGENTS.mdunder "Optional flags" describing the new flag and pointing at the SimpleTIR reference.Why this matters
Issue #315 (filed by @faresobeid, COLLABORATOR) asks for an option to "mask rollouts where all tool calls failed," referencing the SimpleTIR paper (arXiv:2509.02479). In RL training, a trajectory in which every tool invocation failed contributes near-zero signal and can destabilize the gradient. SimpleTIR's contribution is to skip ("mask") these void turns when computing rewards. The verifiers
ToolEnvandStatefulToolEnvalready track tool calls, catch exceptions, and reply with anerror_formatteredToolMessage— but they do not surface a "this rollout produced only failed tool calls" signal that downstream consumers (rubrics, trainers) can use.Acceptance:
ToolEnv(and by inheritanceStatefulToolEnv) tracks per-rollout tool call outcomes instateasstate["tool_call_outcomes"]: list[Literal["ok","error"]].ToolEnvgains a constructor flagmask_all_failed_tool_calls: bool = False(default off — opt-in, non-breaking)."error"(andlen(outcomes) > 0), the rollout state is markedstate["masked"] = Trueand the rubric returnsreward = 0.0with no per-reward-func contributions, AND a new boolean metricvoid_turn_rolloutsis exposed viaadd_metric.Testing
tests/envs/test_tool_env_void_mask.py(new)run_two_failing_tool_calls/run_one_good_one_bad/run_assistant_only_no_toolsare async helpers that drive a synthetic single rollout throughenv.env_response()directly with hand-craftedAssistantMessagecontainingToolCallobjects, mirroring the style of existing tests undertests/envs/.Run with
uv run pytest tests/envs/test_tool_env_void_mask.py -v.Fixes #315
AI was used for assistance.
Note
Medium Risk
Opt-in masking changes reward computation (zeroing rewards/metrics) and adds new state fields that downstream training/eval pipelines may implicitly rely on; behavior is gated by a new flag but touches core env/rubric scoring paths.
Overview
Adds
mask_all_failed_tool_callstoToolEnv(and thusStatefulToolEnv) to record per-tool-call outcomes instate["tool_call_outcomes"], setstate["masked"]when at least one tool was called and all outcomes are errors, and emit a newvoid_turn_rolloutsmetric.Updates
Rubric.score_rolloutandRubric.score_groupto short-circuit masked rollouts to zero reward (and zero out non-metric function contributions), while keeping metric-weight (weight==0) functions evaluable. Adds tests covering flag on/off, mixed/no tool calls, andStatefulToolEnvbehavior, plus docs inAGENTS.mddescribing the new flag and masking semantics.Reviewed by Cursor Bugbot for commit d54ffee. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Add
mask_all_failed_tool_callsflag toToolEnvfor SimpleTIR-style void turn maskingmask_all_failed_tool_callsconstructor parameter toToolEnvandStatefulToolEnv; when enabled, rollouts where every tool call fails are markedstate['masked'] = True.'ok'/'error') are recorded instate['tool_call_outcomes']duringenv_responsefor both env types.Rubric.score_rolloutandscore_rollouts; zero-weight metrics still execute.void_turn_rolloutsmetric (1.0 when masked, 0.0 otherwise) when the flag is enabled.mask_all_failed_tool_calls=True.Macroscope summarized d54ffee.