Merged
Conversation
…ring building - zmq_env_server: offload model_dump + msgpack.packb to asyncio.to_thread; serializing large rollout states (O(n_turns^2) token arrays) was blocking the event loop for seconds - interception_utils: guard _log_request/_log_response with isEnabledFor(DEBUG) to skip O(n_messages) string building on every API call when debug is off Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
model_dump + msgpack.packb on large rollout states (O(n_turns^2) token arrays) was blocking the event loop for seconds per completion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… blocks >10s Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ent loop blocks >10s" This reverts commit 0f46882.
CancelledError is a BaseException in Python 3.9+, so `except Exception` missed it when the future was cancelled by unregister_rollout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only assign self.logger if not already set, so rubric-mixin hybrid classes keep the logger established by Rubric.__init__. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
rasdani
reviewed
Mar 18, 2026
rasdani
approved these changes
Mar 18, 2026
snimu
added a commit
that referenced
this pull request
Mar 19, 2026
- Use self.logger instead of module-level logger - Remove stale get_model_response override and _update_main_metrics (main metrics are now computed in the rubric from trajectory) - Remove unused imports (logging, MessageType, SamplingArgs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
snimu
added a commit
that referenced
this pull request
Mar 19, 2026
* add OpenCodeRLMEnv and smoke-test environment
OpenCodeRLMEnv extends OpenCodeEnv with the snimu/oc RLM plugin for
recursive sub-LLM calls. Sets env vars so the plugin routes llm-subcall
and subagent calls through the interception proxy with model="sub",
enabling concurrent handling and separate token tracking.
Includes opencode-rlm-test environment with 3 tasks exercising basic
bash, llm-subcall, and subagent capabilities.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* add unit tests for OpenCodeRLMEnv (40 tests)
Covers constructor defaults, config generation (including shell
expansion), run command content, env var setup, sub-LLM detection,
state initialization, metrics tracking, and monitor rubric.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use `set -e` instead of `set -eo pipefail` in RLM run command
SWE-Bench Docker images use sh (dash) as default shell, which doesn't
support the bash-only `pipefail` option.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* remove opencode-rlm-test smoke-test environment
Replaced by opencode-rlm-swe in research-environments.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: prevent sub-LLM tasks from being garbage collected
Store asyncio.create_task references in state["_sub_llm_tasks"] set
and use done callbacks to clean up. Prevents Python from silently
dropping in-flight sub-LLM requests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: catch Exception instead of BaseException in sub-LLM handler
Allows CancelledError and KeyboardInterrupt to propagate for proper
task cancellation during shutdown.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: deliver response before re-raising CancelledError in sub-LLM handler
Catch BaseException to always resolve the HTTP future (preventing
hangs), but re-raise non-Exception types (CancelledError, etc.) after
delivery so task cancellation still propagates correctly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: await in-flight sub-LLM tasks before exiting rollout loop
Drain all pending sub-LLM tasks when the agent completes or times out,
ensuring metrics and trajectory updates are finalized before scoring.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use correct TrajectoryStep schema for sub-LLM trajectory entries
Use `prompt` instead of `prompt_messages`, and include all required
fields (completion, tokens, reward, advantage, is_truncated,
trajectory_id) to match the TrajectoryStep TypedDict.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add OpenCodeEnv and OpenCodeRLMEnv to environments list
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* use X-RLM-Role header for sub-LLM detection, remove model-name trick
Replace model-name substring matching with an explicit X-RLM-Role: sub
HTTP header set by the OC plugin. The interception server now captures
all request headers in the intercept dict for general-purpose use.
Removes: RLM_SUB_MODEL_ID env var, sub_model_identifier param,
RLM_LLM_SUBCALL_VIA_PROXY env var (llm-subcall now routes through
OPENAI_BASE_URL automatically when set).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* use X-RLM-Role header for sub-LLM detection, remove model-name trick
Replace model-name substring matching with an explicit X-RLM-Role: sub
HTTP header set by the OC plugin. The interception server now captures
all request headers (lowercased) for general-purpose use.
Removes:
- RLM_SUB_MODEL_ID env var and sub_model_identifier param
- RLM_LLM_SUBCALL_VIA_PROXY env var (llm-subcall now routes through
OPENAI_BASE_URL automatically when set)
- Model-name substring matching
Headers are stored with lowercase keys to handle HTTP/2 case
normalization correctly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* remove section dividers
* refactor: extract _poll_next_request to share polling logic
Move the tunnel/completion/timeout polling loop from get_prompt_messages
into _poll_next_request on CliAgentEnv. OpenCodeRLMEnv now calls this
helper instead of duplicating the loop.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* simplify OpenCodeRLMEnv: extract polling helper, compute main metrics in rubric
- Extract _poll_next_request into CliAgentEnv so the RLM env reuses
the polling loop instead of duplicating it
- Move main-agent metric computation from get_model_response override
to the rubric (computed from trajectory at scoring time)
- Remove get_model_response override and _update_main_metrics
- Add cleanup handler to cancel in-flight sub-LLM tasks on rollout end
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* use hardcoded intercepted/model in OpenCode config
Replace ${OPENAI_MODEL} shell expansion with a fixed "intercepted/model"
provider/model pair, matching the opencode_harbor pattern. The model
name doesn't matter since all API calls go through the interception
proxy. This fixes the ProviderModelNotFoundError when users pass model
names without a provider/ prefix (e.g. gpt-5-mini instead of
openai/gpt-5-mini).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* align OpenCodeRLMEnv with PR #999 conventions
- Use self.logger instead of module-level logger
- Remove stale get_model_response override and _update_main_metrics
(main metrics are now computed in the rubric from trajectory)
- Remove unused imports (logging, MessageType, SamplingArgs)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: compute main metrics from trajectory in rubric, not state
The get_model_response override and _update_main_metrics were removed
but the rubric was still reading main_* from state (always 0). Now
main_turns/main_prompt_tokens/main_completion_tokens are computed from
state["trajectory"] at scoring time. Sub-LLM metrics remain in state
(accumulated during rollout).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: exclude sub-LLM steps from main metrics in rubric
Filter out trajectory steps with extras.is_sub_llm_call=True when
computing main_turns/main_prompt_tokens/main_completion_tokens.
Prevents double-counting when include_sub_llm_in_trajectory is enabled.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: check for first main step, not empty trajectory, for prompt tracking
When include_sub_llm_in_trajectory is enabled, sub-LLM steps can be
appended before the first main step, making len(trajectory) > 0. Use
has_main_step check instead so state["prompt"] is still set correctly
on the first main-agent turn.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: preserve opencode exit code in RLM run command
Replace pipe (cat | opencode run | tee) with redirect + cat so the
script exits with opencode's actual exit code. The pipe masked failures
because set -e only checks the last command in a pipeline (tee).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use set +e around opencode run to capture exit code and logs
set -e would exit the script before _oc_exit capture and log emission.
Temporarily disable with set +e, capture exit code, re-enable, then
cat logs and exit with the real code.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
cli_agent_env.pywith better error handling — agent failures and background job polling errors are now logged/raised properly, per-request timeout increased to 1hopencode_env.pywith adjusted wording, task system prompt, and agent log collection; added newopencode_qa_env.pyvariantread_file, file upload to sandbox), simplifiedpoll_job, and increased default timeouts from 30s → 1hteardownlifecycle hook, andderegister/registerflow from env → rubricisEnabledFor(DEBUG)ininterception_utilsto skip O(n_messages) work when debug is off; compact info logs; moved logging utilsteardownsupport torubric.pyand decorator discovery utils for cleanup hooksType of Change
Testing
uv run pytestlocally.Checklist
Note
Medium Risk
Touches sandbox lifecycle/teardown and scoring flow (including optional deferred sandbox deletion and in-sandbox scoring), which can impact resource cleanup and evaluation correctness. Changes are localized to experimental envs/rubrics but affect long-running timeouts and error propagation.
Overview
Improves the experimental
CliAgentEnv/OpenCode environments with more robust agent execution handling: better background-job start/poll error surfacing (AgentError), simplified tunnel management, longer default request/provider timeouts (to 1h), and richer per-rollout info logging (tool call counts, duration, exit code, errors).Adds post-rollout data capture and dataset controls:
OpenCodeEnvnow collects and stores agent logs from the sandbox; newOpenCodeQAEnvcan optionally filter datasets by a difficulty/reward column range.Extends
SandboxMixinwith sandbox file I/O helpers (upload_file,upload_content,read_file,upload_bundle) and explicitregister_sandbox/deregister_sandboxAPIs;CliAgentEnvgainskeep_sandbox_for_scoringto defer deletion while still removing the sandbox from active tracking.Updates
HybridMathRubricdefaults/behavior (fixed math-verify timeouts, optional judge fallback), and introducesRemoteHybridMathRubricto run math verification inside the sandbox and delete the sandbox after scoring via a cleanup hook.Optimizes interception logging by guarding debug log construction and centralizing string truncation in
logging_utils.truncate.Written by Cursor Bugbot for commit c37da47. This will update automatically on new commits. Configure here.