Skip to content

opencode envs#999

Merged
mikasenghaas merged 55 commits intomainfrom
opencode-envs
Mar 18, 2026
Merged

opencode envs#999
mikasenghaas merged 55 commits intomainfrom
opencode-envs

Conversation

@mikasenghaas
Copy link
Member

@mikasenghaas mikasenghaas commented Mar 7, 2026

Summary

  • CLI agent env improvements: Refactored cli_agent_env.py with better error handling — agent failures and background job polling errors are now logged/raised properly, per-request timeout increased to 1h
  • OpenCode env updates: Enhanced opencode_env.py with adjusted wording, task system prompt, and agent log collection; added new opencode_qa_env.py variant
  • Sandbox mixin enhancements: Added file upload helpers (read_file, file upload to sandbox), simplified poll_job, and increased default timeouts from 30s → 1h
  • Hybrid math rubric overhaul: Added sandbox-based scoring support, offline difficulty filtering, remote math verification, rubric teardown lifecycle hook, and deregister/register flow from env → rubric
  • Logging improvements: Guard debug-level log string building behind isEnabledFor(DEBUG) in interception_utils to skip O(n_messages) work when debug is off; compact info logs; moved logging utils
  • Rubric lifecycle: Added teardown support to rubric.py and decorator discovery utils for cleanup hooks

Type of Change

  • New feature (non-breaking change which adds functionality)

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Note

Medium Risk
Touches sandbox lifecycle/teardown and scoring flow (including optional deferred sandbox deletion and in-sandbox scoring), which can impact resource cleanup and evaluation correctness. Changes are localized to experimental envs/rubrics but affect long-running timeouts and error propagation.

Overview
Improves the experimental CliAgentEnv/OpenCode environments with more robust agent execution handling: better background-job start/poll error surfacing (AgentError), simplified tunnel management, longer default request/provider timeouts (to 1h), and richer per-rollout info logging (tool call counts, duration, exit code, errors).

Adds post-rollout data capture and dataset controls: OpenCodeEnv now collects and stores agent logs from the sandbox; new OpenCodeQAEnv can optionally filter datasets by a difficulty/reward column range.

Extends SandboxMixin with sandbox file I/O helpers (upload_file, upload_content, read_file, upload_bundle) and explicit register_sandbox/deregister_sandbox APIs; CliAgentEnv gains keep_sandbox_for_scoring to defer deletion while still removing the sandbox from active tracking.

Updates HybridMathRubric defaults/behavior (fixed math-verify timeouts, optional judge fallback), and introduces RemoteHybridMathRubric to run math verification inside the sandbox and delete the sandbox after scoring via a cleanup hook.

Optimizes interception logging by guarding debug log construction and centralizing string truncation in logging_utils.truncate.

Written by Cursor Bugbot for commit c37da47. This will update automatically on new commits. Configure here.

mikasenghaas and others added 30 commits March 7, 2026 23:51
…ring building

- zmq_env_server: offload model_dump + msgpack.packb to asyncio.to_thread;
  serializing large rollout states (O(n_turns^2) token arrays) was blocking
  the event loop for seconds
- interception_utils: guard _log_request/_log_response with isEnabledFor(DEBUG)
  to skip O(n_messages) string building on every API call when debug is off

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
model_dump + msgpack.packb on large rollout states (O(n_turns^2) token
arrays) was blocking the event loop for seconds per completion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… blocks >10s

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mikasenghaas mikasenghaas requested a review from rasdani March 18, 2026 13:45
@mikasenghaas mikasenghaas marked this pull request as ready for review March 18, 2026 13:47
CancelledError is a BaseException in Python 3.9+, so `except Exception`
missed it when the future was cancelled by unregister_rollout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only assign self.logger if not already set, so rubric-mixin hybrid
classes keep the logger established by Rubric.__init__.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

@mikasenghaas mikasenghaas merged commit 12a1a91 into main Mar 18, 2026
6 checks passed
snimu added a commit that referenced this pull request Mar 19, 2026
- Use self.logger instead of module-level logger
- Remove stale get_model_response override and _update_main_metrics
  (main metrics are now computed in the rubric from trajectory)
- Remove unused imports (logging, MessageType, SamplingArgs)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
snimu added a commit that referenced this pull request Mar 19, 2026
* add OpenCodeRLMEnv and smoke-test environment

OpenCodeRLMEnv extends OpenCodeEnv with the snimu/oc RLM plugin for
recursive sub-LLM calls. Sets env vars so the plugin routes llm-subcall
and subagent calls through the interception proxy with model="sub",
enabling concurrent handling and separate token tracking.

Includes opencode-rlm-test environment with 3 tasks exercising basic
bash, llm-subcall, and subagent capabilities.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* add unit tests for OpenCodeRLMEnv (40 tests)

Covers constructor defaults, config generation (including shell
expansion), run command content, env var setup, sub-LLM detection,
state initialization, metrics tracking, and monitor rubric.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use `set -e` instead of `set -eo pipefail` in RLM run command

SWE-Bench Docker images use sh (dash) as default shell, which doesn't
support the bash-only `pipefail` option.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* remove opencode-rlm-test smoke-test environment

Replaced by opencode-rlm-swe in research-environments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: prevent sub-LLM tasks from being garbage collected

Store asyncio.create_task references in state["_sub_llm_tasks"] set
and use done callbacks to clean up. Prevents Python from silently
dropping in-flight sub-LLM requests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: catch Exception instead of BaseException in sub-LLM handler

Allows CancelledError and KeyboardInterrupt to propagate for proper
task cancellation during shutdown.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: deliver response before re-raising CancelledError in sub-LLM handler

Catch BaseException to always resolve the HTTP future (preventing
hangs), but re-raise non-Exception types (CancelledError, etc.) after
delivery so task cancellation still propagates correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: await in-flight sub-LLM tasks before exiting rollout loop

Drain all pending sub-LLM tasks when the agent completes or times out,
ensuring metrics and trajectory updates are finalized before scoring.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use correct TrajectoryStep schema for sub-LLM trajectory entries

Use `prompt` instead of `prompt_messages`, and include all required
fields (completion, tokens, reward, advantage, is_truncated,
trajectory_id) to match the TrajectoryStep TypedDict.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add OpenCodeEnv and OpenCodeRLMEnv to environments list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* use X-RLM-Role header for sub-LLM detection, remove model-name trick

Replace model-name substring matching with an explicit X-RLM-Role: sub
HTTP header set by the OC plugin. The interception server now captures
all request headers in the intercept dict for general-purpose use.

Removes: RLM_SUB_MODEL_ID env var, sub_model_identifier param,
RLM_LLM_SUBCALL_VIA_PROXY env var (llm-subcall now routes through
OPENAI_BASE_URL automatically when set).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* use X-RLM-Role header for sub-LLM detection, remove model-name trick

Replace model-name substring matching with an explicit X-RLM-Role: sub
HTTP header set by the OC plugin. The interception server now captures
all request headers (lowercased) for general-purpose use.

Removes:
- RLM_SUB_MODEL_ID env var and sub_model_identifier param
- RLM_LLM_SUBCALL_VIA_PROXY env var (llm-subcall now routes through
  OPENAI_BASE_URL automatically when set)
- Model-name substring matching

Headers are stored with lowercase keys to handle HTTP/2 case
normalization correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* remove section dividers

* refactor: extract _poll_next_request to share polling logic

Move the tunnel/completion/timeout polling loop from get_prompt_messages
into _poll_next_request on CliAgentEnv. OpenCodeRLMEnv now calls this
helper instead of duplicating the loop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* simplify OpenCodeRLMEnv: extract polling helper, compute main metrics in rubric

- Extract _poll_next_request into CliAgentEnv so the RLM env reuses
  the polling loop instead of duplicating it
- Move main-agent metric computation from get_model_response override
  to the rubric (computed from trajectory at scoring time)
- Remove get_model_response override and _update_main_metrics
- Add cleanup handler to cancel in-flight sub-LLM tasks on rollout end

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* use hardcoded intercepted/model in OpenCode config

Replace ${OPENAI_MODEL} shell expansion with a fixed "intercepted/model"
provider/model pair, matching the opencode_harbor pattern. The model
name doesn't matter since all API calls go through the interception
proxy. This fixes the ProviderModelNotFoundError when users pass model
names without a provider/ prefix (e.g. gpt-5-mini instead of
openai/gpt-5-mini).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* align OpenCodeRLMEnv with PR #999 conventions

- Use self.logger instead of module-level logger
- Remove stale get_model_response override and _update_main_metrics
  (main metrics are now computed in the rubric from trajectory)
- Remove unused imports (logging, MessageType, SamplingArgs)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: compute main metrics from trajectory in rubric, not state

The get_model_response override and _update_main_metrics were removed
but the rubric was still reading main_* from state (always 0). Now
main_turns/main_prompt_tokens/main_completion_tokens are computed from
state["trajectory"] at scoring time. Sub-LLM metrics remain in state
(accumulated during rollout).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: exclude sub-LLM steps from main metrics in rubric

Filter out trajectory steps with extras.is_sub_llm_call=True when
computing main_turns/main_prompt_tokens/main_completion_tokens.
Prevents double-counting when include_sub_llm_in_trajectory is enabled.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: check for first main step, not empty trajectory, for prompt tracking

When include_sub_llm_in_trajectory is enabled, sub-LLM steps can be
appended before the first main step, making len(trajectory) > 0. Use
has_main_step check instead so state["prompt"] is still set correctly
on the first main-agent turn.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: preserve opencode exit code in RLM run command

Replace pipe (cat | opencode run | tee) with redirect + cat so the
script exits with opencode's actual exit code. The pipe masked failures
because set -e only checks the last command in a pipeline (tee).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use set +e around opencode run to capture exit code and logs

set -e would exit the script before _oc_exit capture and log emission.
Temporarily disable with set +e, capture exit code, re-enable, then
cat logs and exit with the real code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants