opencode envs by mikasenghaas · Pull Request #999 · PrimeIntellect-ai/verifiers

mikasenghaas · 2026-03-07T23:12:57Z

Summary

CLI agent env improvements: Refactored cli_agent_env.py with better error handling — agent failures and background job polling errors are now logged/raised properly, per-request timeout increased to 1h
OpenCode env updates: Enhanced opencode_env.py with adjusted wording, task system prompt, and agent log collection; added new opencode_qa_env.py variant
Sandbox mixin enhancements: Added file upload helpers (read_file, file upload to sandbox), simplified poll_job, and increased default timeouts from 30s → 1h
Hybrid math rubric overhaul: Added sandbox-based scoring support, offline difficulty filtering, remote math verification, rubric teardown lifecycle hook, and deregister/register flow from env → rubric
Logging improvements: Guard debug-level log string building behind isEnabledFor(DEBUG) in interception_utils to skip O(n_messages) work when debug is off; compact info logs; moved logging utils
Rubric lifecycle: Added teardown support to rubric.py and decorator discovery utils for cleanup hooks

Type of Change

New feature (non-breaking change which adds functionality)

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Note

Medium Risk
Touches sandbox lifecycle/teardown and scoring flow (including optional deferred sandbox deletion and in-sandbox scoring), which can impact resource cleanup and evaluation correctness. Changes are localized to experimental envs/rubrics but affect long-running timeouts and error propagation.

Overview
Improves the experimental CliAgentEnv/OpenCode environments with more robust agent execution handling: better background-job start/poll error surfacing (AgentError), simplified tunnel management, longer default request/provider timeouts (to 1h), and richer per-rollout info logging (tool call counts, duration, exit code, errors).

Adds post-rollout data capture and dataset controls: OpenCodeEnv now collects and stores agent logs from the sandbox; new OpenCodeQAEnv can optionally filter datasets by a difficulty/reward column range.

Extends SandboxMixin with sandbox file I/O helpers (upload_file, upload_content, read_file, upload_bundle) and explicit register_sandbox/deregister_sandbox APIs; CliAgentEnv gains keep_sandbox_for_scoring to defer deletion while still removing the sandbox from active tracking.

Updates HybridMathRubric defaults/behavior (fixed math-verify timeouts, optional judge fallback), and introduces RemoteHybridMathRubric to run math verification inside the sandbox and delete the sandbox after scoring via a cleanup hook.

Optimizes interception logging by guarding debug log construction and centralizing string truncation in logging_utils.truncate.

^{Written by Cursor Bugbot for commit c37da47. This will update automatically on new commits. Configure here.}

…ring building - zmq_env_server: offload model_dump + msgpack.packb to asyncio.to_thread; serializing large rollout states (O(n_turns^2) token arrays) was blocking the event loop for seconds - interception_utils: guard _log_request/_log_response with isEnabledFor(DEBUG) to skip O(n_messages) string building on every API call when debug is off Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

model_dump + msgpack.packb on large rollout states (O(n_turns^2) token arrays) was blocking the event loop for seconds per completion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… blocks >10s Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ent loop blocks >10s" This reverts commit 0f46882.

verifiers/utils/interception_utils.py

verifiers/envs/experimental/sandbox_mixin.py

CancelledError is a BaseException in Python 3.9+, so `except Exception` missed it when the future was cancelled by unregister_rollout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

verifiers/rubrics/experimental/hybrid_math_rubric.py

verifiers/envs/experimental/opencode_qa_env.py

Only assign self.logger if not already set, so rubric-mixin hybrid classes keep the logger established by Rubric.__init__. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

verifiers/envs/experimental/cli_agent_env.py

verifiers/utils/interception_utils.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

verifiers/rubrics/experimental/hybrid_math_rubric.py

verifiers/envs/experimental/sandbox_mixin.py

- Use self.logger instead of module-level logger - Remove stale get_model_response override and _update_main_metrics (main metrics are now computed in the rubric from trajectory) - Remove unused imports (logging, MessageType, SamplingArgs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* add OpenCodeRLMEnv and smoke-test environment OpenCodeRLMEnv extends OpenCodeEnv with the snimu/oc RLM plugin for recursive sub-LLM calls. Sets env vars so the plugin routes llm-subcall and subagent calls through the interception proxy with model="sub", enabling concurrent handling and separate token tracking. Includes opencode-rlm-test environment with 3 tasks exercising basic bash, llm-subcall, and subagent capabilities. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * add unit tests for OpenCodeRLMEnv (40 tests) Covers constructor defaults, config generation (including shell expansion), run command content, env var setup, sub-LLM detection, state initialization, metrics tracking, and monitor rubric. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use `set -e` instead of `set -eo pipefail` in RLM run command SWE-Bench Docker images use sh (dash) as default shell, which doesn't support the bash-only `pipefail` option. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * remove opencode-rlm-test smoke-test environment Replaced by opencode-rlm-swe in research-environments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: prevent sub-LLM tasks from being garbage collected Store asyncio.create_task references in state["_sub_llm_tasks"] set and use done callbacks to clean up. Prevents Python from silently dropping in-flight sub-LLM requests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: catch Exception instead of BaseException in sub-LLM handler Allows CancelledError and KeyboardInterrupt to propagate for proper task cancellation during shutdown. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: deliver response before re-raising CancelledError in sub-LLM handler Catch BaseException to always resolve the HTTP future (preventing hangs), but re-raise non-Exception types (CancelledError, etc.) after delivery so task cancellation still propagates correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: await in-flight sub-LLM tasks before exiting rollout loop Drain all pending sub-LLM tasks when the agent completes or times out, ensuring metrics and trajectory updates are finalized before scoring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use correct TrajectoryStep schema for sub-LLM trajectory entries Use `prompt` instead of `prompt_messages`, and include all required fields (completion, tokens, reward, advantage, is_truncated, trajectory_id) to match the TrajectoryStep TypedDict. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add OpenCodeEnv and OpenCodeRLMEnv to environments list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * use X-RLM-Role header for sub-LLM detection, remove model-name trick Replace model-name substring matching with an explicit X-RLM-Role: sub HTTP header set by the OC plugin. The interception server now captures all request headers in the intercept dict for general-purpose use. Removes: RLM_SUB_MODEL_ID env var, sub_model_identifier param, RLM_LLM_SUBCALL_VIA_PROXY env var (llm-subcall now routes through OPENAI_BASE_URL automatically when set). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * use X-RLM-Role header for sub-LLM detection, remove model-name trick Replace model-name substring matching with an explicit X-RLM-Role: sub HTTP header set by the OC plugin. The interception server now captures all request headers (lowercased) for general-purpose use. Removes: - RLM_SUB_MODEL_ID env var and sub_model_identifier param - RLM_LLM_SUBCALL_VIA_PROXY env var (llm-subcall now routes through OPENAI_BASE_URL automatically when set) - Model-name substring matching Headers are stored with lowercase keys to handle HTTP/2 case normalization correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * remove section dividers * refactor: extract _poll_next_request to share polling logic Move the tunnel/completion/timeout polling loop from get_prompt_messages into _poll_next_request on CliAgentEnv. OpenCodeRLMEnv now calls this helper instead of duplicating the loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * simplify OpenCodeRLMEnv: extract polling helper, compute main metrics in rubric - Extract _poll_next_request into CliAgentEnv so the RLM env reuses the polling loop instead of duplicating it - Move main-agent metric computation from get_model_response override to the rubric (computed from trajectory at scoring time) - Remove get_model_response override and _update_main_metrics - Add cleanup handler to cancel in-flight sub-LLM tasks on rollout end Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * use hardcoded intercepted/model in OpenCode config Replace ${OPENAI_MODEL} shell expansion with a fixed "intercepted/model" provider/model pair, matching the opencode_harbor pattern. The model name doesn't matter since all API calls go through the interception proxy. This fixes the ProviderModelNotFoundError when users pass model names without a provider/ prefix (e.g. gpt-5-mini instead of openai/gpt-5-mini). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * align OpenCodeRLMEnv with PR #999 conventions - Use self.logger instead of module-level logger - Remove stale get_model_response override and _update_main_metrics (main metrics are now computed in the rubric from trajectory) - Remove unused imports (logging, MessageType, SamplingArgs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: compute main metrics from trajectory in rubric, not state The get_model_response override and _update_main_metrics were removed but the rubric was still reading main_* from state (always 0). Now main_turns/main_prompt_tokens/main_completion_tokens are computed from state["trajectory"] at scoring time. Sub-LLM metrics remain in state (accumulated during rollout). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: exclude sub-LLM steps from main metrics in rubric Filter out trajectory steps with extras.is_sub_llm_call=True when computing main_turns/main_prompt_tokens/main_completion_tokens. Prevents double-counting when include_sub_llm_in_trajectory is enabled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: check for first main step, not empty trajectory, for prompt tracking When include_sub_llm_in_trajectory is enabled, sub-LLM steps can be appended before the first main step, making len(trajectory) > 0. Use has_main_step check instead so state["prompt"] is still set correctly on the first main-agent turn. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: preserve opencode exit code in RLM run command Replace pipe (cat | opencode run | tee) with redirect + cat so the script exits with opencode's actual exit code. The pipe masked failures because set -e only checks the last command in a pipeline (tee). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use set +e around opencode run to capture exit code and logs set -e would exit the script before _oc_exit capture and log emission. Temporarily disable with set +e, capture exit code, re-enable, then cat logs and exit with the real code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mikasenghaas and others added 30 commits March 7, 2026 23:51

move logging utils

28a5605

update agent start cmd

602177b

adjust wording + remove task management prompt

4b94f8a

fix: offload rollout serialization to thread to unblock event loop

6116061

model_dump + msgpack.packb on large rollout states (O(n_turns^2) token arrays) was blocking the event loop for seconds per completion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

compact info logs

003744e

ruff

46bb77f

less verbose start log

17be841

use constant timeout

c73e3c8

30s timeout

69882bf

bring back task management prompt

f6469ba

do not redundantly log model abort error + streaming error

c67d4b2

minor logging improvements

3dde6a9

raise agent error

e5f439e

raise sandbox error if agent bg job fails

db39779

Add EventLoopBlockingDetector to capture stack traces when event loop…

0f46882

… blocks >10s Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Revert "Add EventLoopBlockingDetector to capture stack traces when ev…

c717430

…ent loop blocks >10s" This reverts commit 0f46882.

add task sys prompt

a6c5bed

handle bg job polling errors

a41609e

30min default timeouts

a88a216

pipe agent failure

617ad93

http prob for tunnel liveness

ac201f9

ruff

ffa2f9c

Merge branch 'main' into opencode-envs

fab96b5

dont re-raise agent error, but log error

60a2194

update hybrid math rubric constructor

7542021

allow offline difficulty filtering

be2253b

fix default model

00c1a0f

collect agent logs

c7857f4

use warning log

b8c1791

mikasenghaas added 11 commits March 12, 2026 23:51

allow rubric cleanup

556b164

sandbox scoring in hybrid math rubric

bed4ed0

do not upload bundle

ee1cc99

align local and remote math verify

bb4b86c

add read_file

e0a6713

remote math rubric

7996d8b

remove script

724383d

handle answer not found

4af0ca3

add teardown to rubric

e0c5b1b

use utils for decorator discovery

a9767ce

deregister + registere flow from env -> rubric

c5ba1c4

mikasenghaas mentioned this pull request Mar 17, 2026

Add cleanup and teardown lifecycle hooks to Rubric #1026

Merged

mikasenghaas added 3 commits March 17, 2026 10:13

revert zmq env server changes

4ea97f2

Merge branch 'main' into opencode-envs

3fbbced

fix ruff

a0d18a5

mikasenghaas requested a review from rasdani March 18, 2026 13:45

mikasenghaas marked this pull request as ready for review March 18, 2026 13:47

cursor bot reviewed Mar 18, 2026

View reviewed changes

verifiers/utils/interception_utils.py Show resolved Hide resolved

verifiers/envs/experimental/sandbox_mixin.py Show resolved Hide resolved

fix: catch CancelledError from response_future in stream handler

a3f85e5

CancelledError is a BaseException in Python 3.9+, so `except Exception` missed it when the future was cancelled by unregister_rollout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor bot reviewed Mar 18, 2026

View reviewed changes

verifiers/rubrics/experimental/hybrid_math_rubric.py Show resolved Hide resolved

verifiers/envs/experimental/opencode_qa_env.py Show resolved Hide resolved

fix: preserve existing logger in SandboxMixin.init_sandbox_client

3b099b2

Only assign self.logger if not already set, so rubric-mixin hybrid classes keep the logger established by Rubric.__init__. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor bot reviewed Mar 18, 2026

View reviewed changes

verifiers/envs/experimental/cli_agent_env.py Show resolved Hide resolved

verifiers/utils/interception_utils.py Show resolved Hide resolved

fix: use self.logger for tunnel debug level check in CliAgentEnv

c37da47

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor bot reviewed Mar 18, 2026

View reviewed changes

verifiers/rubrics/experimental/hybrid_math_rubric.py Show resolved Hide resolved

verifiers/rubrics/experimental/hybrid_math_rubric.py Show resolved Hide resolved

rasdani reviewed Mar 18, 2026

View reviewed changes

verifiers/envs/experimental/sandbox_mixin.py Show resolved Hide resolved

rasdani approved these changes Mar 18, 2026

View reviewed changes

mikasenghaas merged commit 12a1a91 into main Mar 18, 2026
6 checks passed

mikasenghaas mentioned this pull request Mar 19, 2026

chore: bump verifiers (1960e77) PrimeIntellect-ai/prime-rl#2051

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opencode envs#999

opencode envs#999
mikasenghaas merged 55 commits intomainfrom
opencode-envs

mikasenghaas commented Mar 7, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mikasenghaas commented Mar 7, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Change

Testing

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikasenghaas commented Mar 7, 2026 •

edited by cursor bot

Loading