Add Claude Code and Codex harnesses#1426
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit b3feccf. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b3feccf4fe
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| EOFMCP | ||
|
|
||
| cd "$CLAUDE_WORKDIR" | ||
| claude -p "$(cat {shlex.quote(instruction_path)})" \ |
There was a problem hiding this comment.
Avoid passing full Claude prompt as a CLI argument
This command inlines the entire instruction file into claude -p "$(cat ...)", which makes rollout success depend on OS argv limits (ARG_MAX). Large benchmark tasks or injected context can exceed that limit and fail with an argument-length error before the model runs, causing avoidable rollout failures. Feeding prompt content via stdin or a file-based option avoids this size ceiling.
Useful? React with 👍 / 👎.
| --sandbox {shlex.quote(codex_sandbox)} \ | ||
| --model "$OPENAI_MODEL" \ | ||
| --output-last-message {shlex.quote(final_path)} \ | ||
| "$(cat {shlex.quote(prompt_path)})" > {shlex.quote(log_path)} 2>&1 |
There was a problem hiding this comment.
Avoid passing full Codex prompt as a CLI argument
The script builds a prompt file and then expands it into a single argv value via "$(cat ... )" for codex exec. That can hit command-line length limits on larger tasks and fail the rollout with an argument-size error, even though the prompt file already exists on disk. Using stdin or a file-based prompt path keeps behavior stable for long inputs.
Useful? React with 👍 / 👎.
b0256d7 to
163d2de
Compare
ApprovabilityVerdict: Needs human review This PR introduces two new harness integrations (ClaudeCode and Codex) with new runtime behavior. Additionally, unresolved review comments flag potential issues with command-line argument length limits and missing automation flags that could cause runtime failures. You can customize Macroscope's approvability policy. Learn more. |
b3feccf to
86dcf29
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 86dcf29b91
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| --skip-git-repo-check \ | ||
| --sandbox {shlex.quote(codex_sandbox)} \ | ||
| --model "$OPENAI_MODEL" \ | ||
| --output-last-message {shlex.quote(final_path)} \ |
There was a problem hiding this comment.
Force non-interactive approval mode for codex exec
This harness launches codex exec in unattended eval mode but never sets an explicit automation approval mode (for example --full-auto), so runs can block on approval prompts and eventually timeout instead of completing. I checked the Codex Exec docs (“Approval Modes for Automation” and troubleshooting), which call out --full-auto for automated execution when tasks do not complete automatically; relying on implicit defaults here makes rollout behavior unstable across prompts/configs.
Useful? React with 👍 / 👎.
163d2de to
b3f0633
Compare
86dcf29 to
d431e08
Compare
d431e08 to
e0970f6
Compare

Summary
claude,claude-code,codex, andcodex-clialiases through their config classesTesting
uv run pytest tests/test_v1_config_extension.py tests/test_v1_harbor_cli.py tests/test_eval_cli.py -quv run pre-commit run --all-filesgit diff --check harness-type-aliases...HEADStacked on #1425.
Note
Medium Risk
Adds new packaged CLI harness implementations that generate/install/run command programs and wire MCP proxying, which can affect sandbox execution behavior and config validation for users selecting these harness types.
Overview
Adds two new bundled v1 command harnesses,
ClaudeCodeandCodex, including their typed configs andtype/alias registration (e.g.claude/claude-code,codex/codex-cli) so TOMLharness.typecan select them.ClaudeCoderuns the Anthropic Claude Code CLI in non-interactive mode with MCP proxy config generation, log artifact collection, and configurable permission mode/turn limits;Codexsimilarly runs the OpenAI Codex CLI via a generated.codex/config.toml, supports sandbox mode and optional reasoning-effort tuning, and reads the Responses API key from rolloutState(while explicitly rejectingmax_turnsoverrides).Exports are plumbed through
verifiers.v1and the rootverifierspackage, docs/examples are updated to reference the new harness names, and tests are extended to cover alias selection, re-exports, and program-building behavior for both harnesses.Reviewed by Cursor Bugbot for commit e0970f6. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Add ClaudeCode and Codex harnesses to the verifiers framework
ClaudeCode(aliases:claude,claude-code) andCodex(aliases:codex,codex-cli) as new harness types, each selectable viaharness.typein config.ClaudeCoderuns the Claude Code CLI in non-interactive mode, piping instructions with configurablepermission_modeandmax_turns, and writes logs to a configurable path.Codexrunscodex execwith configurable sandbox mode and reasoning effort;CODEX_API_KEYis populated dynamically from the activeresponsesendpoint at runtime.CodexConfigrejects any attempt to setmax_turnswith a validation error, as Codex does not support it.Macroscope summarized e0970f6.