Skip to content

Add Claude Code and Codex harnesses#1426

Open
xeophon wants to merge 1 commit into
harness-type-aliasesfrom
claude-codex-harnesses
Open

Add Claude Code and Codex harnesses#1426
xeophon wants to merge 1 commit into
harness-type-aliasesfrom
claude-codex-harnesses

Conversation

@xeophon
Copy link
Copy Markdown
Member

@xeophon xeophon commented May 20, 2026

Summary

  • stack on Add V1 harness type aliases #1425 to add packaged Claude Code and Codex command harnesses
  • register claude, claude-code, codex, and codex-cli aliases through their config classes
  • export the new harnesses, document their TOML names, and add construction/alias tests

Testing

  • uv run pytest tests/test_v1_config_extension.py tests/test_v1_harbor_cli.py tests/test_eval_cli.py -q
  • uv run pre-commit run --all-files
  • git diff --check harness-type-aliases...HEAD

Stacked on #1425.


Note

Medium Risk
Adds new packaged CLI harness implementations that generate/install/run command programs and wire MCP proxying, which can affect sandbox execution behavior and config validation for users selecting these harness types.

Overview
Adds two new bundled v1 command harnesses, ClaudeCode and Codex, including their typed configs and type/alias registration (e.g. claude/claude-code, codex/codex-cli) so TOML harness.type can select them.

ClaudeCode runs the Anthropic Claude Code CLI in non-interactive mode with MCP proxy config generation, log artifact collection, and configurable permission mode/turn limits; Codex similarly runs the OpenAI Codex CLI via a generated .codex/config.toml, supports sandbox mode and optional reasoning-effort tuning, and reads the Responses API key from rollout State (while explicitly rejecting max_turns overrides).

Exports are plumbed through verifiers.v1 and the root verifiers package, docs/examples are updated to reference the new harness names, and tests are extended to cover alias selection, re-exports, and program-building behavior for both harnesses.

Reviewed by Cursor Bugbot for commit e0970f6. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Add ClaudeCode and Codex harnesses to the verifiers framework

  • Adds ClaudeCode (aliases: claude, claude-code) and Codex (aliases: codex, codex-cli) as new harness types, each selectable via harness.type in config.
  • ClaudeCode runs the Claude Code CLI in non-interactive mode, piping instructions with configurable permission_mode and max_turns, and writes logs to a configurable path.
  • Codex runs codex exec with configurable sandbox mode and reasoning effort; CODEX_API_KEY is populated dynamically from the active responses endpoint at runtime.
  • Both harnesses install their respective npm packages during setup and wire MCP integration to the verifiers proxy via stdio.
  • CodexConfig rejects any attempt to set max_turns with a validation error, as Codex does not support it.

Macroscope summarized e0970f6.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b3feccf. Configure here.

Comment thread verifiers/v1/packages/harnesses/codex.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3feccf4fe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

EOFMCP

cd "$CLAUDE_WORKDIR"
claude -p "$(cat {shlex.quote(instruction_path)})" \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid passing full Claude prompt as a CLI argument

This command inlines the entire instruction file into claude -p "$(cat ...)", which makes rollout success depend on OS argv limits (ARG_MAX). Large benchmark tasks or injected context can exceed that limit and fail with an argument-length error before the model runs, causing avoidable rollout failures. Feeding prompt content via stdin or a file-based option avoids this size ceiling.

Useful? React with 👍 / 👎.

--sandbox {shlex.quote(codex_sandbox)} \
--model "$OPENAI_MODEL" \
--output-last-message {shlex.quote(final_path)} \
"$(cat {shlex.quote(prompt_path)})" > {shlex.quote(log_path)} 2>&1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid passing full Codex prompt as a CLI argument

The script builds a prompt file and then expands it into a single argv value via "$(cat ... )" for codex exec. That can hit command-line length limits on larger tasks and fail the rollout with an argument-size error, even though the prompt file already exists on disk. Using stdin or a file-based prompt path keeps behavior stable for long inputs.

Useful? React with 👍 / 👎.

@xeophon xeophon force-pushed the harness-type-aliases branch from b0256d7 to 163d2de Compare May 20, 2026 17:34
@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented May 20, 2026

Approvability

Verdict: Needs human review

This PR introduces two new harness integrations (ClaudeCode and Codex) with new runtime behavior. Additionally, unresolved review comments flag potential issues with command-line argument length limits and missing automation flags that could cause runtime failures.

You can customize Macroscope's approvability policy. Learn more.

@xeophon xeophon force-pushed the claude-codex-harnesses branch from b3feccf to 86dcf29 Compare May 20, 2026 17:35
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 86dcf29b91

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +127 to +130
--skip-git-repo-check \
--sandbox {shlex.quote(codex_sandbox)} \
--model "$OPENAI_MODEL" \
--output-last-message {shlex.quote(final_path)} \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Force non-interactive approval mode for codex exec

This harness launches codex exec in unattended eval mode but never sets an explicit automation approval mode (for example --full-auto), so runs can block on approval prompts and eventually timeout instead of completing. I checked the Codex Exec docs (“Approval Modes for Automation” and troubleshooting), which call out --full-auto for automated execution when tasks do not complete automatically; relying on implicit defaults here makes rollout behavior unstable across prompts/configs.

Useful? React with 👍 / 👎.

@xeophon xeophon force-pushed the harness-type-aliases branch from 163d2de to b3f0633 Compare May 20, 2026 17:55
@xeophon xeophon force-pushed the claude-codex-harnesses branch from 86dcf29 to d431e08 Compare May 20, 2026 18:01
@xeophon xeophon force-pushed the claude-codex-harnesses branch from d431e08 to e0970f6 Compare May 20, 2026 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant