Skip to content

Add OpenAI Codex agent support for cross-agent benchmarking#4

Merged
szjanikowski merged 3 commits into
mainfrom
claude/codex-agent-support
Mar 23, 2026
Merged

Add OpenAI Codex agent support for cross-agent benchmarking#4
szjanikowski merged 3 commits into
mainfrom
claude/codex-agent-support

Conversation

@szjanikowski
Copy link
Copy Markdown
Contributor

@szjanikowski szjanikowski commented Mar 23, 2026

Summary

Add OpenAI Codex agent support to nasde-toolkit, enabling cross-agent benchmarking (Claude Code vs Codex on the same tasks).

New agent

  • ConfigurableCodex — subclass of Harbor's built-in Codex agent with sandbox file injection (AGENTS.md, Codex skills)
  • Bridge CODEX_API_KEYOPENAI_API_KEY (Harbor reads only OPENAI_API_KEY, but OpenAI recommends CODEX_API_KEY)
  • DNS fix for cloud sandboxes (same pattern as ConfigurableClaude)

Explicit variant configuration via variant.toml

  • Every variant must declare its agent type in variant.toml: agent = "claude" or agent = "codex"
  • Missing or invalid variant.toml is a hard error — no guessing, no auto-detection
  • harbor_config.json is auto-generated from variant.toml if absent

Multi-agent variant system

  • _collect_codex_skills() — collects agents_skills/<name>/ and injects into /app/.agents/skills/<name>/ (Codex's skill directory convention)
  • _collect_claude_skills() — refactored from inline code, collects skills/<name>/SKILL.md/app/.claude/skills/

Multi-provider auth

  • _ensure_auth() checks CODEX_API_KEY / OPENAI_API_KEY for Codex variants, Anthropic keys for Claude variants
  • Auth check deferred until after variant config is loaded (so agent type is known)

CLI

  • Display agent type (Claude Code / Codex) in run header panel

Example benchmarks — Codex variants added

  • DDD benchmark: 3 Codex variants mirroring Claude variants (codex-vanilla, codex-guided, codex-ntcoding-tactical-ddd)
  • Refactoring-skill benchmark: codex-vanilla variant added
  • All variants in both examples have variant.toml

Skills updated

  • nasde-benchmark-runner — Codex docs: supported models, cross-agent comparison, token cost heuristic, troubleshooting
  • nasde-benchmark-creator — variant.toml requirement, Claude/Codex variant structure, cross-agent design pattern
  • nasde-dev — skills added to documentation consistency checklist

Documentation

  • README.md — Quick start with Codex example, Authentication section (API key from platform.openai.com), Prerequisites, Project structure with variant.toml
  • CLAUDE.md — Package structure, variant conventions, harbor_config.json examples for both agents
  • ARCHITECTURE.md — Multi-agent class diagram, variant docs with variant.toml

Tests — 21 new tests (43 total)

  • test_configurable_codex.py — constructor, name, setup orchestration, missing file error
  • test_runner.pyload_variant_agent_type() (valid/missing/invalid), _is_codex_agent(), _generate_harbor_config() for both agents, _ensure_auth() multi-provider

Verified end-to-end

DDD benchmark (ddd-threshold-discount)

Variant Agent Model Reward Score
vanilla Claude Code claude-sonnet-4-6 1.0 82/100
guided Claude Code claude-sonnet-4-6 1.0 86/100
codex-vanilla Codex gpt-5-codex 1.0 82/100
codex-ntcoding-tactical-ddd Codex gpt-5-codex 1.0 82/100

Refactoring benchmark (python-gilded-rose-polymorphism)

Variant Agent Model Reward Score
vanilla Claude Code claude-sonnet-4-6 1.0 73/100
codex-vanilla Codex gpt-5-codex 1.0 70/100

All Opik traces verified with complete feedback scores.

Note on Codex models: Only gpt-5-codex, gpt-5.3-codex, gpt-5.4, gpt-5.4-mini work via API key. Models like codex-mini-latest are subscription-only; o3-mini doesn't support required Codex tools.

Test plan

  • All 43 tests pass (uv run pytest -v)
  • Codex variant with gpt-5-codex completes tasks on 2 benchmarks
  • Codex variant with skill (codex-ntcoding-tactical-ddd) works end-to-end
  • Missing variant.toml produces clear error
  • _ensure_auth() validates CODEX_API_KEY for Codex agents
  • Opik traces have complete feedback scores for both agent types
  • Rebase on main clean, no conflicts

🤖 Generated with Claude Code

@szjanikowski szjanikowski force-pushed the claude/codex-agent-support branch from ab2355a to 4ef7eb9 Compare March 23, 2026 10:28
Szymon Janikowski and others added 3 commits March 23, 2026 12:25
Introduce ConfigurableCodex agent (subclass of Harbor's Codex) with
sandbox file injection, multi-provider auth (CODEX_API_KEY bridge),
and explicit agent type declaration via variant.toml.

- ConfigurableCodex: AGENTS.md + .agents/skills/ injection, DNS fix
- variant.toml required in every variant: agent = "claude" or "codex"
- Multi-provider _ensure_auth(): CODEX_API_KEY/OPENAI_API_KEY for
  Codex, Anthropic keys for Claude
- CLI shows agent type in run header
- 21 new tests (43 total)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- README: Quick start, Authentication (CODEX_API_KEY), Prerequisites,
  Project structure with variant.toml
- CLAUDE.md: package structure, variant conventions, harbor_config
  examples for both agents
- ARCHITECTURE.md: multi-agent class diagram, variant.toml docs
- benchmark-runner skill: Codex models, cross-agent comparison,
  token cost heuristic, troubleshooting
- benchmark-creator skill: variant.toml, Claude/Codex variant
  structure, cross-agent design pattern
- nasde-dev skill: skills added to documentation checklist

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DDD benchmark: 3 Codex variants (codex-vanilla, codex-guided,
codex-ntcoding-tactical-ddd) mirroring existing Claude variants.
Refactoring-skill benchmark: codex-vanilla variant added.
All variants in both examples have variant.toml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@szjanikowski szjanikowski force-pushed the claude/codex-agent-support branch from 407488a to bb0dbbe Compare March 23, 2026 11:26
@szjanikowski szjanikowski merged commit d10ddb5 into main Mar 23, 2026
@szjanikowski szjanikowski deleted the claude/codex-agent-support branch April 22, 2026 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant