Add OpenAI Codex agent support for cross-agent benchmarking by szjanikowski · Pull Request #4 · NoesisVision/nasde-toolkit

szjanikowski · 2026-03-23T09:22:12Z

Summary

Add OpenAI Codex agent support to nasde-toolkit, enabling cross-agent benchmarking (Claude Code vs Codex on the same tasks).

New agent

ConfigurableCodex — subclass of Harbor's built-in Codex agent with sandbox file injection (AGENTS.md, Codex skills)
Bridge CODEX_API_KEY → OPENAI_API_KEY (Harbor reads only OPENAI_API_KEY, but OpenAI recommends CODEX_API_KEY)
DNS fix for cloud sandboxes (same pattern as ConfigurableClaude)

Explicit variant configuration via `variant.toml`

Every variant must declare its agent type in variant.toml: agent = "claude" or agent = "codex"
Missing or invalid variant.toml is a hard error — no guessing, no auto-detection
harbor_config.json is auto-generated from variant.toml if absent

Multi-agent variant system

_collect_codex_skills() — collects agents_skills/<name>/ and injects into /app/.agents/skills/<name>/ (Codex's skill directory convention)
_collect_claude_skills() — refactored from inline code, collects skills/<name>/SKILL.md → /app/.claude/skills/

Multi-provider auth

_ensure_auth() checks CODEX_API_KEY / OPENAI_API_KEY for Codex variants, Anthropic keys for Claude variants
Auth check deferred until after variant config is loaded (so agent type is known)

CLI

Display agent type (Claude Code / Codex) in run header panel

Example benchmarks — Codex variants added

DDD benchmark: 3 Codex variants mirroring Claude variants (codex-vanilla, codex-guided, codex-ntcoding-tactical-ddd)
Refactoring-skill benchmark: codex-vanilla variant added
All variants in both examples have variant.toml

Skills updated

nasde-benchmark-runner — Codex docs: supported models, cross-agent comparison, token cost heuristic, troubleshooting
nasde-benchmark-creator — variant.toml requirement, Claude/Codex variant structure, cross-agent design pattern
nasde-dev — skills added to documentation consistency checklist

Documentation

README.md — Quick start with Codex example, Authentication section (API key from platform.openai.com), Prerequisites, Project structure with variant.toml
CLAUDE.md — Package structure, variant conventions, harbor_config.json examples for both agents
ARCHITECTURE.md — Multi-agent class diagram, variant docs with variant.toml

Tests — 21 new tests (43 total)

test_configurable_codex.py — constructor, name, setup orchestration, missing file error
test_runner.py — load_variant_agent_type() (valid/missing/invalid), _is_codex_agent(), _generate_harbor_config() for both agents, _ensure_auth() multi-provider

Verified end-to-end

DDD benchmark (`ddd-threshold-discount`)

Variant	Agent	Model	Reward	Score
vanilla	Claude Code	claude-sonnet-4-6	1.0	82/100
guided	Claude Code	claude-sonnet-4-6	1.0	86/100
codex-vanilla	Codex	gpt-5-codex	1.0	82/100
codex-ntcoding-tactical-ddd	Codex	gpt-5-codex	1.0	82/100

Refactoring benchmark (`python-gilded-rose-polymorphism`)

Variant	Agent	Model	Reward	Score
vanilla	Claude Code	claude-sonnet-4-6	1.0	73/100
codex-vanilla	Codex	gpt-5-codex	1.0	70/100

All Opik traces verified with complete feedback scores.

Note on Codex models: Only gpt-5-codex, gpt-5.3-codex, gpt-5.4, gpt-5.4-mini work via API key. Models like codex-mini-latest are subscription-only; o3-mini doesn't support required Codex tools.

Test plan

All 43 tests pass (uv run pytest -v)
Codex variant with gpt-5-codex completes tasks on 2 benchmarks
Codex variant with skill (codex-ntcoding-tactical-ddd) works end-to-end
Missing variant.toml produces clear error
_ensure_auth() validates CODEX_API_KEY for Codex agents
Opik traces have complete feedback scores for both agent types
Rebase on main clean, no conflicts

🤖 Generated with Claude Code

Introduce ConfigurableCodex agent (subclass of Harbor's Codex) with sandbox file injection, multi-provider auth (CODEX_API_KEY bridge), and explicit agent type declaration via variant.toml. - ConfigurableCodex: AGENTS.md + .agents/skills/ injection, DNS fix - variant.toml required in every variant: agent = "claude" or "codex" - Multi-provider _ensure_auth(): CODEX_API_KEY/OPENAI_API_KEY for Codex, Anthropic keys for Claude - CLI shows agent type in run header - 21 new tests (43 total) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- README: Quick start, Authentication (CODEX_API_KEY), Prerequisites, Project structure with variant.toml - CLAUDE.md: package structure, variant conventions, harbor_config examples for both agents - ARCHITECTURE.md: multi-agent class diagram, variant.toml docs - benchmark-runner skill: Codex models, cross-agent comparison, token cost heuristic, troubleshooting - benchmark-creator skill: variant.toml, Claude/Codex variant structure, cross-agent design pattern - nasde-dev skill: skills added to documentation checklist Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

DDD benchmark: 3 Codex variants (codex-vanilla, codex-guided, codex-ntcoding-tactical-ddd) mirroring existing Claude variants. Refactoring-skill benchmark: codex-vanilla variant added. All variants in both examples have variant.toml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

szjanikowski force-pushed the claude/codex-agent-support branch from ab2355a to 4ef7eb9 Compare March 23, 2026 10:28

Szymon Janikowski and others added 3 commits March 23, 2026 12:25

szjanikowski force-pushed the claude/codex-agent-support branch from 407488a to bb0dbbe Compare March 23, 2026 11:26

szjanikowski merged commit d10ddb5 into main Mar 23, 2026

szjanikowski deleted the claude/codex-agent-support branch April 22, 2026 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenAI Codex agent support for cross-agent benchmarking#4

Add OpenAI Codex agent support for cross-agent benchmarking#4
szjanikowski merged 3 commits into
mainfrom
claude/codex-agent-support

szjanikowski commented Mar 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

szjanikowski commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New agent

Explicit variant configuration via variant.toml

Multi-agent variant system

Multi-provider auth

CLI

Example benchmarks — Codex variants added

Skills updated

Documentation

Tests — 21 new tests (43 total)

Verified end-to-end

DDD benchmark (ddd-threshold-discount)

Refactoring benchmark (python-gilded-rose-polymorphism)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

szjanikowski commented Mar 23, 2026 •

edited

Loading

Explicit variant configuration via `variant.toml`

DDD benchmark (`ddd-threshold-discount`)

Refactoring benchmark (`python-gilded-rose-polymorphism`)