Add OpenAI Codex agent support for cross-agent benchmarking#4
Merged
Conversation
ab2355a to
4ef7eb9
Compare
Introduce ConfigurableCodex agent (subclass of Harbor's Codex) with sandbox file injection, multi-provider auth (CODEX_API_KEY bridge), and explicit agent type declaration via variant.toml. - ConfigurableCodex: AGENTS.md + .agents/skills/ injection, DNS fix - variant.toml required in every variant: agent = "claude" or "codex" - Multi-provider _ensure_auth(): CODEX_API_KEY/OPENAI_API_KEY for Codex, Anthropic keys for Claude - CLI shows agent type in run header - 21 new tests (43 total) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- README: Quick start, Authentication (CODEX_API_KEY), Prerequisites, Project structure with variant.toml - CLAUDE.md: package structure, variant conventions, harbor_config examples for both agents - ARCHITECTURE.md: multi-agent class diagram, variant.toml docs - benchmark-runner skill: Codex models, cross-agent comparison, token cost heuristic, troubleshooting - benchmark-creator skill: variant.toml, Claude/Codex variant structure, cross-agent design pattern - nasde-dev skill: skills added to documentation checklist Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DDD benchmark: 3 Codex variants (codex-vanilla, codex-guided, codex-ntcoding-tactical-ddd) mirroring existing Claude variants. Refactoring-skill benchmark: codex-vanilla variant added. All variants in both examples have variant.toml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
407488a to
bb0dbbe
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add OpenAI Codex agent support to nasde-toolkit, enabling cross-agent benchmarking (Claude Code vs Codex on the same tasks).
New agent
ConfigurableCodex— subclass of Harbor's built-inCodexagent with sandbox file injection (AGENTS.md, Codex skills)CODEX_API_KEY→OPENAI_API_KEY(Harbor reads onlyOPENAI_API_KEY, but OpenAI recommendsCODEX_API_KEY)ConfigurableClaude)Explicit variant configuration via
variant.tomlvariant.toml:agent = "claude"oragent = "codex"variant.tomlis a hard error — no guessing, no auto-detectionharbor_config.jsonis auto-generated fromvariant.tomlif absentMulti-agent variant system
_collect_codex_skills()— collectsagents_skills/<name>/and injects into/app/.agents/skills/<name>/(Codex's skill directory convention)_collect_claude_skills()— refactored from inline code, collectsskills/<name>/SKILL.md→/app/.claude/skills/Multi-provider auth
_ensure_auth()checksCODEX_API_KEY/OPENAI_API_KEYfor Codex variants, Anthropic keys for Claude variantsCLI
Example benchmarks — Codex variants added
codex-vanilla,codex-guided,codex-ntcoding-tactical-ddd)codex-vanillavariant addedvariant.tomlSkills updated
Documentation
variant.tomlvariant.tomlTests — 21 new tests (43 total)
test_configurable_codex.py— constructor, name, setup orchestration, missing file errortest_runner.py—load_variant_agent_type()(valid/missing/invalid),_is_codex_agent(),_generate_harbor_config()for both agents,_ensure_auth()multi-providerVerified end-to-end
DDD benchmark (
ddd-threshold-discount)Refactoring benchmark (
python-gilded-rose-polymorphism)All Opik traces verified with complete feedback scores.
Note on Codex models: Only
gpt-5-codex,gpt-5.3-codex,gpt-5.4,gpt-5.4-miniwork via API key. Models likecodex-mini-latestare subscription-only;o3-minidoesn't support required Codex tools.Test plan
uv run pytest -v)gpt-5-codexcompletes tasks on 2 benchmarkscodex-ntcoding-tactical-ddd) works end-to-endvariant.tomlproduces clear error_ensure_auth()validatesCODEX_API_KEYfor Codex agents🤖 Generated with Claude Code